|
With the introduction of Xilinx Virtex-4
FPGAs in September 2004, the world of DSP
design witnessed a dramatic leap in programmable
logic DSP: higher performance, lower
cost, lower power, and maximum flexibility.
At the same time this phenomenon asks
DSP hardware engineers to change their traditional
way of designing and embrace a different
approach. These great improvements have been
made possible by the XtremeDSP slice.
The XtremeDSP Slice
The XtremeDSP slice (also referred to as the
DSP48) is a high-performance multiplier and
arithmetic unit with great flexibility that can
form the building block of many DSP algorithms
implemented in FPGAs. A detailed
diagram of the DSP48 structure is shown in
Figure 1.
The XtremeDSP slice comprises four main
sections:
- I/O registers
- 18 x 18 signed multiplier
- Three-input adder/subtractor
- Op-mode multiplexers
The I/O registers ensure a maximum clock
performance of 500 MHz in the fastest speed
grade device (400 MHz in the slowest speed
grade), also ensuring support for higher sample
rates. The dynamic op-mode multiplexers are
key to the functionality of the structure; they are
responsible for the DSP48's great flexibility. For
example, in a simple MACC engine, you set the
X and Y MUX to multiply and select the feedback
path from the registered output P as the Z
MUX input to the arithmetic unit.
In the Virtex-4 architecture, XtremeDSP
slices are arranged in columns. The most important
aspect about the column is the cascade logic
and routing between each block, which exists on
both the input and output stages of each slice.
This dedicated routing enables a number of
filters and other functions to be built entirely
within the XtremeDSP slice, thus removing the
need for signals to be routed through the FPGA
interconnect or logic fabric.
However, you must take this adder-chain
configuration into account when designing
functions that exploit the XtremeDSP slice.
Herein lies the fundamental change in the
approach to filter design. The simple, traditional
adder-tree approach limited the performance
and extensibility of a given filter
implementation. By using adder-chain-style
implementations, these limitations are lifted
and the huge benefits Virtex-4 FPGAs offer
are possible.
The embedded nature of the XtremeDSP
slice has also had a radical impact on reducing
the power consumed by high-speed multiply
and add functions. Figure 2 illustrates
this dramatic reduction, showing that the
dynamic power consumption is 1/17 of
Virtex-II Pro devices with a specification
of 2.9 mW/100 MHz. As a designer, you
should migrate as much functionality into
these embedded functions as possible.
Filter Techniques
During the last ten years, hardware and
FPGA designers have created a wide variety
of filter architectures to efficiently exploit
the building blocks that the current generation
of technology offers. With the introduction
of Virtex-4 FPGAs and the
XtremeDSP slice, filter implementations
must change to most efficiently exploit this
latest FPGA offering. Filters are prolific in
DSP designs and nearly always form the
starting point for analyzing an architecture.
The general FIR filter equation is a
summation of products (also known as an
inner product) defined in the equation:
In this equation, a set of N coefficients is
multiplied by N respective data samples,
and the results are summed to form an
individual result. The values of the coefficients
determine the characteristics of the
filter: low-pass, band-pass, or high-pass.
The Semi-Parallel FIR Filter
Even within the filter world, you can
implement a wide variety of filters. The key
parameters that tell us which FIR filter
implementation we will construct are:
- Number of coefficients (N)
- Sample rate (Fs)
Let's examine a particular filter structure
to demonstrate the key design techniques
that can help you maximize the benefits of
Virtex-4 devices. Our filter has 20 coefficients
and a sample rate of 74.25 MHz.
As noted earlier, the maximum capable
clock speed of the XtremeDSP slice is 400
MHz in the slowest speed grade (-10).
Therefore, we have a total of five clock
cycles to perform the required 20 multiply
and adds to form the result.
This equation determines how many
multipliers to use for a particular semiparallel
architecture:
Number of Multipliers = (Maximum Input Sample Rate x Number of Coefficients) / Clock Speed
For our example, the required number
of multipliers will be four. Once we have
determined the required number of multipliers,
there is an extendable architecture
using the XtremeDSP slices that can serve
as the basis for the filter.
XtremeDSP arithmetic units are
designed to be chained together easily and
efficiently thanks to dedicated routing
between slices. Figure 3 illustrates how the
four XtremeDSP multiply and add elements
are cascaded together to form the
main part of the filter.
It is critical to highlight the usage of the
adder chain here rather than the more traditional
adder tree. The adder chain has a profound
impact on the control logic required
for the filter, as well as its efficiency, because
of the mapping to the XtremeDSP slice.
Continuing to analyze the filter structure,
an extra XtremeDSP slice is required to perform
the accumulation of the partial results,
thus creating the final result. A new result is
created every five clock cycles. This means
that for every five cycles the accumulation
must be reset to the first inner product of the
next result. This reset (or load) is achieved by
changing the op-mode value of the
XtremeDSP slice for a single cycle, from
0010010 to 0010000 (this is just a single bit
change). At the same time, the capture register
is enabled and the final result stored on
the output.
The Control Logic
The control is the most important and complicated
aspect of semi-parallel FIR filters;
getting it right is crucial to filter operation.
Because the XtremeDSP slice is most efficiently
used in adder chains, memory
addressing is necessary to provide the delay
for each multiply-add element that the adder
chain causes. Figure 4 illustrates the control
logic required to create memory addressing.
The counter creates the fundamental
zero through four count. This is then
delayed by one cycle
by the use of a register
in the control path.
Each successive delay
is used to address both
the coefficient memory
and the data buffer
— and their respective
multiply-add elements.
Hence, a single
delay is required for
the second multiplyadd
element, two
delays for the third
multiply-add element,
and so on. Note that this is extensible control
logic for M number of multipliers.
Figure 4 also shows write enable
sequencing. A relational operator is
required to determine when the count
limited counter resets its count. This signal
is high for one clock cycle every five
cycles, reflecting the input and output
data rates. The clock enable signal is
delayed by a single register just like the
coefficient address; each delayed version
of the signal is tied to the respective section
of the filter.
The filter and control logic are
extremely cascadable. The address for each
SRL16E data buffer and coefficient memory
pair are a delayed version of the previous
elements' address, and are identical.
The performance and resource utilization
for our filter is specified in Table 1. In
the table, you can see how logic slice utilization
dramatically drops when using the
XtremeDSP slice. Clock frequency performance
approximately doubles over
Virtex-II Pro FPGAs.
Three Important Design Points
This new filter architecture, along with
Virtex-4 devices and the XtremeDSP slice,
addresses the demanding needs of current and
future DSP designs. However, it is only one
filter in an extremely large array of possible
implementations, not to mention other DSP
functions such as IIRs, FFTs, and DCTs.
Knowing this, you can take away three
very important design questions that will
enable you to exploit the XtremeDSP slice
and Virtex-4 device as designed.
- Is the design running as fast as possible?
The fastest speed grade (-12) should
run at 500 MHz. If your design is
running at 50 MHz, you've got the
room to reduce your resource utilization
by increasing performance (and
reducing cost) by making more efficient
use of the FPGA resources. The
faster a particular function operates,
the smaller it becomes. Our semiparallel
FIR filter, for example, used
five XtremeDSP slices running at 375
MHz instead of 20 XtremeDSP slices
running at 74.25 MHz.
- Are there any XtremeDSP slices left?
If you are not using them all up, you
can probably add some functionality.
This can lead to logic slice reduction
and lower power consumption.
- Are you using adder chains instead
of adder trees?
DSP algorithms must aim to exploit
adder chain-based implementations
wherever possible, as this will lead to
the best utilization of the XtremeDSP
slice. Such implementations will result
in performance gains, power reduction,
and logic slice reduction.
Conclusion
For more information, see the XtremeDSP
Slice Design Considerations User Guide,
which provides in-depth details on other filter
implementations and DSP functions, at
www.xilinx.com/bvdocs/userguides/ug073.pdf.
There are also other HDL and System
Generator for DSP reference designs to get
you started.
Printable PDF version of this article with graphics. (1/15/05) 430 KB |