|
Most FPGAs today have all the discrete elements
essential for DSP design. By utilizing
these logic fabrics, FPGA designers
have successfully managed to tape out
countless DSP projects for both prototyping
and production runs.
Conventional DSP Design in FPGAs
Before dedicated DSP blocks, the conventional
DSP design methodology was no
different than any other FPGA methodology.
You may still use some typical ASIC
design techniques: identifying the most
common critical block; creating the optimal
design solution for that particular
block; and making that block reusable and
available for your colleagues. You can
characterize critical DSP blocks such as
multipliers, accumulators, and even coefficient
storage, and perform this individual
block characterization technique at
both the RTL and gate level.
In addition, you may also use techniques
such as special floorplanning, blockbased
design (bottom-up methodologies),
and special constraints files to meet design
requirements. After going through these
design processes, you still have to surmount
a few big hurdles, such as minimizing place
and route iterations and interconnect
delays. Unfortunately, unlike logic delays,
you have little control over these obstacles.
Typically, the FPGA design timing
budget has already factored in these routing/interconnect delays. For many DSP
design applications, however, exceeding
certain routing delay limits is simply unacceptable.
That is why dedicated DSP
blocks in high-end FPGAs – such as the
Xilinx® XtremeDSP™ slice (also referred
to as DSP48) in Virtex™-4 devices – are
playing a critical role in designing highperformance
DSP systems.
How to Compete with ASICs
How important is it to achieve a maximum
operating bandwidth (sometimes referred
to as “custom silicon performance”) when
designing a complex DSP system using
current FPGAs? The obvious answer, of
course, is “very important.” Until recently,
the only available dedicated arithmetic
block was the multiplier, a key element for
DSP functions.
To be more competitive with ASICs in
this space, however, all major DSP elements
– including multipliers, adders, subtractors,
pipeline registers, and other
arithmetic operations – must perform at
close to custom silicon levels. Access to a
synthesis tool that allows you to take full
advantage of this advanced DSP silicon
functionality is also important. In this article,
we’ll use the DSP48 dedicated block
available in Virtex-4 devices and the
Mentor Graphics Precision Synthesis tool
as examples to illustrate relevant challenges
and solutions.
Dedicated DSP48 blocks, combined
with the advanced compiler in Precision
Synthesis, enable you to seamlessly implement
various high-performance DSP functions.
Applications benefiting from these
features include digital media and broadcasting
and VoIP, among others.
Not long ago, data paths were an obstacle
when implementing DSP functions in
FPGAs. They were mostly contributed by
arithmetic operations such as multiplication,
addition, and subtraction. A simple
math equation of Y = (C ± (A * B) + CIN) can
generate a plethora of LUTs with multiple
levels of logic. On the other hand, a wellintegrated
DSP48 slice coupled with the
powerful new features of an advanced tool
like Precision Synthesis can enable you to
extract very close to custom silicon performance
on your next DSP design.
DSP48 Slice: Features and Functions
A DSP48 tile comprises two DSP48 slices,
a shared 48-bit C bus, and internal dedicated
interconnect. The DSP48 slice itself
comprises the all-important elements for
DSP functions (Figure 1). The math portion
of the DSP48 slice comprises an 18 x
18-bit two’s complement multiplier followed by three 48-bit datapath multiplexers,
followed by a three-input 48-bit adder
and subtractor. The data and control inputs
feed directly to the arithmetic portions, or
are optionally registered once or twice (by
AREG and BREG) to accommodate the
construction of various highly pipelined
DSP applications.
Multiplier
The multiplier accepts two 18-bit two’s
complement operands producing a 36-bit
two’s complement result. The result is signextended
to 48 bits and can optionally be
fed into an adder/subtractor to form a
MULT-ADD arithmetic function. The
adder/subtractor accepts three 48-bit two’s
complement operands, and produces a 48-bit two’s complement result.
When the result is sign extended from
the 36-bit to 48-bit, the most significant bit
(MSB) is simply copied 12 times to make a
48-bit result. For example, a “36’b 0101
1111 1111 10101 1101 1110 1111 1000
1110” result would become “48’b 0000
0000 0000 0101 1111 1111 10101 1101
1110 1111 1000 1110”.
Precision Synthesis supports this automatic
sign extension when any of the
DSP48 operators are inferred. This is a
simple concept, but very powerful, because
it allows you to perform wider arithmetic
operations without having to manually set
the correct bus width for the output result.
Accumulator (Adder/Subtractor)
The adder/subtractor stages are functions
of the inputs, which are driven by the
upstream multiplexers, carry-select logic,
and multiplier arrays. The CIN, X multiplexer
output, and Y multiplexer output
are always added together. You can control
this combined result to be selectively added
to or subtracted from the Z multiplexer
output: Adder Out = (Z ± (X + Y + CIN).
Pipeline Registers
This is a unique advantage of the DSP48
block compared to other DSP FPGA architectures.
Each DSP48 slice contains the
following pipeline registers at each stage:
- One or two pipeline registers for A and
B inputs (AREG and BREG)
- One pipeline register at the output
multiplier stage (MREG)
- One pipeline register at the output
stage (PREG)
- One pipeline register at the C input
- One pipeline register for opmode and
other control signals
DSP48 Ports
The DSP48 slice input and output ports
support many common DSP and math
algorithms. Two direct 18-bit input data
ports are labeled A and B. As shown in
Figure 1, two DSP48 slices within a
DSP48 tile share a 48-bit input data port
labeled C. Each of them has one direct 48-bit output port labeled P, a cascaded input
datapath (B cascade), and a cascaded output
datapath (P cascade), providing a cascaded
input and output stream between
adjacent DSP48 slices.
Operating Mode
The 7-bit operating mode (opmode) inputs
provide a way for the design to change its
functionality from clock cycle to clock cycle
if desired. There are more than 40 dynamically
controlled opmodes, although you cannot
set all possible combinations (as
described in the Virtex-4 datasheet, www.xilinx.com/bvdocs/userguides/ug073.pdf). The
opmode bits can be optionally registered
under the control of the configuration
memory cells. Precision Synthesis automatically
assigns a correct opmode for
each DSP48 operator being inferred. The
synthesis tool uses a simple control signal
– along with the arithmetic operations in
the HDL source code – to accurately
determine each DSP48 function.
Precision Synthesis with Virtex-4 FPGAs
In advanced FPGA architectures, the
basic DSP building blocks (for delay, data
storage, multiplication, addition, subtraction,
summation, and accumulation) are
no longer built using discrete components.
To tightly integrate
these essential DSP components
in high-end FPGAs, a
well-planned DSP block is
designed as part of the FPGA
chip’s dedicated resources. A
Virtex-4 DSP48 block supports
many independent
functions, including multiplier,
multiplier-accumulator
(MAC), multiplier followed
by adder, three-input adder,
barrel shifter, and pipeline.
To take full advantage of
these DSP blocks without having
to learn about the implementation
details in depth, any
good FPGA synthesis tool must
provide intelligent and accurate
inference and mapping capabilities
for DSP functions. Using
the Precision Synthesis tool,
you can focus your design time
more effectively on more
important tasks and critical
deliverables and meet increasingly
tight project schedules.
Transposed FIR Filter
Figure 2 shows a block diagram, whereas
an example of the coding style of a transposed
FIR filter structure is illustrated in
Figure 3. The post-place and route area
and timing results are in shown in Table 1, in which Precision Synthesis will infer
the MULT_ADD operator (block diagram
shown in Figure 4); map MULTIPLIER
with input-pipeline registers; and
adder logic using DSP blocks.
As you can see from Table 1, Precision
Synthesis uses the same RTL design to target
different Virtex families, starting from
the first Virtex device (2.5V, 0.22 ìm, five
layers metal process) to the latest Virtex-4
device (1.5V, 90 nm copper process). You
may notice that the exponential QoR
improvement begins with the Virtex-II
family; most of the improvement was contributed
by the integration of a dedicated
multiplier in Virtex-II FPGAs.
Since then, these dedicated resources
have been continuously enhanced and
eventually transformed into the advanced
DSP48 slice in the Virtex-4 family. Based
on the results in Table 1, it would be difficult
for anyone to doubt the tremendous
QoR improvements that the Virtex-4
device has brought to today’s FPGA DSP
design community. The advantages of dedicated
DSP48 slices over discrete DSP elements
are also quite clear.
The transposed FIR filter structure in
Figure 2 is optimal for use with the
DSP48 slice. Precision Synthesis can
absorb all FPGA fabric into DSP48 slices,
including pipeline registers, adders, and
multipliers.
You may choose to use one of many
different approaches to code this DSP
design block. Let’s describe one approach
where you only have to implement a simple
MULT-ADD operator and use Verilog
2001 to generate the rest of the blocks.
The design specifications are:
- Signed 18-bit input sampled (B(n))
- Signed 18-bit input coefficients (h(n))
– Use registers to store coefficients
- Signed 48-bit output stream (P(n))
- 16 taps
Precision Synthesis supports the Verilog
2001 generate statement, which you can
use to generate this MULT-ADD operator
as shown by the coding example (Figure 3)
and technology schematic (Figure 5). The
pros to using the transposed FIR filter far
outweigh the cons.
Advantages:
- Low latency – the maximum latency
never exceeds the pipelining time
through the slice containing the first
coefficient. Typically, this is three clock
cycles from the time data is input to
the displayed result.
- Efficient mapping to the DSP48 slice –
mapping is enabled by the adder chain
structure of the transposed FIR filter.
- No external logic – no external
FPGA fabric is required, enabling
you to achieve the highest possible
performance.
Disadvantages:
- Performance may be limited by a high
fan-out input signal if a large number
of taps exist.
Mapping Beyond Multipliers
Precision Synthesis automatically infers
and maps all multiplier and arithmetic
operators into DSP48 where possible. In
DSP designs, however, critical data paths
are not necessarily always lying on multipliers.
Adders, subtractors, counters, and
other operators could also be the source of
the critical timing path. To help deal with
these situations, Precision Synthesis lets
you control each arithmetic operator by
individually manipulating the mapping.
These features deliver two main advantages:
- An extra boost in solving timing problems
when necessary.
- DSP mapping controllability when
desired because of resource availability
and timing requirements.
Precision Synthesis gives you the
option to map the following operators
into a DSP48 slice. Figure 6 shows the
RTL schematic of a DEC operator:
- Adder/subtractor/addsub – default to
CARRY CHAIN and/or LOGIC
- INC/DEC/INCDEC – default to
CARRY CHAIN and/or LOGIC
- EQ/NEQ/LT/LTE/GT/GTE – default
to CARRY CHAIN and/or LOGIC
- Counter – default to CARRY CHAIN
and/or LOGIC
- Mult/mult-add/mult-acc – default to
DSP48
You can map the DEC operator into a
DSP48 slice by using the Precision Synthesis
GUI (Figure 7) or interactive command line
as follows:
- From the RTL hierarchical browser,
right-click on the DEC operator. Select
Set Attributes > New.
- Enter attribute name and value.
- You can use the command line to accomplish
the same task: set_attribute -design
rtl -name use_resource -value DSP48 -
instance rtlc_1_dec_0.
Conclusion
For MULT-ACC with adder/subtractor and
multiple stages of internal input pipeline
registers, the Xilinx Virtex-4 architecture
has specific advantages over other vendors.
For wide multipliers without accumulators,
you may find that DSP48 in Virtex-4
devices and other DSP vendors compete
more closely in terms of performance. But
regardless of your design application, it
would be beneficial to test-drive the DSP48
functions in tandem with the Mentor
Graphics Precision Synthesis tool for your
next design project.
By supporting advanced DSP48 inferencing
capabilities, Precision Synthesis makes it
possible to easily model DSP RTL behavior
at a very high level. It also provides added
flexibility with RTL coding styles. The
DSP48 inference capability in Precision
Synthesis, combined with the advanced
DSP48 slice in Virtex-4 FPGAs, provides a
powerful system solution if you are considering
designing in DSP with today’s advanced
FPGA architectures.
Printable PDF version of this article with graphics. (7/11/05) 345 KB
|