Xcell Journal Online
  Xcell Journal Archives
   
  Writing for Xcell
  Advertising in Xcell
  FREE Subscription
   
  Partner Yellow Pages
  Reference Pages
  Contact Us

    

Home : Documentation : Xcell Journal Online : Article
Designing with DSP48 Blocks Using Precision Synthesis



by Douang Phanthavong, Technical Marketing Engineer, Mentor Graphics Corporation
douang_phanthavong@mentor.com (7/11/05)


Achieve close to custom silicon performance in FPGA DSP designs.
article link to PDF
Article PDF 345 KB


Most FPGAs today have all the discrete elements essential for DSP design. By utilizing these logic fabrics, FPGA designers have successfully managed to tape out countless DSP projects for both prototyping and production runs.

Conventional DSP Design in FPGAs
Before dedicated DSP blocks, the conventional DSP design methodology was no different than any other FPGA methodology. You may still use some typical ASIC design techniques: identifying the most common critical block; creating the optimal design solution for that particular block; and making that block reusable and available for your colleagues. You can characterize critical DSP blocks such as multipliers, accumulators, and even coefficient storage, and perform this individual block characterization technique at both the RTL and gate level.

In addition, you may also use techniques such as special floorplanning, blockbased design (bottom-up methodologies), and special constraints files to meet design requirements. After going through these design processes, you still have to surmount a few big hurdles, such as minimizing place and route iterations and interconnect delays. Unfortunately, unlike logic delays, you have little control over these obstacles.

Typically, the FPGA design timing budget has already factored in these routing/interconnect delays. For many DSP design applications, however, exceeding certain routing delay limits is simply unacceptable. That is why dedicated DSP blocks in high-end FPGAs – such as the Xilinx® XtremeDSP™ slice (also referred to as DSP48) in Virtex™-4 devices – are playing a critical role in designing highperformance DSP systems.

How to Compete with ASICs
How important is it to achieve a maximum operating bandwidth (sometimes referred to as “custom silicon performance”) when designing a complex DSP system using current FPGAs? The obvious answer, of course, is “very important.” Until recently, the only available dedicated arithmetic block was the multiplier, a key element for DSP functions.

To be more competitive with ASICs in this space, however, all major DSP elements – including multipliers, adders, subtractors, pipeline registers, and other arithmetic operations – must perform at close to custom silicon levels. Access to a synthesis tool that allows you to take full advantage of this advanced DSP silicon functionality is also important. In this article, we’ll use the DSP48 dedicated block available in Virtex-4 devices and the Mentor Graphics Precision Synthesis tool as examples to illustrate relevant challenges and solutions.

Dedicated DSP48 blocks, combined with the advanced compiler in Precision Synthesis, enable you to seamlessly implement various high-performance DSP functions. Applications benefiting from these features include digital media and broadcasting and VoIP, among others.

Not long ago, data paths were an obstacle when implementing DSP functions in FPGAs. They were mostly contributed by arithmetic operations such as multiplication, addition, and subtraction. A simple math equation of Y = (C ± (A * B) + CIN) can generate a plethora of LUTs with multiple levels of logic. On the other hand, a wellintegrated DSP48 slice coupled with the powerful new features of an advanced tool like Precision Synthesis can enable you to extract very close to custom silicon performance on your next DSP design.

DSP48 Slice: Features and Functions
A DSP48 tile comprises two DSP48 slices, a shared 48-bit C bus, and internal dedicated interconnect. The DSP48 slice itself comprises the all-important elements for DSP functions (Figure 1). The math portion of the DSP48 slice comprises an 18 x 18-bit two’s complement multiplier followed by three 48-bit datapath multiplexers, followed by a three-input 48-bit adder and subtractor. The data and control inputs feed directly to the arithmetic portions, or are optionally registered once or twice (by AREG and BREG) to accommodate the construction of various highly pipelined DSP applications.

Multiplier
The multiplier accepts two 18-bit two’s complement operands producing a 36-bit two’s complement result. The result is signextended to 48 bits and can optionally be fed into an adder/subtractor to form a MULT-ADD arithmetic function. The adder/subtractor accepts three 48-bit two’s complement operands, and produces a 48-bit two’s complement result. When the result is sign extended from the 36-bit to 48-bit, the most significant bit (MSB) is simply copied 12 times to make a 48-bit result. For example, a “36’b 0101 1111 1111 10101 1101 1110 1111 1000 1110” result would become “48’b 0000 0000 0000 0101 1111 1111 10101 1101 1110 1111 1000 1110”.

Precision Synthesis supports this automatic sign extension when any of the DSP48 operators are inferred. This is a simple concept, but very powerful, because it allows you to perform wider arithmetic operations without having to manually set the correct bus width for the output result.

Accumulator (Adder/Subtractor)
The adder/subtractor stages are functions of the inputs, which are driven by the upstream multiplexers, carry-select logic, and multiplier arrays. The CIN, X multiplexer output, and Y multiplexer output are always added together. You can control this combined result to be selectively added to or subtracted from the Z multiplexer output: Adder Out = (Z ± (X + Y + CIN).

Pipeline Registers
This is a unique advantage of the DSP48 block compared to other DSP FPGA architectures. Each DSP48 slice contains the following pipeline registers at each stage:

  • One or two pipeline registers for A and B inputs (AREG and BREG)
  • One pipeline register at the output multiplier stage (MREG)
  • One pipeline register at the output stage (PREG)
  • One pipeline register at the C input
  • One pipeline register for opmode and other control signals
DSP48 Ports
The DSP48 slice input and output ports support many common DSP and math algorithms. Two direct 18-bit input data ports are labeled A and B. As shown in Figure 1, two DSP48 slices within a DSP48 tile share a 48-bit input data port labeled C. Each of them has one direct 48-bit output port labeled P, a cascaded input datapath (B cascade), and a cascaded output datapath (P cascade), providing a cascaded input and output stream between adjacent DSP48 slices.

Operating Mode
The 7-bit operating mode (opmode) inputs provide a way for the design to change its functionality from clock cycle to clock cycle if desired. There are more than 40 dynamically controlled opmodes, although you cannot set all possible combinations (as described in the Virtex-4 datasheet, www.xilinx.com/bvdocs/userguides/ug073.pdf). The opmode bits can be optionally registered under the control of the configuration memory cells. Precision Synthesis automatically assigns a correct opmode for each DSP48 operator being inferred. The synthesis tool uses a simple control signal – along with the arithmetic operations in the HDL source code – to accurately determine each DSP48 function.

Precision Synthesis with Virtex-4 FPGAs
In advanced FPGA architectures, the basic DSP building blocks (for delay, data storage, multiplication, addition, subtraction, summation, and accumulation) are no longer built using discrete components. To tightly integrate these essential DSP components in high-end FPGAs, a well-planned DSP block is designed as part of the FPGA chip’s dedicated resources. A Virtex-4 DSP48 block supports many independent functions, including multiplier, multiplier-accumulator (MAC), multiplier followed by adder, three-input adder, barrel shifter, and pipeline.

To take full advantage of these DSP blocks without having to learn about the implementation details in depth, any good FPGA synthesis tool must provide intelligent and accurate inference and mapping capabilities for DSP functions. Using the Precision Synthesis tool, you can focus your design time more effectively on more important tasks and critical deliverables and meet increasingly tight project schedules.

Transposed FIR Filter
Figure 2 shows a block diagram, whereas an example of the coding style of a transposed FIR filter structure is illustrated in Figure 3. The post-place and route area and timing results are in shown in Table 1, in which Precision Synthesis will infer the MULT_ADD operator (block diagram shown in Figure 4); map MULTIPLIER with input-pipeline registers; and adder logic using DSP blocks.

As you can see from Table 1, Precision Synthesis uses the same RTL design to target different Virtex families, starting from the first Virtex device (2.5V, 0.22 ìm, five layers metal process) to the latest Virtex-4 device (1.5V, 90 nm copper process). You may notice that the exponential QoR improvement begins with the Virtex-II family; most of the improvement was contributed by the integration of a dedicated multiplier in Virtex-II FPGAs.

Since then, these dedicated resources have been continuously enhanced and eventually transformed into the advanced DSP48 slice in the Virtex-4 family. Based on the results in Table 1, it would be difficult for anyone to doubt the tremendous QoR improvements that the Virtex-4 device has brought to today’s FPGA DSP design community. The advantages of dedicated DSP48 slices over discrete DSP elements are also quite clear. The transposed FIR filter structure in Figure 2 is optimal for use with the DSP48 slice. Precision Synthesis can absorb all FPGA fabric into DSP48 slices, including pipeline registers, adders, and multipliers.

You may choose to use one of many different approaches to code this DSP design block. Let’s describe one approach where you only have to implement a simple MULT-ADD operator and use Verilog 2001 to generate the rest of the blocks. The design specifications are:

  • Signed 18-bit input sampled (B(n))
  • Signed 18-bit input coefficients (h(n))
    – Use registers to store coefficients
  • Signed 48-bit output stream (P(n))
  • 16 taps
Precision Synthesis supports the Verilog 2001 generate statement, which you can use to generate this MULT-ADD operator as shown by the coding example (Figure 3) and technology schematic (Figure 5). The pros to using the transposed FIR filter far outweigh the cons.

Advantages:

  • Low latency – the maximum latency never exceeds the pipelining time through the slice containing the first coefficient. Typically, this is three clock cycles from the time data is input to the displayed result.
  • Efficient mapping to the DSP48 slice – mapping is enabled by the adder chain structure of the transposed FIR filter.
  • No external logic – no external FPGA fabric is required, enabling you to achieve the highest possible performance. Disadvantages:
  • Performance may be limited by a high fan-out input signal if a large number of taps exist.
Mapping Beyond Multipliers
Precision Synthesis automatically infers and maps all multiplier and arithmetic operators into DSP48 where possible. In DSP designs, however, critical data paths are not necessarily always lying on multipliers. Adders, subtractors, counters, and other operators could also be the source of the critical timing path. To help deal with these situations, Precision Synthesis lets you control each arithmetic operator by individually manipulating the mapping. These features deliver two main advantages:
  • An extra boost in solving timing problems when necessary.
  • DSP mapping controllability when desired because of resource availability and timing requirements.
Precision Synthesis gives you the option to map the following operators into a DSP48 slice. Figure 6 shows the RTL schematic of a DEC operator:
  • Adder/subtractor/addsub – default to CARRY CHAIN and/or LOGIC
  • INC/DEC/INCDEC – default to CARRY CHAIN and/or LOGIC
  • EQ/NEQ/LT/LTE/GT/GTE – default to CARRY CHAIN and/or LOGIC
  • Counter – default to CARRY CHAIN and/or LOGIC
  • Mult/mult-add/mult-acc – default to DSP48
You can map the DEC operator into a DSP48 slice by using the Precision Synthesis GUI (Figure 7) or interactive command line as follows:
  • From the RTL hierarchical browser, right-click on the DEC operator. Select Set Attributes > New.
  • Enter attribute name and value.
  • You can use the command line to accomplish the same task: set_attribute -design rtl -name use_resource -value DSP48 - instance rtlc_1_dec_0.
Conclusion
For MULT-ACC with adder/subtractor and multiple stages of internal input pipeline registers, the Xilinx Virtex-4 architecture has specific advantages over other vendors. For wide multipliers without accumulators, you may find that DSP48 in Virtex-4 devices and other DSP vendors compete more closely in terms of performance. But regardless of your design application, it would be beneficial to test-drive the DSP48 functions in tandem with the Mentor Graphics Precision Synthesis tool for your next design project.

By supporting advanced DSP48 inferencing capabilities, Precision Synthesis makes it possible to easily model DSP RTL behavior at a very high level. It also provides added flexibility with RTL coding styles. The DSP48 inference capability in Precision Synthesis, combined with the advanced DSP48 slice in Virtex-4 FPGAs, provides a powerful system solution if you are considering designing in DSP with today’s advanced FPGA architectures.

Printable PDF version of this article with graphics. PDF logo (7/11/05) 345 KB

 
Jobs Events Webcasts News Investors Feedback Legal Privacy Trademarks Sitemap
© 1994-2008 Xilinx, Inc. All Rights Reserved.