Support|documentation

  Xcell Journal Online
  Xcell Journal Archives
   
  Writing for Xcell
  Advertising in Xcell
  FREE Subscription
   
  Partner Yellow Pages
  Reference Pages
  Contact Us

    

Home : Documentation : Xcell Journal Online : Article
Optimizing Virtex-4 High-Performance Designs



by Carlos Abraham, FPGA Synthesis CAE, Synopsys, Inc.
carlos.abraham@synopsys.com
and
Yanbing Li, Corporate Applications Engineering Manager, Synopsys, Inc. yanbing.li@synopsys.com (1/15/05)


Synopsys Design Compiler FPGA can take your high-speed design to the next level of performance. .
article link to PDF
Article PDF 340 KB


Synopsys® Design Compiler® FPGA (DC FPGA) allows you to meet your high-performance design goals by using a powerful set of optimization algorithms and features specifically tuned for the Xilinx® Virtex-4™ architecture. These algorithms use special Virtex-4 resources such as the DSP48 block and block RAM to achieve the lowest overall area utilization and the optimal circuit timing performance.

Design Compiler FPGA Overview
Designs that target complex devices such as Virtex-4 FPGAs require the same power and flexibility in synthesis that only ASIC designers had access to in the past. DC FPGA is built on Design Compiler’s industry-leading ASIC synthesis technology and then customized to include FPGAspecific optimizations to handle even the most challenging designs. FPGA-specific optimizations enable optimal mapping to FPGA basic primitives such as LUTs and complex components like RAM, multipliers, and DSP blocks.

DC FPGA includes innovative Adaptive Optimization™ (AO) technology to dynamically tune the synthesis algorithms based on the design context, as well as timing constraints to provide faster synthesis runtime and optimal timing. DC FPGA inherits Design Compiler’s reliability – proven through the development of more than 125,000 ASIC designs. DC FPGA brings the powerful ASIC-strength synthesis of Design Compiler to FPGA designs.

In addition to AO technology, DC FPGA deploys a rich set of optimizations to achieve the best timing Quality of Results (QoR) for FPGA devices. These include:

  • Constraint-driven synthesis and design space exploration
  • Automatic finite state machine (FSM) extraction and optimization
  • Automatic inference of special FPGA resources, such as RAM, ROM, multipliers, DSP blocks, shift registers, and global clock buffers
  • Advanced datapath optimizations and module generation
  • Logic and register duplication
  • Register retiming and pipelining
  • Critical path re-synthesis
  • Across-boundary optimization
  • Automatic gated-clock transformation
DC FPGA is part of a family of products from Synopsys that work in conjunction with the Xilinx ISE™ tool to streamline the FPGA design process. In this article, we’ll show how DC FPGA optimizes for high performance in Xilinx Virtex-4 FPGAs.

Constraint-Driven Synthesis
DC FPGA uses a true timing-driven synthesis engine. You can greatly influence the final implementation choice by specifying appropriate timing and design-specific constraints during synthesis. Therefore, we recommend that you drive DC FPGA synthesis with the same set of constraints as the Xilinx ISE tool.

At a minimum, you should specify appropriate design timing constraints such as clock frequency, I/O offsets, and any timing exceptions applicable to your design (such as multicycle and false paths). Any other design-specific constraints – such as controlling special FPGA resource usage – could also be specified. For best performance, your design should not be overconstrained, which in some cases can lead to unnecessary increases in area.

Without any timing constraints, DC FPGA will perform area-based optimizations with good timing results. With proper timing constraints, DC FPGA applies the AO technology to explore the areatiming tradeoffs of various optimizations, selecting the final implementation that best fits your constraints.

For example, your timing goals enable DC FPGA to decide whether distributed RAM, block RAM, or a LUT with register-based implementation is sufficient for an inferred memory component in your design. Otherwise, DC FPGA optimizes for the lowest area utilization possible.

Table 1 shows two implementations for a small sub-module with two different clock constraints. The module is the critical one for a larger design of about 8,600 slices. The design contains a single clock domain with only one clock period constraint specified in DC FPGA.

In the first case, the module is constrained at 10 ns. DC FPGA exceeds the timing requirement after its area-based implementation and does not invoke the timing optimization phase. The critical path of the design runs through a series of carry logic.

In the second case, when a much tighter constraint (3 ns) is applied, DC FPGA performs aggressive timing optimizations and replaces the carry logic on its critical paths with parallel circuit structures built by LUTs. This results in a design with a slightly larger area but meets the new timing requirement, which was impossible to achieve with the carry logic structure. At the overall design level, a 29% timing improvement is achieved with a minor area increase of 11 slices.

Flexible FSM Support
DC FPGA contains sophisticated FSM extraction and optimization algorithms to ensure optimum high-performance state logic implementation. Once the FSM is detected and extracted from the RTL code, DC FPGA’s powerful state machine optimization engine performs various optimization schemes, such as optimizing unreachable states or removing duplicate states to produce the best logic implementation to meet timing.

At the same time, you have the flexibility to select a different FSM coding style such as one-hot, binary, gray, and zero-one-hot on a state-machine-by-state-machine basis, design basis, and global basis. This FSM encoding exploration flexibility allows you to customize the synthesis script to address design bottlenecks.

For an FPGA implementation, one-hot state implementations typically provide the best timing QoR for most designs at the expense of a higher register-to-LUT ratio. For most designs this is not a problem because of the register-rich architecture of FPGA devices.

High-Performance DSP Inference Capability
The availability of special FPGA resources such as block RAM, dedicated DSP slice, and carry logic combined with your specified design and timing constraints guides DC FPGA’s specialized optimization algorithms to determine the best optimum circuit implementation.

DC FPGA is highly capable of inferring complex circuit topology from your design’s RTL coding structure, effectively deciding the final implementation that best exploits the resources of the targeted FPGA. DC FPGA minimizes overall resource usage while providing the best circuit performance possible.

This powerful optimization feature allows DC FPGA to effectively infer and map complex logic configurations into special resources such as the Virtex-4 dedicated DSP48 slice. To illustrate this powerful feature, Figure 1 shows a simple multiply accumulate (MAC) logic structure, where A- and B-registered input signals are multiplied. The registered multiplier intermediate output is then accumulated in the last adder stage, feeding the registered Q output signal.

The RTL code for this simple MAC function is:

   module test ( Q, A, B, clk );
   output [47:0] Q;
   input [16:0] A, B;
   input   clk;

   reg [47:0] Q;
   reg [16:0] A_reg, B_reg;
   reg [33:0] mult;

   always @( posedge clk )
   begin
      A_reg <= A;
      B_reg <= B;
      mult <= A_reg * B_reg;
      Q <= Q + mult;
   end
endmodule
DC FPGA is able to effectively implement the logic configuration shown in Figure 1 in a single DSP48 slice, fully recognizing and taking advantage of the DSP48’s embedded 18 x18 signed multipliers, accumulated adder mode, and integrated pipeline registers to obtain the highest performance system clock speed.

Figure 2 shows the final DC FPGA single DSP48 implementation without the use of other logic resources. The OPMODE control input pin of the DSP48 element is set to “0100101” to realize the overall MAC functionality mode intended by circuit topology, while the AREG, BREG, MREG, and PREG attributes are set to “1,” respectively, to signify a single-stage register pipeline.

Furthermore, the high-performance DSP inference feature in DC FPGA supports very complex design topologies. Such topologies are extensively used in DSP-intensive applications such as a digital FIR filter, commonly found in wireless communication applications.

Figure 3 shows the schematic of a four-tap systolic FIR digital filter structure. DC FPGA uses advanced DSP inference to implement this design in only four DSP48 slices without the use of external logic resources. The integrated pipeline registers are further exploited for faster clock throughput performance for this type of filter structure.

The following shows the RTL code for the systolic FIR filter:

module test ( Yn, Xn, h0, h1, h2, h3, clk );
output [47:0] Yn;
input [15:0] Xn, h0, h1, h2, h3;
input clk;

   reg [15:0] X [7:1];
   wire [15:0] h [3:0];
   reg [32:0] mult [3:0];
   reg [47:0] pcout [3:0];
   wire [47:0] Yn;
   integer i;

   assign h[3] = h3, h[2] = h2, h[1] = h1, h[0] = h0;

   always @( posedge clk )
   begin
      X[1] <= Xn;
      mult[0] <= h[0] * X[1];
      pcout[0] <= mult[0];

      for (i=1; i <= 3; i=i+1)
      begin: my_for_loop_block0
         X[2*i] <= X[2*i-1];
         X[2*i+1] <= X[2*i];
               mult[i] <= h[i] * X[2*i+1];
               pcout[i] <= pcout[i-1] + mult[i];
   end //my_for_loop_block0
end

assign Yn = pcout[3];

endmodule
DC FPGA can also implement other complex logic configurations in a DSP48 slice. Table 2 shows a sample of some of these complex logic structures.

The designs shown in Table 2 were synthesized using DC FPGA and place and routed using Xilinx ISE 6.3i Service Pack 2, while targeting an XC4VFX20-11 Virtex-4 device. The purpose of this exercise is to show the performance and area improvements performed by DC FPGA’s advanced DSP inference capability. Each design was synthesized with and without DSP inference enabled during synthesis.

Conclusion
Complex devices such as Virtex-4 require a flexible ASIC-strength synthesis solution. The advanced optimization engine in Synopsys Design Compiler FPGA efficiently utilizes the special resources available in Virtex-4 devices to provide the highest performance design possible.

DC FPGA gives you the freedom to modify synthesis scripts to address design bottlenecks, implement different FSM encoding styles, or to explore other design optimizations to reach your design goals. Now you have access to the power and flexibility of Design Compiler to implement your complex FPGA designs.

DC FPGA is an integral part of the complete ASIC-strength prototyping solution from Synopsys. Other tools supported in the Xilinx flow are Formality™ for formal verification, DesignWare® Library IP, Leda® for RTL design and code checking, PrimeTime® for static timing analysis, VCS® for simulation, Module Compiler™ for datapath synthesis, and HSPICE™ for analysis of multigigabit serial I/Os.

DC FPGA has a rapidly growing base of more than 100 customers. For more information about Design Compiler FPGA, visit www.synopsys.com/products/dcfpga/dcfpga.html.

Printable PDF version of this article with graphics. PDF logo (1/15/05) 340 KB

 
/csi/footer.htm