Support|documentation

  Xcell Journal Online
  Xcell Journal Archives
   
  Writing for Xcell
  Advertising in Xcell
  FREE Subscription
   
  Partner Yellow Pages
  Reference Pages
  Contact Us

    

Home : Documentation : Xcell Journal Online : Article
HDL Coding Practices to Accelerate Design Performance



by Philippe Garrault, Technical Marketing Engineer, Xilinx, Inc.
philippe.garrault@xilinx.com
and
Brian Philofsky, Technical Marketing Engineer, Xilinx, Inc.
brian.philofsky@xilinx.com (12/1/05)


Small code changes can make a big difference.
article link to PDF
Article PDF 275 KB


You can achieve increases in design performance by selecting the right hardware platform and silicon features, being familiar with the device architecture, or having the proper settings and features in your implementation tools. But one of the most overlooked ways to increase design performance is to write HDL code that is very efficient for the targeted device. In this article, we’ll present coding style tips to accelerate design performance.

Use of Resets and Performance
Few system-wide choices have as much of a profound effect on performance, area, and power as reset choice. Some system architects specify the use of a global asynchronous reset for the system. Whether it is truly needed or not, the ramifications of this choice are not always understood. With Xilinx® FPGA architecture, the use of and type of reset can have serious effects on the performance of your code.

SRLs
In all current Xilinx FPGA architectures, LUT (look-up table) elements are configurable as either logic, ROM/RAM, or a shift register (SRL, or shift register LUT). Synthesis tools can infer the use of any one of these structures from RTL code. However, in order to realize the use of the LUT as a shift register, a reset can not be described in the code, as the SRL does not have a reset. This means that shift registers coded with resets results in suboptimal implementation (requiring several flip-flops and the associated routing between them), while code without resets results in fast and compact implementation (using SRLs).

The effect on area and power is more obvious for these two cases, but the effect on performance is a little less clear. In general, a shift register built out of flip-flops is not going to be the critical path in a design because the timing path between registers is not normally long enough to be the longest path in the design. The added consumption of resources (flip-flops and routing) can have a negative influence on the placement and routing choices for other portions of the design, possibly resulting in longer routing paths.

Dedicated Multipliers and RAM Blocks
Multipliers are generally thought of for DSP designs. But because Xilinx FPGA architectures contain dedicated resources for multiplication, multipliers can be found in many types of designs, performing multiplication as well as other functions. Similarly, virtually every FPGA design uses RAMs of various sizes, regardless of the application.

Xilinx FPGAs contain several block RAM elements that can be used in a design as RAM, ROM, a large LUT, or even general logic. The use of both multipliers and RAM resources can result in more compact and higher performing designs, but reset choice can have either a positive or negative performance impact, depending on the type of reset used. Both RAM and multiplier blocks contain only synchronous resets; thus, if an asynchronous reset is coded for these functions, the registers within these blocks cannot be used. The effect this has on performance can be severe. For example, using a fully pipelined multiplier targeting Virtex™-4 devices with an asynchronous reset can result in a 200 MHz performance. Changing the code to a synchronous reset can more than double design performance to 500 MHz. The issues with RAMs are twofold. Similar to the multipliers, Virtex-4 block RAMs have optional output registers which, when used, can reduce the clock-to-out times of the RAMs and increase overall design speed. These registers offer synchronous resets but not asynchronous resets, and thus cannot be used if the registers within the code describe an asynchronous reset.

A secondary issue comes to light when using the RAMs as a LUT or general logic. At times, it is advantageous for both area and performance reasons to condense several LUTs configured as ROM or general logic into a single block RAM. This can be done either by manually specifying these structures, or (in automated ways) mapping the portions of the logical design to unused block RAM resources. Because the block RAM has a synchronous reset, the mapping of general logic can occur without changing the specified functionality of the design – if a synchronous reset (or no reset) is used. If an asynchronous reset is described, this is not possible.

General Logic
Probably the least-known effect asynchronous resets have is on general logic structures. Because all Xilinx FPGA general-purpose registers contain the ability to program the set/reset as either asynchronous or synchronous, you might think that there is no penalty to use asynchronous resets. That assumption is often wrong. If an asynchronous reset is not used, the set/reset logic can be configured as synchronous logic; if so, this frees up added resources for logic optimization. To illustrate how asynchronous resets can inhibit optimization, let’s look at the following suboptimal code examples:
VHDL Example #1 and Verilog Example #1

To implement this code, the synthesis tool has no choice but to infer two LUTs for the data path, because there are five signals used to create this logic. A possible implementation of the above code would look like Figure 1.

If, however, this same code is re-written for a synchronous reset, as in the following examples of corrected code with reduced area and improved performance:
VHDL Example #2 and Verilog Example #2

The synthesis tool now has more flexibility as to how this function can exist. A possible implementation of the preceding code would look like Figure 2.

In this implementation, the synthesis tool can identify that any time A is active high, Q is always a logic one (the OR function). With the register now configured with the set/reset as a synchronous operation, the set is now free to be used as part of the synchronous data path. This reduces the amount of logic necessary to implement the function, as well as reducing the data path delays for the D and E signals from the previous example. Logic could have also been shifted to the reset side as well, if the code was written in a way that was a more beneficial implementation. Consider the following addition to these examples:
VHDL Example #3 and Verilog Example #3

Now that there are eight signals that contribute to the logic function, a minimum of three LUTs would be needed to implement this function. A possible implementation of the above code would look like Figure 3. If the same code is written with a synchronous reset:
VHDL Example #4 and Verilog Example #4

A possible implementation of the above code would look like Figure 4. Again, the resulting implementation not only uses fewer LUTs to implement the same logic function, but also could potentially result in a faster design because of the reduction of logic levels for practically every signal that creates this function.

These examples are simple, but they do illustrate our point of how asynchronous resets force all synchronous data signals on the data input to the register, thus resulting in possibly more logic levels and less optimal implementation. In general, the more signals that fan into a logic function, the more effective the use of synchronous sets/resets (or no resets at all) in minimizing logic resources or maximizing design performance.

Adder Chains Instead of Adder Trees
Many signal processing algorithms perform an arithmetic operation on an input stream of samples, followed by a summation of all outputs of this arithmetic operation. The adder tree structure is typically used to implement the summation in parallel architectures such as FPGAs.

One difficulty with the adder tree concept is the varying nature of its size. The number of adders is dependent on the number of inputs in the adder tree. The more inputs in the adder tree, the more adders you need, which increases both the number of logic resources and power consumption. Larger trees also mean larger adders in the last stages of the tree, which further reduces system performance.

To reduce power consumption and maintain high performance, adder trees should be implemented as dedicated silicon resources. But placing a number of fixed-size adder tree components in silicon is not efficient because you would have to use logic resources when the fixed number of additions is exceeded or even go to a larger FPGA, thereby increasing the cost of the device.

With its columns of DSP48 dedicated silicon, the Virtex-4 device family takes a different approach in implementing summations. It involves computing the summation incrementally using chained adders instead of adder trees. This approach is a departure from any existing FPGA and is key to maximizing performance and lowering power for DSP algorithms because both logic and interconnect are contained entirely within the dedicated silicon.

When pipelined, performance of the DSP48 block is 500 MHz – independent of the number of adders. As illustrated in Figure 5, cascading ports combined with the 48-bit resolution of the adder/accumulator allow computing of the current sample calculation, along with the summation of all computed samples so far.

To take advantage of the Virtex-4 adder chain structure in the RTL, simply replace the adder tree description with an adder chain description. This process of converting a direct form filter to a transposed or systolic form is detailed in the XtremeDSP Design Considerations User Guide. Once the conversion is complete, you may find that the algorithm runs much faster than your application needs. In that case, you could further reduce device utilization and power consumption by using either folding or multi-channeling techniques. Both techniques help implement designs in smaller devices or allow you to add functionality to a design using the freed resources.

Multi-channeling is a process that leverages very fast math elements across multiple input streams (channels) with much lower sample rates. This technique increases silicon efficiency by a factor almost equal to the number of channels. Multi-channel filtering can be looked at as time-multiplexing single-channel filters. For example, in a typical multi-channel filtering scenario, multiple input channels are filtered using a separate digital filter for each channel. Taking advantage of the Virtex-4 DSP48 block, you could use a single digital filter to filter all eight input channels by clocking the single filter with an 8x clock. This reduces the number of FPGA resources needed by almost 8x.

Maximize Block RAM Performance
When inferring memory elements, factors affecting performance include:

  • using dedicated blocks or distributed RAMs
  • using the output pipeline register
  • not using asynchronous resets
There are also a couple of lesser known areas – HDL coding style and synthesis tool settings – that can substantially impact memory performance.

HDL Coding Style
When inferring dual-port block memories, it is possible that both ports could try to access the same memory cell at the same time. If both ports are simultaneously writing different values at the same memory cell, this creates a collision and the memory cell content cannot be guaranteed. But what happens if one port reads while the other port is writing at the same address? Well, it depends on the target device. The latest Virtex and Spartan™ families have three programmable operating modes to govern memory output while a write operation is occurring. Additional information about these operating modes is provided in the device user guides.

Note that the different modes affect how the memory outputs behave and also affect the performance of the memory. As illustrated in the following example, your coding style determines in which mode the memory is operating: Code Style Example

Add Pipeline Levels
Another way to increase performance is to restructure long data paths made of several levels of logic, breaking them up over multiple clock cycles. This method allows for a faster clock cycle and increased data throughput, at the expense of latency and pipeline management overhead logic. Because FPGAs are register-rich, the additional registers and overhead logic are usually not an issue.

Because the data is now on a multicycle path, you must use special considerations for the rest of the design to account for the added latency. The following example presents a coding style to add five levels of registers on the output of a 32 x 32 multiplier. The synthesis tool will pipeline these registers to the registers available in the Virtex-4 DSP48 block so as to maximize data throughput.
Code Example

Nests in the Code
Try not to make too many nests in the code, such as nested if and case statements. If you have too many if statements inside of other if statements, it can make the line length too long, as well as inhibit synthesis optimizations. By following this guideline, your code is generally more readable and more portable.

When describing “for-loops” in HDL, it is preferable to place at least one register in the data path, especially when there are arithmetics or other logic-intensive operations. During compilation, the synthesis tool will unroll the loops. Without these synchronous elements, it will concatenate logic created at each iteration of the loop, resulting in a very long combinatorial path that may limit design performance.

Conclusion
Recent advances in synthesis and place and route algorithms have made achieving the best performance out of a particular device much more straightforward. Synthesis tools are able to infer and map complex arithmetics and memory descriptions onto the dedicated hardware blocks. They will also perform optimizations such as retiming and logic and register replications. Based on timing constraints, the place and route tool can now restructure the netlist and perform timing-driven packing and placement to minimize placement and routing congestions.

However, today (just as yesterday), there is only so much the tools can do to maximize performance. If you need more performance out of your design, then a very efficient way to proceed is by learning more about the target device, the synthesis tool, and by using the coding guidelines illustrated in this article.

Printable PDF version of this article with graphics. PDF logo (12/1/05) 275 KB

 
/csi/footer.htm