VHDL Example #4 and Verilog Example #4
A possible implementation of the above
code would look like Figure 4. Again, the
resulting implementation not only uses
fewer LUTs to implement the same logic
function, but also could potentially result
in a faster design because of the reduction
of logic levels for practically every signal
that creates this function.
These examples are simple, but they do
illustrate our point of how asynchronous
resets force all synchronous data signals on
the data input to the register, thus resulting
in possibly more logic levels and less optimal
implementation. In general, the more signals
that fan into a logic function, the more effective
the use of synchronous sets/resets (or no
resets at all) in minimizing logic resources or
maximizing design performance.
Adder Chains Instead of Adder Trees
Many signal processing algorithms perform
an arithmetic operation on an input stream
of samples, followed by a summation of all
outputs of this arithmetic operation. The adder tree structure is typically used to
implement the summation in parallel
architectures such as FPGAs.
One difficulty with the adder tree concept
is the varying nature of its size. The
number of adders is dependent on the number
of inputs in the adder tree. The more
inputs in the adder tree, the more adders
you need, which increases both the number
of logic resources and power consumption.
Larger trees also mean
larger adders in the last stages of
the tree, which further reduces
system performance.
To reduce power consumption
and maintain high performance,
adder trees should be
implemented as dedicated silicon
resources. But placing a number
of fixed-size adder tree components
in silicon is not efficient
because you would have to use
logic resources when the fixed
number of additions is exceeded
or even go to a larger FPGA,
thereby increasing the cost of the device.
With its columns of DSP48 dedicated
silicon, the Virtex-4 device family takes a
different approach in implementing summations.
It involves computing the summation
incrementally using chained adders
instead of adder trees. This approach is a
departure from any existing FPGA and is
key to maximizing performance and lowering
power for DSP algorithms because
both logic and interconnect are contained
entirely within the dedicated silicon.
When pipelined, performance of the
DSP48 block is 500 MHz – independent
of the number of adders. As illustrated in
Figure 5, cascading ports combined with
the 48-bit resolution of the adder/accumulator
allow computing of the current sample
calculation, along with the summation
of all computed samples so far.
To take advantage of the Virtex-4 adder
chain structure in the RTL, simply replace
the adder tree description with an adder
chain description. This process of converting
a direct form filter to a transposed or
systolic form is detailed in the XtremeDSP
Design Considerations User Guide.
Once the conversion is complete, you
may find that the algorithm runs much faster
than your application needs. In that case,
you could further reduce device utilization
and power consumption by using either folding or multi-channeling techniques.
Both techniques help implement designs in
smaller devices or allow you to add functionality
to a design using the freed resources.
Multi-channeling is a process that leverages
very fast math elements across multiple
input streams (channels) with much
lower sample rates. This technique increases
silicon efficiency by a factor almost equal
to the number of channels. Multi-channel
filtering can be looked at as time-multiplexing
single-channel filters. For example,
in a typical multi-channel filtering scenario,
multiple input channels are filtered
using a separate digital filter for each channel.
Taking advantage of the Virtex-4
DSP48 block, you could use a single digital
filter to filter all eight input channels by
clocking the single filter with an 8x clock.
This reduces the number of FPGA
resources needed by almost 8x.
Maximize Block RAM Performance
When inferring memory elements, factors
affecting performance include:
- using dedicated blocks or distributed
RAMs
- using the output pipeline register
- not using asynchronous resets
There are also a couple of lesser known
areas – HDL coding style and synthesis
tool settings – that can substantially
impact memory performance.
HDL Coding Style
When inferring dual-port block memories,
it is possible that both ports could try to
access the same memory cell at the same
time. If both ports are simultaneously writing
different values at the same memory
cell, this creates a collision and the memory
cell content cannot be guaranteed. But
what happens if one port reads while the
other port is writing at the same address?
Well, it depends on the target device. The
latest Virtex and Spartan™ families have
three programmable operating modes to
govern memory output while a write operation
is occurring. Additional information
about these operating modes is provided in
the device user guides.
Note that the different modes affect
how the memory outputs behave and also
affect the performance of the memory. As
illustrated in the following example, your
coding style determines in which mode the
memory is operating:
Code Style Example
Add Pipeline Levels
Another way to increase performance is to
restructure long data paths made of several
levels of logic, breaking them up over
multiple clock cycles. This method allows
for a faster clock cycle and increased data
throughput, at the expense of latency and
pipeline management overhead logic.
Because FPGAs are register-rich, the additional
registers and overhead logic are usually
not an issue.
Because the data is now on a multicycle
path, you must use special considerations
for the rest of the design to account
for the added latency. The following
example presents a coding style to add five
levels of registers on the output of a 32 x
32 multiplier. The synthesis tool will
pipeline these registers to the registers
available in the Virtex-4 DSP48 block so
as to maximize data throughput.
Code Example
Nests in the Code
Try not to make too many nests in the
code, such as nested if and case statements.
If you have too many if statements inside of
other if statements, it can make the line
length too long, as well as inhibit synthesis
optimizations. By following this guideline,
your code is generally more readable and
more portable.
When describing “for-loops” in HDL, it
is preferable to place at least one register in
the data path, especially when there are
arithmetics or other logic-intensive operations.
During compilation, the synthesis
tool will unroll the loops. Without these
synchronous elements, it will concatenate
logic created at each iteration of the loop,
resulting in a very long combinatorial path
that may limit design performance.
Conclusion
Recent advances in synthesis and place and
route algorithms have made achieving the
best performance out of a particular device
much more straightforward. Synthesis
tools are able to infer and map complex
arithmetics and memory descriptions onto
the dedicated hardware blocks. They will
also perform optimizations such as retiming
and logic and register replications.
Based on timing constraints, the place and
route tool can now restructure the netlist
and perform timing-driven packing and
placement to minimize placement and
routing congestions.
However, today (just as yesterday), there
is only so much the tools can do to maximize
performance. If you need more performance
out of your design, then a very
efficient way to proceed is by learning more
about the target device, the synthesis tool,
and by using the coding guidelines illustrated
in this article.
Printable PDF version of this article with graphics.
(12/1/05) 275 KB