|
By approaching FPGA designs as
three-dimensional endeavors,
you can radically reduce
device size – and cost.
"Performance + Time = Memory" may
sound like an odd formula, but when you
understand it, you can realize significantly
lower implementation costs within Xilinx
FPGAs. In this article, I’ll show you how to
use three-dimensional (3-D) design to
accomplish a 15X reduction in the number
of logic blocks in a sensing application.
Although vital for DSP applications, I
really like the way the formula can be
applied to so many designs. It is particularly
useful for applications that are suited to
the range of Spartan™ devices, where
cost savings are always welcome for high-volume
applications.
But let’s understand the formula first.
2-D Parallel Design
In most hardware designs, we treat the
Xilinx FPGA as a two-dimensional (2-D)
fabric, as shown in Figure 1. Complex logic
blocks (CLBs) provide the logical functions
and blocks of RAM are used for buffers,
such as first-in first-out (FIFO) memories.
The tendency is for a design to become
larger as more functionality is required.
Therefore, it must use larger devices. The
clock speed can often be well below 100
MHz, and many of the functions are
clock-enabled at even lower rates.
Because the cost of a design is proportional
to the size of the device, parallel
implementations, even if well optimized,
will be relatively expensive. They cannot
be avoided where maximum performance
is required.
Applications such as bus interfaces that
need a predefined number of pins and clock
rates are also fundamentally constrained
in the way they can be implemented.
However, when processing functions need
only be completed in a relatively long time
period, such 2-D design is wasteful and
unnecessarily expensive.
A parallel design provides logic for
each and every function that must be implemented. This means that there is
actually a zero requirement for memory,
because a signal (wire) exits for every value
to be calculated. The addition tree example
in Figure 2 shows how the value
“A+B+C+D” is created. Because the value is
immediately applied to the final adder, however,
the value does not need to be stored.
Of course, the parallel implementation
offers the very highest performance. The
adder tree can easily exceed 100 MHz in a
Spartan-II device, which is equivalent to
more than 700 million additions per second.
However, such a structure cannot
benefit from having more time to complete
the required operation, other than consuming
less power if it is clocked slower.
If there is 1 ms available to perform the
addition tree, then it can be clocked at 1
KHz. It will work, but it really is a waste of
the Spartan-II silicon performance potential.
Even worse, the more values that need
to be added, the larger the circuit becomes – and this increases the cost of your product.
Processors Obey 3-D Formula
Now, take a closer look at the familiar world
of processors. A processor is a very good
time-sharing engine. The ALU is directed
to perform many different operations over
many clock cycles to (it is hoped) complete
the desired process in the required time
period. The higher the performance of the
processor, the faster the ALU will be
clocked, and hence, the more that ALU can
be time-shared to achieve the algorithmic
process, as illustrated in Figure 3.
For example, given that a particular
process must be completed in a maximum
time of 1 ms, the number of clock cycles
available for the processor to exploit
depends on the performance:
- A clock speed of 1 MHz provides
1,000 clock cycles per 1 ms.
- A clock speed of 100 MHz provides
100,000 clock cycles per 1 ms.
- A clock speed of 200 MHz provides
200,000 clock cycles per 1 ms.
This is all very obvious, but less obvious
is the direct link this has to memory.
Suppose the available clock cycles are
used by the ALU to perform the trivial task
of summing data values. In a 1 ms time period, a 1 MHz clock rate means that the
processor has the ability to sum 1,000 data
values. It will have to get these values from
somewhere, and that place will be memory.
As the clock increases to 200 MHz, it can
then use the same amount of logic to sum
200,000 data values – and it now needs a
memory to hold 200,000 words.
In a more realistic case, a process tends
to apply multiple instructions to each data
set, so the memory requirement to store
data is not so high; all the same, there is a
very strong relationship.
3-D Sequential Design
Making the decision to operate the logic
functions at a higher rate than the processing
rate allows operations to be achieved
sequentially. As with a processor, logic is
time-shared over multiple clock cycles.
Because "Performance + Time = Memory,"
we also need to use memory to hold all the
values not being used in a given clock
cycle, as well as partial/temporary results
created during the processing. See Figure 4
for a 3-D rendering.
The FPGA can now be thought of as a
3-D volume to be filled. The best part is
that you just pay for the 2-D fabric being
occupied. The only limits to "building"
upwards are the maximum clock rate of
the device and the amount of RAM available
in a given block. In addition to the
dedicated blocks of RAM, each CLB can
be used to provide distributed RAM,
allowing the correct amount of memory
to be allocated in each position. This prevents
memory access bottlenecks from
forming in your design.
3-D Approach to Design
When any function is implemented, two
basic questions should be asked:
1. How much time is available to
complete the process?
2. Given the performance of the
selected Xilinx device, what clock
rate will be used?
The answer to the first question comes
from the design specification. The way
you partition the design into functions
can have quite an impact, so consider
some alternatives. As to the performance
of Xilinx FPGA devices, this has more to
do with "design comfort" than the actual
peak performance of the devices.
Regarding the second question, I personally
like to see devices clocked above
75 MHz, and I find this relatively easy
to achieve. However, the higher the
clock rate is, the more challenging
the design is. Anything lower than a
50 MHz clock rate is very slow and
wasteful of the performance potential
offered by Xilinx logic devices.
Remember that the embedded DLL
(delay locked loop) and DCM (digital
clock manager) blocks can be used to
create internal clocks of a higher rate
than those available on the PCB.
The answers to the first two questions
will let you know if there is any potential
for time-sharing of logic resources. This
leads to a third and final question:
3. How can the memory resources of
the device be utilized to reduce the
size of implementation?
Now the engineering starts. It does take
some practice, so what follows is a design
specification for you to consider.
Design Challenge
The challenge is to design a small box to be
used in factories processing items such as
fruits and vegetables. The card is used to
collect data from light sensors located on
various conveyor belts, along which the
fruits and vegetables pass as they are sorted
for type, quality, and size.
The initial design concept is to employ a
microcontroller (or similar small processor)
to collate the information and communicate
it via serial (RS-232) links to a PC in
the factory control room.
An FPGA is being considered to interface
the processor to the sensors. The product
is required in high volume (50K to
100K units), so a Xilinx Spartan-II FPGA
is the target for a cost-effective solution.
The card supports 64 sensors. A logic "1"
signal is generated when the light beam is
broken by a passing object. The maximum
speed of the conveyor belt is 1 meter/sec.
The minimum width of a single item is 3
cm, and there is a minimum 10 cm between
items on the belts.
Each pulse is recorded by a separate
counter, which can support a maximum
value of 4,095 (12 bits). A simple interface
to the microcontroller is then able to read
the value of any of the 64 counters in the
card by supplying a 6-bit address.
Initial Observations
Taking a very direct approach to the
design, we could simply identify the need
to implement 64 counters of 12 bits followed by a 64:1 data multiplexer. In fact,
this is a direct representation of the block
diagram shown in Figure 5.
However, we need to apply some fundamentally
good engineering here, because we
certainly wouldn’t want to have 64 independent
clocks in a design. Such a design
would lead to very poor utilization of the
device and have a high probability of unreliable
operation. The signal inputs really
should be synchronized to a single internal
clock, and then clock enables should be
used with the counters.
First Estimate
Given a basic understanding of the device
architecture, you can easily make an estimate
of the device resources used.
- Counters – Because each slice of an
FPGA can implement a 2-bit counter,
six slices are required to implement a
12-bit counter. Therefore, a total of 384
slices are required for all 64 counters.
(Two slices form a CLB within the
Virtex™ and Spartan-II FPGA families.)
- Multiplexer – Each slice contains two
lookup tables and a dedicated multi-plexer
(MUXF5), enabling a 4:1 multi-plexer
to be implemented. However,
each pair of slices within a CLB share
an additional dedicated multiplexer
(MUXF6), enabling a complete 8:1
multiplexer to be implemented in two
slices. Nine of these 8:1 multiplexers
are required to construct a 64:1 multiplexer,
which then must be replicated
12 times to support the data width of
the counters. The total size of the multiplexer
is then 2 x 9 x 12 = 216 slices.
- Synchronizing Logic – At this stage,
we have not designed the logic to capture
the input signals and synchronize
them to the internal clock. For now,
we will allow a slice per input (two
flip-flops and some gates). This gives
us a total of 64 slices.
Based on these major building blocks
of the design, our estimate is for 664
slices. Thus, a Spartan-II XC2S50 device
is suitable with its 768 slices, providing a
surplus of 104 slices to complete the
processor interface.
There are many ways to implement the
"Performance + Time = Memory" formula –
and we will look at just one. But as long as
you can significantly lower the cost, you are
well on your way to improving the profit
margins on your own designs in the future.
Remember, the target to beat is 664
slices in a Spartan XC2S50 device, which
was the result of a full parallel 2-D design.
Implementing a 3-D Design
We must begin our 3-D design process by
asking the right questions that relate to the
"Performance + Time = Memory" formula.
How Much Time Is Available?
Taking the minimum fruit size and minimum
spacing between fruit passing on a
belt at the maximum speed of 1 meter/sec,
we derive the timing of the fastest pulses
from a light sensor.
We discover that the pulses are of a long
duration and that the pulse rate is very low.
In fact, the maximum pulse rate is less than
8 Hz, which is very slow indeed. However,
we must consider that there are 64 sensors
to be monitored; we could be unlucky
enough to have them all triggered at the
same time. So, all 64 sensors must be serviced
in a maximum of 30 ms, and the
aggregate data rate is more like 500 Hz.
What Performance Is Available?
We know that a Spartan-II FPGA is the
target architecture. This device is capable of
operation above 100 MHz, so device performance
should not limit us at all in this
case. Although we want to get the most out
of the silicon, there is no point overdoing it
and burning power unnecessarily. In this
case, it is better to work out the minimum
clock rate required to process all 64 channels, and then tie this rate in with a suitable
clock source on the PCB.
Looking at the timing waveform shown
in Figure 6, the pulse width caused by the
smallest fruit breaking the light beam is
the most demanding. We must guarantee
that we observe each sensor at least once
every 30 ms.
If, however, the 64 sensors are observed
and processed sequentially, rather than in
parallel, then 30 ms divided by 64 is the
maximum time that can be allocated to
each sensor. This means that the minimum
processing rate is 2,133 Hz. Obviously,
this is still desperately slow, but it only
emphasizes that "Performance + Time =
Memory"” must be a valid formula to be
applied in this case.
Replacing Counters with Memory
We have "Time" and we have ample
"Performance," so now it is a case of working
out how to make the whole thing a
sequential 3-D design. How can the memory
resources of the device be utilized to
reduce the size of implementation?
Because memory is used to hold data
values, we must identify where the data is
in the system. These may be complete values
or partial values, so we must have a
good look through the block diagram and
identify where the data values are. In this
system, they are fairly obvious in that the
counters each hold a value. In the parallel
implementation, they are distributed
across the 384 slices, forming the 64 counters,
but we want to consolidate them into
a single memory.
We can choose between distributed
(CLB) memory and dedicated (block)
memory, and we could really use either to
form storage for 64 values of 12 bits.
However, as the dedicated block RAM
isn’t required for anything else, let’s take
that option. Configured as 256 words of
16 bits, a single block provides more than
adequate storage.
The counter functionality is then
replaced with a single increment function,
as shown in Figure 7. A "count value" is
read from the RAM, passed through the
increment block, and then written back
into the RAM at the same location. This is best organized as a two-cycle process, but is
no issue given the "Performance + Time"
that is available.
Although we could selectively access the
count values to be recorded as a corresponding
light beam is broken, it is much easier
to scan sequentially through all 64 count
values and record only those which must be
increased before the value is written back
into the RAM. This reduces the address
generation to a simple 6-bit counter.
At this stage, we have replaced 384
slices with one block RAM and just nine
slices of logic (six for the increment and
three for the address counter). This is a
huge savings. Now, however, we must find
ways to connect the inputs and outputs to
this 3-D processing engine.
Eliminate the Data Multiplexer
The parallel data multiplexer is simply not
required in this design. We save 216 slices
instantly because the count values are now
held in one consolidated memory. The
dual-port nature of the block RAM really
makes it very easy to connect the external
processor.
As illustrated in Figure 8, the memory
also offers the opportunity for the processor
to have a write mode to reset count values
or set test values. As with the parallel
implementation, there is a risk that the
processor will try to read a count value that
is in the process of being modified.
However, it’s very easy to allocate time for
the recording process and time for the
processor to read values.
Although a clock rate of a few KHz is
adequate for the processing, a clock of 2
MHz (or similar clock rate associated with
the microcontroller) would achieve a count
value update scan in 64 µs, leaving nearly
all of the 30 ms processing period available
for the microcontroller to read or write
count values.
Connecting the Sensors
At some point in all 3-D designs, the parallel
world must be interfaced to the
sequential processing engine. This does not
have to be difficult, and often a simple
method is adequate, as seen in Figure 9.
The counter used to access each count
value from the RAM can be used to select
the associated sensor via a 64:1 multiplexer.
Although this requires multiplexer logic, it
is just for one bit and therefore only
requires nine slices.
Each sensor still requires its own logic.
This is partly to synchronize the input
signals, but is also required to ensure that
each "beam broken" pulse is only used to
record a count value once. For this reason,
the one slice per sensor is unlikely to
be reduced.
When you see that the logic size is
increasing because the function is becoming
more parallel, it is worth looking to see
if anything else can be time-shared and
moved into memory. In this case, we can
indeed improve things.
We can replace the multiplexer with a
64-bit parallel-to-serial converter (32
slices), which converts the parallel domain
into a serial sequential process, as demonstrated
in Figure 10. To detect only the
start of a new pulse, a memory is used to
remember the last state of each of the 64
sensors. Because the operation is so predictable,
we can use the SRL16E memory
mode, which requires just two slices.
Dramatic Cost Reduction
So was it worth it? I think the diagrams in
Figure 11 speak for themselves.
To reduce the function from 332 CLBs
to just 22 CLBs is a dramatic change: 15
times smaller. Our design now fits in the
smallest Spartan-II device (XC2S15) – and
actually only uses 25% of that.
This reduction in size and cost is not
just specific to this particular design. For
example, much of 3G wireless processing is
involved with "chip rates" of 1.2288 MHz
and 3.84 MHz. This provides the time to
allow the performance and memory of
Virtex devices to process at least 32 channels
sequentially, in just the same way as
our simple fruit counter.
Final Considerations
The Spartan-II XC2S15 has only 86 user
I/Os, and our design has high I/O
demands. Having used 64 for sensor
inputs and applied a clock, only 21 I/Os
are left for the microcontroller interface.
Given an 8-bit data bus, it is possible to
connect to the microcontroller, but it
does illustrate how I/O can limit a design
once these highly efficient techniques are
employed.
Of course, it would be a pity for 75% of
the XC2S15 to be completely wasted. It
would be nice to embed the microcontroller
and the UART in the same device.
This is also possible, but it’s a topic for
another article.
Meanwhile, once you discover that 3-D
designs are possible, you are well on your
way to improving the profit margins on
your own designs.
[Editor’s note: This article was derived
from a two-part TechXclusive on the
support.xilinx.com website. To see the
original TechXclusive, go to
support.xilinx.com/support/techxclusives/3-D-techX22.htm and
support.xilinx.com/support/techxclusives/3-D2-techX23.htm.
To see more TechXclusives,
go to support.xilinx.com and search for
"TechXclusives," then click on "Xilinx
TechXclusives Home."]
Printable PDF version of this article. (07/15/03) 405 KB |