Support|documentation

  Xcell Journal Home
  Xcell Journal Article
  Partner Yellow Pages
   
  Xcell Archives
  Order Free Xcell Journal
  Comments & Suggestions
  Write Articles for Xcell

 

Home : Literature : Xcell Journal Online : Article

Performance + Time = Memory

by Ken Chapman, Senior Staff Engineer, Xilinx, UK
ken.chapman@xilinx.com (07/15/03)

By approaching FPGA designs as three-dimensional endeavors, you can radically reduce device size – and cost.

"Performance + Time = Memory" may sound like an odd formula, but when you understand it, you can realize significantly lower implementation costs within Xilinx FPGAs. In this article, I’ll show you how to use three-dimensional (3-D) design to accomplish a 15X reduction in the number of logic blocks in a sensing application.

Although vital for DSP applications, I really like the way the formula can be applied to so many designs. It is particularly useful for applications that are suited to the range of Spartan™ devices, where cost savings are always welcome for high-volume applications.

But let’s understand the formula first.

2-D Parallel Design
In most hardware designs, we treat the Xilinx FPGA as a two-dimensional (2-D) fabric, as shown in Figure 1. Complex logic blocks (CLBs) provide the logical functions and blocks of RAM are used for buffers, such as first-in first-out (FIFO) memories.

The tendency is for a design to become larger as more functionality is required. Therefore, it must use larger devices. The clock speed can often be well below 100 MHz, and many of the functions are clock-enabled at even lower rates.

Because the cost of a design is proportional to the size of the device, parallel implementations, even if well optimized, will be relatively expensive. They cannot be avoided where maximum performance is required.

Applications such as bus interfaces that need a predefined number of pins and clock rates are also fundamentally constrained in the way they can be implemented. However, when processing functions need only be completed in a relatively long time period, such 2-D design is wasteful and unnecessarily expensive.

A parallel design provides logic for each and every function that must be implemented. This means that there is actually a zero requirement for memory, because a signal (wire) exits for every value to be calculated. The addition tree example in Figure 2 shows how the value “A+B+C+D” is created. Because the value is immediately applied to the final adder, however, the value does not need to be stored.

Of course, the parallel implementation offers the very highest performance. The adder tree can easily exceed 100 MHz in a Spartan-II device, which is equivalent to more than 700 million additions per second. However, such a structure cannot benefit from having more time to complete the required operation, other than consuming less power if it is clocked slower.

If there is 1 ms available to perform the addition tree, then it can be clocked at 1 KHz. It will work, but it really is a waste of the Spartan-II silicon performance potential. Even worse, the more values that need to be added, the larger the circuit becomes – and this increases the cost of your product.

Processors Obey 3-D Formula
Now, take a closer look at the familiar world of processors. A processor is a very good time-sharing engine. The ALU is directed to perform many different operations over many clock cycles to (it is hoped) complete the desired process in the required time period. The higher the performance of the processor, the faster the ALU will be clocked, and hence, the more that ALU can be time-shared to achieve the algorithmic process, as illustrated in Figure 3.

For example, given that a particular process must be completed in a maximum time of 1 ms, the number of clock cycles available for the processor to exploit depends on the performance:

  • A clock speed of 1 MHz provides 1,000 clock cycles per 1 ms.
  • A clock speed of 100 MHz provides 100,000 clock cycles per 1 ms.
  • A clock speed of 200 MHz provides 200,000 clock cycles per 1 ms.
This is all very obvious, but less obvious is the direct link this has to memory.

Suppose the available clock cycles are used by the ALU to perform the trivial task of summing data values. In a 1 ms time period, a 1 MHz clock rate means that the processor has the ability to sum 1,000 data values. It will have to get these values from somewhere, and that place will be memory. As the clock increases to 200 MHz, it can then use the same amount of logic to sum 200,000 data values – and it now needs a memory to hold 200,000 words.

In a more realistic case, a process tends to apply multiple instructions to each data set, so the memory requirement to store data is not so high; all the same, there is a very strong relationship.

3-D Sequential Design
Making the decision to operate the logic functions at a higher rate than the processing rate allows operations to be achieved sequentially. As with a processor, logic is time-shared over multiple clock cycles. Because "Performance + Time = Memory," we also need to use memory to hold all the values not being used in a given clock cycle, as well as partial/temporary results created during the processing. See Figure 4 for a 3-D rendering.

The FPGA can now be thought of as a 3-D volume to be filled. The best part is that you just pay for the 2-D fabric being occupied. The only limits to "building" upwards are the maximum clock rate of the device and the amount of RAM available in a given block. In addition to the dedicated blocks of RAM, each CLB can be used to provide distributed RAM, allowing the correct amount of memory to be allocated in each position. This prevents memory access bottlenecks from forming in your design.

3-D Approach to Design
When any function is implemented, two basic questions should be asked:

1. How much time is available to complete the process?

2. Given the performance of the selected Xilinx device, what clock rate will be used?

The answer to the first question comes from the design specification. The way you partition the design into functions can have quite an impact, so consider some alternatives. As to the performance of Xilinx FPGA devices, this has more to do with "design comfort" than the actual peak performance of the devices.

Regarding the second question, I personally like to see devices clocked above 75 MHz, and I find this relatively easy to achieve. However, the higher the clock rate is, the more challenging the design is. Anything lower than a 50 MHz clock rate is very slow and wasteful of the performance potential offered by Xilinx logic devices. Remember that the embedded DLL (delay locked loop) and DCM (digital clock manager) blocks can be used to create internal clocks of a higher rate than those available on the PCB.

The answers to the first two questions will let you know if there is any potential for time-sharing of logic resources. This leads to a third and final question:

3. How can the memory resources of the device be utilized to reduce the size of implementation?

Now the engineering starts. It does take some practice, so what follows is a design specification for you to consider.

Design Challenge
The challenge is to design a small box to be used in factories processing items such as fruits and vegetables. The card is used to collect data from light sensors located on various conveyor belts, along which the fruits and vegetables pass as they are sorted for type, quality, and size.

The initial design concept is to employ a microcontroller (or similar small processor) to collate the information and communicate it via serial (RS-232) links to a PC in the factory control room.

An FPGA is being considered to interface the processor to the sensors. The product is required in high volume (50K to 100K units), so a Xilinx Spartan-II FPGA is the target for a cost-effective solution.

The card supports 64 sensors. A logic "1" signal is generated when the light beam is broken by a passing object. The maximum speed of the conveyor belt is 1 meter/sec. The minimum width of a single item is 3 cm, and there is a minimum 10 cm between items on the belts.

Each pulse is recorded by a separate counter, which can support a maximum value of 4,095 (12 bits). A simple interface to the microcontroller is then able to read the value of any of the 64 counters in the card by supplying a 6-bit address.

Initial Observations
Taking a very direct approach to the design, we could simply identify the need to implement 64 counters of 12 bits followed by a 64:1 data multiplexer. In fact, this is a direct representation of the block diagram shown in Figure 5.

However, we need to apply some fundamentally good engineering here, because we certainly wouldn’t want to have 64 independent clocks in a design. Such a design would lead to very poor utilization of the device and have a high probability of unreliable operation. The signal inputs really should be synchronized to a single internal clock, and then clock enables should be used with the counters.

First Estimate
Given a basic understanding of the device architecture, you can easily make an estimate of the device resources used.

  • Counters – Because each slice of an FPGA can implement a 2-bit counter, six slices are required to implement a 12-bit counter. Therefore, a total of 384 slices are required for all 64 counters. (Two slices form a CLB within the Virtex™ and Spartan-II FPGA families.)
  • Multiplexer – Each slice contains two lookup tables and a dedicated multi-plexer (MUXF5), enabling a 4:1 multi-plexer to be implemented. However, each pair of slices within a CLB share an additional dedicated multiplexer (MUXF6), enabling a complete 8:1 multiplexer to be implemented in two slices. Nine of these 8:1 multiplexers are required to construct a 64:1 multiplexer, which then must be replicated 12 times to support the data width of the counters. The total size of the multiplexer is then 2 x 9 x 12 = 216 slices.
  • Synchronizing Logic – At this stage, we have not designed the logic to capture the input signals and synchronize them to the internal clock. For now, we will allow a slice per input (two flip-flops and some gates). This gives us a total of 64 slices.
Based on these major building blocks of the design, our estimate is for 664 slices. Thus, a Spartan-II XC2S50 device is suitable with its 768 slices, providing a surplus of 104 slices to complete the processor interface.

There are many ways to implement the "Performance + Time = Memory" formula – and we will look at just one. But as long as you can significantly lower the cost, you are well on your way to improving the profit margins on your own designs in the future.

Remember, the target to beat is 664 slices in a Spartan XC2S50 device, which was the result of a full parallel 2-D design.

Implementing a 3-D Design
We must begin our 3-D design process by asking the right questions that relate to the "Performance + Time = Memory" formula.

How Much Time Is Available?
Taking the minimum fruit size and minimum spacing between fruit passing on a belt at the maximum speed of 1 meter/sec, we derive the timing of the fastest pulses from a light sensor.

We discover that the pulses are of a long duration and that the pulse rate is very low. In fact, the maximum pulse rate is less than 8 Hz, which is very slow indeed. However, we must consider that there are 64 sensors to be monitored; we could be unlucky enough to have them all triggered at the same time. So, all 64 sensors must be serviced in a maximum of 30 ms, and the aggregate data rate is more like 500 Hz.

What Performance Is Available?
We know that a Spartan-II FPGA is the target architecture. This device is capable of operation above 100 MHz, so device performance should not limit us at all in this case. Although we want to get the most out of the silicon, there is no point overdoing it and burning power unnecessarily. In this case, it is better to work out the minimum clock rate required to process all 64 channels, and then tie this rate in with a suitable clock source on the PCB.

Looking at the timing waveform shown in Figure 6, the pulse width caused by the smallest fruit breaking the light beam is the most demanding. We must guarantee that we observe each sensor at least once every 30 ms.

If, however, the 64 sensors are observed and processed sequentially, rather than in parallel, then 30 ms divided by 64 is the maximum time that can be allocated to each sensor. This means that the minimum processing rate is 2,133 Hz. Obviously, this is still desperately slow, but it only emphasizes that "Performance + Time = Memory"” must be a valid formula to be applied in this case.

Replacing Counters with Memory
We have "Time" and we have ample "Performance," so now it is a case of working out how to make the whole thing a sequential 3-D design. How can the memory resources of the device be utilized to reduce the size of implementation?

Because memory is used to hold data values, we must identify where the data is in the system. These may be complete values or partial values, so we must have a good look through the block diagram and identify where the data values are. In this system, they are fairly obvious in that the counters each hold a value. In the parallel implementation, they are distributed across the 384 slices, forming the 64 counters, but we want to consolidate them into a single memory.

We can choose between distributed (CLB) memory and dedicated (block) memory, and we could really use either to form storage for 64 values of 12 bits. However, as the dedicated block RAM isn’t required for anything else, let’s take that option. Configured as 256 words of 16 bits, a single block provides more than adequate storage.

The counter functionality is then replaced with a single increment function, as shown in Figure 7. A "count value" is read from the RAM, passed through the increment block, and then written back into the RAM at the same location. This is best organized as a two-cycle process, but is no issue given the "Performance + Time" that is available.

Although we could selectively access the count values to be recorded as a corresponding light beam is broken, it is much easier to scan sequentially through all 64 count values and record only those which must be increased before the value is written back into the RAM. This reduces the address generation to a simple 6-bit counter.

At this stage, we have replaced 384 slices with one block RAM and just nine slices of logic (six for the increment and three for the address counter). This is a huge savings. Now, however, we must find ways to connect the inputs and outputs to this 3-D processing engine.

Eliminate the Data Multiplexer
The parallel data multiplexer is simply not required in this design. We save 216 slices instantly because the count values are now held in one consolidated memory. The dual-port nature of the block RAM really makes it very easy to connect the external processor.

As illustrated in Figure 8, the memory also offers the opportunity for the processor to have a write mode to reset count values or set test values. As with the parallel implementation, there is a risk that the processor will try to read a count value that is in the process of being modified. However, it’s very easy to allocate time for the recording process and time for the processor to read values.

Although a clock rate of a few KHz is adequate for the processing, a clock of 2 MHz (or similar clock rate associated with the microcontroller) would achieve a count value update scan in 64 µs, leaving nearly all of the 30 ms processing period available for the microcontroller to read or write count values.

Connecting the Sensors
At some point in all 3-D designs, the parallel world must be interfaced to the sequential processing engine. This does not have to be difficult, and often a simple method is adequate, as seen in Figure 9. The counter used to access each count value from the RAM can be used to select the associated sensor via a 64:1 multiplexer. Although this requires multiplexer logic, it is just for one bit and therefore only requires nine slices.

Each sensor still requires its own logic. This is partly to synchronize the input signals, but is also required to ensure that each "beam broken" pulse is only used to record a count value once. For this reason, the one slice per sensor is unlikely to be reduced.

When you see that the logic size is increasing because the function is becoming more parallel, it is worth looking to see if anything else can be time-shared and moved into memory. In this case, we can indeed improve things.

We can replace the multiplexer with a 64-bit parallel-to-serial converter (32 slices), which converts the parallel domain into a serial sequential process, as demonstrated in Figure 10. To detect only the start of a new pulse, a memory is used to remember the last state of each of the 64 sensors. Because the operation is so predictable, we can use the SRL16E memory mode, which requires just two slices.

Dramatic Cost Reduction
So was it worth it? I think the diagrams in Figure 11 speak for themselves.

To reduce the function from 332 CLBs to just 22 CLBs is a dramatic change: 15 times smaller. Our design now fits in the smallest Spartan-II device (XC2S15) – and actually only uses 25% of that.

This reduction in size and cost is not just specific to this particular design. For example, much of 3G wireless processing is involved with "chip rates" of 1.2288 MHz and 3.84 MHz. This provides the time to allow the performance and memory of Virtex devices to process at least 32 channels sequentially, in just the same way as our simple fruit counter.

Final Considerations
The Spartan-II XC2S15 has only 86 user I/Os, and our design has high I/O demands. Having used 64 for sensor inputs and applied a clock, only 21 I/Os are left for the microcontroller interface. Given an 8-bit data bus, it is possible to connect to the microcontroller, but it does illustrate how I/O can limit a design once these highly efficient techniques are employed.

Of course, it would be a pity for 75% of the XC2S15 to be completely wasted. It would be nice to embed the microcontroller and the UART in the same device. This is also possible, but it’s a topic for another article.

Meanwhile, once you discover that 3-D designs are possible, you are well on your way to improving the profit margins on your own designs.

[Editor’s note: This article was derived from a two-part TechXclusive on the support.xilinx.com website. To see the original TechXclusive, go to
support.xilinx.com/support/techxclusives/3-D-techX22.htm and
support.xilinx.com/support/techxclusives/3-D2-techX23.htm.

To see more TechXclusives, go to support.xilinx.com and search for "TechXclusives," then click on "Xilinx TechXclusives Home."]

Printable PDF version of this article. PDF logo (07/15/03) 405 KB

 
/csi/footer.htm