Support|documentation
 
 
Home : Publications : Xcell Journal Online : Articles by Date : Article

Xcell Journal Online Article
   
     
   
   
   
 
  Xcell Home
  Articles by Date
   
  Subscription
  Comments & Suggestions
  Write Articles for Xcell
   
   
   
   
 
FPGAs Have the Multiprocessing I/O Infrastructure to Meet 3G Base Station Design Goals
by Peter Galicki, CEO, CrossBow Technologies Inc. peter.galicki@crossbowip.com (02/15/03)

Two-dimensional fabric efficiently links arrays of processors inside Virtex-II Pro devices to enable parallel processing of data.

With increased data traffic and new multiuser detection and adaptive beam-forming algorithms, data processing requirements of 3G base stations will increase by as much as 100 times relative to current equipment. This increase in processing capacity must be matched by low power consumption, as the new picocell base stations mounted on building sides will not be using forced air cooling.

Arrays of small and specialized processors (Figures 1 and 2) will provide a power-efficient method of increasing performance, more so than can be obtained by increasing the features of larger, general-purpose super processors.

The evolution of current standards and introduction of new standards currently force base station operators to perform frequent upgrades to their wireless infrastructure, often requiring board replacements. To reduce field maintenance, 3G equipment must be upgradable without board swapping. The high cost of 3G equipment often leaves wireless infrastructure manufacturers with thin profit margins; costs will have to come down to enable large-scale deployment.

Finally, OEMs cannot abandon their current design methods to start designing 3G equipment from scratch. They must be able to reuse semiconductor IP, code, and development tools to hit market windows and to obtain a return on investment.

Meeting these seemingly exclusive goals requires a combination of top-notch process technology, combined with comprehensive component library and efficient data communication methods, inside-chip and chip-to-chip. Reaching 3G design goals hinges on achieving the right balance between the size and number of data processing components to keep most of the chip busy at all times, while reducing the overall distance that the data has to travel inside ICs. System efficiency is heavily influenced by design partitioning, optimization of individual data processing components, and streamlining data flow between components. To keep data processing components busy, inter-processor data transfers must have low latency and be precisely deterministic – otherwise components will waste valuable processing cycles while waiting for data.

Low latency requires the removal of data communication bottlenecks by spreading the data flow over the entire area of a chip. Transfer determinism is achieved by a combination of low latency and a uniform data communications structure inside the chip.

Data Processing Elements

Each 3G chip is likely to contain hundreds of processing elements, many representing autonomous processors with their own data processing flows, control flows, memory, and communications ports. Some data processing flows may be augmented with dedicated DSP blocks. Virtex™-II FPGAs support DSP functions and a MicroBlaze™ soft processor. Virtex-II Pro™ devices also feature embedded IBM PowerPC™ processors. Depending on the task at hand, smaller processors may be better suited for simpler functions, and larger processors may be a better fit for more complex algorithms.

In order to work in parallel, processors must be able to easily communicate with each other. An efficient way for processors to communicate is through a fabric dispersed across the entire design that looks to individual processors like conventional memory (Figure 3). This approach enables each processing element to be developed and verified individually, yet easily exchange data with other processors.

Two-Dimensional Data Communications

An effective data interconnect fabric must support low latency and deterministic data transfers occurring simultaneously among multiple processing elements. It must also be flexible and scalable to allow for the addition of new elements or the removal of unwanted elements without affecting the rest of the design. Finally, it should be compatible with existing processors and be as easy to use as accessing memory.

Memory-Like Interface
Using conventional bus cycles to transfer data between processors dispenses with exotic and hard-to-implement communications peripherals and protocols in favor of a simple memory-like interface. As shown in Figure 4, 2D-fabric from CrossBow appears to processors as a memory-mapped peripheral on an IBM CoreConnect™ bus. PowerPC and MicroBlaze processors can issue conventional read and/or write bus cycles to their local 2D-fabric peripherals to communicate with other processors on the chip (Figures 5 and 6). The payload for each transfer is derived from the data bus. The destination location and the initial direction of travel are derived from the address bus. The transfers are totally transparent to the sending and receiving processors, launching transfers with write cycles and terminating transfers with read cycles.

Routing of data from source to destination, as well as arbitration with other data traffic, is performed autonomously by the interconnected 2D-fabric peripherals.

A 2D Array of Data Transport Links
Efficient 3G designs will feature global communication fabrics using single sets of lines to transfer all kinds of data, including payloads, control words, and configuration data. Duplication of data transfer lines reduces overall system efficiency.

As shown in Figure 7, 2D-fabric peripherals of adjacent processors are interconnected with a single mesh of horizontal and vertical data transport links. Individual bus cycles are autonomously converted to small packets that travel between source processors and destination processors through chains of 2D-fabric peripherals of the intermediate processors along the way. Short point-to-point links reduce power consumption.

Small packets with single word payloads reduce data transfer latencies, enabling data and control packets to share common transfer lines. The same lines can also be used for system initialization and configuration.

Scalability
Scalability is an important requirement for the design effort and product field upgrades. Constantly changing standards may require adding or removing processors late in the design cycle or even after field deployment.

In the past, adding or removing processors has always been difficult when using centralized DMAs for movement of data. In any centralized I/O structure, removing or adding new components is likely to affect other system components. Two-dimensional I/O structures are much less sensitive to design changes. Adding another processor to a chip is as simple as wrapping it with a 2D-fabric peripheral and connecting the respective data transport links to the existing fabric. This can be easily done without affecting any hardware or software already in place.

Low Latency and Deterministic Data Transfers

In computing environments where hundreds of processors are simultaneously exchanging data, how can you guarantee that any one of those transfers is going to arrive at its destination no later than a fixed amount of time? Buses, crossbars, and other centralized I/O structures force all data traffic through one central location, creating huge traffic jams. Two-dimensional I/O structures, however, can easily guarantee data delivery by spreading out data traffic across the design. As shown in Figure 7, a two-dimensional data transport grid dispersed across the entire design area removes communication bottlenecks to allow individual transfers to complete on time, without interfering with other transfers.

Individual processors must use worst-case transfer latency when planning data transfers. Although it is acceptable for data to wait to be transferred, processors waiting for data are wasting precious processing cycles. Total transfer latency depends on the worst-case latency across one processing node and the number of intermediate processing nodes between the source and destination nodes.

Worst-Case Latency Across One Processing Node
A 50 ns packet latency across one node represents the time elapsed from when the packet started entering the node to the time when it started exiting that node. A packet delay time is the time from when it starts entering the node to the time when it completely emerges. Thus, a 100 ns packet delay time is 50 ns latency plus another 50ns for the packet to fully emerge from the node.

If packets exiting from a given output port can arrive from three different sources, the worst-case latency for any one packet is 250 ns. This is equal to the best-case latency of 50 ns plus two packet-delay slots of 100 ns each.

Worst-Case Latency Across Several Processing Nodes
If the worst-case latency for crossing of one processing node is 250 ns, the worst-case latency for the entire transfer chain of two nodes, for example, would amount to 500 ns. Thus, if a packet is launched from a source processor two nodes away from its destination, it will take it a maximum of 500 ns to arrive at its destination processor, regardless of any other data traffic in the system (Figure 8).

Total Latency
Because 2D-fabric appears to processors as if it were memory, and because transfer latency increases with the geographical distance from the source, processors can treat transfer latency as memory wait states for the purpose of scheduling the transfers. In a fully deterministic way, the further you go, the more wait states will be required to complete a transfer (Figure 4).

Although actual latency for the above example is most likely to be closer to the best-case latency of 100 ns, the worst-case latency should always be used when planning data transfers between processors. In some I/O fabrics, worst-case latency can be further reduced by launching packets in specific routing directions to avoid interference with other packets, thus reducing the number of packet delay slots from two to one, or even down to zero.

As shown in Figure 4, 2D-fabric allows processors to easily determine the worst-case transfer latency for any destination inside the chip by simply counting the number of intermediate nodes. 2D-fabric also enables packets to be launched in any one of four possible directions by encoding exit directions in the address field of each data write cycle.

Conclusion

Two-dimensional inter-processor interfaces enable fast, easy, and efficient data communications among hundreds of data processing elements of 3G functions implemented inside Virtex-II FPGAs. In addition to 3G, two-dimensional I/O also benefits voice-over-packet, routers, medical imaging, radar, and sonar applications. Linking processing elements with 2D-fabric increases system performance by enabling multiple processors to process data in parallel. At the same time, 2D-fabric reduces power consumption by minimizing the total distance that data has to travel inside chips.

And because it looks to the processors like conventional memory, 2D-fabric does not force system programmers to change their programming methods to benefit from higher performance. Serial programming code investment is preserved, because each processing element has only one processor. Finally, system designers can now drastically increase processing throughput and I/O bandwidth while retaining current processor architectures and design tools.

For more information on the 2D-fabric parallel-processing interface, go to www.xilinx.com/products/logicore/alliance/crossbow/crossbow.htm.

Printable PDF version of this article. PDF logo (02/15/03) 300 KB

 
/csi/footer.htm