FPGAs Have the Multiprocessing I/O Infrastructure to Meet 3G Base Station Design Goals
Two-dimensional fabric efficiently links arrays of processors inside Virtex-II Pro devices to enable parallel processing of data.
With increased data traffic and new
multiuser detection and adaptive beam-forming
algorithms, data processing
requirements of 3G base stations will
increase by as much as 100 times relative
to current equipment. This increase in
processing capacity must be matched by
low power consumption, as the new picocell
base stations mounted on building
sides will not be using forced air cooling.
Arrays of small and specialized
processors (Figures 1 and 2) will provide
a power-efficient method of increasing
performance, more so than can be
obtained by increasing the features of
larger, general-purpose super processors.
The evolution of current standards and
introduction of new standards currently
force base station operators to perform
frequent upgrades to their wireless infrastructure,
often requiring board replacements.
To reduce field maintenance, 3G
equipment must be upgradable without
board swapping. The high cost of 3G
equipment often leaves wireless infrastructure
manufacturers with thin profit
margins; costs will have to come down to
enable large-scale deployment.
Finally, OEMs cannot abandon their
current design methods to start designing
3G equipment from scratch. They must be
able to reuse semiconductor IP, code, and
development tools to hit market windows
and to obtain a return on investment.
Meeting these seemingly exclusive goals
requires a combination of top-notch process
technology, combined with comprehensive
component library and efficient data communication
methods, inside-chip and chip-to-chip. Reaching 3G design goals hinges on
achieving the right balance between the size
and number of data processing components
to keep most of the chip busy at all times,
while reducing the overall distance that the
data has to travel inside ICs. System efficiency
is heavily influenced by design partitioning,
optimization of individual data
processing components, and streamlining
data flow between components. To keep data
processing components busy, inter-processor
data transfers must have low latency and be
precisely deterministic – otherwise components
will waste valuable processing cycles
while waiting for data.
Low latency requires the removal of data
communication bottlenecks by spreading
the data flow over the entire area of a chip.
Transfer determinism is achieved by a combination
of low latency and a uniform data
communications structure inside the chip.
Data Processing Elements
Each 3G chip is likely to contain hundreds
of processing elements, many representing
autonomous processors with their own data
processing flows, control flows, memory,
and communications ports. Some data processing
flows may be augmented with dedicated DSP blocks. Virtex™-II FPGAs support
DSP functions and a MicroBlaze™
soft processor. Virtex-II Pro™ devices also
feature embedded IBM PowerPC™ processors.
Depending on the task at hand, smaller
processors may be better suited for
simpler functions, and larger processors may
be a better fit for more complex algorithms.
In order to work in parallel, processors
must be able to easily communicate with
each other. An efficient way for processors
to communicate is through a fabric dispersed
across the entire design that looks to
individual processors like conventional
memory (Figure 3). This approach enables
each processing element to be developed
and verified individually, yet easily
exchange data with other processors.
Two-Dimensional Data Communications
An effective data interconnect fabric must
support low latency and deterministic data
transfers occurring simultaneously among
multiple processing elements. It must also
be flexible and scalable to allow for the
addition of new elements or the removal of
unwanted elements without affecting the
rest of the design. Finally, it should be compatible
with existing processors and be as
easy to use as accessing memory.
Memory-Like Interface
Using conventional bus cycles to transfer
data between processors dispenses with
exotic and hard-to-implement communications
peripherals and protocols in favor
of a simple memory-like interface. As
shown in Figure 4, 2D-fabric from
CrossBow appears to processors as a memory-mapped peripheral on an IBM
CoreConnect™ bus. PowerPC and
MicroBlaze processors can issue conventional
read and/or write bus cycles to their
local 2D-fabric peripherals to communicate
with other processors on the chip
(Figures 5 and 6). The payload for each
transfer is derived from the data bus. The
destination location and the initial direction
of travel are derived from the address
bus. The transfers are totally transparent to
the sending and receiving processors,
launching transfers with write cycles and
terminating transfers with read cycles.
Routing of data from source to destination,
as well as arbitration with other data traffic,
is performed autonomously by the interconnected
2D-fabric peripherals.
A 2D Array of Data Transport Links
Efficient 3G designs will feature global
communication fabrics using single sets of
lines to transfer all kinds of data, including
payloads, control words, and configuration
data. Duplication of data transfer lines
reduces overall system efficiency.
As shown in Figure 7, 2D-fabric peripherals
of adjacent processors are interconnected
with a single mesh of horizontal and
vertical data transport links. Individual bus
cycles are autonomously converted to small
packets that travel between source processors
and destination processors through chains of
2D-fabric peripherals of the intermediate
processors along the way. Short point-to-point
links reduce power consumption.
Small packets with single word payloads
reduce data transfer latencies, enabling data
and control packets to share common transfer
lines. The same lines can also be used for
system initialization and configuration.
Scalability
Scalability is an important requirement for
the design effort and product field upgrades.
Constantly changing standards may require
adding or removing processors late in the
design cycle or even after field deployment.
In the past, adding or removing processors
has always been difficult when using
centralized DMAs for movement of data.
In any centralized I/O structure, removing
or adding new components is likely to
affect other system components. Two-dimensional
I/O structures are much less
sensitive to design changes. Adding another
processor to a chip is as simple as wrapping
it with a 2D-fabric peripheral and
connecting the respective data transport
links to the existing fabric. This can be easily
done without affecting any hardware or
software already in place.
Low Latency and Deterministic Data Transfers
In computing environments where hundreds
of processors are simultaneously
exchanging data, how can you guarantee
that any one of those transfers is going to
arrive at its destination no later than a
fixed amount of time? Buses, crossbars,
and other centralized I/O structures force
all data traffic through one central location,
creating huge traffic jams. Two-dimensional
I/O structures, however, can
easily guarantee data delivery by spreading
out data traffic across the design. As
shown in Figure 7, a two-dimensional
data transport grid dispersed across the
entire design area removes communication
bottlenecks to allow individual transfers
to complete on time, without
interfering with other transfers.
Individual processors must use worst-case transfer latency when planning data
transfers. Although it is acceptable for data
to wait to be transferred, processors waiting
for data are wasting precious processing
cycles. Total transfer latency depends on
the worst-case latency across one processing
node and the number of intermediate
processing nodes between the source and
destination nodes.
Worst-Case Latency Across
One Processing Node
A 50 ns packet latency across one node
represents the time elapsed from when the
packet started entering the node to the
time when it started exiting that node. A
packet delay time is the time from when it
starts entering the node to the time when
it completely emerges. Thus, a 100 ns
packet delay time is 50 ns latency plus
another 50ns for the packet to fully emerge
from the node.
If packets exiting from a given output
port can arrive from three different sources,
the worst-case latency for any one packet is
250 ns. This is equal to the best-case latency
of 50 ns plus two packet-delay slots of
100 ns each.
Worst-Case Latency Across
Several Processing Nodes
If the worst-case latency for crossing of one
processing node is 250 ns, the worst-case
latency for the entire transfer chain of two
nodes, for example, would amount to 500
ns. Thus, if a packet is launched from a
source processor two nodes away from its
destination, it will take it a maximum of 500
ns to arrive at its destination processor,
regardless of any other data traffic in the system
(Figure 8).
Total Latency
Because 2D-fabric appears to processors as if
it were memory, and because transfer latency
increases with the geographical distance
from the source, processors can treat transfer
latency as memory wait states for the purpose
of scheduling the transfers. In a fully
deterministic way, the further you go, the
more wait states will be required to complete
a transfer (Figure 4).
Although actual latency for the above
example is most likely to be closer to the
best-case latency of 100 ns, the worst-case
latency should always be used when planning
data transfers between processors. In
some I/O fabrics, worst-case latency can be
further reduced by launching packets in specific
routing directions to avoid interference
with other packets, thus reducing the number
of packet delay slots from two to one, or
even down to zero.
As shown in Figure 4, 2D-fabric allows
processors to easily determine the worst-case
transfer latency for any destination
inside the chip by simply counting the
number of intermediate nodes. 2D-fabric
also enables packets to be launched in any
one of four possible directions by encoding
exit directions in the address field of each
data write cycle.
Conclusion
Two-dimensional inter-processor interfaces
enable fast, easy, and efficient data communications
among hundreds of data processing
elements of 3G functions implemented
inside Virtex-II FPGAs. In addition to 3G,
two-dimensional I/O also benefits voice-over-packet, routers, medical imaging, radar,
and sonar applications. Linking processing
elements with 2D-fabric increases system
performance by enabling multiple processors
to process data in parallel. At the same time,
2D-fabric reduces power consumption by
minimizing the total distance that data has to
travel inside chips.
And because it looks to the processors like
conventional memory, 2D-fabric does not
force system programmers to change their
programming methods to benefit from higher
performance. Serial programming code
investment is preserved, because each processing
element has only one processor.
Finally, system designers can now drastically
increase processing throughput and I/O
bandwidth while retaining current processor
architectures and design tools.
For more information on the 2D-fabric
parallel-processing interface, go to
www.xilinx.com/products/logicore/alliance/crossbow/crossbow.htm.
Printable PDF version of this article. (02/15/03) 300 KB |