|
Performance in today’s systems is defined
by more than FPGA clock rates. Every system
has different requirements, and the
maximum achievable performance is determined
by various factors such as logic fabric
performance, I/O bandwidth,
embedded processing, and DSP performance,
among others. These requirements
can also be subject to power restrictions, as
well as signal integrity and cost budgets.
Xilinx® developed the Virtex™-4
FPGA family after consulting hundreds
of customers to address these requirements
and make it easier than ever to
meet system performance goals. In this
article, we’ll look at how Virtex-4 FPGAs
provide new and unique capabilities to
help you meet diverse requirements for
system performance.
System Design Challenges
With each new generation of devices, semiconductor
vendors are able to offer higher
clock rates, due to shrinking process
geometries. However, today’s system performance
challenges go beyond traditional
glue logic and maximized clock rates. In a
PC, for example, the real system performance
bottleneck lies not in clock frequency
but in how the other blocks of the system
work together at the desired frequency.
Let’s consider these challenges in the perspective
of applications employing highperformance
FPGAs. Seemingly diverse
applications like video stream processing,
packet data processing, storage systems,
wireless base stations, and many others
incorporate similar functions, including:
- Incoming and outgoing data streams
- Bridging multiple connectivity
standards
- Arithmetic and DSP (signal conditioning
and data processing)
- External memory interfacing
- State machines
- Data buffering
- Embedded processing (Figure 1)
To facilitate these applications, Virtex-4
FPGAs include common building blocks as
embedded – yet parameterizable – hard IP.
The integration of complex functions like
DSP slices, embedded CPUs, dedicated
I/O circuitry, and on-chip RAM (block
RAM, FIFOs) provides you with unprecedented
capabilities to build programmable
systems within a single FPGA device.
Meeting system requirements takes the
right combination of I/O bandwidth, programmable
logic, on-chip RAM, DSP, and
embedded processing. To provide the ideal
combination of functions, Virtex-4 FPGAs
come in three flavors (LX, SX, and FX platforms)
comprising 17 devices.
Virtex-4 FPGAs offer not only
enhanced logic fabric capabilities, but also
customized XtremeDSP™ MACs and
embedded PowerPC™ processors that give
you enough performance headroom to
reach your design performance goals.
I/O bandwidth is often the limiting factor
in the quest for performance. To remove
I/O bottlenecks, Virtex-4 FPGAs have
unique built-in 1 Gbps ChipSync™ sourcesynchronous
circuitry and 622 Mbps to
10.3125 Gbps serial transceivers that can
help you achieve bandwidth targets.
System Performance Categories
Let’s look at various aspects of performance
and Virtex-4 FPGAs in the context of seven major performance categories: logic fabric,
embedded processing, DSP, on-chip RAM,
high-speed serial, I/O memory bandwidth,
and I/O LVDS bandwidth. Figure 2 offers
a comparison with the nearest 90 nm
FPGA vendor in each of these categories.
Logic Fabric Performance
Xilinx enhanced the performance of its
already fast programmable logic fabric by
building Virtex-4 devices with advanced 90 nm technology. A flexible
look-up table (LUT) architecture
(with the ability to covert any
LUT into a 16-bit RAM or 16-bit
shift register), a high-speed carry
chain, and arithmetic blocks provide
further performance gains.
The 500 MHz global clocking
structure, the key driver behind
logic performance, is fully differential
to reduce skew, jitter, and dutycycle
distortion. Virtex-4 FPGAs
also provide a hierarchical clocking
structure (global and regional
clocks) and clock management circuitry.
Evaluations of logic fabric
performance using a suite of realworld
designs demonstrate a performance
advantage as much as
70% above our nearest 90 nm competitor.
Averaged across this suite of designs, the
Virtex-4 performance advantage is 15%.
This performance boost means that Virtex-4 devices effectively provide an extra speedgrade
advantage.
Embedded Processing
Virtex-4 FX platform FPGAs provide up
to two enhanced PowerPC 405 cores,
each delivering 702 DMIPS performance at 450 MHz, while consuming only 0.45
mW/MHz. This is more than three times
the performance of the best soft microprocessor
cores.
Moreover, the new Auxiliary Processor
Unit (APU) controller makes it easy to reach
even higher levels of performance by integrating
custom co-processors and hardware
accelerators. The APU controller provides a
low-latency path for connecting co-processor
modules implemented in the FPGA to the
embedded PowerPC processor. These userdefined,
configurable hardware accelerator
functions operate as extensions to the
PowerPC 405, offloading the CPU from
demanding computational tasks. For example,
implementing floating-point calculations
in hardware improves performance by a
factor of 20 over software emulation. A
10/100/1000 Mbps tri-mode Ethernet
MAC implemented alongside a PowerPC
processor enables Ethernet connectivity.
DSP Performance
The XtremeDSP™ slice is a versatile,
user-configurable block providing twice
the DSP performance of previous implementations
while drawing less than 1/7th
the power. Each slice contains a dedicated
two’s complement, signed 18 x 18 bit multiplier,
and a three-input adder/subtracter/accumulator with feedback path.
With as many as 512 XtremeDSP slices
running at 500 MHz, a single Virtex-4
FPGA delivers 256 GigaMAC/s (18 x 18
GMACs) performance.
You can configure the XtremeDSP slices
to implement multipliers, counters, multiply-
accumulators, and many more functions,
all without consuming logic fabric resources.
The ability to implement complex systolic
functions without incurring the delay of fabric
routing provides significant performance
gains. For example, in a 32-tap FIR implementation,
the Virtex-4 FPGA outperforms
competing devices by 40%.
On-Chip Memory Performance
The Virtex-4 family carries forward the size
and basic structure of on-chip memory, 18
Kb dual-port block RAM (proven in previous
generations), but adds a data-output
pipeline register to increase speed to 500
MHz. The two ports still have individual
width control, and in write mode you can
choose between automatically reading the
previously stored data or the new data. Two
neighboring block RAMs, when combined,
form a 32K x 1 RAM without loss of speed,
or a 512-deep 64-wide RAM with automatic
Hamming error correction – without
using any extra logic.
Each block RAM also contains its own
FIFO controller, a unique Virtex-4 FPGA
feature that provides 500 MHz functionality
without additional logic resources.
Compared to competing devices, the
block RAMs provide at least 20% better
performance.
But getting your FPGA internal blocks to
run fast is only half the battle. Maximum system
performance requires efficient interaction
between the FPGA and other
components in your system. Virtex-4 FPGAs
offer the flexibility to achieve the highest
possible bandwidth for chip-to-chip, board-to-board, and box-to-box connectivity.
High-Speed Serial I/O
As designs move to faster interface speeds,
serial interconnect saves power and board
space while reducing design complexity and
cost. Virtex-4 RocketIO™ MGTs offer
performance from 622 Mbps to 10.3125
Gbps, one of the broadest ranges offered by
any device. The transceivers are fully programmable
and can implement a myriad of
speeds and serial standards. Link-layer IP is
available for such standards as PCI Express,
Serial-ATA, Fibre Channel, Gigabit
Ethernet, and Aurora.
Memory I/O Bandwidth
The great majority of systems today need a
data buffer external to the FPGA for temporary
storage. This buffer’s bandwidth
can be the critical factor in determining
overall performance.
Memory interfaces like DDR2
SDRAM, QDR II SRAM, or RLDRAM II
are source-synchronous, with per-pin data
rates of more than 533 Mbps. Memory
bandwidth is determined not only by the
per-pin data rate but also by the width of
the bus. The ChipSync circuitry built into
every I/O simplifies the physical layer
interface and provides the capability to
implement buses three times wider than
other programmable solutions, for bandwidths
as high as 260 Gbps.
To enable reliable data capture,
ChipSync circuitry also includes built-in
delay elements, adjustable in 75 ps increments,
to ensure the proper alignment
between clock and data signals. The
unique capability to calibrate timing at
run time, rather than at design time, substantially
improves design margins. Xilinx
also provides hardware-verified reference
designs, development systems, and software
tools to further speed up the implementation
of memory interfaces.
LVDS I/O Bandwidth
ChipSync technology simplifies the design
of differential parallel bus interfaces, with
embedded SERDES blocks that serialize
and de-serialize parallel interfaces to match
the data rate to the speed of the internal
FPGA circuits. Additionally, this technology
provides per-bit and per-channel de-skew
for increased design margins, simplifying
the design of interfaces such as SPI-4.2,
XSBI, and SFI-4, as well as RapidIO.
Virtex-4 FPGAs incorporate ChipSync
technology into every I/O, providing the
most flexible I/O solution available. This
enables wider 1 Gbps LVDS buses for up to
480 Gbps bandwidth, 60% higher than the
competition.
Other Performance Challenges
Achieving the desired system performance
with your FPGA is often impeded by signal
integrity, cost, and power budget
restrictions.
The innovative Application Specific
Modular Block (ASMBL) architecture
enables I/O, clock, power, and ground pins
to be located anywhere on the silicon chip,
not just along the periphery. This architecture
alleviates the problems associated with
I/O and array dependency, power and
ground distribution, and hard-IP scaling.
Furthermore, the Virtex-4 FPGA packaging
technology, SparseChevron, enables
distribution of power and ground pins
evenly across the package. The benefit to
you is improved signal integrity. As
demonstrated by Dr. Howard Johnson,
Virtex-4 FPGA devices have seven times
less simultaneously switching output
(SSO) noise and crosstalk when compared
to competing devices.
The ASMBL architecture, with its column-based implementation of programmable
logic, DSP slices, block RAM, I/O
columns, MGTs, clocking, and PowerPC
embedded cores, provides another significant
benefit in that it allows a more flexible allocation
of resources. This enables Xilinx to
offer three Virtex-4 FPGA platforms: the LX
platform, optimized for logic resources; the
SX platform, optimized for DSP; and the FX
platform, optimized for embedded processing
and high-speed serial applications.
Device power budgets impose an additional
impediment to meeting performance
goals. Because power consumption increases
with clock rate, you may exceed your
power budget at frequencies below your
performance target, even if your chosen
device has more performance on tap.
Selecting a device with low power consumption
will help you achieve performance
goals while staying within your power
budget, and can deliver the additional benefits
of lower system cost and higher reliability
through reduced power supply and
cooling requirements.
Virtex-4 FPGAs incorporate unique
triple-oxide 90 nm technology that significantly
reduces static power. Additionally, by
implementing commonly used functions
such as embedded IP, Virtex-4 FPGAs further
reduce dynamic power when compared
to previous generations or competing
devices. Measurements and analysis of Xilinx
against competing tools and silicon show
that Virtex-4 FPGAs consume 1 to 5W less
than the competition’s 90 nm FPGAs.
Conclusion
Virtex-4 FPGAs incorporate innovative
built-in silicon features, extensive embedded
IP, triple-oxide 90 nm technology, and
unique packaging to provide designers with
capabilities that enable breakthrough performance
at the lowest cost.
For more information about getting
started with your Virtex-4 FPGA design,
visit www.xilinx.com/virtex4.
Printable PDF version of this article with graphics. (7/11/05) 280 KB
|