|
Hardware acceleration speeds the entire design cycle,
and the combination of Tarari’s Content Processing Platform
and Celoxica’s DK Design Suite makes it fast and painless.
Traditional methods of porting algorithms
for hardware acceleration usually
require many steps that are both difficult
and time-consuming. The algorithms
must be ported to a hardware language,
simulated by themselves, and then prototyped
on FPGAs.
During this prototype stage, the
FPGA must also connect to the software
environment. The complete task often
requires a team of hardware engineers
well-versed in FPGA development,
another hardware engineer to develop the
platform, and a software engineer to write
the interfaces from software to hardware.
In addition, software developers must
wait for the hardware to be completed
and tested before integration.
Software developers can get a head start
on developing and debugging software using
the Tarari Content Processing Platform™
(CPP) and the Celoxica DK Design Suite to
co-simulate the algorithm and quickly convert
new or existing C code to hardware for
significant acceleration. Celoxica’s C-to-gates
technology eliminates the language
conversion step. Tarari’s CPP eliminates the
need to understand the complexities of
interconnects and software interfaces,
enabling developers to concentrate on performance
rather than functionality.
Tarari Content Processing Platform
The concept of hardware acceleration is a
simple one: Create a piece of hardware dedicated
to completing tasks much faster than
software or general-purpose hardware. The
reality of creating that piece of dedicated
hardware, however, is much more complex.
Consider just a PCI interface. It has a
bus with several possible configurations,
such as 32- or 64-bit buses, or 33- or 66-MHz clocks. During operation there are
many states and conditions to keep track
of and control. Even SDRAM has somewhat
complex controls. You could buy
each of these elements of the design, but
then you would have to create interfaces
and control mechanisms between them
and tie them all together.
All of this work has been completed
with the Tarari CPP.
Figure 1 shows a block diagram of the
Tarari CPP that includes:
- A half-length PCI-compliant card
- 32- or 64-bit bus width operating at
33- or 66-MHz bus speeds
- 256 MB of DDR SDRAM for high-capacity
global storage
- LEDs for status and debug
- 4 MB of ZBT SSRAM for high-speed,
low-latency local storage
- A 2M-gate XCV2000 Virtex™-II
Platform FPGA performing as the content
processing controller (CPC) that:
– Provides all connectivity between the
PCI Bus, DDR SDRAM, and content
processing engines (CPEs)
– Dynamically reconfigures FPGAs
– Hosts core content processing logic
– Has maximum total data throughput
of 4 GBps
– Supports bus-master direct-memory-access (DMA) mode.
- Two independent content processing
engines (CPE) each consisting of:
– A 1M-gate XC2V1000 Virtex-II
Platform FPGA
– Independent busses for fast algorithmic
processing and continuous operation
during reconfiguration
– Dynamic reconfigurability, allowing
upgrades via software.
The Tarari CPP is a fully developed PCI
card with three Virtex-II FPGAs. The three
Xilinx FPGAs are located under the fan/heat
sink/shroud assembly as shown in Figure 2.
This placement ensures that the components
don’t overheat. The FPGAs require external
cooling because they are pushed to their performance
limits in terms of clock speed and
there is little to no airflow to cool them.
All data flow and control logic is implemented
on the CPC. Using DMA technology,
the CPC can act as a PCI bus master,
allowing data transfer between the host system
memory and the onboard DDR
SDRAM. The bandwidth of the DDR
SDRAM is approximately four times the
bandwidth of the PCI or CPE buses. This
high bandwidth allows the CPC to multiplex
accesses to the memory and makes it
appear that multiple devices are simultaneously
accessing the memory. Finally, the
CPC controls programming of the CPEs
for fast reconfigurability.
The Tarari CPP contains the entire support
infrastructure needed to deploy a PCI
solution, allowing you to focus on algorithm
development rather than on hardware implementation
details. Another key advantage is
time to market; using the Tarari CPP allows
you to skip the work involved in designing
the PCI infrastructure and bypass all of the
manufacturing steps as well. The card is delivered to you ready for production
deployment. Your algorithm firmware can
be downloaded to the card at runtime, and
no hardware customization is required.
Plug it in and start accelerating.
Development Flow
The first step is to select an algorithm for
hardware acceleration. Selection is usually
done using profiling tools that identify
software bottlenecks. The next step is to
determine memory usage, dataflow, and
suitability of the identified code sections
that will be implemented in an FPGA. The
following step is to estimate the size, speed,
and performance metrics at each stage of
the design cycle. With so many parameters
to consider, designers increasingly rely
upon hardware/software co-design, co-simulation,
and prototyping as essential
elements of the development flow.
The Tarari CPP is built on the development
framework shown in Figure 3 to meet
these needs. The flow emphasizes rapid
development with a common C-based language
for both hardware and software components.
This allows maximum
experimentation and exploration of the
design space for better quality of design.
Using this framework, you can:
- Create system models directly from the
original software application and verify
the functionality and characteristics of
a system.
- Quickly identify and extract critical
portions of the algorithms as candidates
for hardware implementation.
- Make informed partitioning decisions
for optimum use of hardware resources
and verify performance and functionality
in simulation.
- Perform cycle-accurate hardware simulations.
- Directly implement C-based hardware
designs onto the Tarari CPP.
This approach reduces the time it takes
to design accelerated applications, thus
lowering cost, improving time to market,
and minimizing risks.
Applications
The best candidate algorithms for hardware
acceleration with Tarari’s CPP board
fit one or more of the following criteria:
- High operation-per-byte ratio
- Benefits from running multiple
instantiations (parallel processing,
multi-threading)
- Capable of being pipelined
- Takes advantage of wide data widths
- CPU execution time far exceeding data
transfer time over PCI.
Where these criteria are met, there has
been considerable success with FPGA
solutions for the following applications:
- Security/encryption – RSA, DES/3-DES, SHA, MD5, AES/Rijndael
- Streaming applications – video processing,
pattern matching/filtering,
compression/decompression
- Vector engines.
Candidate products and applications
incorporating these basic hardware blocks
include:
- Network security
- Media streaming
- XML/Web services accelerations
- Image processing
- High-performance computing
applications.
Algorithm Example
The GZIP (GNU zip) algorithm is a
commonly used utility that compresses
and decompresses data files using the
"Deflate" and other patent-free algorithms
at www.gzip.org. The key feature
of GZIP decompression is that it operates
on a 1-bit wide stream of data, requiring
that all operations reading from the input
data stream are performed sequentially.
Although GZIP decompression fulfills a
limited number of the algorithm selection
criteria described above, the results
demonstrate that the general technique of
offloading a CPU and exploiting algorithm
parallelism can improve system
performance.
Porting GZIP C Source Code to Handel-C
To reduce the time required to implement
the GZIP decompression, GNU C source
code was used for the example. The majority
of the algorithm was used with little or no
modification to provide an immediate basis
for prototyping and a controlled test framework
in which to enhance and optimize the
algorithm. Making use of Handel-C’s simple
support for parallelism via the par{} construct
is an example.
The process for porting the GZIP
algorithm to hardware was to write the
code for one "module" of the system at a
time in Handel-C. Note that the modules
correspond to algorithmic stages in the
decompression algorithm, as shown in
Table 1, rather than more traditional
hardware blocks.
Table 1 - Modules and descriptions for Handel-C GZIP decompression algorithm
| | Module | Function |
| 1 | Main | Detects the type of a compressed data block, and calls the appropriate modules to decompress it. Coordinates operation
of all other modules in the application. |
| 2 | Optimized block memory access | Provides a simple interface for pipelined access to block
RAM on Xilinx FPGAs. Required to achieve high clock rates.
Used to provide RAMs for Huffman tables and LZ77 sliding
window.
| | 3 | Bitstream reader | Reads a variable number of bits from the input bitstream.
| | 4 | Builder for tree description Huffman decoder | For dynamic compressed blocks. Uses the tree
description generated from the input bitstream to build a
Huffman decoder table for the tree descriptions for the main
Huffman decoders.
| | 5 | Tree description Huffman decoder | For dynamic compressed blocks. Decodes the Huffman
codes from the input bitstream, generating tree descriptions
for the "literal/length" and "distance" Huffman decoders.
| | 6 | Builder for "literal/length" Huffman decoder | Uses the tree description generated from the input
bitstream or the static bit-length codes to build a Huffman
decoder table for the "literal/length" Huffman decoder.
| | 7 | Builder for "distance" Huffman decoder | Uses the tree description generated from the input
bitstream or the static bit-length codes to build a Huffman
decoder table for the "distance" Huffman decoder.
| | 8 | "Literal/length" and "distance" Huffman decoders | Decodes the Huffman codes from the input bitstream,
generating "literal" values and "length/distance" pairs.
| | 9 | LZ77 decompress | Uses the LZ77 algorithm on the stream of "literal" values
and length/distance pairs to reconstruct the original uncompressed
data. |
Optimized Block Memory Accesses
The Handel-C language supports the use
of block RAM by adding the specification
with {block=1} after a RAM declaration.
This specification results in using block
memory to build the RAM and generating
a RAM clock at double the algorithm
clock rate, which is used to implement
single-cycle access to the block RAM.
This process provides a quick route to
hardware prototype for the design.
However, using the single-cycle block
RAM access limits the algorithm clock
rate to half of the RAM clock rate. Some
portions of the GZIP decompression
algorithm are sequential in nature, so a
high clock rate is desirable to achieve
maximum performance. Consequently,
for this improved performance, a simple
interface eliminated the double clock
constraint by using pipelined access to
the block RAM.
LZ77 Decompressor
The main component in the hardware
implementation is the "sliding window,"
which is 32 KB in size with byte-wide
access. The sliding window is built using
the pipelined block RAM access macros
mentioned above. Copy operations
resulting from a length/distance pair are
performed using a pipelined loop as
shown in the code in Figure 4, which
reads from one port of the sliding window
block RAM and writes back to the
other port. This results in fast operation
by processing one byte every clock cycle
once the pipeline is full.
Simulation Testing of GZIP Modules
A significant advantage of developing a
GZIP decompression algorithm in
Handel-C is that the simulator can be used
to test the functionality of the design
before implementation. These simulations
are performed rapidly – a model of the
entire GZIP decompression algorithm can
simulate 100 KB of compressed data in 25
seconds or less on a Pentium™ III 850
MHz system. During development of each
GZIP module in Handel-C, we conducted
simulations to verify their operation. The
software GZIP source code was modified
to produce test harnesses implemented as
C functions called directly from the
Handel-C code.
Data Throughput
Given the maximum clock rate of the
design to be 133 MHz – verified by using
the Xilinx ISE 5.1i timing analyzer – the
data throughput of the GZIP decompression
algorithm was calculated from simulation results in terms of the number of clock
cycles required to decompress given files.
Table 2 shows the theoretical peak
and maximum sustained decompressed
data output rate from the GZIP decompression
algorithm. The actual data output
rate depends on the level of
compression – highly compressed files
result in a higher rate.
Table 2 - Performance results for GZIP decompression
| | GZIP Decompression Rate (Mbytes/s) |
|
Theoretical peak | 130 |
| Maximum sustained | 116 |
In comparison, software GZIP has a
peak data output rate of approximately
25 MBps on an average Pentium III-based
system.
Implementation
The Handel-C GZIP decompression algorithm
was synthesized directly to EDIF
using the compiler. Xilinx ISE 5.1i tools
were used to place-and-route the circuitry
and to configure the Xilinx FPGAs on the
Tarari CPP. A board-support package
(BSP) of function calls is provided in the
Tarari Content Processor Development
Kit. The BSP provides a layer of abstraction
that shields users from "low-level detail"
and is used during both simulation and
implementation of the design.
The final implementation size was 40%
of one Virtex-II 1000 Platform FPGA
with speed exceeding the 133 MHz
required for maximum DDR data access
on the Tarari CPP card.
Conclusion
Application acceleration is simplified using
the combination of the Tarari CPP and the
Celoxica DK Design Suite. Together they
provide the ability to deliver hardware
acceleration products with much lower
NRE costs than traditional methods and
with a quicker time to market.
The GZIP decompression algorithm
was used to demonstrate the platform’s
ability to deliver significant acceleration
and an overall performance improvement
for applications. The Celoxica DK Design
Suite allowed interactive simulation and
debugging of the GZIP decompression
algorithm. Interactive simulation delivers
both rapid proof of concept and the completion
of development in only two
months. The Celoxica DK Design Suite –
along with the other design tools, examples,
and documentation from Celoxica –
enable you to quickly develop and deploy
highly accelerated products.
You could continue to optimize performance
for a specific algorithm or, more
important, incorporate additional algorithms
to deliver even higher application
performance benefits. Systems deployed at
customer sites with Tarari CPPs can take
advantage of these enhancements due to
the built-in reconfiguration capabilities of
the Virtex-II FPGAs, resulting in an overall
lower cost of ownership.
To find out more about hardware acceleration
using the Tarari Content Processor
Development Kit, visit www.tarari.com/products-cpdk.html. To discuss how your
applications can be accelerated, contact
Tarari at sales@tarari.com.
Printable PDF version of this article. (05/02/03) 200 KB |