Support|documentation
 
 
Home : Publications : Xcell Journal Online : Article

Tarari and Celoxica Deliver Fast and Easy Algorithm Acceleration
   
     
   
   
   
 
  Xcell Journal Home
  Xcell Archives
   
  Subscription
  Comments & Suggestions
  Write Articles for Xcell
   
   
   
   
 
by Dale Riley, Systems Engineer, Tarari, Inc.
dale.riley@tarari.com (05/02/03)

Hardware acceleration speeds the entire design cycle, and the combination of Tarari’s Content Processing Platform and Celoxica’s DK Design Suite makes it fast and painless.

Traditional methods of porting algorithms for hardware acceleration usually require many steps that are both difficult and time-consuming. The algorithms must be ported to a hardware language, simulated by themselves, and then prototyped on FPGAs.

During this prototype stage, the FPGA must also connect to the software environment. The complete task often requires a team of hardware engineers well-versed in FPGA development, another hardware engineer to develop the platform, and a software engineer to write the interfaces from software to hardware. In addition, software developers must wait for the hardware to be completed and tested before integration.

Software developers can get a head start on developing and debugging software using the Tarari Content Processing Platform™ (CPP) and the Celoxica DK Design Suite to co-simulate the algorithm and quickly convert new or existing C code to hardware for significant acceleration. Celoxica’s C-to-gates technology eliminates the language conversion step. Tarari’s CPP eliminates the need to understand the complexities of interconnects and software interfaces, enabling developers to concentrate on performance rather than functionality.

Tarari Content Processing Platform

The concept of hardware acceleration is a simple one: Create a piece of hardware dedicated to completing tasks much faster than software or general-purpose hardware. The reality of creating that piece of dedicated hardware, however, is much more complex.

Consider just a PCI interface. It has a bus with several possible configurations, such as 32- or 64-bit buses, or 33- or 66-MHz clocks. During operation there are many states and conditions to keep track of and control. Even SDRAM has somewhat complex controls. You could buy each of these elements of the design, but then you would have to create interfaces and control mechanisms between them and tie them all together.

All of this work has been completed with the Tarari CPP.

Figure 1 shows a block diagram of the Tarari CPP that includes:

  • A half-length PCI-compliant card
  • 32- or 64-bit bus width operating at 33- or 66-MHz bus speeds
  • 256 MB of DDR SDRAM for high-capacity global storage
  • LEDs for status and debug
  • 4 MB of ZBT SSRAM for high-speed, low-latency local storage
  • A 2M-gate XCV2000 Virtex™-II Platform FPGA performing as the content processing controller (CPC) that:
    – Provides all connectivity between the PCI Bus, DDR SDRAM, and content processing engines (CPEs)
    – Dynamically reconfigures FPGAs
    – Hosts core content processing logic
    – Has maximum total data throughput of 4 GBps
    – Supports bus-master direct-memory-access (DMA) mode.
  • Two independent content processing engines (CPE) each consisting of:
    – A 1M-gate XC2V1000 Virtex-II Platform FPGA
    – Independent busses for fast algorithmic processing and continuous operation during reconfiguration
    – Dynamic reconfigurability, allowing upgrades via software.
The Tarari CPP is a fully developed PCI card with three Virtex-II FPGAs. The three Xilinx FPGAs are located under the fan/heat sink/shroud assembly as shown in Figure 2. This placement ensures that the components don’t overheat. The FPGAs require external cooling because they are pushed to their performance limits in terms of clock speed and there is little to no airflow to cool them.

All data flow and control logic is implemented on the CPC. Using DMA technology, the CPC can act as a PCI bus master, allowing data transfer between the host system memory and the onboard DDR SDRAM. The bandwidth of the DDR SDRAM is approximately four times the bandwidth of the PCI or CPE buses. This high bandwidth allows the CPC to multiplex accesses to the memory and makes it appear that multiple devices are simultaneously accessing the memory. Finally, the CPC controls programming of the CPEs for fast reconfigurability.

The Tarari CPP contains the entire support infrastructure needed to deploy a PCI solution, allowing you to focus on algorithm development rather than on hardware implementation details. Another key advantage is time to market; using the Tarari CPP allows you to skip the work involved in designing the PCI infrastructure and bypass all of the manufacturing steps as well. The card is delivered to you ready for production deployment. Your algorithm firmware can be downloaded to the card at runtime, and no hardware customization is required. Plug it in and start accelerating.

Development Flow

The first step is to select an algorithm for hardware acceleration. Selection is usually done using profiling tools that identify software bottlenecks. The next step is to determine memory usage, dataflow, and suitability of the identified code sections that will be implemented in an FPGA. The following step is to estimate the size, speed, and performance metrics at each stage of the design cycle. With so many parameters to consider, designers increasingly rely upon hardware/software co-design, co-simulation, and prototyping as essential elements of the development flow.

The Tarari CPP is built on the development framework shown in Figure 3 to meet these needs. The flow emphasizes rapid development with a common C-based language for both hardware and software components. This allows maximum experimentation and exploration of the design space for better quality of design.

Using this framework, you can:

  • Create system models directly from the original software application and verify the functionality and characteristics of a system.
  • Quickly identify and extract critical portions of the algorithms as candidates for hardware implementation.
  • Make informed partitioning decisions for optimum use of hardware resources and verify performance and functionality in simulation.
  • Perform cycle-accurate hardware simulations.
  • Directly implement C-based hardware designs onto the Tarari CPP.
This approach reduces the time it takes to design accelerated applications, thus lowering cost, improving time to market, and minimizing risks.

Applications

The best candidate algorithms for hardware acceleration with Tarari’s CPP board fit one or more of the following criteria:

  • High operation-per-byte ratio
  • Benefits from running multiple instantiations (parallel processing, multi-threading)
  • Capable of being pipelined
  • Takes advantage of wide data widths
  • CPU execution time far exceeding data transfer time over PCI.
Where these criteria are met, there has been considerable success with FPGA solutions for the following applications:
  • Security/encryption – RSA, DES/3-DES, SHA, MD5, AES/Rijndael
  • Streaming applications – video processing, pattern matching/filtering, compression/decompression
  • Vector engines.
Candidate products and applications incorporating these basic hardware blocks include:
  • Network security
  • Media streaming
  • XML/Web services accelerations
  • Image processing
  • High-performance computing applications.
Algorithm Example

The GZIP (GNU zip) algorithm is a commonly used utility that compresses and decompresses data files using the "Deflate" and other patent-free algorithms at www.gzip.org. The key feature of GZIP decompression is that it operates on a 1-bit wide stream of data, requiring that all operations reading from the input data stream are performed sequentially. Although GZIP decompression fulfills a limited number of the algorithm selection criteria described above, the results demonstrate that the general technique of offloading a CPU and exploiting algorithm parallelism can improve system performance.

Porting GZIP C Source Code to Handel-C
To reduce the time required to implement the GZIP decompression, GNU C source code was used for the example. The majority of the algorithm was used with little or no modification to provide an immediate basis for prototyping and a controlled test framework in which to enhance and optimize the algorithm. Making use of Handel-C’s simple support for parallelism via the par{} construct is an example.

The process for porting the GZIP algorithm to hardware was to write the code for one "module" of the system at a time in Handel-C. Note that the modules correspond to algorithmic stages in the decompression algorithm, as shown in Table 1, rather than more traditional hardware blocks.

Table 1 - Modules and descriptions for Handel-C GZIP decompression algorithm
 ModuleFunction
1 MainDetects the type of a compressed data block, and calls the appropriate modules to decompress it. Coordinates operation of all other modules in the application.
2 Optimized block memory accessProvides a simple interface for pipelined access to block RAM on Xilinx FPGAs. Required to achieve high clock rates. Used to provide RAMs for Huffman tables and LZ77 sliding window.
3 Bitstream reader Reads a variable number of bits from the input bitstream.
4 Builder for tree description Huffman decoderFor dynamic compressed blocks. Uses the tree description generated from the input bitstream to build a Huffman decoder table for the tree descriptions for the main Huffman decoders.
5 Tree description Huffman decoderFor dynamic compressed blocks. Decodes the Huffman codes from the input bitstream, generating tree descriptions for the "literal/length" and "distance" Huffman decoders.
6 Builder for "literal/length" Huffman decoderUses the tree description generated from the input bitstream or the static bit-length codes to build a Huffman decoder table for the "literal/length" Huffman decoder.
7 Builder for "distance" Huffman decoder Uses the tree description generated from the input bitstream or the static bit-length codes to build a Huffman decoder table for the "distance" Huffman decoder.
8 "Literal/length" and "distance" Huffman decoders Decodes the Huffman codes from the input bitstream, generating "literal" values and "length/distance" pairs.
9 LZ77 decompress Uses the LZ77 algorithm on the stream of "literal" values and length/distance pairs to reconstruct the original uncompressed data.

Optimized Block Memory Accesses
The Handel-C language supports the use of block RAM by adding the specification with {block=1} after a RAM declaration. This specification results in using block memory to build the RAM and generating a RAM clock at double the algorithm clock rate, which is used to implement single-cycle access to the block RAM. This process provides a quick route to hardware prototype for the design. However, using the single-cycle block RAM access limits the algorithm clock rate to half of the RAM clock rate. Some portions of the GZIP decompression algorithm are sequential in nature, so a high clock rate is desirable to achieve maximum performance. Consequently, for this improved performance, a simple interface eliminated the double clock constraint by using pipelined access to the block RAM.

LZ77 Decompressor
The main component in the hardware implementation is the "sliding window," which is 32 KB in size with byte-wide access. The sliding window is built using the pipelined block RAM access macros mentioned above. Copy operations resulting from a length/distance pair are performed using a pipelined loop as shown in the code in Figure 4, which reads from one port of the sliding window block RAM and writes back to the other port. This results in fast operation by processing one byte every clock cycle once the pipeline is full.

Simulation Testing of GZIP Modules
A significant advantage of developing a GZIP decompression algorithm in Handel-C is that the simulator can be used to test the functionality of the design before implementation. These simulations are performed rapidly – a model of the entire GZIP decompression algorithm can simulate 100 KB of compressed data in 25 seconds or less on a Pentium™ III 850 MHz system. During development of each GZIP module in Handel-C, we conducted simulations to verify their operation. The software GZIP source code was modified to produce test harnesses implemented as C functions called directly from the Handel-C code.

Data Throughput
Given the maximum clock rate of the design to be 133 MHz – verified by using the Xilinx ISE 5.1i timing analyzer – the data throughput of the GZIP decompression algorithm was calculated from simulation results in terms of the number of clock cycles required to decompress given files.

Table 2 shows the theoretical peak and maximum sustained decompressed data output rate from the GZIP decompression algorithm. The actual data output rate depends on the level of compression – highly compressed files result in a higher rate.

Table 2 - Performance results for GZIP decompression
 GZIP Decompression Rate (Mbytes/s)
Theoretical peak 130
Maximum sustained 116

In comparison, software GZIP has a peak data output rate of approximately 25 MBps on an average Pentium III-based system.

Implementation
The Handel-C GZIP decompression algorithm was synthesized directly to EDIF using the compiler. Xilinx ISE 5.1i tools were used to place-and-route the circuitry and to configure the Xilinx FPGAs on the Tarari CPP. A board-support package (BSP) of function calls is provided in the Tarari Content Processor Development Kit. The BSP provides a layer of abstraction that shields users from "low-level detail" and is used during both simulation and implementation of the design.

The final implementation size was 40% of one Virtex-II 1000 Platform FPGA with speed exceeding the 133 MHz required for maximum DDR data access on the Tarari CPP card.

Conclusion

Application acceleration is simplified using the combination of the Tarari CPP and the Celoxica DK Design Suite. Together they provide the ability to deliver hardware acceleration products with much lower NRE costs than traditional methods and with a quicker time to market.

The GZIP decompression algorithm was used to demonstrate the platform’s ability to deliver significant acceleration and an overall performance improvement for applications. The Celoxica DK Design Suite allowed interactive simulation and debugging of the GZIP decompression algorithm. Interactive simulation delivers both rapid proof of concept and the completion of development in only two months. The Celoxica DK Design Suite – along with the other design tools, examples, and documentation from Celoxica – enable you to quickly develop and deploy highly accelerated products.

You could continue to optimize performance for a specific algorithm or, more important, incorporate additional algorithms to deliver even higher application performance benefits. Systems deployed at customer sites with Tarari CPPs can take advantage of these enhancements due to the built-in reconfiguration capabilities of the Virtex-II FPGAs, resulting in an overall lower cost of ownership.

To find out more about hardware acceleration using the Tarari Content Processor Development Kit, visit www.tarari.com/products-cpdk.html. To discuss how your applications can be accelerated, contact Tarari at sales@tarari.com.

Printable PDF version of this article. PDF logo (05/02/03) 200 KB

 
/csi/footer.htm