FF Xcell Journal Online -- Bioinformatics article - Issue 53

  Xcell Journal Online
  Xcell Journal Archives
   
  Writing for Xcell
  Advertising in Xcell
  FREE Subscription
   
  Partner Yellow Pages
  Reference Pages
  Contact Us

    

Home : Documentation : Xcell Journal Online : Article
Implementing Bioinformatics Algorithms on Nallatech-Configurable Multi-FPGA Systems



by Keith Regester, Systems Applications Engineer, Nallatech
k.regester@nallatech.com

Jong-Ho Byun, Research Assistant, University of North Carolina at Charlotte
jbyun1@uncc.edu

Arindam Mukherjee, Assistant Professor, University of North Carolina at Charlotte
amukherj@uncc.edu
and
Arun Ravindran, Assistant Professor, University of North Carolina at Charlotte
aravindr@uncc.edu  (3/15/05)



A computational acceleration of 200X is impressive, but the future of bioinformatics requires more.
article link to PDF
Article PDF 940 KB


In the past decade there has been an explosive growth of biological data, including genome projects, proteomics, protein structure determination, cellular regulatory mechanisms, and the rapid expansion in digitization of patient biological data. This has led to the emergence of a new area of research: bioinformatics, where powerful computational techniques are used to store, analyze, simulate, and predict biological information.

Although raw computational power as predicted by “Moore’s Law” has led to the number of transistors that can be integrated doubling every 18 months, the genomic data at GenBank (the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences) is doubling every six months. Proteomic and cellular imaging data is expected to grow even faster. Post-genomic-era bioinformatics will require high-performance computing power of the order of several hundreds of teraflops or more.

In recent years, FPGAs have emerged as high-performance computing accelerators capable of implementing fine-grained, massively parallelized versions of computationally intensive algorithms. The reprogrammability of FPGAs enables algorithmspecific computing architectures to be implemented using the same hardware resource across a range of algorithms. In this article, we’ll describe the design flow, test results, and system optimization for implementing a Smith-Waterman algorithm using a scaleable Nallatech FPGA computing architecture.

Sequence Matching Overview
The objective of bioinformatic sequence matching is to identify similarities between subsequences of strings as far as possible. The dissimilarity of two sequences of nucleotides or amino acids can be defined as the minimum change required in one to get the other one, among all possible alignments between them – this is the edit distance of the pair of sequences.

Note that it depends on the choice of scoring scheme. For instance, a possible scoring scheme can assign 0 values to matches of nucleotides at certain positions in the sequences, 1 to gaps (insertions or deletions) at those positions, and 2 to mismatches (substitutions) at those positions.

The Smith-Waterman algorithm computes the local alignments or similarities between substrings of two sequences A and B of lengths m and n, respectively, using a dynamic programming approach. Dynamic programming is a strategy of building a solution gradually using simple recurrences.

The key observation for the alignment problem is that the similarity between sequences A[1..n] and B[1..m] can be computed by taking the maximum of the three following values:

  • The similarity of A[1..n -1] and B[1..m -1] plus the score of substituting (sub) A[n] for B[m]
  • The similarity of A[1..n] and B[1..m- 1] plus the score of deleting (del) A[n]
  • The similarity of A[1..n] and B[1..m - 1] plus the score of inserting (ins) B[m]
To solve the problem with this recurrence, the algorithm builds an (n +1) x (m +1) matrix M, where each M[i, j] represents the similarity between sequences A[1..i] and B[1..j]. The first row and the first column represent alignments of one sequence with spaces. M[0, 0] represents the alignment of two empty sequences, and is set to zero. All other entries are computed with the following formula:
M[i, j] = min {M[i -1, j -1] + sub (A[i], B[j]); M[i -1, j] + del (A[i]); M[i, j -1] + ins (B[j])}
Because there are (m+1) x (n+1) positions to compute and each takes a constant amount of work, this algorithm has a time complexity of O (n2). Clearly, it also has quadratic space complexity because it needs to keep the entire matrix in memory.

FPGA Architectures and Networking Tools
The FPGA network hardware comprises Nallatech FPGA computing cards containing between one and seven FPGAs per card. Several such cards can be plugged into the PCI slots of a desktop server and networked using Nallatech’s DIMEtalk networking software, greatly increasing the available computing capability. Let’s briefly describe the different FPGA network components.

High-Capacity Motherboard – BenNUEY
The BenNUEY motherboard features a Xilinx® Virtex™-II FPGA and module sites for as many as six additional FPGAs. The PCI/control and low-level drivers abstract the PCI interfacing, resulting in a simplified design process for designs/applications.

Virtex-II Expansion Module – BenBlue-II
The BenBlue-II DIME-II module provides a substantial logic resource ideal for implementing applications that have a large number of processing elements. Through support for as many as two on-board Xilinx XC2V8000 FPGAs, the BenBlue-II can provide more than 200,000 logic cells on a single module. A Virtex-II Pro version of this module is also available, providing two XC2VP100 devices.

Multi-FPGA Management – DIMEtalk
To manage the large silicon resource pool provided by the hardware, the DIMEtalk tool accelerates the design flow for creating a reconfigurable data network by providing a communications channel between FPGAs and the host user environment.

FUSE Tcl/Tk Control and C++ APIs
FUSE is a reconfigurable operating system that allows flexible and scalable control of the FPGA network directly from applications using the C++ development API, which is complemented by a Tcl/Tk toolset for scripting base control.

Merging Algorithms and Networks
The design flow consists of two interdependent tasks:

  • Capture of the Smith-Waterman algorithm in VHDL
  • Creation of the FPGA network
The user VHDL design maps the Smith-Waterman algorithm to a systolic array structure. As shown in Figure 1, the array structure comprises computational units known as processing elements (PEs) that have local interconnections between them. Each element of the query sequence is hardcoded into a PE to achieve an area-efficient realization of approximately four slices.

Note that the length of the maximum length of the query sequence corresponds to the maximum number of PEs that can fit inside a given Virtex-II/Pro device. A C-based routine instantiates the VHDL code of a PE multiple times to create the array of interconnected PEs. Additional VHDL code is then inserted to interface with the FPGA network.

DIMEtalk networks are defined early in the application development phase using the supplied network design software tool, DTdesign. This allows the structure of the network to be outlined across the FPGAs in a system.

Having defined the network, Smith-Waterman blocks of design (VHDL or VHDL-wrapped source) are imported using the custom component insertion wizard and interconnected to the network in the DTdesign workspace.

FPGA I/O ports for the whole design can be mapped to the device pins using the high-level drag-and-drop device editor. DTdesign then automatically generates VHDL code, user constraint files, and a VHDL test bench and configuration script files.

DIMEtalk components include:

  • Routers that direct data around the network and interconnect all other component types
  • Bridges that move data between different physical FPGA devices
  • Nodes serving as the user interface to the network that can be connected to user FPGA designs, through node interfaces such as block memories (BRAMs), queues (FIFOs), and memory maps
  • Edges serving as the user interface to the host computer
Note that the sizes of memory elements, such as BRAMs and FIFOs, depend on the application requirements and are easily interchangeable in the DIMEtalk GUI environment. Figure 2 shows a block diagram of components within the FPGA.

The entire network, including the Smith-Waterman application, is then synthesized using the Xilinx ISE™ 6.2i flow, and the place and route timing netlist is generated. Functional and timing simulations verify the operation of the design. Thereafter, the bitstreams for the entire multiple FPGA network are generated within the DIMEtalk environment and placed in corresponding work directories. Figure 3 shows a block diagram of a multi-FPGA network using five XC2V6000 devices as designed in DIMEtalk.

The bit files are then downloaded to the designated FPGAs using FUSE C++ API calls. In runtime, dedicated Tcl/Tk software API functions provide runtime host connectivity. This enables direct communication from the host system to all nodes within the network.

The database sequences downloaded from the GenBank database are formatted using a C-routine for input to the FPGA network. The formatting consists of coding the individual nucleotide bases using 4 bits ((A – ux00, C – ux01, G – ux10, T – ux11), where the bit denoted by “u” is undefined and the one denoted by “x” is set to 1 for the starting nucleotide base of a sequence (set to 0 otherwise). The sequences are then packed into groups of eight to fully use the 32-bit DIMEtalk networks. The length of an individual sequence could be well over 200,000 nucleotide bases long.

Test Results
The initial test utilizes three of the five FPGAs from the network. The resulting multi-FPGA network comprises a BenNUEY motherboard with a Virtex-II XC2V3000-4 FPGA and an attached BenBLUE-II daughterboard with two Virtex- II XC2V6000-4 FPGAs.

The query sequences are hard-coded into the FPGAs, while the individual database sequence files are loaded in the main memory. The database sequences are first written across the DIMEtalk network into the 16K read-FIFOs of the individual FPGAs. Note that a single XC2V6000 device is capable of as many as 1.26 trillion cell updates per second. At the end of the processing, the final edit distance is written into a write-FIFO.

Table 1 gives the runtimes for the different GenBank databases in the case of a single XC2V6000 FPGA, and two XC2V6000 FPGAs in the network. The databases used were the GBUNA (3 MB, 1,276 sequences in one database file), GBPRI20 (187 MB, 2,036 sequences with each sequence in a separate database file) and GBROD10 (245 MB, 1,200 sequences with each sequence in a separate database file), all from the GenBank database.

Table 1 – Performance of the FPGA network for GenBank databases
Computation Platform Time (GBUNA) Time (GBPRI20) Time (GBROD10)
Single XC2V6000120 seconds 294 seconds 216 seconds
Network of two XC2V6000 90 seconds 220 seconds 175 seconds
SunFire 280R 265 seconds 11 hours, 24 minutes 7 hours, 45 minutes

The results are compared with those obtained using a SunFire 280R (two UltraSPARC III processors clocking at 1.05 GHz, with 8 MB L2 cache and 8 GB memory, running Solaris 9). Each XC2V6000 FPGA holds PEs corresponding to a 5,000-nucleotide-bases-long query sequence, besides a 10% overhead for the interface logic. We achieved an overall FPGA utilization of more than 80%.

Systems Optimization
The time required to compare a query sequence against a database sequence for the individual FPGA configuration is calculated as:

equation 1

Where |D| is the number of elements in a given database sequence D; |Q| is the length of the query sequence Q (equal to the number of processing elements); T is the number of nucleotide sequences in the database; N is the number of FPGAs in the network; and tclk is the clock period in the FPGA network.

A truly parallel FPGA network would have a processing time given by:

equation 2

while a serial FPGA network would have a processing time given by: equation 3

In tasks where the processing is I/O-limited (|D| > |Q|), the FPGA network acts as a serial system, whereas in tasks where the processing is computation-limited (|Q| > |D|), the network behaves as a parallel system. Note that the use of large on-board memories relieves the I/O bottleneck.

For example, the Nallatech BenData-II Module and BenNUEYPCI- 4e motherboard adds 1.5 GB of on-board memory. In addition, the BenNUEY-PCI-4e motherboard has 4 GB Ethernet ports and eight Xilinx RocketIO™ ports, allowing direct connection to storage area networks (SANs), networked memory devices, and interlacing of MGTs between boards.

These Nallatech boards, which became available in early 2005, would be ideally suited for real-world bioinformatics applications (Figure 4).

Automatically evolving computational resource partitioners are currently being developed at UNC Charlotte for optimal resource allocation for a given computational problem and an FPGA network configuration. Figure 5 shows a system-level view of mapping an enormous sequence-matching application to a multi-device FPGA computing platform.

Conclusion
Looking forward, using Nallatech’s scalable FPGA networks will help realize high-performance computing capability for bioinformatics applications at a low cost. A practical system will be evaluated on:

  • Scaleability to execute diverse complex algorithms
  • Ease of implementation and integration
  • Price/performance demands
Our prototype implementation of the Smith-Waterman sequence alignment algorithm provides a computational speed improvement of more than two orders of magnitude (200X) while running on a three-FPGA network.

To find out more about Nallatech’s scalable FPGA computing architectures and networking tools, visit www.nallatech.com. To discuss how Nallatech can support your FPGA acceleration and systems development needs, please contact Nallatech at contact@nallatech.com or call (877) 44-NALLA.

Printable PDF version of this article with graphics. PDF logo (3/15/05) 940 KB

 
Jobs Events Webcasts News Investors Feedback Legal Privacy Trademarks Sitemap
© 1994-2008 Xilinx, Inc. All Rights Reserved.