|
In the past decade there has been an explosive
growth of biological data, including
genome projects, proteomics, protein
structure determination, cellular regulatory
mechanisms, and the rapid expansion in
digitization of patient biological data. This
has led to the emergence of a new area of
research: bioinformatics, where powerful
computational techniques are used to
store, analyze, simulate, and predict biological
information.
Although raw computational power as
predicted by “Moore’s Law” has led to the
number of transistors that can be integrated
doubling every 18 months, the genomic
data at GenBank (the NIH genetic
sequence database, an annotated collection
of all publicly available DNA sequences) is
doubling every six months. Proteomic and
cellular imaging data is expected to grow
even faster. Post-genomic-era bioinformatics
will require high-performance computing
power of the order of several hundreds
of teraflops or more.
In recent years, FPGAs have emerged as
high-performance computing accelerators
capable of implementing fine-grained,
massively parallelized versions of computationally
intensive algorithms. The reprogrammability
of FPGAs enables algorithmspecific
computing architectures to be
implemented using the same hardware
resource across a range of algorithms. In
this article, we’ll describe the design flow,
test results, and system optimization for
implementing a Smith-Waterman algorithm
using a scaleable Nallatech FPGA
computing architecture.
Sequence Matching Overview
The objective of bioinformatic sequence
matching is to identify similarities between
subsequences of strings as far as possible.
The dissimilarity of two sequences of
nucleotides or amino acids can be defined
as the minimum change required in one to
get the other one, among all possible alignments
between them – this is the edit distance
of the pair of sequences.
Note that it depends on the choice of
scoring scheme. For instance, a possible
scoring scheme can assign 0 values to matches of nucleotides at certain positions
in the sequences, 1 to gaps (insertions or
deletions) at those positions, and 2 to mismatches
(substitutions) at those positions.
The Smith-Waterman algorithm computes
the local alignments or similarities
between substrings of two sequences A and
B of lengths m and n, respectively, using a
dynamic programming approach. Dynamic
programming is a strategy of building a
solution gradually using simple recurrences.
The key observation for the alignment
problem is that the similarity between
sequences A[1..n] and B[1..m] can be computed
by taking the maximum of the three
following values:
- The similarity of A[1..n -1] and
B[1..m -1] plus the score of substituting
(sub) A[n] for B[m]
- The similarity of A[1..n] and B[1..m-
1] plus the score of deleting (del) A[n]
- The similarity of A[1..n] and B[1..m -
1] plus the score of inserting (ins) B[m]
To solve the problem with this recurrence,
the algorithm builds an (n +1) x (m
+1) matrix M, where each M[i, j] represents
the similarity between sequences
A[1..i] and B[1..j]. The first row and the
first column represent alignments of one
sequence with spaces. M[0, 0] represents
the alignment of two empty sequences, and
is set to zero. All other entries are computed
with the following formula:
M[i, j] = min {M[i -1, j -1] + sub (A[i], B[j]);
M[i -1, j] + del (A[i]); M[i, j -1] + ins (B[j])}
Because there are (m+1) x (n+1) positions
to compute and each takes a constant
amount of work, this algorithm has
a time complexity of O (n2). Clearly, it
also has quadratic space complexity
because it needs to keep the entire matrix
in memory.
FPGA Architectures and Networking Tools
The FPGA network hardware comprises
Nallatech FPGA computing cards containing
between one and seven FPGAs per
card. Several such cards can be plugged
into the PCI slots of a desktop server and
networked using Nallatech’s DIMEtalk networking software, greatly increasing
the available computing capability. Let’s
briefly describe the different FPGA network
components.
High-Capacity Motherboard – BenNUEY
The BenNUEY motherboard features a
Xilinx® Virtex™-II FPGA and module sites
for as many as six additional FPGAs. The
PCI/control and low-level drivers abstract
the PCI interfacing, resulting in a simplified
design process for designs/applications.
Virtex-II Expansion Module – BenBlue-II
The BenBlue-II DIME-II module provides a
substantial logic resource ideal for implementing
applications that have a large number
of processing elements. Through support
for as many as two on-board Xilinx
XC2V8000 FPGAs, the BenBlue-II can provide
more than 200,000 logic cells on a single
module. A Virtex-II Pro version of this
module is also available, providing two
XC2VP100 devices.
Multi-FPGA Management – DIMEtalk
To manage the large silicon resource pool
provided by the hardware, the DIMEtalk
tool accelerates the design flow for creating
a reconfigurable data network by providing
a communications channel between
FPGAs and the host user environment.
FUSE Tcl/Tk Control and C++ APIs
FUSE is a reconfigurable operating system
that allows flexible and scalable control of
the FPGA network directly from applications
using the C++ development API,
which is complemented by a Tcl/Tk toolset
for scripting base control.
Merging Algorithms and Networks
The design flow consists of two interdependent
tasks:
- Capture of the Smith-Waterman
algorithm in VHDL
- Creation of the FPGA network
The user VHDL design maps the Smith-Waterman algorithm to a systolic array
structure. As shown in Figure 1, the array
structure comprises computational units
known as processing elements (PEs) that
have local interconnections between them.
Each element of the query sequence is hardcoded
into a PE to achieve an area-efficient
realization of approximately four slices.
Note that the length of the maximum
length of the query sequence corresponds
to the maximum number of PEs that can
fit inside a given Virtex-II/Pro device. A C-based
routine instantiates the VHDL code
of a PE multiple times to create the array of
interconnected PEs. Additional VHDL
code is then inserted to interface with the
FPGA network.
DIMEtalk networks are defined early in
the application development phase using
the supplied network design software tool,
DTdesign. This allows the structure of the
network to be outlined across the FPGAs
in a system.
Having defined the network, Smith-Waterman blocks of design (VHDL or
VHDL-wrapped source) are imported
using the custom component insertion
wizard and interconnected to the network
in the DTdesign workspace.
FPGA I/O ports for the whole design can
be mapped to the device pins using the high-level drag-and-drop device editor.
DTdesign then automatically generates
VHDL code, user constraint
files, and a VHDL test bench and
configuration script files.
DIMEtalk components include:
- Routers that direct data around
the network and interconnect
all other component types
- Bridges that move data
between different physical
FPGA devices
- Nodes serving as the user
interface to the network that
can be connected to user
FPGA designs, through node
interfaces such as block memories
(BRAMs), queues
(FIFOs), and memory maps
- Edges serving as the user interface
to the host computer
Note that the sizes of memory
elements, such as BRAMs and
FIFOs, depend on the application
requirements and are easily interchangeable
in the DIMEtalk GUI
environment. Figure 2 shows a
block diagram of components
within the FPGA.
The entire network, including
the Smith-Waterman application,
is then synthesized using the Xilinx
ISE™ 6.2i flow, and the place and
route timing netlist is generated.
Functional and timing simulations verify
the operation of the design. Thereafter, the
bitstreams for the entire multiple FPGA
network are generated within the
DIMEtalk environment and placed in corresponding
work directories. Figure 3
shows a block diagram of a multi-FPGA
network using five XC2V6000 devices as
designed in DIMEtalk.
The bit files are then downloaded to the
designated FPGAs using FUSE C++ API
calls. In runtime, dedicated Tcl/Tk software
API functions provide runtime host
connectivity. This enables direct communication
from the host system to all nodes
within the network.
The database sequences downloaded
from the GenBank database are formatted
using a C-routine for input to the FPGA
network. The formatting consists of coding
the individual nucleotide bases using 4 bits
((A – ux00, C – ux01, G – ux10, T – ux11),
where the bit denoted by “u” is undefined
and the one denoted by “x” is set to 1 for
the starting nucleotide base of a sequence
(set to 0 otherwise). The sequences are then
packed into groups of eight to fully use the
32-bit DIMEtalk networks. The length of
an individual sequence could be well over
200,000 nucleotide bases long.
Test Results
The initial test utilizes three of the five
FPGAs from the network. The resulting
multi-FPGA network comprises a
BenNUEY motherboard with a
Virtex-II XC2V3000-4 FPGA
and an attached BenBLUE-II
daughterboard with two Virtex-
II XC2V6000-4 FPGAs.
The query sequences are
hard-coded into the FPGAs,
while the individual database
sequence files are loaded in the
main memory. The database
sequences are first written
across the DIMEtalk network
into the 16K read-FIFOs of
the individual FPGAs. Note
that a single XC2V6000
device is capable of as many as
1.26 trillion cell updates per
second. At the end of the processing,
the final edit distance
is written into a write-FIFO.
Table 1 gives the runtimes
for the different GenBank databases
in the case of a single
XC2V6000 FPGA, and two
XC2V6000 FPGAs in the network.
The databases used were
the GBUNA (3 MB, 1,276
sequences in one database file),
GBPRI20 (187 MB, 2,036
sequences with each sequence
in a separate database file) and
GBROD10 (245 MB, 1,200
sequences with each sequence
in a separate database file), all
from the GenBank database.
Table 1 – Performance of the FPGA network for GenBank databases
|
Computation Platform | Time (GBUNA) | Time (GBPRI20) | Time (GBROD10) |
| Single XC2V6000 | 120 seconds | 294 seconds | 216 seconds |
| Network of two XC2V6000 | 90 seconds | 220 seconds | 175 seconds |
| SunFire 280R | 265 seconds | 11 hours, 24 minutes | 7 hours, 45 minutes |
The results are compared
with those obtained using a SunFire 280R
(two UltraSPARC III processors clocking at
1.05 GHz, with 8 MB L2 cache and 8 GB
memory, running Solaris 9). Each
XC2V6000 FPGA holds PEs corresponding
to a 5,000-nucleotide-bases-long
query sequence, besides a 10% overhead for
the interface logic. We achieved an overall
FPGA utilization of more than 80%.
Systems Optimization
The time required to compare a query
sequence against a database sequence for
the individual FPGA configuration is calculated
as:

Where |D| is the number of elements in
a given database sequence D;
|Q| is the length of the query sequence
Q (equal to the number of processing elements);
T is the number of nucleotide
sequences in the database; N is the number
of FPGAs in the network; and tclk is the
clock period in the FPGA network.
A truly parallel FPGA network would
have a processing time given by:

while a serial FPGA network would
have a processing time given by:

In tasks where the processing is
I/O-limited (|D| > |Q|), the FPGA
network acts as a serial system, whereas
in tasks where the processing is
computation-limited (|Q| > |D|), the
network behaves as a parallel system.
Note that the use of large on-board
memories relieves the I/O bottleneck.
For example, the Nallatech
BenData-II Module and BenNUEYPCI-
4e motherboard adds 1.5 GB of
on-board memory. In addition, the
BenNUEY-PCI-4e motherboard has 4 GB
Ethernet ports and eight Xilinx
RocketIO™ ports, allowing direct connection
to storage area networks (SANs), networked
memory devices, and interlacing of
MGTs between boards.
These Nallatech boards, which became
available in early 2005, would be ideally
suited for real-world bioinformatics applications
(Figure 4).
Automatically evolving computational
resource partitioners are currently being
developed at UNC Charlotte for optimal
resource allocation for a given computational
problem and an FPGA network configuration.
Figure 5 shows a system-level
view of mapping an enormous sequence-matching
application to a multi-device
FPGA computing platform.
Conclusion
Looking forward, using Nallatech’s scalable
FPGA networks will help realize high-performance
computing capability for bioinformatics
applications at a low cost. A
practical system will be evaluated on:
- Scaleability to execute diverse complex
algorithms
- Ease of implementation and integration
- Price/performance demands
Our prototype implementation of the
Smith-Waterman sequence alignment algorithm
provides a computational speed
improvement of more than two orders of
magnitude (200X) while running on a
three-FPGA network.
To find out more about Nallatech’s scalable
FPGA computing architectures and
networking tools, visit www.nallatech.com.
To discuss how Nallatech can support your
FPGA acceleration and systems development
needs, please contact Nallatech at
contact@nallatech.com or call (877) 44-NALLA.
Printable PDF version of this article with graphics. (3/15/05) 940 KB |