Design to Win with Integrated FPGA and Microprocessor Solutions
Using “software-compiled system design” for programmable systems, we show how you
can combine software and hardware design methodologies – and development tools –
from system-level specification to direct implementation and run-time reconfiguration.
As FPGAs have developed from logic prototyping
devices into fundamental system
elements, there has been enthusiasm for the
concept of using high-performance processors
closely coupled to or immersed inside
the FPGA fabric for applications that
require unrivalled levels of performance
and flexibility.
In this architecture, the microprocessor
typically runs system applications while the
FPGA manages computationally intensive
tasks. Offloading processor-intensive tasks
to hardware reduces the load on the processor
and delivers greater bandwidth. It can
also remove bottlenecks by migrating algorithms
to hardware. In short, FPGAs have
evolved into fully programmable systems
and fast co-processors, rather than just flexible
relations of the ASIC.
Existing design examples that combine
Xilinx FPGA solutions with development
tools from Celoxica and Wind River
Systems already provide unique and tangible
proof that this concept and design flow
works. They form a core element of programmable
system co-design, delivering a
quick, efficient, and verifiable route to
device-optimized implementation.
“Software-compiled system design” provides
the capability to drive partitioning,
co-verification, and direct implementation
from the system specification. Moreover, it
allows engineers to jump-start their system
and software application development
before actual hardware is available, thereby
enabling concurrent design, saving valuable
development effort and delivering the best
time to market. Starting at the system level,
verification becomes a whole-design life
cycle activity, and by enabling system-level
partitioning, you can realize a better quality
of design (QoD) – right the first time, more
of the time.
A Design Example
An early design example – developed by
Celoxica, Wind River, and Xilinx – focused
on the design methodology, tools, and run-time
environments that can be applied to
programmable systems. Specifically, we
developed a triple-DES encryption and
decryption engine to compare a programmable
system solution with an alternative
software implementation. A compressed
video stream formed the basis of test data.
Hardware
The selected hardware architecture was
initially based around a discrete IBM
PowerPC™ processor and a Xilinx
Virtex™ FPGA – effectively a first-generation
Virtex-II Pro™ prototyping
platform (Figure 1). We used a PPMC750
single-board computer from Wind River
and Celoxica’s RC1000 – a Virtex-based
PCI card with a Xilinx FPGA and 8 Mb of
local memory (Figure 2).
Subsequently, we deployed a newer
reference platform using the PowerPC
405GP processor (Figure 3). In addition
to PCI and PMC (PCI mezzanine card)
connectors, this platform also featured a
custom connector that allowed an FPGA
daughtercard to be plugged directly onto
the processor peripheral bus, thus providing
even closer coupling, lower latency,
and higher throughput.
Various FPGA daughtercards can be
used with this reference platform, such as
the ADM-XRC from Alpha Data
Systems, Xilinx Durango, or the Proteus
card from Wind River.
Wind River’s Proteus card is equipped
with a Xilinx Virtex device and memory
includes 4 MB on-board SSRAM. The
FPGA PMC can interface with any standard
PMC slot (with an image containing
a PCI soft core) or the microprocessor
local bus on Wind River’s SBC405GP
single board computer. There is a substantial
performance boost from direct
processor bus connection, compared
with PCI.
The design platform is completed
by a simple DAC interface, enabling
the FPGA card to drive a video monitor
or a flat-panel LCD for standalone
demonstrations.
Development Tools
The 405GP processor runs Wind River’s
VxWorks™ real-time operating system
(RTOS), together with hardware bring-up
tools that allow close control of the boot
cycle for the board during the time period
before control is passed to the RTOS.
The PAVE Framework API from
Xilinx was used to program the FPGA
with configuration files.
Determination of the system partition
and application content for the FPGA
were developed using Celoxica’s Nexus co-design
environment and DK Design Suite.
Nexus and DK
Nexus is a powerful co-design environment
for programmable systems. It supports
system partitioning, co-verification, and
co-simulation. Nexus allows you to fully
explore the design space to identify
optimal system partitioning. System
functionality can be simulated
between hardware and software using
multiple languages such as C, C++,
Handel-C, SystemC, and HDLs.
These models can be used throughout
the design for co-verification. Nexus
communicates directly during simulation
with popular, third-party, hardware
RTL simulators and software ISS
environments.
Using DK, the resulting code may be
debugged using a familiar integrated
development environment (IDE), and
applications are compiled direct to the
FPGA fabric using device-optimized
synthesis. VHDL and Verilog output
is also supported for traditional RTL
synthesis.
Handel-C
We selected the Handel-C language for
hardware implementation, as it provides a
common level of abstraction and a common
language base for both the hardware
and software. The language has simple
extensions to ANSI-C (Figure 4) that can
be leveraged to quickly create applications
that fully exploit the capabilities of a programmable
system, without compromising
performance or area.
As a fully synthesizable language, everything
that can be described in Handel-C has
translation to hardware (Figure 5). The code
illustrates concepts and extensions, such as
par, chan, synchronization, functions,
pointers, structures, interfacing, and externing
pure C functions for simulation.
With a simple timing model, each
assignment in a program takes one clock
cycle to execute, giving you full control
over what is happening in the design at any
point in time. Results are predictable and
controllable, and the facility for complex
sequential control flows means there are no
state machines to design.
Run-Time Environment
Typically, the FPGA is connected to the
microprocessor in a memory-mapped or
programmed I/O fashion, but this creates
the challenge of needing to develop and
redevelop individual communications protocols
and data-marshalling routines for
each application. This problem is overcome
by using DSM (Data Streaming Manager),
a portable co-design API developed for
hardware/software integration in programmable
systems.
Data Streaming Manager
DSM is a portable hardware/software co-design
API that offers a simple and transparent
interface for transferring multiple
independent streams of data between hardware
and software. DSM supports system
partitioning and final implementation; it is
both bus/interconnect and OS-independent;
and for the developer, it simplifies the integration
between the hardware and software
(Figure 6).
As an example, the hardware function
reads parameters from an input port and
then writes the results to an output port.
All the complexities of receiving commands
over the PCI or bus, routing parameters
to the appropriate hardware function,
and then routing the responses back to the
calling software thread are handled transparently
by the hardware side of the
DSM.
On the software side, there are two
main parts to the DSM: the control/setup
phase and then the specific usage of
the custom hardware function.
Essentially, information about the FPGA
configuration and available functions are
retrieved by reading a memory-mapped
register. User-defined
identifiers (called function
addresses) are assigned to
each available hardware
function, and these
function addresses are
later used to communicate
between the
application software and
the functions implemented
in hardware.
Using this methodology, the
optimal system partition can be identified
by porting blocks of software to
Handel-C, for hardware prototyping, testing,
and verification. DSM’s portability
means that multiple partitions can be rapidly
evaluated, tested, and verified with the
software used as a testbench throughout.
DSM also provides a functionally accurate
simulation environment that allows
ANSI-C programs and Handel-C applications
to interact using the DSM (Figure 7).
The ANSI-C program is run as a native executable
on the PC. The Handel-C application
is run using the simulation and
debugging capability of Celoxica’s co-design
environment. A utility is provided through
which the data passing between the applications
may be monitored to assist with
debugging (Figure 8). All of the API functions
are provided, allowing complete system
development to begin – without the
development platform being available. Once
working, the application can be easily transferred
to the target platform for final testing.
Triple DES Encryption
Our design example was based around
streaming of compressed and encrypted
video data. The Autodesk FLI file format
was used to compress the video, and an FLI
player, developed by Celoxica, was implemented in FPGA hardware connected to
the processor via the PCI bus. To benchmark
the design, we loaded a cartoon animation
into the memory on the processor
board. A triple DES algorithm described in
C ran on the PowerPC microprocessor.
The same C source code was ported to
Handel-C, optimized in terms of controlling
parallelism and timing, and compiled
to a gate-level design that was device-optimized
for the target FPGA.
A 64-bit key was used for the encryption,
which subsequently allowed correct decryption
of the video stream. Implementing
three DES algorithms in sequence (triple
DES encryption) provided further increases
in this standard’s security. Three 64-bit keys
were used for an encrypt/decrypt/encrypt
cycle in a triple DES pass, and the same keys
allowed decrypt/encrypt/decrypt for
decrypting the data.
This was a robust test of performance.
The algorithm was inherently sequential in
software, but it could be heavily pipelined
for a hardware implementation.
To measure the performance improvement,
we played a cartoon with each compressed
frame being encrypted, decrypted,
and displayed on a VGA monitor. Both
hardware and software implementations
were displayed together. They were triggered
to start simultaneously, with the
hardware version programmed to cycle
continuously until the software implementation
finished. Processed data from the
microprocessor was fed to the FPGA,
which as well as performing DES encryption/decryption, was programmed to generate
VGA signals. Output from both the
hardware and software implementations
was merged to form a composite image on
the monitor.
Performance Comparison
A test harness enabled triple DES performance
to be benchmarked by streaming
data into either the software or
hardware encryption algorithm.
Theoretical performance for the
FPGA was calculated as follows: The
triple DES implementation produced a
64-bit word every 19 clock cycles, giving
a data throughput of 85.6 Mbps for a
device running at 25.4 MHz.
Actual performance was profiled using
WindView, a diagnostic tool from Wind
River that enables visualization and
analysis of performance and timing
issues in embedded systems. It allowed
triggers to be set at different points in the
code, and then provided accurate
timing information for each trigger
event. Performance statistics are detailed
in Table 1.

Table 1 - Performance statistics for triple DES encryption/decryption
This test scenario showed an FPGA
throughput about 13 times faster using
hardware than running software on a
PowerPC microprocessor. Nevertheless, it
was still about a quarter of the theoretical
maximum rate. This indicated that the full
benefits of placing core routines in hardware
might be compromised by other system
bottlenecks. Further analysis showed
that there were overheads associated with
offloading functionality into hardware.
These overheads were associated with
RAM access latency and/or bus speeds.
We also calculated the performance of
hardware and software encryption in the
cartoon demonstration. Results demonstrated
that the hardware performed 22
times faster on a 15 times slower clock, as
shown in Table 2.

Table 2 - Performance comparison
of hardware and software encryption
Following more detailed partitioning
analysis, performance closer to the theoretical
limits might be realized by removing
code and functionality that are not directly
associated with the triple DES algorithm
(for example, the FLI decoder, frame
buffer, and VGA driver). Better performance
would also be achievable by connecting
the FPGA directly to the processor bus
in a memory-mapped fashion rather than
across the PCI bus.
Conclusion
The performance analysis results demonstrated
significant improvements in overall
system performance and quality of design.
The results were achieved using a software-compiled
system design methodology – specifically developed for programmable
systems – that consistently delivered the
fastest time to market (some 50% to 75%
advantage in design time) without compromising
performance or area.
For example, using the selected development
tools and run-time environment,
the FLI player took 10 person-days to
implement, as did the triple-DES
functionality. On the other hand, integrating
these two blocks to produce the
cartoon demonstration took just half a
day. Moreover, you can very quickly
explore the design space, experiment, and
analyze different hardware/software
trade-offs, and rapidly implement and
prototype the system.
Coupling Celoxica’s co-design technology
with high-performance profiling
tools in the development tool chain
enabled further performance boosts and
time-to-market efficiencies. Overall
improvements in the quality of design
were realized by more informed and accurate
partitioning decisions, better up-front
system verification, and by
maximizing the speed gains of hardware
implementation while minimizing the
negative impact of transferring data
between the FPGA and microprocessor.
The bottom line is that these system-level
design qualities offer real and competitive
advantages for designers of
programmable systems who want to move
to volume production.
Printable PDF version of this article. (02/15/03) 435 KB |