|
High-end electronic design flows have traditionally
included the creation of
Verilog™/VHDL representations by hand.
These manual methods were effective in the
past, but the algorithms used in many of
today’s new designs are so complex that traditional
design practices are now inadequate.
Meanwhile, FPGAs are increasingly
attractive because companies can avoid
time-consuming, exorbitant mask re-spins
and other risks associated with ASICs. The
emergence of multimillion-gate, 1,000+
pin, “ASIC-like” devices incorporating
embedded processors and innovative memory
architectures calls for a system-level
approach to programmable logic design.
FPGAs have already moved beyond
their traditional applications into new
domains such as digital signal processing
(DSP). Unfortunately, creating register
transfer level (RTL) implementations for
high-end FPGAs can become as errorprone
and time-consuming as when targeting
an ASIC, thereby negating much of
their inherent value.
You can now prevent these problems by
adopting a design flow based on the simulation
and synthesis of C representations.
By using pure untimed C++ to describe
functional intent, your design teams can
move up to a far more productive abstraction
level for designing hardware, thus
reducing implementation efforts by as
much as 20 times while creating a more
repeatable and reliable design flow.
An important outcome of this approach
is that you can produce designs of better
quality than traditional RTL methods by
identifying fundamentally superior microarchitectural
solutions.
In this article, we’ll examine the conventional
design flow and its associated
problems, and highlight some alternative
approaches to hardware design based on
the use of C/C++, comparing the pros and
cons of SystemC™ and the pure, untimed
C++ used by Mentor Graphics®
Catapult™ C Synthesis tool.
Traditional Design Flow
Many high-end designs in the communications
or video/image processing industries
are typically based on extremely complex
algorithms. The first step in a conventional design flow involves modeling
and proving the design
functionality at the algorithmic
level of abstraction, using
tools such as MATLAB™
from The MathWorks or
plain C/C++ modeling.
MATLAB is good for initial
algorithm proof-of-concept
and validation, although
many design teams also
develop C/C++ models to
facilitate high-speed system-level
verification beyond
what MATLAB can provide.
For subsequent discussion,
we’ll use the term “untimed”
to represent those algorithms
written either in MATLAB
or pure ANSI C/C++.
Based on project requirements,
system architects
then partition the design
into blocks to be implemented
either in hardware or software. For the
hardware blocks, a floating-point algorithm
represents the functionality. Next,
either the system or hardware designer
quantizes the floating-point algorithm
into an integral or fixed-point algorithm.
These fixed-point algorithms are represented
in MATLAB, Simulink™, or
untimed C++ using bit-accurate types
(SystemC 2.0). After validating the fixedpoint
algorithm, the hardware designer
starts the long and tedious manual process
of creating Verilog or VHDL for the RTL
abstraction. This process can be divided
into three distinct phases:
- Micro-Architecture Definition. Decide
on the structure of the data path, control,
and interfaces. Typically done on
paper or perhaps a Microsoft™
Excel™ spreadsheet. The resulting
micro-architecture has a significant
impact on the overall speed/area of the
hardware. Designs can easily swing by
10 times in area or performance based
on the decisions made.
- RTL Design. Manually write the
RTL to represent the defined micro-architecture.
- RTL Area/Timing Optimization.
Iterate through RTL synthesis to meet
design goals.
In some cases, the hardware engineers
manually translate the floating-point
untimed algorithm into bit-accurate RTL,
either Verilog or VHDL. This RTL is subsequently
synthesized into a gate-level
netlist using traditional RTL synthesis
technology (Figure 1).
The main problems associated with this
traditional flow are:
- Functional Intent. A significant conceptual
and representational divide
exists between the system architects
working with untimed algorithms and
the hardware designers working with
the timed RTL in VHDL/Verilog. As a
result, the original design intent specified
by the system architect is easily
misinterpreted, causing functional
errors in the end product. In addition,
it is relatively easy to implement and
evaluate specification changes in the
untimed algorithm, but very painful
and time-consuming to subsequently
fold these changes into the RTL. This
is a serious consideration in wireless
applications, because broadcast standards
and protocols constantly evolve
and change.
- Meeting Requirements. Predicting
design performance (area, delay,
power) is difficult until RTL is done.
Therefore, system-level partitioning
and the resulting block-level design
goals are inaccurate at best. Many
system-level timing closure problems
are directly related to poor macroarchitectural
choices and unrealistic
goals placed on the hardware engineer
designing the hardware blocks.
- Design Complexity. Because the
untimed algorithmic domain and RTL
domain are dissimilar, the manual
translation from untimed algorithms to
RTL is prolonged and error-prone. In
addition, RTL uses technology-dependent
coding styles and “hard-codes”
the micro-architecture.
Evaluating alternative implementations is impractical because modifying and
re-verifying RTL to perform a series of
“what-if ” analyses of alternate microarchitecture
implementations is too
lengthy to be practical. Such evaluations
may include performing certain
operations in parallel versus sequentially;
pipelining portions of the design
versus non-pipelining; or sharing common
resources. Because of the amount
of time involved, design teams are limited
to the number of evaluations they
can perform, which can result in a
non-optimal implementation. The
complexity of high-end, compute-intensive
applications exemplifies the
difficulties associated with traditional
hand-coded RTL.
- RTL Reuse. Using the same RTL for
an ASIC and FPGA implies that the
ASIC implementation is sub-optimal
due to inherent FPGA performance
limitations. Conversely, users can realize
performance goals in an FPGA
through massive parallelism; however,
this parallelism may not be necessary
for an ASIC. This makes it extremely
difficult, if not impossible, to re-target
a complex RTL design to create a
tuned representation for the technology
node. Finally, because RTL hardcodes
the micro-architecture, using the
same RTL for a 10 MHz application
(for example) versus a high-performance
400 MHz application will result
in sub-optimal hardware.
- Functional Verification. Using traditional
logic simulation to verify a large
design represented in RTL is computationally
expensive and extremely slow.
The most important challenge facing the
designer is that all of the implementation
“intelligence” associated with the design is
hard-coded into the RTL, which therefore
becomes rigid and implementation-specific.
Next-Generation C-based Flow
An examination of the conventional flow
reveals three stages:
- Untimed algorithm evaluation in
MATLAB or C/C++, including quantization
and integral/fixed-point analysis
- Algorithm (untimed) to RTL (timed)
translation, including verification and
“what-if ” implementation analysis
- RTL to gate-level netlist using industry-standard
RTL synthesis
The front-end untimed algorithm evaluations
and the back-end RTL-to-netlist
synthesis are both well known and efficient.
The bottleneck is the manual creation
of the RTL, including performing
“what-if ” evaluations, implementing specification
changes, and verifying the RTL.
Any ideal flow should be based on
industry-standard ANSI C/C++, the language
of choice for software and system-level
modeling for many years. The pure,
untimed C/C++ written by system designers
is an excellent source for creating hardware
because it is void of implementation
details. This maximizes flexibility to the
synthesis tool and provides a source that is
“liquid” – capable of targeting ASICs,
FPGAs, highly compact small solutions,
and highly parallel fast solutions.
Translation from MATLAB to C/C++ is
still manual, but because these domains are
conceptually very close, the translation is
relatively quick and easy.
The untimed C/C++ adds significant
value by providing much faster simulation
than the MATLAB Simulink environment,
and is thus ideally suited for system-level
validation. Following verification, the C
representation is used to automatically generate
RTL, which in turn is subsequently
used to drive existing RTL synthesis technology
(Figure 2).
With this flow, you can synthesize the
untimed C/C++ directly into a gate-level
netlist. However, generating the intermediate
RTL provides a timed “comfort zone”
for existing flows by allowing you to validate
the implementation decisions made by
the C synthesis tool.
Furthermore, RTL is a useful point to
“stitch” the various functional blocks
together. Large portions of today’s designs
exist as IP blocks represented at the RTL
level of abstraction. This means that RTL is
a useful point in the design flow for integrating
and verifying the entire hardware
system. Your design teams can thus take
full advantage of existing, mature, and
robust RTL design tools such as test insertion
or power analysis.
The ideal flow based on algorithmic
synthesis of pure, untimed C/C++ addresses
all of the traditional bottlenecks:
- Functional Intent. Almost no conceptual
gap exists because system architects
and hardware designers use the
same untimed C/C++ source. Their
worlds are connected for the first time.
Moreover, it eliminates any chance of
misinterpretation by the hardware
designer, thereby reducing errors and
improving overall reliability. The new
flow also easily accommodates design
specification changes.
- Meeting Requirements. Algorithmic C
synthesis provides accurate metrics up
front, shortening lengthy RTL synthesis
runtimes and manual RTL optimization.
You can leverage these metrics to
make system-level macro-architecture
partitioning decisions, thus creating a
design that is better architected to meet
system performance.
- Design Complexity. You can address
the design complexity issue by using
algorithmic C synthesis to thoroughly
explore any highly complex design
space. C is fast and efficient to create
and verify, providing additional benefits
around system-level validation and
integration. RTL uses technology-dependent
coding styles and hardcodes
the micro-architecture. Using the
ideal flow, evaluating alternative implementations
is fast and efficient. You
can modify and re-verify C to effectively
perform a series of “what-if ”
evaluations of alternative algorithms.
Thus, your design teams are not limited
by the number of evaluations they
can perform, which results in an optimal
implementation.
- RTL Reuse. A key feature of this ideal
flow is that the C representation is completely
abstracted from the final implementation.
Therefore, as opposed to
embedding implementation “intelligence”
into the C representation, designers
can instead use such intelligence to
drive the C to the RTL implementation
through a series of “soft” constraints. In
turn, this means that they can easily retarget
the same C representation for different
micro-architectures and
ASIC/FPGA implementations.
- Functional Verification. Verifying C is
fast and efficient. A pure untimed C
representation will simulate as much as
10,000 times faster than an equivalent
RTL representation (the larger the
design, the faster C is compared to its
RTL counterpart).
Let’s examine alternatives to hardware
design based on the use of C/C++. These
include SystemC and the synthesizable subset
of pure untimed C++ used by the
Mentor Graphics Catapult C Synthesis tool.
SystemC-Based Flow
Two main SystemC-based design flows
exist: both require the untimed algorithm
representation to be manually translated
into its SystemC counterpart. Following
verification via simulation, you can automatically
translate the SystemC representation
into an RTL equivalent for use with
existing synthesis technology. Alternatively,
you can directly synthesize the SystemC
representation into a gate-level netlist
(Figure 3).
Because it was specifically created to
represent hardware, SystemC is equipped
with hardware-centric data types, including
integral and fixed-point entities with
rounding and overflow modes. SystemC
also includes system-level simulation
capabilities, including support for
abstract data transactions. Although powerful,
SystemC is an extremely complex
language. Moreover, the pseudo-timed
constructs required for SystemC synthesis
and simulation are foreign to both system-level and hardware designers.
One advantage of SystemC is that it
simulates as much as 100 times faster than
an equivalent RTL representation specified
at the same level of abstraction. However,
to make a SystemC representation suitable
for RTL generation or direct C synthesis,
designers would need to write it at nearly
the same level of abstraction as hand-translated
RTL, which largely negates the
advantages of using it in the first place.
Even worse, all of the implementation
“intelligence” associated with the design
has to be hard-coded into the SystemC representation,
which therefore becomes
implementation-specific. This means that a
SystemC representation intended for an
FPGA is not suitable for a subsequent ASIC
realization, and vice versa. Finally, it is not
possible to re-target the SystemC representation
to a compact or highly parallel solution
because the micro-architecture is
hard-coded.
Another SystemC approach
“wraps” the untimed C++ algorithm
in a timed interface. This approach
may have some advantages in system-level
integration; however, the resulting
source is now pseudo-timed and
hard-coded to the hardware interface.
Therefore, the notion of interface
exploration is not practical.
For example, targeting the C source
to a streaming I/O model versus a single-
port memory implies re-coding the
interface wrapper (difficult and time
consuming). In addition, the source is
no longer the pure, untimed C++
description already validated and
proven by the system designer. Thus,
any interface changes will require reverification
and possibly introduce foreign
coding concepts to the pure C++
representation.
Finally, the degree of interface
detail is extremely critical. Too much
information stifles the behavioral synthesis
tool and results in sub-optimal designs. Too
little means the tool doesn’t have the minimum
information needed to synthesize the
design, resulting in functional errors.
Catapult C-Based Flow
As noted previously, the most significant
problem with existing C-based design flows
is that the implementation “intelligence”
associated with the design has to be hard-coded
into the C representation, which
then becomes implementation-specific.
This is the key differentiator of the
Catapult C-based design flow from Mentor
Graphics. In this flow, the C code is very
close to what a system designer would write
to model functional behavior without any
preconceived hardware implementation or
target device architecture in mind.
As opposed to adding “intelligence” to
the source code (thereby locking it into a
target implementation), all of the intelligence
is provided by controlling the
Catapult C engine itself (Figure 4).
Catapult C uses industry-standard
C++ source code augmented with
SystemC data types that allow specific
bit-widths to be associated with variables
and constants. An advantage is
that many companies already create an
untimed C/C++ representation of their
designs for algorithmic validation. They
do this because a pure C representation
is easy and compact to write and simulates
100 to 10,000 times faster than an
equivalent RTL representation.
The only modification typically
required to use this model with
Catapult C is to add a single pragma to
the source code to indicate the top of
the functional portion of the design –
anything conceptually above this point
is considered part of the test bench.
Another Catapult C advantage is its
intuitive interface. Once the tool has read
the source code, you can immediately perform
micro-architecture tradeoffs and evaluate
their effects in terms of size and speed.
Catapult C easily associates ports with registers
or RAM blocks. It identifies constructs
like loops and allows you to specify
– on an individual basis – whether they
should be unrolled, partially unrolled, or
left alone. You can also specify if you wish
to perform resource sharing on specific entities,
pipeline loops, and other constructs.
All of these evaluations are done within
a few seconds or minutes depending on
design size. Catapult C then reports total
size/area and latency in terms of clock
cycles or input-to-output delays (or
throughput time/cycles in the case of
pipelined designs). You can name, save,
and reuse any of these “what-if ” scenarios.
It would be almost impossible to perform
these tradeoffs in a timely manner
using a conventional hand-coded RTL-based
flow.
More importantly, the fact that the C
source code used by Catapult C is not
required to contain any implementation
“intelligence” – and that all such intelligence
is supplied by controlling the Catapult C
engine itself – means that your design teams can easily re-target the same source code to
alternative micro-architectures and different
implementation technologies.
Conclusion
The fundamental difference between the various
C-based design flows is the level of synthesis
abstraction they support (Figure 5).
SystemC offers significant system-level simulation
capabilities, but its synthesizable
subset is at a lower abstraction level, so modification
to the source drives the results.
This lack of synthesis abstraction causes
the SystemC representations to be
implementation-specific. This makes them
difficult to create and modify, and significantly
reduces their flexibility with regard
to performing “what-if ” evaluations and
re-targeting them toward alternative
implementation technologies.
By comparison, Catapult C employs
models represented in standard C++ and
supports a high level of synthesis abstraction.
Because they are not implementation-specific,
Catapult C models are compact
and can thus be easily created and modified.
By means of the Catapult C engine
itself, you can quickly perform “what-if ”
evaluations and re-target the design toward
alternative implementation technologies.
The end result is that the Catapult C-based
design flow dramatically speeds
implementation, improves design flow reliability,
and increases design flexibility when
compared to other C-based flows or traditional
hand-coded RTL methods.
Catapult C Synthesis has already been
instrumental in many successful tapeouts
from major hardware design companies
worldwide. The mature, second-generation
algorithmic synthesis environment
unites two distinct domains – system-level
design and hardware design – and when
combined with Mentor Graphics
ModelSim™ simulation tools, lays the
foundation for next-generation electronic
system level (ESL) design.
To learn more about how Catapult C
Synthesis can address your hardware design
needs, call Mentor Graphics to schedule a
complete product demonstration, or visit
our website for the latest product news and
case studies at www.mentor.com/c-design/.
Printable PDF version of this article with graphics. (8/1/04) 355 KB |