Xcell Journal Online
  Xcell Journal Archives
   
  Writing for Xcell
  Advertising in Xcell
  FREE Subscription
   
  Partner Yellow Pages
  Reference Pages
  Contact Us

    

Home : Documentation : Xcell Journal Online : Article
Algorithmic C Synthesis Optimizes ESL Design Flows



by Shawn McCloud, High-Level Synthesis Product Manager, Mentor Graphics Corporation
shawn_mccloud@mentor.com (8/1/04)


Using pure, untimed algorithmic C dramatically speeds implementation and increases design flexibility when compared to other C-based flows.
article link to PDF
Article PDF 355 KB


High-end electronic design flows have traditionally included the creation of Verilog™/VHDL representations by hand. These manual methods were effective in the past, but the algorithms used in many of today’s new designs are so complex that traditional design practices are now inadequate.

Meanwhile, FPGAs are increasingly attractive because companies can avoid time-consuming, exorbitant mask re-spins and other risks associated with ASICs. The emergence of multimillion-gate, 1,000+ pin, “ASIC-like” devices incorporating embedded processors and innovative memory architectures calls for a system-level approach to programmable logic design.

FPGAs have already moved beyond their traditional applications into new domains such as digital signal processing (DSP). Unfortunately, creating register transfer level (RTL) implementations for high-end FPGAs can become as errorprone and time-consuming as when targeting an ASIC, thereby negating much of their inherent value.

You can now prevent these problems by adopting a design flow based on the simulation and synthesis of C representations. By using pure untimed C++ to describe functional intent, your design teams can move up to a far more productive abstraction level for designing hardware, thus reducing implementation efforts by as much as 20 times while creating a more repeatable and reliable design flow.

An important outcome of this approach is that you can produce designs of better quality than traditional RTL methods by identifying fundamentally superior microarchitectural solutions.

In this article, we’ll examine the conventional design flow and its associated problems, and highlight some alternative approaches to hardware design based on the use of C/C++, comparing the pros and cons of SystemC™ and the pure, untimed C++ used by Mentor Graphics® Catapult™ C Synthesis tool.

Traditional Design Flow
Many high-end designs in the communications or video/image processing industries are typically based on extremely complex algorithms. The first step in a conventional design flow involves modeling and proving the design functionality at the algorithmic level of abstraction, using tools such as MATLAB™ from The MathWorks or plain C/C++ modeling.

MATLAB is good for initial algorithm proof-of-concept and validation, although many design teams also develop C/C++ models to facilitate high-speed system-level verification beyond what MATLAB can provide. For subsequent discussion, we’ll use the term “untimed” to represent those algorithms written either in MATLAB or pure ANSI C/C++.

Based on project requirements, system architects then partition the design into blocks to be implemented either in hardware or software. For the hardware blocks, a floating-point algorithm represents the functionality. Next, either the system or hardware designer quantizes the floating-point algorithm into an integral or fixed-point algorithm. These fixed-point algorithms are represented in MATLAB, Simulink™, or untimed C++ using bit-accurate types (SystemC 2.0). After validating the fixedpoint algorithm, the hardware designer starts the long and tedious manual process of creating Verilog or VHDL for the RTL abstraction. This process can be divided into three distinct phases:

  • Micro-Architecture Definition. Decide on the structure of the data path, control, and interfaces. Typically done on paper or perhaps a Microsoft™ Excel™ spreadsheet. The resulting micro-architecture has a significant impact on the overall speed/area of the hardware. Designs can easily swing by 10 times in area or performance based on the decisions made.
  • RTL Design. Manually write the RTL to represent the defined micro-architecture.
  • RTL Area/Timing Optimization. Iterate through RTL synthesis to meet design goals.
In some cases, the hardware engineers manually translate the floating-point untimed algorithm into bit-accurate RTL, either Verilog or VHDL. This RTL is subsequently synthesized into a gate-level netlist using traditional RTL synthesis technology (Figure 1).

The main problems associated with this traditional flow are:

  • Functional Intent. A significant conceptual and representational divide exists between the system architects working with untimed algorithms and the hardware designers working with the timed RTL in VHDL/Verilog. As a result, the original design intent specified by the system architect is easily misinterpreted, causing functional errors in the end product. In addition, it is relatively easy to implement and evaluate specification changes in the untimed algorithm, but very painful and time-consuming to subsequently fold these changes into the RTL. This is a serious consideration in wireless applications, because broadcast standards and protocols constantly evolve and change.
  • Meeting Requirements. Predicting design performance (area, delay, power) is difficult until RTL is done. Therefore, system-level partitioning and the resulting block-level design goals are inaccurate at best. Many system-level timing closure problems are directly related to poor macroarchitectural choices and unrealistic goals placed on the hardware engineer designing the hardware blocks.
  • Design Complexity. Because the untimed algorithmic domain and RTL domain are dissimilar, the manual translation from untimed algorithms to RTL is prolonged and error-prone. In addition, RTL uses technology-dependent coding styles and “hard-codes” the micro-architecture.

    Evaluating alternative implementations is impractical because modifying and re-verifying RTL to perform a series of “what-if ” analyses of alternate microarchitecture implementations is too lengthy to be practical. Such evaluations may include performing certain operations in parallel versus sequentially; pipelining portions of the design versus non-pipelining; or sharing common resources. Because of the amount of time involved, design teams are limited to the number of evaluations they can perform, which can result in a non-optimal implementation. The complexity of high-end, compute-intensive applications exemplifies the difficulties associated with traditional hand-coded RTL.

  • RTL Reuse. Using the same RTL for an ASIC and FPGA implies that the ASIC implementation is sub-optimal due to inherent FPGA performance limitations. Conversely, users can realize performance goals in an FPGA through massive parallelism; however, this parallelism may not be necessary for an ASIC. This makes it extremely difficult, if not impossible, to re-target a complex RTL design to create a tuned representation for the technology node. Finally, because RTL hardcodes the micro-architecture, using the same RTL for a 10 MHz application (for example) versus a high-performance 400 MHz application will result in sub-optimal hardware.
  • Functional Verification. Using traditional logic simulation to verify a large design represented in RTL is computationally expensive and extremely slow.
The most important challenge facing the designer is that all of the implementation “intelligence” associated with the design is hard-coded into the RTL, which therefore becomes rigid and implementation-specific.

Next-Generation C-based Flow
An examination of the conventional flow reveals three stages:

  • Untimed algorithm evaluation in MATLAB or C/C++, including quantization and integral/fixed-point analysis
  • Algorithm (untimed) to RTL (timed) translation, including verification and “what-if ” implementation analysis
  • RTL to gate-level netlist using industry-standard RTL synthesis
The front-end untimed algorithm evaluations and the back-end RTL-to-netlist synthesis are both well known and efficient. The bottleneck is the manual creation of the RTL, including performing “what-if ” evaluations, implementing specification changes, and verifying the RTL.

Any ideal flow should be based on industry-standard ANSI C/C++, the language of choice for software and system-level modeling for many years. The pure, untimed C/C++ written by system designers is an excellent source for creating hardware because it is void of implementation details. This maximizes flexibility to the synthesis tool and provides a source that is “liquid” – capable of targeting ASICs, FPGAs, highly compact small solutions, and highly parallel fast solutions. Translation from MATLAB to C/C++ is still manual, but because these domains are conceptually very close, the translation is relatively quick and easy.

The untimed C/C++ adds significant value by providing much faster simulation than the MATLAB Simulink environment, and is thus ideally suited for system-level validation. Following verification, the C representation is used to automatically generate RTL, which in turn is subsequently used to drive existing RTL synthesis technology (Figure 2).

With this flow, you can synthesize the untimed C/C++ directly into a gate-level netlist. However, generating the intermediate RTL provides a timed “comfort zone” for existing flows by allowing you to validate the implementation decisions made by the C synthesis tool.

Furthermore, RTL is a useful point to “stitch” the various functional blocks together. Large portions of today’s designs exist as IP blocks represented at the RTL level of abstraction. This means that RTL is a useful point in the design flow for integrating and verifying the entire hardware system. Your design teams can thus take full advantage of existing, mature, and robust RTL design tools such as test insertion or power analysis.

The ideal flow based on algorithmic synthesis of pure, untimed C/C++ addresses all of the traditional bottlenecks:

  • Functional Intent. Almost no conceptual gap exists because system architects and hardware designers use the same untimed C/C++ source. Their worlds are connected for the first time. Moreover, it eliminates any chance of misinterpretation by the hardware designer, thereby reducing errors and improving overall reliability. The new flow also easily accommodates design specification changes.
  • Meeting Requirements. Algorithmic C synthesis provides accurate metrics up front, shortening lengthy RTL synthesis runtimes and manual RTL optimization. You can leverage these metrics to make system-level macro-architecture partitioning decisions, thus creating a design that is better architected to meet system performance.
  • Design Complexity. You can address the design complexity issue by using algorithmic C synthesis to thoroughly explore any highly complex design space. C is fast and efficient to create and verify, providing additional benefits around system-level validation and integration. RTL uses technology-dependent coding styles and hardcodes the micro-architecture. Using the ideal flow, evaluating alternative implementations is fast and efficient. You can modify and re-verify C to effectively perform a series of “what-if ” evaluations of alternative algorithms. Thus, your design teams are not limited by the number of evaluations they can perform, which results in an optimal implementation.
  • RTL Reuse. A key feature of this ideal flow is that the C representation is completely abstracted from the final implementation. Therefore, as opposed to embedding implementation “intelligence” into the C representation, designers can instead use such intelligence to drive the C to the RTL implementation through a series of “soft” constraints. In turn, this means that they can easily retarget the same C representation for different micro-architectures and ASIC/FPGA implementations.
  • Functional Verification. Verifying C is fast and efficient. A pure untimed C representation will simulate as much as 10,000 times faster than an equivalent RTL representation (the larger the design, the faster C is compared to its RTL counterpart).
Let’s examine alternatives to hardware design based on the use of C/C++. These include SystemC and the synthesizable subset of pure untimed C++ used by the Mentor Graphics Catapult C Synthesis tool.

SystemC-Based Flow
Two main SystemC-based design flows exist: both require the untimed algorithm representation to be manually translated into its SystemC counterpart. Following verification via simulation, you can automatically translate the SystemC representation into an RTL equivalent for use with existing synthesis technology. Alternatively, you can directly synthesize the SystemC representation into a gate-level netlist (Figure 3).

Because it was specifically created to represent hardware, SystemC is equipped with hardware-centric data types, including integral and fixed-point entities with rounding and overflow modes. SystemC also includes system-level simulation capabilities, including support for abstract data transactions. Although powerful, SystemC is an extremely complex language. Moreover, the pseudo-timed constructs required for SystemC synthesis and simulation are foreign to both system-level and hardware designers.

One advantage of SystemC is that it simulates as much as 100 times faster than an equivalent RTL representation specified at the same level of abstraction. However, to make a SystemC representation suitable for RTL generation or direct C synthesis, designers would need to write it at nearly the same level of abstraction as hand-translated RTL, which largely negates the advantages of using it in the first place.

Even worse, all of the implementation “intelligence” associated with the design has to be hard-coded into the SystemC representation, which therefore becomes implementation-specific. This means that a SystemC representation intended for an FPGA is not suitable for a subsequent ASIC realization, and vice versa. Finally, it is not possible to re-target the SystemC representation to a compact or highly parallel solution because the micro-architecture is hard-coded.

Another SystemC approach “wraps” the untimed C++ algorithm in a timed interface. This approach may have some advantages in system-level integration; however, the resulting source is now pseudo-timed and hard-coded to the hardware interface. Therefore, the notion of interface exploration is not practical.

For example, targeting the C source to a streaming I/O model versus a single- port memory implies re-coding the interface wrapper (difficult and time consuming). In addition, the source is no longer the pure, untimed C++ description already validated and proven by the system designer. Thus, any interface changes will require reverification and possibly introduce foreign coding concepts to the pure C++ representation.

Finally, the degree of interface detail is extremely critical. Too much information stifles the behavioral synthesis tool and results in sub-optimal designs. Too little means the tool doesn’t have the minimum information needed to synthesize the design, resulting in functional errors.

Catapult C-Based Flow
As noted previously, the most significant problem with existing C-based design flows is that the implementation “intelligence” associated with the design has to be hard-coded into the C representation, which then becomes implementation-specific. This is the key differentiator of the Catapult C-based design flow from Mentor Graphics. In this flow, the C code is very close to what a system designer would write to model functional behavior without any preconceived hardware implementation or target device architecture in mind.

As opposed to adding “intelligence” to the source code (thereby locking it into a target implementation), all of the intelligence is provided by controlling the Catapult C engine itself (Figure 4).

Catapult C uses industry-standard C++ source code augmented with SystemC data types that allow specific bit-widths to be associated with variables and constants. An advantage is that many companies already create an untimed C/C++ representation of their designs for algorithmic validation. They do this because a pure C representation is easy and compact to write and simulates 100 to 10,000 times faster than an equivalent RTL representation.

The only modification typically required to use this model with Catapult C is to add a single pragma to the source code to indicate the top of the functional portion of the design – anything conceptually above this point is considered part of the test bench.

Another Catapult C advantage is its intuitive interface. Once the tool has read the source code, you can immediately perform micro-architecture tradeoffs and evaluate their effects in terms of size and speed. Catapult C easily associates ports with registers or RAM blocks. It identifies constructs like loops and allows you to specify – on an individual basis – whether they should be unrolled, partially unrolled, or left alone. You can also specify if you wish to perform resource sharing on specific entities, pipeline loops, and other constructs.

All of these evaluations are done within a few seconds or minutes depending on design size. Catapult C then reports total size/area and latency in terms of clock cycles or input-to-output delays (or throughput time/cycles in the case of pipelined designs). You can name, save, and reuse any of these “what-if ” scenarios. It would be almost impossible to perform these tradeoffs in a timely manner using a conventional hand-coded RTL-based flow.

More importantly, the fact that the C source code used by Catapult C is not required to contain any implementation “intelligence” – and that all such intelligence is supplied by controlling the Catapult C engine itself – means that your design teams can easily re-target the same source code to alternative micro-architectures and different implementation technologies.

Conclusion
The fundamental difference between the various C-based design flows is the level of synthesis abstraction they support (Figure 5). SystemC offers significant system-level simulation capabilities, but its synthesizable subset is at a lower abstraction level, so modification to the source drives the results.

This lack of synthesis abstraction causes the SystemC representations to be implementation-specific. This makes them difficult to create and modify, and significantly reduces their flexibility with regard to performing “what-if ” evaluations and re-targeting them toward alternative implementation technologies.

By comparison, Catapult C employs models represented in standard C++ and supports a high level of synthesis abstraction. Because they are not implementation-specific, Catapult C models are compact and can thus be easily created and modified.

By means of the Catapult C engine itself, you can quickly perform “what-if ” evaluations and re-target the design toward alternative implementation technologies.

The end result is that the Catapult C-based design flow dramatically speeds implementation, improves design flow reliability, and increases design flexibility when compared to other C-based flows or traditional hand-coded RTL methods.

Catapult C Synthesis has already been instrumental in many successful tapeouts from major hardware design companies worldwide. The mature, second-generation algorithmic synthesis environment unites two distinct domains – system-level design and hardware design – and when combined with Mentor Graphics ModelSim™ simulation tools, lays the foundation for next-generation electronic system level (ESL) design.

To learn more about how Catapult C Synthesis can address your hardware design needs, call Mentor Graphics to schedule a complete product demonstration, or visit our website for the latest product news and case studies at www.mentor.com/c-design/.

Printable PDF version of this article with graphics. PDF logo (8/1/04) 355 KB

 
Jobs Events Webcasts News Investors Feedback Legal Privacy Trademarks Sitemap
© 1994-2008 Xilinx, Inc. All Rights Reserved.