Optimizing Compute Units

Datawidth

One, if not the most important, aspect for performance is the datawidth required for the implementation. The tool propagates port widths throughout the algorithm. In some cases, especially when starting out with an algorithmic description, the C/C++/OpenCL™ code might only utilize large data types such as integers even at the ports of the design. However, as the algorithm gets mapped to a fully configurable implementation, smaller data types such as 10 or 12 bit might often suffice. Towards that end it is beneficial to check the size of basic operations in the HLS Synthesis report during optimization. In general, when SDx™ maps an algorithm onto the FPGA, much processing is required to comprehend the C/C++/OpenCL structure and extract operational dependencies. Therefore, to perform this mapping SDx generally partitions the source code into operational units which are than mapped onto the FPGA. Several aspects influence the number and size of these operational units (ops) as seen by the tool.

In the following table, the basic operations and their bitwidth are reported.

Simply look for typical Bitwidths of 16, 32, and 64 bits as commonly used in algorithmic descriptions and verify if the associated operation from the C/C++/OpenCL source actually requires to be this large. This can considerably improve the implementation of the algorithm, as smaller operations require less computation time.

Fixed Point Arithmetic

Some applications use floating point computation only because they are optimized for other hardware architecture. As explained in “Deep Learning with INT8 Optimization on Xilinx Devices,” using fixed point arithmetic for applications like deep learning can save the power efficiency and area significantly while keeping the same level of accuracy. It is recommended to explore fixed point arithmetic for your application before committing to using floating point operations.

Macro Operations

It is sometimes advantageous to think about larger computational elements. The tool will operate on the source code independently of the remaining source code, effectively mapping the algorithm without consideration of surrounding operations onto the FPGA. When applied, SDx keeps operational boundaries, effectively creating macro operations for specific code. This utilizes the following principles:

  • Operational locality to the mapping process.
  • Reduction in complexity for the heuristics.
This might create vastly different results when applied. In C/C++ macro operations are created with the help of
#pragma HLS inline off
While in OpenCL the same kind of macro operation can be generated by not specifying the attribute:
__attribute__((always_inline))

when defining a function.

Utilizing Optimized Libraries

The OpenCL Specification provides a wealth of math built-in functions. All math built-in functions with the native_ prefix are mapped to one or more native device instructions and will typically have better performance compared to the corresponding functions (without the native_ prefix). The accuracy and in some cases the input ranges of these functions is implementation-defined. In SDAccel™ environment these native_ built-in functions use the equivalent functions in Vivado® HLS Math library, which are already optimized for Xilinx® FPGAs in terms of area and performance. Xilinx recommends that you use native_ built-in functions or HLS Math library if the accuracy meets the application requirement.