Introduction to the Vitis Environment for Acceleration

Accelerated Flow Application Development Using the Vitis Software Platform

The Vitis™ unified software platform is a new tool that combines all aspects of Xilinx® software development into one unified environment. The Vitis software platform supports both the Vitis embedded software development flow, for Xilinx Software Development Kit (SDK) users looking to move into the next generation technology, and the Vitis application acceleration development flow, for software developers looking to use the latest in Xilinx FPGA-based software acceleration. This content is primarily concerned with the application acceleration flow, and use of the Vitis core development kit and Xilinx Runtime (XRT).

The Vitis application acceleration development flow provides a framework for developing and delivering FPGA accelerated applications using standard programming languages for both software and hardware components. The software component, or host program, is developed using C/C++ to run on x86 or embedded processors, with OpenCL™ API calls to manage runtime interactions with the accelerator. The hardware component, or kernel, can be developed using C/C++, OpenCL C, or RTL. The Vitis software platform promotes concurrent development and test of the Hardware and Software elements of an heterogeneous application.

Figure 1: Vitis Unified Software Platform

As shown in the figure above, the Vitis unified software platform consists of the following features and elements:

  • Vitis technology targets acceleration hardware platforms, such as the Alveo™ Data Center accelerator cards, and Versal or Zynq® UltraScale+™ MPSoC-based embedded processor platforms.
  • XRT provides an API and drivers for your host program to connect with the target platform, and handles transactions between your host program and accelerated kernels.
  • Vitis core development kit provides the software development tool stack, such as compilers and cross-compilers, to build your host program and kernel code, analyzers to let you profile and analyze the performance of your application, and debuggers to help you locate and fix any problems in your application.
  • Vitis accelerated libraries provide performance-optimized FPGA acceleration with minimal code changes, and without the need to re-implement your algorithms to harness the benefits of Xilinx adaptive computing. Vitis accelerated libraries are available for common functions of math, statistics, linear algebra and DSP, and also for domain specific applications, like vision and image processing, quantitative finance, database, data analytics, and data compression. For more information on Vitis accelerated libraries, refer to https://xilinx.github.io/Vitis_Libraries/.

FPGA Acceleration

Xilinx FPGAs offer many advantages over traditional CPU/GPU acceleration, including a custom architecture capable of implementing any function that can run on a processor, resulting in better performance at lower power dissipation. When compared with processor architectures, the structures that comprise the programmable logic (PL) fabric in a Xilinx device enable a high degree of parallelism in application execution.

To realize the advantages of software acceleration on a Xilinx device, you should look to accelerate large compute-intensive portions of your application in hardware. Implementing these functions in custom hardware gives you an ideal balance between performance and power.

For more information on how to architect an application for optimal performance and other recommended design techniques, review the Methodology for Accelerating Applications with the Vitis Software Platform.

Execution Model

In the Vitis core development kit, an application program is split between a host application and hardware accelerated kernels with a communication channel between them. The host program, written in C/C++ and using API abstractions like OpenCL, runs on a host processor (such as an x86 server or an Arm processor for embedded platforms), while hardware accelerated kernels run within the programmable logic (PL) region of a Xilinx device.

The API calls, managed by XRT, are used to process transactions between the host program and the hardware accelerators. Communication between the host and the kernel, including control and data transfers, occurs across the PCIe® bus or an AXI bus for embedded platforms. While control information is transferred between specific memory locations in the hardware, global memory is used to transfer data between the host program and the kernels. Global memory is accessible by both the host processor and hardware accelerators, while host memory is only accessible by the host application.

For instance, in a typical application, the host first transfers data to be operated on by the kernel from host memory into global memory. The kernel subsequently operates on the data, storing results back to the global memory. Upon kernel completion, the host transfers the results back into the host memory. Data transfers between the host and global memory introduce latency, which can be costly to the overall application. To achieve acceleration in a real system, the benefits achieved by the hardware acceleration kernels must outweigh the added latency of the data transfers.

The target platform contains the FPGA accelerated kernels, global memory, and the direct memory access (DMA) for memory transfers. Kernels can have one or more global memory interfaces and are programmable. The Vitis core development kit execution model can be broken down into the following steps:

  1. The host program writes the data needed by a kernel into the global memory of the attached device through the PCIe interface on an Alveo Data Center accelerator card, or through the AXI bus on an embedded platform.
  2. The host program sets up the kernel with its input parameters.
  3. The host program triggers the execution of the kernel function on the FPGA.
  4. The kernel performs the required computation while reading data from global memory, as necessary.
  5. The kernel writes data back to global memory and notifies the host that it has completed its task.
  6. The host program reads data back from global memory into the host memory and continues processing as needed.

The FPGA can accommodate multiple kernel instances on the accelerator, both different types of kernels, and multiple instances of the same kernel. XRT transparently orchestrates the interactions between the host program and kernels in the accelerator. XRT architecture documentation is available at https://xilinx.github.io/XRT/.

Data Center Application Acceleration Development Flow

The following diagram describes the steps needed to build and run an application for use on the Alveo Data Center accelerator cards. The steps are summarized below, and the details of each step can be found throughout this documentation.

Figure 2: Application Development Flow for Data Center Accelerator Cards
x86 Application Compilation
Compile the host application to run on the x86 processor using the G++ compiler to create a host executable file. The host program interacts with kernels in the PL region. For more information on writing the host application, refer to Developing Applications. For more information on compiling the host application, refer to Building the Host Program.
PL Kernel Compilation and Linking

PL kernels are compiled for implementation in the PL region of the target platform. PL kernels can be compiled into Xilinx object form (XO) file using the Vitis compiler (v++), Vitis HLS for C/C++ kernels, or the package_xo command for RTL kernels. For information on coding kernels, refer to C/C++ Kernels or RTL Kernels.

The Vitis compiler also links the kernel XO files with the hardware platform to create a device executable (.xclbin) for the application. For more information, refer to Building the Device Binary.

Xilinx object (XO) files are linked with the target hardware platform by the v++ --link command to create a device binary file (.xclbin) that is loaded into the Xilinx device on the target platform.

Running the Application
For Alveo Data Center accelerator cards, the .xclbin file is the required build object for running the system. When running the application, you can run software emulation, hardware emulation, or run on the actual physical accelerator platform. For more information, refer to Running the Application Hardware Build.
  • When the build target is software or hardware emulation, the Vitis compiler generates simulation models of the kernels in the device binary. As described in Build Targets, emulation targets let you build, run, and iterate the design over relatively quick cycles; debugging the application and evaluating performance.
  • When the build target is the hardware system, Vitis compiler generates the .xclbin using the Vivado Design Suite to run synthesis and implementation, and resolve timing. This process is automated to generate high quality results; however, hardware-savvy developers can fully leverage the Vivado tools in their design process.

Embedded Processor Application Acceleration Development Flow

The following diagram describes the steps needed to build and run an application using Arm® processors and kernels running in programmable logic regions of Versal ACAP, Zynq UltraScale+ MPSoC, and Zynq-7000 SoC devices. The steps are summarized below, and the details of each step can be found throughout this documentation.

Figure 3: Application Development Flow for Versal ACAP and Zynq UltraScale+ MPSoC Devices
PS Application Compilation
Compile the host application to run on the Cortex®-A72 or Cortex-A53 core processor using the GNU Arm cross-compiler to create an ELF file. The host program interacts with kernels in the PL and AI Engine regions of the device. For more information on writing the host application, refer to Developing Applications. For more information on compiling the host application, refer to Building the Host Program.
AI Engine Array (Optional for Versal AI Engine Core series only)
Some Versal ACAP devices incorporate an AI Engine array of very-long instruction word (VLIW) processors with single instruction multiple data (SIMD) vector units that are highly optimized for compute-intensive applications such as 5G wireless and artificial intelligence (AI) applications. AI Engine graphs and kernels are built using Vitis tools such as the aiecompiler and aiesimulator, and can be integrated into the embedded processor application acceleration flow as described in Versal ACAP AI Engine Programming Environment User Guide (UG1076).
PL Kernel Compilation and Linking

PL kernels are compiled for implementation in the PL region of the target platform. PL kernels can be compiled into Xilinx object form (XO) file using the Vitis compiler (v++) or Vitis HLS for C/C++ kernels, or the package_xo command for RTL kernels. For more information on coding kernels, refer to C/C++ Kernels or RTL Kernels.

The Vitis compiler also links the kernel XO files with the hardware platform to create a device executable (.xclbin) for the application. For more information, refer to Building the Device Binary.

Xilinx object (XO) files are linked with the target hardware platform by the v++ --link command to create a device binary file (.xclbin) that is loaded into the Xilinx device on the target platform.

System Package
Use the v++ --package command to gather the required files to configure and boot the system, to load and run the application, including the host application and PL kernel binaries. This step builds the necessary package to run software or hardware emulation and debug, or to create an SD card to run your application on hardware. For more information, refer to Packaging the System.
Running the Application
When running the application, you can run software emulation, hardware emulation, or run on the actual physical accelerator platform. Running the application on embedded processor platforms is different from running on data center accelerator cards. For more information, refer to Running the Application Hardware Build.
  • When the build target is software or hardware emulation, the Vitis compiler generates simulation models of the kernels in the device binary. As described in Build Targets, emulation targets let you build, run, and iterate the design over relatively quick cycles; debugging the application and evaluating performance.
  • When the build target is the hardware system, Vitis compiler generates the .xclbin using the Vivado Design Suite to run synthesis and implementation, and resolve timing. This process is automated to generate high quality results; however, hardware-savvy developers can fully leverage the Vivado tools in their design process.

Build Targets

The Vitis compiler build process generates the host program executable and the FPGA binary (.xclbin). The nature of the FPGA binary is determined by the build target.

  • When the build target is software or hardware emulation, the Vitis compiler generates simulation models of the kernels in the FPGA binary. These emulation targets let you build, run, and iterate the design over relatively quick cycles; debugging the application and evaluating performance.
  • When the build target is the hardware system, Vitis compiler generates the .xclbin for the hardware accelerator, using the Vivado Design Suite to run synthesis and implementation. It uses these tools with predefined settings proven to provide good quality of results. Using the Vitis core development kit does not require knowledge of these tools; however, hardware-savvy developers can fully leverage these tools and use all the available features to implement kernels.

The Vitis compiler provides three different build targets, two emulation targets used for debug and validation purposes, and the default hardware target used to generate the actual FPGA binary:

Software Emulation (sw_emu)
Both the host application code and the kernel code are compiled to run on the host processor. This allows iterative algorithm refinement through fast build-and-run loops. This target is useful for identifying syntax errors, performing source-level debugging of the kernel code running together with application, and verifying the behavior of the system.
Hardware Emulation (hw_emu)
The kernel code is compiled into a hardware model (RTL), which is run in a dedicated simulator. This build-and-run loop takes longer but provides a detailed, cycle-accurate view of kernel activity. This target is useful for testing the functionality of the logic that will go in the FPGA and getting initial performance estimates.
Hardware (hw)
The kernel code is compiled into a hardware model (RTL) and then implemented on the FPGA, resulting in a binary that will run on the actual FPGA.

Tutorials and Examples

To help you quickly get started with the Vitis core development kit, you can find tutorials, example applications, and hardware kernels in the following repositories on https://github.com/xilinx/Vitis-Tutorials.

Vitis Application Acceleration Development Flow Tutorials
Provides a number of tutorials that can be worked through to teach specific concepts regarding the tool flow and application development.

The Getting Started pathway tutorials are an excellent place to start as a new user.

Vitis Examples
Hosts many examples to demonstrate good design practices, coding guidelines, design pattern for common applications, and most importantly, optimization techniques to maximize application performance. The on-boarding examples are divided into several main categories. Each category has various key concepts illustrated by individual examples in both OpenCL™ C and C/C++ frameworks, when applicable. All examples include a Makefile to enable building for software emulation, hardware emulation, and running on hardware, and a README.md file with a detailed explanation of the example.

Now that you have an idea of the elements of the Vitis core development kit and how to write and build an application for acceleration, review the best approach for your design problem.