Profiling the Application

The Vitis™ core development kit generates various system and kernel resource performance reports during compilation. These reports help you establish a baseline of performance on your application, identify bottlenecks, and help to identify target functions that can be accelerated in hardware kernels as discussed in Methodology for Architecting a Device Accelerated Application. The Xilinx Runtime (XRT) also collects profiling data during application execution in both emulation and system mode configurations. Examples of the reported data includes:

  • Host and device timeline events
  • OpenCL™ API call sequence
  • Kernel execution sequence
  • FPGA trace data including AXI transactions
  • Kernel start and stop signals

Together the reports and profiling data can be used to isolate performance bottlenecks in the application and optimize the design to improve performance. Optimizing an application requires optimizing both the application host code and any hardware accelerated kernels. The host code must be optimized to facilitate data transfers and kernel execution, while the kernel should be optimized for performance and resource usage.

There are four distinct areas to be considered when performing algorithm optimization in Vitis: System resource usage and performance, kernel optimization, host optimization, and data transfer optimization. The following Vitis reports and graphical tools support your efforts to profile and optimize these areas:

Reports are automatically generated after running the active build, either from the command line as described in Running an Application, or from the Vitis integrated design environment (IDE). Separate sets of reports are generated for all three build targets and can be found in the respective report directories. Refer to Directory Structure for more information on locating these reports.

Reports can be viewed in a web browser, a spreadsheet viewer, or from the Vitis IDE. To access these reports from the Vitis IDE, ensure the Assistant view is visible and double-click the desired report.

The following topics briefly describe the various reports and graphical visualization tools and how they can be used to profile your design.

Baselining Functionalities and Performance

It is very important to understand the performance of your application before you start any optimization effort. This is achieved by establishing a baseline for the application in terms of functionalities and performance.

Figure 1: Baselining Functionalities and Performance Flow

Identify Bottlenecks

The first step is to identify the bottlenecks of the your application running on your target platform. The most effective way is to run the application with profiling tools, like valgrind, callgrind, and GNU gprof. The profiling data generated by these tools show the call graph with the number of calls to all functions and their execution time.

Run Software and Hardware Emulation

Run software and hardware emulation on the accelerated application as described in Running an Application, to verify functional correctness, and to generate profiling data on the host code and the kernels. Analyze the kernel compilation reports, profile summary, timeline trace, and device hardware transactions to understand the baseline performance estimate for timing interval, and latency and resource utilization, such as DSP and block RAM.

Build and Run the Application

The last step in baselining is building and running the application on an FPGA acceleration card, like one of the Alveo™ Data Center accelerator cards. Analyze the reports from the system compilation, and the profiling data from application execution to see the actual performance and resource utilization on hardware.

TIP: Save all the reports during the baseline process, so that you can refer back to them and compare results during optimization.

Design Guidance

The Vitis core development kit has a comprehensive design guidance tool that provides immediate, actionable guidance to the software developer for issues detected in their designs. These issues might be related to the source code, or due to missed tool optimizations. Also, the rules are generic rules based on an extensive set of reference designs. Therefore, these rules might not be applicable for your specific design. It is up to you to understand the specific guidance rules and take appropriate action based on your specific algorithm and requirements.

Guidance is generated from the Vivado HLS, Vitis profiler, and Vivado Design Suite when invoked by the v++ compiler. The generated design guidance can have several severity levels; errors, advisories, warnings, and critical warnings are provided during software emulation, hardware emulation, and system builds. The profile design guidance helps you interpret the profiling results which allows you to focus on improving performance.

The guidance includes hyperlinks, examples, and links to documentation. This helps improves productivity by quickly highlighting issues and directing you to additional information in using the Vitis technology.

Design guidance is automatically generated after building or running a design in the Vitis IDE.

You can open the Guidance report as discussed in Using the Vitis Analyzer. To access the Guidance report, open the Compile summary, the Link summary, or the Run summary, and open the Guidance report. With the Guidance report open, hovering over the guidance highlights recommended resolutions.

The following image shows an example of the Guidance report displayed in the Vitis analyzer. Clicking a link displays an expanded view of the actionable guidance.

Figure 2: Design Guidance Example

There is one HTML guidance report for each run of the v++ command, including compile and link. The report files are located in the --report_dir under the specific output name. For example:

  • v++_compile_<output>_guidance.html for v++ compilation
  • v++_link_<output>_guidance.html for v++ linking

Opening the Guidance Report

When kernels are compiled and when the FPGA binary is linked, guidance reports are generated automatically by the v++ command. You can view these reports in the Vitis analyzer by opening the <output_filename>.compile.summary or the <output_filename>.link.summary for the application project. The <output_filename> is the output of the v++ command.

As an example, launch the Vitis analyzer and open the report using this command:

vitis_analyzer <output_filename>.link.summary

When the Vitis analyzer opens, it displays the link summary report, as well as the compile summary, and a collection of reports generated during the compile and link processes. Both the compile and link steps generate Guidance reports to view by clicking the Build heading on the left-hand side. Refer to Using the Vitis Analyzer for more information.

Interpreting Guidance Data

The Guidance view places each entry in a separate row. Each row might contain the name of the guidance rule, threshold value, actual value, and a brief but specific description of the rule. The last field provides a link to reference material intended to assist in understanding and resolving any of the rule violations.

In the GUI Guidance view, guidance rules are grouped by categories and unique IDs in the Name column and annotated with symbols representing the severity. These are listed individually in the HTML report. In addition, as the HTML report does not show tooltips, a full Name column is included in the HTML report as well.

The following list describes all fields and their purpose as included in the HTML guidance reports.

Id
Each guidance rule is assigned a unique ID. Use this id to uniquely identify a specific message from the guidance report.
Name
The Name column displays a mnemonic name uniquely identifying the guidance rule. These names are designed to assist in memorizing specific guidance rules in the view.
Severity
The Severity column allows the easy identification of the importance of a guidance rule.
Full Name
The Full Name provides a less cryptic name compared to the mnemonic name in the Name column.
Categories
Most messages are grouped within different categories. This allows the GUI to display groups of messages within logical categories under common tree nodes in the Guidance view.
Threshold
The Threshold column displays an expected threshold value, which determines whether or not a rule is met. The threshold values are determined from many applications that follow good design and coding practices.
Actual
The Actual column displays the values actually encountered on the specific design. This value is compared against the expected value to see if the rule is met.
Details
The Details column provides a brief but specific message describing the specifics of the current rule.
Resolution
The Resolution column provides a pointer to common ways the model source code or tool transformations can be modified to meet the current rule. Clicking the link brings up a pop-up window or the documentation with tips and code snippets that you can apply to the specific issue.

System Estimate Report

The process step with the longest execution time includes building the hardware system and the FPGA binary to run on Xilinx devices. Build time is also affected by the target device and the number of compute units instantiated onto the FPGA fabric. Therefore, it is useful to estimate the performance of an application without needing to build it for the system hardware.

The System Estimate report provides estimates of FPGA resource usage and the estimated frequency at which the hardware accelerated kernels can operate. The report is automatically generated for hardware emulation and system hardware builds. The report contains high-level details of the user kernels, including resource usage and estimated frequency. This report can be used to guide design optimization.

You can also force the generation of the System Estimate report with the following option:

v++ .. --report_level estimate

An example report is shown in the figure:

Figure 3: System Estimate

Opening the System Estimate Report

The System Estimate report can be opened in the Vitis analyzer tool, intended for viewing reports from the Vitis compiler when the application is built, and the XRT library when the application is run. You can launch the Vitis analyzer and open the report using the following command:

vitis_analyzer <output_filename>.link.summary

The <output_filename> is the output of the v++ command. This opens the Link Summary for the application project in the Vitis analyzer tool. Then, select the System Estimate report. Refer to Using the Vitis Analyzer for more information.

TIP: Because the System Estimate report is a text file, you can also view it in a text editor or target platform.

Interpreting the System Estimate Report

The System Estimate report generated by the v++ command provides information on every binary container in the application, as well as every compute unit in the design. The report is structured as follows:

  • Target device information
  • Summary of every kernel in the application
  • Detailed information on every binary container in the solution

The following example report file represents the information generated for the estimate report:

-------------------------------------------------------------------------------
Design Name:             mmult.hw_emu.xilinx_u200_xdma_201830_2
Target Device:           xilinx:u200:xdma:201830.2
Target Clock:            300.000000MHz
Total number of kernels: 1
-------------------------------------------------------------------------------

Kernel Summary
Kernel Name  Type  Target              OpenCL Library                          Compute Units
-----------  ----  ------------------  --------------------------------------  -------------
mmult        c     fpga0:OCL_REGION_0  mmult.hw_emu.xilinx_u200_xdma_201830_2  1


-------------------------------------------------------------------------------
OpenCL Binary:     mmult.hw_emu.xilinx_u200_xdma_201830_2
Kernels mapped to: clc_region

Timing Information (MHz)
Compute Unit  Kernel Name  Module Name  Target Frequency  Estimated Frequency
------------  -----------  -----------  ----------------  -------------------
mmult_1       mmult        mmult        300.300293        411.015198

Latency Information (clock cycles)
Compute Unit  Kernel Name  Module Name  Start Interval  Best Case  Avg Case  Worst Case  
------------  -----------  -----------  --------------  ---------  --------  ----------  
mmult_1       mmult        mmult        826 ~ 829       825        827       828         

Area Information
Compute Unit  Kernel Name  Module Name  FF     LUT    DSP   BRAM  URAM
------------  -----------  -----------  -----  -----  ----  ----  ----
mmult_1       mmult        mmult        81378  35257  1036  2     0
-------------------------------------------------------------------------------

Design and Target Device Summary

All design estimate reports begin with an application summary and information about the target device. The device information is provided in the following section of the report:

-------------------------------------------------------------------------------
Design Name:             mmult.hw_emu.xilinx_u200_xdma_201830_2
Target Device:           xilinx:u200:xdma:201830.2
Target Clock:            300.000000MHz
Total number of kernels: 1
-------------------------------------------------------------------------------

For the design summary, the information provided includes the following:

Target Device
Name of the Xilinx device on the target platform that runs the FPGA binary built by the Vitis compiler.
Target Clock
Specifies the target operating frequency for the compute units (CUs) mapped to the FPGA fabric.

Kernel Summary

This section lists all of the kernels defined for the application project. The following example shows the kernel summary:

Kernel Summary
Kernel Name  Type  Target              OpenCL Library                          Compute Units
-----------  ----  ------------------  --------------------------------------  -------------
mmult        c     fpga0:OCL_REGION_0  mmult.hw_emu.xilinx_u200_xdma_201830_2  1

In addition to the kernel name, the summary also provides the execution target and type of the input source. Because there is a difference in compilation and optimization methodology for OpenCL™, C, and C/C++ source files, the type of kernel source file is specified.

The Kernel Summary section is the last summary information in the report. From here, detailed information on each compute unit binary container is presented.

Timing Information

For each binary container, the detail section begins with the execution target of all compute units (CUs). It also provides timing information for every CU. As a general rule, if the estimated frequency for the FPGA binary is higher than the target frequency, the CU will be able to run in the device. If the estimated frequency is below the target frequency, the kernel code for the CU needs to be further optimized to run correctly on the FPGA fabric. This information is shown in the following example:

OpenCL Binary:     mmult.hw_emu.xilinx_u200_xdma_201830_2
Kernels mapped to: clc_region

Timing Information (MHz)
Compute Unit  Kernel Name  Module Name  Target Frequency  Estimated Frequency
------------  -----------  -----------  ----------------  -------------------
mmult_1       mmult        mmult        300.300293        411.015198

It is important to understand the difference between the target and estimated frequencies. CUs are not placed in isolation into the FPGA fabric. CUs are placed as part of a valid FPGA design that can include other components defined by the device developer to support a class of applications.

Because the CU custom logic is generated one kernel at a time, an estimated frequency that is higher than the target frequency indicates that the CU can run at the higher estimated frequency. Therefore, CU should meet timing at the target frequency during implementation of the FPGA binary.

Latency Information

The latency information presents the execution profile of each CU in the binary container. When analyzing this data, it is important to recognize that all values are measured from the CU boundary through the custom logic. In-system latencies associated with data transfers to global memory are not reported as part of these values. Also, the latency numbers reported are only for CUs targeted at the FPGA fabric. The following is an example of the latency report:

Latency Information (clock cycles)
Compute Unit  Kernel Name  Module Name  Start Interval  Best Case  Avg Case  Worst Case  
------------  -----------  -----------  --------------  ---------  --------  ----------  
mmult_1       mmult        mmult        826 ~ 829       825        827       828         

The latency report is divided into the following fields:

  • Start interval
  • Best case latency
  • Average case latency
  • Worst case latency

The start interval defines the amount of time that has to pass between invocations of a CU for a given kernel.

The best, average, and worst case latency numbers refer to how much time it takes the CU to generate the results of one ND Range data tile for the kernel. For cases where the kernel does not have data dependent computation loops, the latency values will be the same. Data dependent execution of loops introduces data specific latency variation that is captured by the latency report.

The interval or latency numbers will be reported as "undef" for kernels with one or more conditions listed below:
  • OpenCL kernels that do not have explicit reqd_work_group_size(x,y,z)
  • Kernels that have loops with variable bounds
Note: The latency information reflects estimates based on the analysis of the loop transformations and exploited parallelism of the model. These advanced transformations such as pipelining and data flow can heavily change the actual throughput numbers. Therefore, latency can only be used as relative guides between different runs.

Area Information

Although the FPGA can be thought of as a blank computational canvas, there are a limited number of fundamental building blocks available in each FPGA. These fundamental blocks (FF, LUT, DSP, block RAM) are used by the Vitis compiler to generate the custom logic for each CU in the design. The quantity of fundamental resources needed to implement the custom logic for a single CU determines how many CUs can be simultaneously loaded into the FPGA fabric. The following example shows the area information reported for a single CU:

Area Information
Compute Unit  Kernel Name  Module Name  FF     LUT    DSP   BRAM  URAM
------------  -----------  -----------  -----  -----  ----  ----  ----
mmult_1       mmult        mmult        81378  35257  1036  2     0
-------------------------------------------------------------------------------

HLS Report

The HLS report provides details about the high-level synthesis (HLS) process of a user kernel and is generated in hardware emulation and system builds. This process translates the C/C++ and OpenCL kernel into the hardware description language used for implementing the kernel logic on the FPGA. The report provides estimated FPGA resource usage, operating frequency, latency, and interface signals of the custom-generated hardware logic. These details provide many insights to guide kernel optimization.

When running from the command line, this report can be found in the following directory:

_x/<kernel_name>.<target>.<platform>/<kernel_name>/<kernel_name>/solution/syn/report

The HLS report can be opened from the Vitis analyzer by opening the Compile Summary, or the Link Summary as described in Using the Vitis Analyzer. An example of the HLS report is shown.

Figure 4: HLS Report

Generating and Opening the HLS Report

IMPORTANT: The --save-temps option must be specified to preserve the intermediate files produced by Vivado HLS, including the reports. The HLS report and HLS guidance are only generated for hardware emulation and system builds for C and OpenCL kernels. It is not generated for software emulation or RTL kernel.

The HLS report can be viewed through the Vitis analyzer by opening the <output_filename>.compile.summary or the <output_filename>.link.summary for the application project. The <output_filename> is the output of the v++ command.

You can launch the Vitis analyzer and open the report using the following command:

vitis_analyzer <output_filename>.compile.summary

When the Vitis analyzer opens, it displays the compile summary and a collection of reports generated during the compile process. The HLS report can be viewed by clicking the Build heading on the left-hand side. Refer to Using the Vitis Analyzer for more information.

Interpreting the HLS Report

The left pane of the HLS report shows the module hierarchy. Each module generated as part of the HLS run is represented in this hierarchy. You can select any of these modules to present the synthesis details of the module in the right-side of the Synthesis Report window. The Synthesis Report is separated into several sections:

  • General Information
  • Performance Estimates (timing and latency)
  • Utilization Estimates
  • Interface Information

If this information is part of a hierarchical block, it will sum up the information of the blocks contained in the hierarchy. Therefore, the hierarchy can also be navigated from within the report when it is clear which instance contributes to the overall design.

CAUTION: Regarding the absolute counts of cycles and latency, these numbers are based on estimates identified during HLS synthesis, especially with advanced transformations, such as pipelining and dataflow; these numbers might not accurately reflect the final results. If you encounter question marks in the report, this might be due to variable bound loops, and you are encouraged to set trip counts for such loops to have some relative estimates presented in this report.

Profile Summary Report

When properly configured, the Vitis Runtime library collects profiling data on host applications and kernels. After the application finishes execution, the Profile Summary report is saved as a .csv file in the directory where the compiled host code is executed.

The Profile Summary provides annotated details regarding the overall application performance. All data generated during the execution of the application is grouped into categories. The Profile Summary lets you examine kernel execution and data transfer statistics.

TIP: The Profile Summary report can be generated for all build configurations. However, with the software emulation build, the report will not include any data transfer details under kernel execution efficiency and data transfer efficiency. This information is only generated in hardware emulation or system builds.

An example of the Profile Summary report is shown below.

Figure 5: Profile Summary

The report has multiple tabs that can be selected. A description of each tab is given here:

Top Operations
Kernels and Global Memory. This tab shows a summary of top operations. It displays the profile data for top data transfers between FPGA and device memory.
Kernels & Compute Units
Displays the profile data for all kernels and compute units.
Data Transfers
Host and Global Memory. This displays the profile data for all read and write transfers between the host and device memory through the PCIe link. It also displays data transfers between kernels and global memory, if enabled.
OpenCL APIs
Displays the profile data for all OpenCL C host API function calls executed in the host application.
Kernel Internals
This tab is reported during HW emulation if you have enabled launch_waveform in the [Emulation] section of the xrt.ini as described in xrt.ini File. The generated waveform data (.wdb) are reported in profile_kernels.csv and timeline_kernels.csv files, and the Kernel Internals tab is populated with this information. The reported information applies to C/C++ and OpenCL kernels, and is not reported for RTL kernels.

For details on the profile summary, see Interpreting the Profile Summary.

Generating and Opening the Profile Summary Report

Capturing the data required for the Profile Summary requires a few steps prior to actually running the application.

  1. The FPGA binary (xclbin) file is configured for capturing profiling data by default. However, using the v++ --profile_kernel option during the linking process enables a greater level of detail in the profiling data captured. See the Vitis Compiler Command for more information on the --profile_kernel option.
  2. The runtime requires the presence of an xrt.ini file, as described in xrt.ini File, that includes the keyword for capturing profiling data:
    [Debug]
    profile = true
  3. To enable the profiling of Kernel Internals data, you must also add the launch_waveform tag in the [Emulation] section of the xrt.ini:
    [Emulation]
    launch_waveform = batch

With profiling enabled in the FPGA binary and in the xrt.ini file, the runtime creates the profile_summary.csv report file when running the application, and also creates the profile_kernels.csv and timeline_kernels.csv files when Kernel Internals is enabled.

The CSV report can be viewed in a spreadsheet tool or utility, or can be opened in the Vitis analyzer tool, intended for viewing reports from the Vitis compiler when the application is built, and the XRT library when the application is run. You can launch the Vitis analyzer and open the report using the following command:

vitis_analyzer profile_summary.csv

Related Information

Interpreting the Profile Summary

The Profile Summary includes a number of useful statistics for your OpenCL application. This provides a general idea of the functional bottlenecks in your application. The Profile Summary consists of four sections with the following information:

  1. Top Operations
    • Top Data Transfer: Kernels and Global Memory: This table displays the profile data for top data transfers between FPGA and device memory.
      • Device: Name of device
      • Compute Unit: Name of compute unit
      • Number of Transfers: Sum of write and read AXI transactions monitored on device
      • Average Bytes per Transfer: (Total read bytes + Total write bytes) / (Total read AXI transactions + Total write AXI transactions)
      • Transfer Efficiency (%): (Average bytes per transfer) / min (4K, (Memory bit width / 8 * 256))

        AXI4 specification limits the maximum burst length to 256 and maximum burst size to 4K bytes.

      • Total Data Transfer (MB): (Total read bytes + Total write bytes) / 1.0E6
      • Total Write (MB): (Total write bytes) / 1.0E6
      • Total Read (MB): (Total read bytes) / 1.0E6
      • Transfer Rate (MB/s): (Total data transfer) / (Compute unit total time)
    • Top Kernel Execution
      • Kernel Instance Address: Host address of kernel instance (in hex)
      • Kernel: Name of kernel
      • Context ID: Context ID on host
      • Command Queue ID: Command queue ID on host
      • Device: Name of device where kernel was executed (format: <device>-<ID>)
      • Start Time (ms): Start time of execution (in ms)
      • Duration (ms): Duration of execution (in ms)
      • Global Work Size: NDRange of kernel
      • Local Work Size: Work group size of kernel
    • Top Memory Writes: Host and Device Global Memory
      • Buffer Address: Host address of buffer (in hex)
      • Context ID: Context ID on host
      • Command Queue ID: Command queue ID on host
      • Start Time (ms) : Start time of write transfer (in ms)
      • Duration (ms): Duration of write transfer (in ms)
      • Buffer Size (KB): Size of write transfer (in KB)
      • Writing Rate (MB/s): Writing rate = (Buffer size) / (Duration)
    • Top Memory Reads: Host and Device Global Memory
      • Buffer Address: Host address of buffer (in hex)
      • Context ID: Context ID on host
      • Command Queue ID: Command queue ID on host
      • Start Time (ms): Start time of read transfer (in ms)
      • Duration (ms): Duration of read transfer (in ms)
      • Buffer Size (KB): Size of read transfer (in KB)
      • Reading Rate (MB/s): Reading rate = (Buffer size) / (Duration)
  2. Kernels & Compute Units
    • Kernel Execution (includes estimated device times): This displays the profile data summary for all kernel functions scheduled and executed.
      • Kernel: Name of kernel
      • Number of Enqueues: Number of times kernel is enqueued
      • Total Time (ms): Sum of runtimes of all enqueues (measured from START to END in OpenCL execution model)
      • Minimum Time (ms): Minimum runtime of all enqueues
      • Average Time (ms): (Total time) / (Number of enqueues)
      • Maximum Time (ms): Maximum runtime of all enqueues
    • Compute Unit Utilization (includes estimated device times): This displays the summary profile data for all compute units on the FPGA.
      • Device: Name of device (format: <device>-<ID>)
      • Compute Unit: Name of compute unit
      • Kernel: Kernel this compute unit is associated with
      • Global Work Size: NDRange of kernel (format is x:y:z)
      • Local Work Size: Local work group size (format is x:y:z)
      • Number of Calls: Number of times the compute unit is called
      • Total Time (ms): Sum of runtimes of all calls
      • Minimum Time (ms): Minimum runtime of all calls
      • Average Time (ms): (Total time) / (Number of work groups)
      • Maximum Time (ms): Maximum runtime of all calls
      • Clock Frequency (MHz): Clock frequency used for a given accelerator (in MHz)
  3. Data Transfers
    • Data Transfer: Host and Global Memory: This displays the profile data for all read and write transfers between the host and device memory via PCI Express® link.
      • Context:Number of Devices: Context ID and number of devices in context
      • Transfer Type: READ or WRITE
      • Number of Transfers: Number of host data transfers
        Note: Can contain printf transfers
      • Transfer Rate (MB/s): (Total bytes sent) / (Total time in µs)

        where Total time includes software overhead

      • Average Bandwidth Utilization (%): (Transfer rate) / (Maximum transfer rate)

        where Maximum transfer rate = (256 / 8 bytes) * (300 MHz) = 9.6 GB/s

      • Average Size (KB): (Total KB sent) / (Number of transfers)
      • Total Time (ms): Sum of transfer times
      • Average Time (ms): (Total time) / (Number of transfers)
    • Data Transfer: Kernels and Global Memory: This displays the profile data for all read and write transfers between the FPGA and device memory.
      • Device: Name of device
      • Compute Unit/Port Name: <Name of compute unit>/<Name of port>
      • Kernel Arguments: List of arguments connected to this port
      • DDR Bank: DDR bank number this port is connected to
      • Transfer Type: READ or WRITE
      • Number of Transfers: Number of AXI transactions monitored on device
        Note: Might contain printf transfers)
      • Transfer Rate (MB/s): (Total bytes sent) / (Compute unit total time)
        • Compute unit total time = Total execution time of compute unit
        • Total bytes sent = sum of bytes across all transactions
      • Average Bandwidth Utilization (%): (Transfer rate) / (0.6 * Maximum transfer rate)

        where Maximum transfer rate = (512 / 8 bytes) * (300 MHz) = 19200 MB/s

      • Average Size (KB): (Total KB sent) / (number of AXI transactions)
      • Average Latency (µs): (Total latency of all transaction) / (Number of AXI transactions)
  4. OpenCL API Calls: This displays the profile data for all OpenCL host API function calls executed in the host application.
    • API Name: Name of API function (for example, clCreateProgramWithBinary, clEnqueueNDRangeKernel)
    • Number of Calls: Number of calls to this API
    • Total Time (ms): Sum of runtimes of all calls
    • Minimum Time (ms): Minimum runtime of all calls
    • Average Time (ms): (Total time) / (Number of calls)
    • Maximum Time (ms): Maximum runtime of all calls
  5. Kernel Internals
    • Compute Units: Running Time and Stalls: Reports the running time for compute units in microseconds, and reports stall time as a percent of the running time.
      TIP: The Kernel Internals tab reports time in microseconds (µs), while the rest of the Profile Summary reports time in milliseconds (ms).
      • Compute Unit: Indicates the compute unit instance name.
      • Running Time (µs): Reports the total running time for the CU.
      • Intra-Kernel Stream Stalls (%): Reports the percentage of running time consumed in stalls when streaming data between kernels.
      • External Memory Stalls (%): Reports the percentage of running time consumed in stalls for memory transfers outside the CU.
      • External Stream Stalls (%): Reports the percentage of running time consumed in stalls when streaming data to or from outside the CU.
    • Functions: Running Time and Stalls: Reports the running time for executing functions inside the CU in microseconds, and reports stall time as a percent of the running time.
      • Compute Unit: Indicates the compute unit instance name.
      • Function: Indicates the function name inside the CU.
      • Running Time (µs): Reports the total running time for the function.
      • Intra-Kernel Stream Stalls (%): Reports the percentage of running time consumed in stalls when streaming data between kernels.
      • External Memory Stalls (%): Reports the percentage of running time consumed in stalls for memory transfers outside the CU.
      • External Stream Stalls (%): Reports the percentage of running time consumed in stalls when streaming data to or from outside the CU.
    • Compute Units: Port Data Transfer: Reports the data transfer for specific ports on the compute unit.
      • Compute Unit: Indicates the compute unit instance name.
      • Port: Indicates the port name on the compute unit.
      • Write Time (µs): Specifies the total data write time on the port.
      • Outstanding Write (%): Specifies the percentage of the runtime consumed in the write process.
      • Read Time (µs): Specifies the total data read time on the port.
      • Outstanding Read (%): Specifies the percentage of the runtime consumed in the read process.

Application Timeline

The Application Timeline collects and displays host and kernel events on a common timeline to help you understand and visualize the overall health and performance of your systems. The graphical representation lets you see issues regarding kernel synchronization and efficient concurrent execution. The displayed events include:

  • OpenCL API calls from the host code.
  • Device trace data including compute units, AXI transaction start/stop.
  • Host events and kernel start/stops.

While this is useful for debugging and profiling the application, the timeline and device trace data are not collected by default, which can affect performance by adding time to the application execution. However, the trace data is collected with dedicated resources in the kernel, and does not affect kernel functionality. The data is offloaded only at the end of the run (v++ --trace_memory option).

The following is a snapshot of the Application Timeline window which displays host and device events on a common timeline. Host activity is displayed at the top of the image and kernel activity is shown on the bottom of the image. Host activities include creating the program, running the kernel and data transfers between global memory and the host. The kernel activities include read/write accesses and transfers between global memory and the kernel(s). This information helps you understand details of application execution and identify potential areas for improvements.

Figure 6: Application Timeline

Timeline data can be enabled and collected through the command line flow. However, viewing must be done in the Vitis analyzer as described in Using the Vitis Analyzer.

Generating and Opening the Application Timeline

To generate the Application Timeline report, you must complete the following steps to enable timeline and device trace data collection in the command line flow:

  1. Instrument the FPGA binary during linking, by adding Acceleration Monitors and AXI Performance Monitors to kernels using the v++ --profile_kernel option. This option has three distinct instrumentation options (data, stall, and exec), as described in the Vitis Compiler Command. As an example, add --profile_kernel to the v++ linking command line:
    v++ -g -l --profile_kernel data:all:all:all ...
  2. After the kernels are instrumented during the build process, data gathering must also be enabled during the application runtime execution by editing the xrt.ini file. Refer to xrt.ini File for more information.
    The following xrt.ini file will enable maximum information gathering when the application is run:
    [Debug]
    profile=true
    timeline_trace=true
    data_transfer_trace=coarse
    stall_trace=all
    
    TIP: If you are collecting a large amount of trace data, you might need to use the --trace_memory with the v++ command, and the trace_buffer_size keyword in the xrt.ini.

    After running the application, the Application Timeline data is captured in a CSV file called timeline_trace.csv.

  3. The CSV report can be viewed in a spreadsheet tool or utility, or can be opened in the Vitis analyzer tool, intended for viewing reports from the Vitis compiler when the application is built, and the XRT library when the application is run. You can launch the Vitis analyzer and open the report using the following command:
    vitis_analyzer timeline_trace.csv

Interpreting the Appication Timeline

The Application Timeline window displays host and device events on a common timeline. This information helps you understand details of application execution and identify potential areas for improvements. The Application Timeline report has two main sections: Host and Device. The Host section shows the trace of all the activity originating from the host side. The Device section shows the activity of the CUs on the FPGA.

The report has the following structure:

  • Host
    OpenCL API Calls
    All OpenCL API calls are traced here. The activity time is measured from the host perspective.
    General
    All general OpenCL API calls such as clCreateProgramWithBinary, clCreateContext, and clCreateCommandQueue, are traced here.
    Queue
    OpenCL API calls that are associated with a specific command queue are traced here. This includes commands such as clEnqueueMigrateMemObjects, and clEnqueueNDRangeKernel. If the user application creates multiple command queues, then this section shows all the queues and activities.
    Data Transfer
    In this section the DMA transfers from the host to the device memory are traced. There are multiple DMA threads implemented in the OpenCL runtime and there is typically an equal number of DMA channels. The DMA transfer is initiated by the user application by calling OpenCL APIs such as clEnqueueMigrateMemObjects. These DMA requests are forwarded to the runtime which delegates to one of the threads. The data transfer from the host to the device appear under Write as they are written by the host, and the transfers from device to host appear under Read.
    Kernel Enqueues
    The kernels enqueued by the host program are shown here. The kernels here should not be confused with the kernels/CUs on the device. Here kernel refers to the NDRangeKernels and tasks created by the OpenCL commands clEnqueueNDRangeKernels and clEnqueueTask. These are plotted against the time measured from the host’s perspective. Multiple kernels can be scheduled to be executed at the same time, and they are traced from the point they are scheduled to run until the end of the kernel execution. This is the reason for multiple entries. The number of rows depend on the number of overlapping kernel executions.
    Note: Overlapping of the kernels should not be mistaken for actual parallel execution on the device as the process might not be ready to execute right away.
  • Device "name"
    Binary Container "name"
    Binary container name.
    Accelerator "name"
    Name of the compute unit (a.k.a., Accelerator) on the FPGA.
    User Functions
    In the case of the Vivado HLS tool kernels, functions that are implemented as data flow processes are traced here. The trace for these functions show the number of active instances of these functions that are currently executing in parallel. These names are generated in hardware emulation when waveform is enabled.
    Note: Function level activity is only possible in hardware emulation.
    • Function: "name a"
    • Function: "name b"
    Read
    A CU reads from the DDR over AXI-MM ports. The trace of a data read by a CU is shown here. The activity is shown as transaction and the tool-tip for each transaction shows more details of the AXI transaction. These names are generated when --profile_kernel data is used where the format name is m_axi_<bundle name>(port).
    Write
    A CU writes to the DDR over AXI-MM ports. The trace of data written by a CU is shown here. The activity is shown as transactions and the tool-tip for each transaction shows more details of the AXI transaction. This is generated when --profile_kernel data is used where the format name is m_axi_<bundle name>(port).

Waveform View and Live Waveform Viewer

The Vitis core development kit can generate a Waveform view when running hardware emulation. It displays in-depth details at the system-level, CU level, and at the function level. The details include data transfers between the kernel and global memory and data flow through inter-kernel pipes. These details provide many insights into performance bottlenecks from the system-level down to individual function calls to help optimize your application.

The Live Waveform Viewer is similar to the Waveform view, however, it provides even lower-level details with some degree of interactivity. The Live Waveform Viewer can also be opened using the Vivado logic simulator, xsim.

Note: The Waveform view allows you to examine the device transactions from within the Vitis analyzer, as described in Using the Vitis Analyzer. In contrast, the Live Waveform Viewer generates the Vivado simulation waveform viewer to examine the hardware transactions in addition to user selected signals.

Waveform View and Live Waveform Viewer data are not collected by default because it requires the runtime to generate simulation waveforms during hardware emulation, which consumes more time and disk space. Refer to Generating and Opening the Waveform Reports for instructions on enabling these features.

Figure 7: Waveform View

You can also open the waveform database (.wdb) file with the Vivado logic simulator through the Linux command line:

xsim -gui <filename.wdb> &
TIP: The .wdb file is located in the directory where the compiled host code is executed.

Generating and Opening the Waveform Reports

Follow these instructions to enable waveform data collection from the command line during hardware emulation and open the viewer:

  1. Enable debug code generation during compilation and linking using the -g option.
    v++ -c -g -t hw_emu ...
  2. Create an xrt.ini file in the same directory as the host executable with the following contents (see xrt.ini File for more information):
    [Debug]
    profile=true
    timeline_trace=true
    
    [Emulation]
    launch_waveform=batch
    For Live Waveform Viewer, launch_waveform is as follows:
    [Emulation]
    launch_waveform=gui
    TIP: If Live Waveform Viewer is enabled, the simulation waveform opens during the hardware emulation run.
  3. Run the hardware emulation build of the application as described in Running an Application. The hardware transaction data is collected in the waveform database file, <hardware_platform>-<device_id>-<xclbin_name>.wdb. Refer to Directory Structure for the location of this file.
  4. Open the Waveform view in the Vitis analyzer as described in Waveform View and Live Waveform Viewer.

Interpreting Data in the Waveform Views

The following image shows the Waveform view:

Figure 8: Waveform View

The Waveform and Live Waveform views are organized hierarchically for easy navigation.

  • The Waveform view is based on the actual waveforms generated during hardware emulation (Kernel Trace). This allows the viewer to descend all the way down to the individual signals responsible for the abstracted data. However, because the Waveform view is generated from the post-processed data, no additional signals can be added to the report, and some of the runtime analysis cannot be visualized, such as DATAFLOW transactions.
  • The Live Waveform viewer is displaying the Vivado logic simulator (xsim) run, so you can add extra signals and internals of the register transfer (RTL) design to the live view. Refer to the Vivado Design Suite User Guide: Logic Simulation (UG900) for information on working with the Waveform viewer.

The hierarchy of the Waveform and Live Waveform views include the following:

Device "name"
Target device name.
Binary Container "name"
Binary container name.
Memory Data Transfers
For each DDR Bank, this shows the trace of all the read and write request transactions arriving at the bank from the host.
Kernel "name" 1:1:1
For each kernel and for each compute unit of that kernel, this section breaks down the activities originating from the compute unit.
Compute Unit: "name"
Compute unit name.
CU Stalls (%)
Stall signals are provided by the Vivado HLS tool to inform you when a portion of the circuit is stalling because of external memory accesses, internal streams (that is, dataflow), or external streams (that is, OpenCL pipes). The stall bus shown in detailed kernel trace compiles all of the lowest level stall signals and reports the percentage that are stalling at any point in time. This provides a factor of how much of the kernel is stalling at any point in the simulation.

For example, if there are 100 lowest level stall signals and 10 are active on a given clock cycle, then the CU Stall percentage is 10%. If one goes inactive, then it is 9%.

Data Transfers
This shows the read/write data transfer accesses originating from each Master AXI port of the compute unit to the DDR.
User Functions
This information is available for the HLS kernels and shows the user functions.
Function: "name"
Function name.
Dataflow/Pipeline Activity
This shows the number of parallel executions of the function if the function is implemented as a dataflow process.
Active Iterations
This shows the currently active iterations of the dataflow. The number of rows is dynamically incremented to accommodate the visualization of any concurrent execution.
StallNoContinue
This is a stall signal that tells if there were any output stalls experienced by the dataflow processes (function is done, but it has not received a continue from the adjacent dataflow process).
RTL Signals
These are the underlying RTL control signals that were used to interpret the above transaction view of the dataflow process.
Function Stalls
Shows the different types of stalls experienced by the process.
External Memory
Stalls experienced while accessing the DDR memory.
Internal-Kernel Pipe
If the compute units communicated between each other through pipes, then this will show the related stalls.
Intra-Kernel Dataflow
FIFO activity internal to the kernel.
Function I/O
Actual interface signals.
Function: "name"
Function name.
Function: "name"
Function name.