Hardware Tracing

The SDSoC environment supports hardware event tracing of accelerators cross-compiled using Vivado HLS, and data transfers over AXI4-Stream connections. When the sdscc/++ linker is invoked with the -trace option, it automatically inserts hardware monitor IP cores into the generated system to log these event types:

  • Accelerator start and stop, defined by ap_start and ap_done signals.
  • Data transfer start and stop, defined by AXI4-Stream handshake and TLAST signals.
Each of these events is independently monitored and receives a timestamp from the same global timer used for software events. If the hardware function explicitly declares an AXI4-Lite control interface using the following pragma, it cannot be traced because its ap_start and ap_done signals are not part of the IP interface:
#pragma HLS interface s_axilite port=foo

To give you an idea of the approximate resource utilization of these hardware monitor cores, the following table shows the resource utilization of these cores for a Zynq-7000 (xc7z020-1clg400) device:

Core Name LUTs FFs BRAMs DSPs
Accelerator 79 18 0 0
AXI4-Stream (basic) 79 14 0 0
AXI4-Stream (statistics) 132 183 0 0

The AXI4-Stream monitor core has two modes: basic and statistics. The basic mode does just the start/stop trace event generation. The statistics mode enables an AXI4-Lite interface to two 32-bit registers. The register at offset 0x0 presents the word count of the current, on-going transfer. The register at offset 0x4 presents the word count of the previous transfer. As soon as a transfer is complete, the current count is moved to the previous register. By default, the AXI4-Stream core is configured in the basic mode. Future releases will enable the user to choose which mode to use. The core does support it today so adventurous users could potentially configure the core manually in the Vivado tools. However, this is not supported in the current release.

In addition to the hardware trace monitor cores, the output trace event signals are combined by a single integration core. This core has a parameterizeable number of ports (from 1–63), and can thus support up to 63 individual monitor cores (either accelerator or AXI4-Stream). The resource utilization of this core depends on the number of ports enabled, and thus the number of monitor cores inserted. The following table shows the resource utilization of this core for a Zynq-7000 (xc7z020-1clg400) device:

Number of Ports LUTs FFs BRAMs DSPs
1 241 404 0 0
2 307 459 0 0
3 366 526 0 0
4 407 633 0 0
6 516 686 0 0
8 644 912 0 0
16 1243 1409 0 0
32 2190 2338 0 0
63 3830 3812 0 0

Depending on the number of ports (i.e., monitor cores), the integration core will use on average 110 flip-flops (FFs) and 160 look-up tables (LUTs). At the system level for example, the resource utilization for the matrix multiplication template application on the ZC702 platform (using the same xc7z020-1clg400 part) is shown in the table below:

System LUTs FFs BRAMs DSPs
Base (no trace) 16,433 21,426 46 160
Event trace enabled 17,612 22,829 48 160

Based on the results above, the difference in designs is approximately 1,000 LUTs, 1,200 FFs, and two BRAMs. This design has a single accelerator with three AXI4-Stream ports (two inputs and one output). When event trace is enabled, four monitors are inserted into the system (one accelerator and three AXI4-Stream monitors), in addition to a single integration core and other associated read-out logic. Given the resource estimations above, 720 LUTs and 700 FFs are from the actual trace monitoring hardware (monitors and integration core). The remaining 280 LUTs, 500 FFs and two BRAMs are from the read-out logic which converts the AXI4-Stream output trace data stream to JTAG. The resource utilization for this read-out logic is static and does not vary based on the design.