Profiling and Instrumenting Code to Measure Performance

The first major task in creating a software-defined SoC is to identify portions of application code that are suitable for implementation in hardware, and that significantly improve overall performance when run in hardware. Program hot-spots that are compute-intensive are good candidates for hardware acceleration, especially when it is possible to stream data between hardware and the CPU and memory to overlap the computation with the communication. Software profiling is a standard way to identify the most CPU-intensive portions of your program.

The SDSoC environment includes all performance and profiling capabilities that are included in the Xilinx SDK, including gprof, the non-intrusive Target Communication Framework (TCF) Profiler, and the Performance Analysis perspective within Eclipse.

To run the TCF Profiler for a standalone application, run the following steps:
  1. Set the active build configuration to SDDebug by right-clicking on the project in the Project Explorer and selecting Build Configurations > Set Active > SDDebug.
  2. In the SDSoC Project Overview window, click on Debug application.
    Note: The board must be connected to your computer and powered on. The application automatically breaks at the entry to main().
  3. Launch the TCF Profiler by selecting Window > Show View > Other > Debug > TCF Profiler.
  4. Start the TCF Profiler by clicking on the green Start button at the top of the TCF Profiler tab. Enable Aggregate per function in the Profiler Configuration dialog box.
  5. Start the profiling by clicking on the Resume button. The program runs to completion and breaks at the exit() function.
  6. View the results in the TCF Profiler tab.

Profiling provides a statistical method for finding hot spots based on sampling the CPU program counter and correlating to the program in execution. Another way to measure program performance is to instrument the application to determine the actual duration between different parts of a program in execution.

The sds_lib library included in the SDSoC environment provides a simple, source code annotation based time-stamping API that can be used to measure application performance.

 * @return value of free-running 64-bit Zynq(TM) global counter
unsigned long long sds_clock_counter(void);
By using this API to collect timestamps and differences between them, you can determine duration of key parts of your program. For example, you can measure data transfer or overall round trip execution time for hardware functions as shown in the following code snippet:
class perf_counter
     uint64_t tot, cnt, calls;
     perf_counter() : tot(0), cnt(0), calls(0) {};
     inline void reset() { tot = cnt = calls = 0; }
     inline void start() { cnt = sds_clock_counter(); calls++; };
     inline void stop() { tot += (sds_clock_counter() - cnt); };
     inline uint64_t avg_cpu_cycles() { return (tot / calls); };

extern void f();
void measure_f_runtime()
     perf_counter f_ctr;
     std::cout << "Cpu cycles f(): " << f_ctr.avg_cpu_cycles()
     	       << std::endl;

The performance estimation feature within the SDSoC environment employs this API by automatically instrumenting functions selected for hardware implementation, measuring actual run-times by running the application on the target, and then comparing actual times with estimated times for the hardware functions.

Note: While off-loading CPU-intensive functions is probably the most reliable heuristic to partition your application, it is not guaranteed to improve system performance without algorithmic modification to optimize memory accesses. A CPU almost always has much faster random access to external memory than you can achieve from programmable logic, due to multi-level caching and a faster clock speed (typically 2x to 8x faster than programmable logic). Extensive manipulation of pointer variables over a large address range, for example, a sort routine that sorts indices over a large index set, while very well-suited for a CPU, may become a liability when moving a function into programmable logic. This does not mean that such compute functions are not good candidates for hardware, only that code or algorithm restructuring may be required. This issue is also well-known for DSP and GPU coprocessors.