Lab 5: Task-Level Pipelining

This lab demonstrates how to modify your code to optimize the hardware-software system generated by the SDx IDE using task-level pipelining. You can observe the impact of pipelining on performance.

Note: This tutorial is separated into steps, followed by general instructions and supplementary detailed steps allowing you to make choices based on your skill level as you progress through it. If you need help completing a general instruction, go to the detailed steps, or if you are ready, simply skip the step-by-step directions and move on to the next general instruction.
Note: You can complete this tutorial even if you do not have a ZC702 board. When creating the SDSoC environment project, select your board. The tutorial instructions ask you to add source files created for an application created for the ZC702. If your board contains a smaller Zynq-7000 device, after adding source files you need to edit the file mmult_accel.cpp to reduce resource usage (in the accelerator source file you will see #pragma_HLS_array_partition which sets block factor=16; instead, set block factor=8).

Task Pipelining

If there are multiple calls to an accelerator in your application, then you can structure your application such that you can pipeline these calls and overlap the setup and data transfer with the accelerator computation. In the case of the matrix multiply application, the following events take place:

  1. Matrices A and B are transferred from the main memory to accelerator local memories.
  2. The accelerator executes.
  3. The result, C, is transferred back from the accelerator to the main memory.

The following figure illustrates the matrix multiply design on the left side and on the right side a time-chart of these events for two successive calls that are executing sequentially.

Figure: Sequential Execution of Matrix Multiply Calls

Outline Panel

The following figure shows the two calls executing in a pipelined fashion. The data transfer for the second call starts as soon as the data transfer for the first call is finished and overlaps with the execution of the first call. To enable the pipelining, however, we need to provide extra local memory to store the second set of arguments while the accelerator is computing with the first set of arguments. The SDSoC environment generates these memories, called multi-buffers, under the guidance of the user.

Figure: Pipelined Execution of Matrix Multiply Calls

Outline Panel

Specifying task level pipelining requires rewriting the calling code using the pragmas async(id) and wait(id). The SDSoC environment includes an example that demonstrates the use of async pragmas and this Matrix Multiply Pipelined example is used in this tutorial.

Task Pipelining in the Matrix Multiply Example

Learning Objectives

After you complete the tutorial, you should be able to:
  • Use the SDx IDE to optimize your application to reduce runtime by performing task-level pipelining.
  • Observe the impact on performance of pipeline calls to an accelerator when overlapping accelerator computation with input and output communication.