Pipeline a loop to improve latency and throughput. Although loop unrolling exposes concurrency, it does not address the issue of keeping all elements in a kernel data path busy at all times. This is necessary for maximizing kernel throughput and performance. Even in an unrolled case, loop control dependencies can lead to sequential behavior. The sequential behavior of operations results in idle hardware and a loss of performance.

Xilinx addresses this issue by introducing a vendor extension on top of the OpenCL 2.0 specification for loop pipelining. The Xilinx attribute for loop pipelining is xcl_pipeline_loop. By default, the SDAccel™ compiler automatically applies this attribute on the innermost loop with trip count more than 64 or its parent loop when its trip count is less than or equal 64.


Place the attribute in the OpenCL source before the loop definition:



The following example pipelines LOOP_1 of function vaccum to improve performance:

__kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void vaccum(__global const int* a, __global const int* b, __global int*
int tmp = 0;
LOOP_1: for (int i=0; i < 32; i++) {
tmp += a[i] * b[i];
result[0] = tmp;

See Also