You can pipeline a loop to improve latency and maximize kernel throughput and performance.

Although unrolling loops increases concurrency, it does not address the issue of keeping all elements in a kernel data path busy at all times. Even in an unrolled case, loop control dependencies can lead to sequential behavior. The sequential behavior of operations results in idle hardware and a loss of performance.

Xilinx addresses this issue by introducing a vendor extension on top of the OpenCL 2.0 specification for loop pipelining: xcl_pipeline_loop.

By default, the XOCC compiler automatically pipelines loops with a trip count more than 64, or unrolls loops with a trip count less than 64. This should provide good results. However, you can choose to pipeline loops (instead of the automatic unrolling) by explicitly specifying the nounroll attribute and xcl_pipeline_loop attribute before the loop.


Place the attribute in the OpenCL source before the loop definition:



The following example pipelines LOOP_1 of function vaccum to improve performance:

__kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void vaccum(__global const int* a, __global const int* b, __global int*
int tmp = 0;
LOOP_1: for (int i=0; i < 32; i++) {
tmp += a[i] * b[i];
result[0] = tmp;

See Also