Optimizing the Performance

Host Optimization

This section focuses on optimization of the host program, which uses the OpenCL™ API to schedule the individual compute unit executions, and data transfers to and from the FPGA board. As a result, you need to think about concurrent execution of tasks through the OpenCL command queue(s). This section discusses common pitfalls, and how to recognize and address them.

Reducing Overhead of Kernel Enqueing

The OpenCL API execution model supports data parallel and task parallel programming models. Kernels are usually enqueued by the OpenCL runtime multiple times and then scheduled to be executed on the device. You must send the command to start the kernel in one of two ways:

  • Using clEnqueueNDRange API for the data parallel case
  • Using clEnqueueTask for the task parallel case

The dispatching process is executed on the host processor, and the kernel commands and arguments need to be sent to the accelerator, over the PCIe® bus in the case of the Alveo card for instance. In the Xilinx Runtime (XRT) library, the overhead of dispatching the command and arguments to the accelerator can be between 30 µs and 60 µs, depending the number of arguments on the kernel. You can reduce the impact of this overhead by minimizing the number of times the kernel needs to be executed.

For the data parallel case, Xilinx recommends that you carefully choose the global and local work sizes for your host code and kernel so that the global work size is a small multiple of the local work size. Ideally, the global work size is the same as the local work size as shown in the following code snippet:

size_t global = 1;
size_t local = 1;
clEnqueueNDRangeKernel(world.command_queue, kernel, 1, nullptr,
                       &global, &local, 2, write_events.data(),
                       &kernel_events[0]));
Note: For the task parallel case, Xilinx recommends minimizing the calls to clEnqueueTask. Ideally, you should finish all the workload in a single call to clEnqueueTask.

For more information on reducing overhead on kernel execution, see Kernel Execution.

Optimizing Data Movement

Figure 1: Optimizing Data Movement Flow

In the OpenCL API, all data is transferred from the host memory to the global memory on the device first and then from the global memory to the kernel for computation. The computation results are written back from the kernel to the global memory and lastly from the global memory to the host memory. A key factor in determining strategies for kernel optimization is understanding how data can be efficiently moved around.

Note: Optimize the data movement in the application before optimizing computation.

During data movement optimization, it is important to isolate data transfer code from computation code because inefficiency in computation might cause stalls in data movement. Xilinx recommends that you modify the host code and kernels with data transfer code only for this optimization step. The goal is to maximize the system level data throughput by maximizing PCIe bandwidth usage and DDR bandwidth usage. It usually takes multiple iterations of running software emulation, hardware emulation, as well as execution on FPGAs to achieve this goal.

Overlapping Data Transfers with Kernel Computation

Applications, such as database analytics, have a much larger data set than can be stored in the available memory on the acceleration device. They require the complete data to be transferred and processed in blocks. Techniques that overlap the data transfers with the computation are critical to achieve high performance for these applications.

Below is the vadd kernel from the overlap example in the host category of Vitis Examples: Getting Started on GitHub.

#define BUFFER_SIZE 256
#define DATA_SIZE 1024

//TRIPCOUNT indentifier
const unsigned int c_len = DATA_SIZE / BUFFER_SIZE;
const unsigned int c_size = BUFFER_SIZE;

extern "C" {
void vadd(int *c, int *a, int *b, const int elements) {
#pragma HLS INTERFACE m_axi port = c offset = slave bundle = gmem
#pragma HLS INTERFACE m_axi port = a offset = slave bundle = gmem
#pragma HLS INTERFACE m_axi port = b offset = slave bundle = gmem

#pragma HLS INTERFACE s_axilite port = c bundle = control
#pragma HLS INTERFACE s_axilite port = a bundle = control
#pragma HLS INTERFACE s_axilite port = b bundle = control
#pragma HLS INTERFACE s_axilite port = elements bundle = control
#pragma HLS INTERFACE s_axilite port = return bundle = control

    int arrayA[BUFFER_SIZE];
    int arrayB[BUFFER_SIZE];
    for (int i = 0; i < elements; i += BUFFER_SIZE) {
       #pragma HLS LOOP_TRIPCOUNT min=c_len max=c_len
        int size = BUFFER_SIZE;
        if (i + size > elements)
            size = elements - i;
    readA:
        for (int j = 0; j < size; j++) {
           #pragma HLS PIPELINE II=1
           #pragma HLS LOOP_TRIPCOUNT min=c_size max=c_size
            arrayA[j] = a[i + j];
        }

    readB:
        for (int j = 0; j < size; j++) {
           #pragma HLS PIPELINE II=1
           #pragma HLS LOOP_TRIPCOUNT min=c_size max=c_size
            arrayB[j] = b[i + j];
        }

    vadd_writeC:
        for (int j = 0; j < size; j++) {
           #pragma HLS PIPELINE II=1
           #pragma HLS LOOP_TRIPCOUNT min=c_size max=c_size
            c[i + j] = arrayA[j] + arrayB[j];
        }
    }
}
}

From the host perspective, there are four tasks to perform in this example:

  1. Write buffer a (Wa)
  2. Write buffer b (Wb)
  3. Execute vadd kernel
  4. Read buffer c (Rc)

Using an out-of-order command queue, data transfer and kernel execution can overlap as illustrated in the figure below. In the host code for this example, double buffering is used for all buffers so that the kernel can process one set of buffers while the host can operate on the other set of buffers.

The OpenCL event object provides an easy method to set up complex operation dependencies and synchronize host threads and device operations. Events are OpenCL objects that track the status of operations. Event objects are created by kernel execution commands, read, write, copy commands on memory objects or user events created using clCreateUserEvent. You can ensure an operation has completed by querying events returned by these commands. The arrows in the figure below show how event triggering can be set up to achieve optimal performance.

Figure 2: Event Triggering Set Up

The host code enqueues the four tasks in a loop to process the complete data set. It also sets up event synchronization between different tasks to ensure that data dependencies are met for each task. The double buffering is set up by passing different memory objects values to clEnqueueMigrateMemObjects API. The event synchronization is achieved by having each API call wait for other event as well as trigger its own event when the API completes.

// THIS PAIR OF EVENTS WILL BE USED TO TRACK WHEN A KERNEL IS FINISHED WITH
// THE INPUT BUFFERS. ONCE THE KERNEL IS FINISHED PROCESSING THE DATA, A NEW
// SET OF ELEMENTS WILL BE WRITTEN INTO THE BUFFER.
vector<cl::Event> kernel_events(2);
vector<cl::Event> read_events(2);
cl::Buffer buffer_a[2], buffer_b[2], buffer_c[2];

for (size_t iteration_idx = 0; iteration_idx < num_iterations; iteration_idx++) {
    int flag = iteration_idx % 2;

    if (iteration_idx >= 2) {
        OCL_CHECK(err, err = read_events[flag].wait());
    }

    // Allocate Buffer in Global Memory
    // Buffers are allocated using CL_MEM_USE_HOST_PTR for efficient memory and
    // Device-to-host communication
    std::cout << "Creating Buffers..." << std::endl;
    OCL_CHECK(err,
                buffer_a[flag] =
                    cl::Buffer(context,
                                CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
                                bytes_per_iteration,
                                &A[iteration_idx * elements_per_iteration],
                                &err));
    OCL_CHECK(err,
                buffer_b[flag] =
                    cl::Buffer(context,
                                CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
                                bytes_per_iteration,
                                &B[iteration_idx * elements_per_iteration],
                                &err));
    OCL_CHECK(err,
                buffer_c[flag] = cl::Buffer(
                    context,
                    CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR,
                    bytes_per_iteration,
                    &device_result[iteration_idx * elements_per_iteration],
                    &err));

    vector<cl::Event> write_event(1);

    OCL_CHECK(err, err = krnl_vadd.setArg(0, buffer_c[flag]));
    OCL_CHECK(err, err = krnl_vadd.setArg(1, buffer_a[flag]));
    OCL_CHECK(err, err = krnl_vadd.setArg(2, buffer_b[flag]));
    OCL_CHECK(err, err = krnl_vadd.setArg(3, int(elements_per_iteration)));

    // Copy input data to device global memory
    std::cout << "Copying data (Host to Device)..." << std::endl;
    // Because we are passing the write_event, it returns an event object
    // that identifies this particular command and can be used to query
    // or queue a wait for this particular command to complete.
    OCL_CHECK(
        err,
        err = q.enqueueMigrateMemObjects({buffer_a[flag], buffer_b[flag]},
                                            0 /*0 means from host*/,
                                            NULL,
                                            &write_event[0]));
    set_callback(write_event[0], "ooo_queue");

    printf("Enqueueing NDRange kernel.\n");
    // This event needs to wait for the write buffer operations to complete
    // before executing. We are sending the write_events into its wait list to
    // ensure that the order of operations is correct.
    //Launch the Kernel
    std::vector<cl::Event> waitList;
    waitList.push_back(write_event[0]);
    OCL_CHECK(err,
                err = q.enqueueNDRangeKernel(
                    krnl_vadd, 0, 1, 1, &waitList, &kernel_events[flag]));
    set_callback(kernel_events[flag], "ooo_queue");

    // Copy Result from Device Global Memory to Host Local Memory
    std::cout << "Getting Results (Device to Host)..." << std::endl;
    std::vector<cl::Event> eventList;
    eventList.push_back(kernel_events[flag]);
    // This operation only needs to wait for the kernel call. This call will
    // potentially overlap the next kernel call as well as the next read
    // operations
    OCL_CHECK(err,
                err = q.enqueueMigrateMemObjects({buffer_c[flag]},
                                                CL_MIGRATE_MEM_OBJECT_HOST,
                                                &eventList,
                                                &read_events[flag]));
    set_callback(read_events[flag], "ooo_queue");

    OCL_CHECK(err, err = read_events[flag].wait());
}

The Application Timeline view below clearly shows that the data transfer time is completely hidden, while the compute unit vadd_1 is running constantly.

Figure 3: Data Transfer Time Hidden in Application Timeline View

Buffer Memory Segmentation

Allocation and deallocation of memory buffers can lead to memory segmentation in the DDR controllers. This might result in sub-optimal performance of compute units, even if they could theoretically execute in parallel.

This issue occurs most often when multiple pthreads for different compute units are used and the threads allocate and release many device buffers with different sizes every time they enqueue the kernels. In this case, the timeline trace will exhibit gaps between kernel executions and it might seem the processes are sleeping.

Each buffer allocated by runtime should be continuous in hardware. For large memory, it might take some time to wait for that space to be freed, when many buffers are allocated and deallocated. This can be resolved by allocating device buffer and reusing it between different enqueues of a kernel.

For more details on optimizing memory performance, see Reading and Writing by Burst.

Compute Unit Scheduling

Scheduling kernel operations is key to overall system performance. This becomes even more important when implementing multiple compute units (of the same kernel or of different kernels). This section examines the different command queues responsible for scheduling the kernels.

Multiple In-Order Command Queues

The following figure shows an example with two in-order command queues, CQ0 and CQ1. The scheduler dispatches commands from each queue in order, but commands from CQ0 and CQ1 can be pulled out by the scheduler in any order. You must manage synchronization between CQ0 and CQ1 if required.

Figure 4: Example with Two In-Order Command Queues

The following is code extracted from host.cpp of the concurrent_kernel_execution_c example that sets up multiple in-order command queues and enqueues commands into each queue:

    OCL_CHECK(err,
              cl::CommandQueue ordered_queue1(
                  context, device, CL_QUEUE_PROFILING_ENABLE, &err));
    OCL_CHECK(err,
              cl::CommandQueue ordered_queue2(
                  context, device, CL_QUEUE_PROFILING_ENABLE, &err));
...

    printf("[Ordered Queue 1]: Enqueueing scale kernel\n");
    OCL_CHECK(
        err,
        err = ordered_queue1.enqueueNDRangeKernel(
            kernel_mscale, offset, global, local, nullptr, &kernel_events[0]));

    set_callback(kernel_events[0], "scale");
...
    printf("[Ordered Queue 1]: Enqueueing addition kernel\n");
    OCL_CHECK(
        err,
        err = ordered_queue1.enqueueNDRangeKernel(
            kernel_madd, offset, global, local, nullptr, &kernel_events[1]));

    set_callback(kernel_events[1], "addition");
...
    printf("[Ordered Queue 2]: Enqueueing matrix multiplication kernel\n");
    OCL_CHECK(
        err,
        err = ordered_queue2.enqueueNDRangeKernel(
            kernel_mmult, offset, global, local, nullptr, &kernel_events[2]));
    set_callback(kernel_events[2], "matrix multiplication");

Single Out-of-Order Command Queue

The following figure shows an example with a single out-of-order command queue. The scheduler can dispatch commands from the queue in any order. You must manually define event dependencies and synchronizations as required.

Figure 5: Example with Single Out-of-Order Command Queue

The following is code extracted from host.cpp of the concurrent_kernel_execution_c example that sets up a single out-of-order command queue and enqueues commands as needed:

    OCL_CHECK(
        err,
        cl::CommandQueue ooo_queue(context,
                                   device,
                                   CL_QUEUE_PROFILING_ENABLE |
                                       CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
                                   &err));

...
    printf("[OOO Queue]: Enqueueing scale kernel\n");
    OCL_CHECK(
        err,
        err = ooo_queue.enqueueNDRangeKernel(
            kernel_mscale, offset, global, local, nullptr, &ooo_events[0]));
    set_callback(ooo_events[0], "scale");
...
    // This is an out of order queue, events can be executed in any order. Since
    // this call depends on the results of the previous call we must pass the
    // event object from the previous call to this kernel's event wait list.
    printf("[OOO Queue]: Enqueueing addition kernel (Depends on scale)\n");

    kernel_wait_events.resize(0);
    kernel_wait_events.push_back(ooo_events[0]);

    OCL_CHECK(err,
              err = ooo_queue.enqueueNDRangeKernel(
                  kernel_madd,
                  offset,
                  global,
                  local,
                  &kernel_wait_events, // Event from previous call
                  &ooo_events[1]));
    set_callback(ooo_events[1], "addition");
...
    // This call does not depend on previous calls so we are passing nullptr
    // into the event wait list. The runtime should schedule this kernel in
    // parallel to the previous calls.
    printf("[OOO Queue]: Enqueueing matrix multiplication kernel\n");
    OCL_CHECK(err,
              err = ooo_queue.enqueueNDRangeKernel(
                  kernel_mmult,
                  offset,
                  global,
                  local,
                  nullptr, // Does not depend on previous call
                  &ooo_events[2]));
    set_callback(ooo_events[2], "matrix multiplication");

The Application Timeline view (as shown in the following figure) that the compute unit mmult_1 is running in parallel with the compute units mscale_1 and madd_1, using both multiple in-order queues and single out-of-order queue methods.

Figure 6: Application Timeline View Showing mult_1 Running with mscale_1 and madd_1

Kernel Optimization

One of the key advantages of an FPGA is its flexibility and capacity to create customized designs specifically for your algorithm. This enables various implementation choices to trade off algorithm throughput versus power consumption. The following guidelines help manage the design complexity and achieve the desired design goals.

Optimizing Kernel Computation

Figure 7: Optimizing Kernel Computation Flow

The goal of kernel optimization is to create processing logic that can consume all the data as soon as it arrives at the kernel interfaces. The key metric is the initiation interval (II), or the number of clock cycles before the kernel can accept new input data. Optimizing the II is generally achieved by expanding the processing code to match the data path with techniques such as function pipelining, loop unrolling, array partitioning, data flowing, etc. For more information on kernel optimization, see Linking the Kernels.

Interface Attributes (Detailed Kernel Trace)

The detailed kernel trace provides easy access to the AXI transactions and their properties. The AXI transactions are presented for the global memory, as well as the Kernel side (Kernel "pass" 1:1:1) of the AXI interconnect. The following figure illustrates a typical kernel trace of a newly accelerated algorithm.

Figure 8: Accelerated Algorithm Kernel Trace

Most interesting with respect to performance are the fields:

Burst Length
Describes how many packages are sent within one transaction.
Burst Size
Describes the number of bytes being transferred as part of one package.

Given a burst length of 1 and just 4 bytes per package, it will require many individual AXI transactions to transfer any reasonable amount of data.

Note: The Vitis core development kit never creates burst sizes less than 4 bytes, even if smaller data is transmitted. In this case, if consecutive items are accessed without AXI bursts enabled, it is possible to observe multiple AXI reads to the same address.

Small burst lengths, as well as burst sizes, considerably less than 512 bits are therefore good opportunities to optimize interface performance.

Using Burst Data Transfers

Transferring data in bursts hides the memory access latency and improves bandwidth usage and efficiency of the memory controller.

Note: Infer burst transfers from successive requests of data from consecutive address locations. Refer to Reading and Writing by Burst for more details.

If burst data transfers occur, the detailed kernel trace will reflect the higher burst rate as a larger burst length number:

Figure 9: Burst Data Transfer with Detailed Kernel Trace

In the previous figure, it is also possible to observe that the memory data transfers following the AXI interconnect are actually implemented rather differently (shorter transaction time). Hover over these transactions, you would see that the AXI interconnect has packed the 16 x 4 byte transaction into a single package transaction of 1 x 64 bytes. This effectively uses the AXI4 bandwidth which is even more favorable. The next section focuses on this optimization technique in more detail.

Burst inference is heavily dependent on coding style and access pattern. However, you can ease burst detection and improve performance by isolating data transfer and computation, as shown in the following code snippet:

void kernel(T in[1024], T out[1024]) {
    T tmpIn[1024];
    T tmpOu[1024];
    read(in, tmpIn);
    process(tmpIn, tmpOut);
    write(tmpOut, out);
}

In short, the function read is responsible for reading from the AXI input to an internal variable (tmpIn). The computation is implemented by the function process working on the internal variables tmpIn and tmpOut. The function write takes the produced output and writes to the AXI output.

The isolation of the read and write function from the computation results in:

  • Simple control structures (loops) in the read/write function which makes burst detection simpler.
  • The isolation of the computational function away from the AXI interfaces, simplifies potential kernel optimization. See Kernel Optimization for more information.
  • The internal variables are mapped to on-chip memory, which allow faster access compared to AXI transactions. Acceleration platforms supported in the Vitis core development kit can have as much as 10 MB on-chip memories that can be used as pipes, local memories, and private memories. Using these resources effectively can greatly improve the efficiency and performance of your applications.

Using Full AXI Data Width

The user data width between the kernel and the memory controller can be configured by the Vitis compiler based on the data types of the kernel arguments. To maximize the data throughput, Xilinx recommends that you choose data types map to the full data width on the memory controller. The memory controller in all supported acceleration cards supports 512-bit user interface, which can be mapped to OpenCL vector data types, such as int16 or C/C++ arbitrary precision data type ap_int<512>.

As shown on the following figure, you can observe burst AXI transactions (Burst Length 16) and a 512-bit package size (Burst Size 64 bytes).

Figure 10: Burst AXI Transactions

This example shows good interface configuration as it maximizes AXI data width as well as actual burst transactions.

Complex structs or classes, used to declare interfaces, can lead to very complex hardware interfaces due to memory layout and data packing differences. This can introduce potential issues that are very difficult to debug in a complex system.

Note: Use simple structs for kernel arguments that can be packed to 32-bit boundary. Refer to the Custom Data Type Example in kernel_to_gmem category at Xilinx Getting Started Example on GitHub for the recommended method to use structs.
Setting Data Width Using OpenCL Attributes

The OpenCL API provides attributes to support a more automatic approach to incrementing AXI data width usage. The change of the interface data types, as stated above is supported in the API as well but will require the same code changes as C/C++ to the algorithm to accommodate the larger input vector.

To eliminate manual code modifications, the following OpenCL attributes are interpreted to perform data path widening and vectorization of the algorithm:

Examine the combined functionality on the following case:

__attribute__((reqd_work_group_size(64, 1, 1)))
__attribute__((vec_type_hint(int)))
__attribute__((xcl_zero_global_work_offset))
__kernel void vector_add(__global int* c, __global const int* a, __global const int* b) {
    size_t idx = get_global_id(0);
    c[idx] = a[idx] + b[idx];
}

In this case, the hard coded interface is a 32-bit wide data path (int *c, int* a, int *b), which drastically limits the memory throughput if implemented directly. However, the automatic widening and transformation is applied, based on the values of the three attributes.

__attribute__((vec_type_hint(int)))
Declares that int is the main type used for computation and memory transfer (32-bit). This knowledge is used to calculate the vectorization/widening factor based on the target bandwidth of the AXI interface (512 bits). In this example the factor would be 16 = 512 bits / 32-bit. This implies that in theory, 16 values could be processed if vectorization can be applied.
__attribute__((reqd_work_group_size(X, Y, Z)))
Defines the total number of work items (where X, Y, and Z are positive constants). X*Y*Z is the maximum number of work items therefore defining the maximum possible vectorization factor which would saturate the memory bandwidth. In this example, the total number of work items is 64*1*1=64.

The actual vectorization factor to be applied will be the greatest common divider of the vectorization factor defined by the actual coded type or the vec_type_hint, and the maximum possible vectorization factor defined through reqd_work_group_size.

The quotient of maximum possible vectorization factor divided by the actual vectorization factor provides the remaining loop count of the OpenCL description. As this loop is pipelined, it can be advantageous to have several remaining loop iterations to take advantage of a pipelined implementation. This is especially true if the vectorized OpenCL code has long latency.

__attribute__((xcl_zero_global_work_offset))
The __attribute__((xcl_zero_global_work_offset)) instructs the compiler that no global offset parameter is used at runtime, and all accesses are aligned. This gives the compiler valuable information with regard to alignment of the work groups, which in turn usually propagates to the alignment of the memory accesses (less hardware).

It should be noted, that the application of these transformations changes the actual design to be synthesized. Partially unrolled loops require reshaping of local arrays in which data is stored. This usually behaves nicely, but can interact poorly in rare situations.

For example:

  • For partitioned arrays, when the partition factor is not divisible by the unrolling/vectorization factor.
    • The resulting access requires a lot of multiplexers and will create a difficult issue for the scheduler (might severely increase memory usage and compilation time). Xilinx recommends using partitioning factors that are powers of two (as the vectorization factor is always a power of two).
  • If the loop being vectorized has an unrelated resource constraint, the scheduler complains about II not being met.
    • This is not necessarily correlated with a loss of performance (usually it is still performing better) because the II is computed on the unrolled loop (which has therefore a multiplied throughput for each iteration).
    • The scheduler informs you of the possible resources constraints and resolving those will further improve the performance.
    • Note that a common occurrence is that a local array does not get automatically reshaped (usually because it is accessed in a later section of the code in non-vectorizable method).

Reducing Kernel to Kernel Communication Latency with OpenCL Pipes

The OpenCL API 2.0 specification introduces a new memory object called a pipe. A pipe stores data organized as a FIFO. Pipe objects can only be accessed using built-in functions that read from and write to a pipe. Pipe objects are not accessible from the host. Pipes can be used to stream data from one kernel to another inside the FPGA without having to use the external memory, which greatly improves the overall system latency. For more information, see Pipe Functions on Version 2.0 of the OpenCL C Specification from Khronos Group.

In the Vitis IDE, pipes must be statically defined outside of all kernel functions. Dynamic pipe allocation using the OpenCL 2.x clCreatePipe API is not supported. The depth of a pipe must be specified by using the OpenCL attribute xcl_reqd_pipe_depth in the pipe declaration. For more information, see xcl_reqd_pipe_depth.

As specified in xcl_reqd_pipe_depth, the valid depth values are as follows: 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768.

A given pipe can have one and only one producer and consumer in different kernels.
pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));

Pipes can be accessed using standard OpenCL read_pipe() and write_pipe() built-in functions in non-blocking mode or using the Xilinx extended read_pipe_block() and write_pipe_block() functions in blocking mode.

The status of pipes can be queried using OpenCL get_pipe_num_packets() and get_pipe_max_packets() built-in functions.

The following function signatures are the currently supported pipe functions, where gentype indicates the built-in OpenCL C scalar integer or floating-point data types.
int read_pipe_block (pipe gentype p, gentype *ptr) 
int write_pipe_block (pipe gentype p, const gentype *ptr) 

The following “dataflow/dataflow_pipes_ocl” from Xilinx Getting Started Examples on GitHub uses pipes to pass data from one processing stage to the next using blocking read_pipe_block() and write_pipe_block() functions:

pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));
pipe int p1 __attribute__((xcl_reqd_pipe_depth(32)));
// Input Stage Kernel : Read Data from Global Memory and write into Pipe P0
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void input_stage(__global int *input, int size)
{
    __attribute__((xcl_pipeline_loop)) 
    mem_rd: for (int i = 0 ; i < size ; i++)
    {
        //blocking Write command to pipe P0
        write_pipe_block(p0, &input[i]);
    }
}
// Adder Stage Kernel: Read Input data from Pipe P0 and write the result 
// into Pipe P1
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void adder_stage(int inc, int size)
{
    __attribute__((xcl_pipeline_loop))
    execute: for(int i = 0 ; i < size ;  i++)
    {
        int input_data, output_data;
        //blocking read command to Pipe P0
        read_pipe_block(p0, &input_data);
        output_data = input_data + inc;
        //blocking write command to Pipe P1
        write_pipe_block(p1, &output_data);
    }
}
// Output Stage Kernel: Read result from Pipe P1 and write the result to Global
// Memory
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void output_stage(__global int *output, int size)
{
    __attribute__((xcl_pipeline_loop))
    mem_wr: for (int i = 0 ; i < size ; i++)
    {
        //blocking read command to Pipe P1
        read_pipe_block(p1, &output[i]);
    }
}

The Device Traceline view shows the detailed activities and stalls on the OpenCL pipes after hardware emulation is run. This information can be used to choose the correct FIFO sizes to achieve the optimal application area and performance.

Figure 11: Device Traceline View

Optimizing Computational Parallelism

By default, C/C++ does not model computational parallelism, as it always executes any algorithm sequentially. On the other hand, the OpenCL API does model computational parallelism with respect to work groups, but it does not use any additional parallelism within the algorithm description. However, fully configurable computational engines like FPGAs allow more freedom to exploit computational parallelism.

Coding Data Parallelism

To leverage computational parallelism during the implementation of an algorithm on the FPGA, it should be mentioned that the synthesis tool will need to be able to recognize computational parallelism from the source code first. Loops and functions are prime candidates for reflecting computational parallelism and compute units in the source description. However, even in this case, it is key to verify that the implementation takes advantage of the computational parallelism as in some cases the Vitis technology might not be able to apply the desired transformation due to the structure of the source code.

It is quite common, that some computational parallelism might not be reflected in the source code to begin with. In this case, it will need to be added. A typical example is a kernel that might be described to operate on a single input value, while the FPGA implementation might execute computations more efficiently in parallel on multiple values. This kind of parallel modeling is described in Using Full AXI Data Width. A 512-bit interface can be created using OpenCL vector data types such as int16 or C/C++ arbitrary precision data type ap_int<512>.

Note: These vector types can also be used as a powerful way to model data parallelism within a kernel, with up to 16 data paths operating in parallel in case of int16. Refer to the Median Filter Example in the vision category at Xilinx Getting Started Example on GitHub for the recommended method to use vectors.

Loop Parallelism

Loops are the basic C/C++/OpenCL API method of representing repetitive algorithmic code. The following example illustrates various implementation aspects of a loop structure:

  for(int i = 0; i<255; i++) {
    out[i] = in[i]+in[i+1];
  }
  out[255] = in[255];

This code iterates over an array of values and adds consecutive values, except the last value. If this loop is implemented as written, each loop iteration requires two cycles for implementation, which results in a total of 510 cycles for implementation. This can be analyzed in detail through the Schedule Viewer in the HLS Project:

Figure 12: Implemented Loop Structure in Schedule Viewer

This can also be analyzed in terms of total numbers and latency through the Vivado synthesis results:

Figure 13: Synthesis Results Performance Estimates

The key numbers here are the latency numbers and total LUT usage. For example, depending on the configuration, you could get latency of 511 and total LUT usage of 47. As a result, these values can vary based on the implementation choices. While this implementation will require very little area, it results in significant latency.

Unrolling Loops

Unrolling a loop enables the full parallelism of the model to be used. To perform this, mark a loop to be unrolled and the tool will create the implementation with the most parallelism possible. To mark a loop to unroll, an OpenCL loop can be marked with the UNROLL attribute:

__attribute__((opencl_unroll_hint))

Or a C/C++ loop can use the unroll pragma:

#pragma HLS UNROLL

For more information, see Loop Unrolling.

When applied to this specific example, the Schedule Viewer in the HLS Project will be:

Figure 14: Schedule Viewer

The following figure shows the estimated performance:

Figure 15: Performance Estimates

Therefore, the total latency was considerably improved to be 127 cycles and as expected the computational hardware was increased to 4845 LUTs, to perform the same computation in parallel.

However, if you analyze the for-loop, you might ask why this algorithm cannot be implemented in a single cycle, as each addition is completely independent of the previous loop iteration. The reason is the memory interface is used for the variable out. The Vitis core development kit uses dual port memory by default for an array. However, this implies that at most two values can be written to the memory per cycle. Thus to see a fully parallel implementation, you must specify that the variable out should be kept in registers as in this example:

#pragma HLS array_partition variable= out complete dim= 0

For more information, see pragma HLS array_partition.

The results of this transformation can be observed in the following Schedule Viewer:

Figure 16: Transformation Results in Schedule Viewer

The associated estimates are:

Figure 17: Transformation Results Performance Estimates

Accordingly, this code can be implemented as a combinatorial function requiring only a fraction of the cycle to complete.

Pipelining Loops

Pipelining loops allow you to overlap iterations of a loop in time, as discussed in Loop Pipelining. Allowing loop iterations to operate concurrently is often a good approach, as resources can be shared between iterations (less resource utilization), while requiring less execution time compared to loops that are not unrolled.

Pipelining is enabled in C/C++ through the pragma HLS pipeline:

#pragma HLS PIPELINE

While the OpenCL API uses the xcl_pipeline_loop attribute:

__attribute__((xcl_pipeline_loop))
Note: The OpenCL API has an additional method of specifying loop pipelining, see xcl_pipeline_workitems. The reason is the work item loops are not explicitly stated and pipelining these loops require this attribute:
__attribute__((xcl_pipeline_workitems))

In this example, the Schedule Viewer in the HLS Project produces the following information:

Figure 18: Pipelining Loops in Schedule Viewer

With the overall estimates being:

Figure 19: Performance Estimates

Because each iteration of a loop consumes only two cycles of latency, there can only be a single iteration overlap. This enables the total latency to be cut into half compared to the original, resulting in 257 cycles of total latency. However, this reduction in latency was achieved using fewer resources when compared to unrolling.

In most cases, loop pipelining by itself can improve overall performance. Yet, the effectiveness of the pipelining depends on the structure of the loop. Some common limitations are:

  • Resources with limited availability such as memory ports or process channels can limit the overlap of the iterations (Initiation Interval).
  • Loop-carry dependencies, such as those created by variable conditions computed in one iteration affecting the next, might increase the II of the pipeline.

These are reported by the tool during high-level synthesis and can be observed and examined in the Schedule Viewer. For the best possible performance, the code might have to be modified to remove these limiting factors, or the tool needs to be instructed to eliminate some dependency by restructuring the memory implementation of an array, or breaking the dependencies all together.

Task Parallelism

Task parallelism allows you to take advantage of dataflow parallelism. In contrast to loop parallelism, when task parallelism is deployed, full execution units (tasks) are allowed to operate in parallel taking advantage of extra buffering introduced between the tasks.

See the following example:

void run (ap_uint<16> in[1024],
	  ap_uint<16> out[1024]
	  ) {
  ap_uint<16> tmp[128];
  for(int i = 0; i<8; i++) {
    processA(&(in[i*128]), tmp);
    processB(tmp, &(out[i*128]));
  }
}

When this code is executed, the function processA and processB are executed sequentially 128 times in a row. Given the combined latency for processA and processB, the loop is set to 278 and the total latency can be estimated as:

Figure 20: Performance Estimates

The extra cycle is due to loop setup and can be observed in the Schedule Viewer.

For C/C++ code, task parallelism is performed by adding the DATAFLOW pragma into the for-loop:

#pragma HLS DATAFLOW

For OpenCL API code, add the attribute before the for-loop:

__attribute__ ((xcl_dataflow))

Refer to Dataflow Optimization, HLS Pragmas, and OpenCL Attributes for more details on this topic.

As illustrated by the estimates in the HLS report, applying the transformation will considerably improve the overall performance effectively using a double (ping pong) buffer scheme between the tasks:

Figure 21: Performances Estimates

The overall latency of the design has almost halved in this case due to concurrent execution of the different tasks of the different iterations. Given the 139 cycles per processing function and the full overlap of the 128 iterations, this allows the total latency to be:

(1x only processA + 127x both processes + 1x only processB) * 139 cycles = 17931 cycles 

Using task parallelism is a powerful method to improve performance when it comes to implementation. However, the effectiveness of applying the DATAFLOW pragma to a specific and arbitrary piece of code might vary vastly. It is often necessary to look at the execution pattern of the individual tasks to understand the final implementation of the DATAFLOW pragma. Finally, the Vitis core development kit provides the Detailed Kernel Trace, which illustrates concurrent execution.

Figure 22: Detailed Kernel Trace

For this Detailed Kernel Trace, the tool displays the start of the dataflow loop, as shown in the previous figure. It illustrates how processA is starting up right away with the beginning of the loop, while processB waits until the completion of the processA before it can start up its first iteration. However, while processB completes the first iteration of the loop, processA begins operating on the second iteration, etc.

A more abstract representation of this information is presented in Application Timeline for the host and device activity.

Optimizing Compute Units

Data Width

One, if not the most important aspect for performance is the data width required for the implementation. The tool propagates port widths throughout the algorithm. In some cases, especially when starting out with an algorithmic description, the C/C++/OpenCL API code might only use large data types such as integers even at the ports of the design. However, as the algorithm is mapped to a fully configurable implementation, smaller data types such as 10-/12-bit might often suffice. It is beneficial to check the size of basic operations in the HLS Synthesis report during optimization.

In general, when the Vitis core development kit maps an algorithm onto the FPGA, more processing is required to comprehend the C/C++/OpenCL API structure and extract operational dependencies. Therefore, to perform this mapping the Vitis core development kit generally partitions the source code into operational units which are then mapped onto the FPGA. Several aspects influence the number and size of these operational units (ops) as seen by the tool.

In the following figure, the basic operations and their bit-width are reported.

Figure 23: Operations Utilization Estimates

Look for bit widths of 16, 32, and 64 bits commonly used in algorithmic descriptions and verify that the associated operation from the C/C++/OpenCL API source actually requires the bit width to be this large. This can considerably improve the implementation of the algorithm, as smaller operations require less computation time.

Fixed Point Arithmetic

Some applications use floating point computation only because they are optimized for other hardware architecture. Using fixed point arithmetic for applications like deep learning can save the power efficiency and area significantly while keeping the same level of accuracy.

Note: Xilinx recommends exploring fixed point arithmetic for your application before committing to using floating point operations.

Macro Operations

It is sometimes advantageous to think about larger computational elements. The tool will operate on the source code independently of the remaining source code, effectively mapping the algorithm without consideration of surrounding operations onto the FPGA. When applied, the Vitis technology keeps operational boundaries, effectively creating macro operations for specific code. This uses the following principles:

  • Operational locality to the mapping process
  • Reduction in complexity for the heuristics

This might create vastly different results when applied. In C/C++, macro operations are created with the help of #pragma HLS inline off. While in the OpenCL API, the same kind of macro operation can be generated by not specifying the following attribute when defining a function:

__attribute__((always_inline))

For more information, see pragma HLS inline.

Using Optimized Libraries

The OpenCL specification provides many math built-in functions. All math built-in functions with the native_ prefix are mapped to one or more native device instructions and will typically have better performance compared to the corresponding functions (without the native_ prefix). The accuracy and in some cases the input ranges of these functions is implementation-defined. In the Vitis technology, these native_ built-in functions use the equivalent functions in the Vivado HLS tool Math library, which are already optimized for Xilinx FPGAs in terms of area and performance.

Note: Xilinx recommends using native_ built-in functions or the HLS tool Math library if the accuracy meets the application requirement.

Optimizing Memory Architecture

Memory architecture is a key aspect of implementation. Due to the limited access bandwidth, it can heavily impact the overall performance, as shown in the following example:


void run (ap_uint<16> in[256][4],
          ap_uint<16> out[256]
         ) {
  ...
  ap_uint<16> inMem[256][4];
  ap_uint<16> outMem[256];

  ... Preprocess input to local memory
  
  for( int j=0; j<256; j++) {
    #pragma HLS PIPELINE OFF
    ap_uint<16> sum = 0;
    for( int i = 0; i<4; i++) {

      sum += inMem[j][i];
    }
    outMem[j] = sum;
  } 

  ... Postprocess write local memory to output
}

This code adds the four values associated with the inner dimension of the two dimensional input array. If implemented without any additional modifications, it results in the following estimates:

Figure 24: Performance Estimates

The overall latency of 4608 (Loop 2) is due to 256 iterations of 18 cycles (16 cycles spent in the inner loop, plus the reset of sum, plus the output being written). This is observed in the Schedule Viewer in the HLS Project. The estimates become considerably better when unrolling the inner loop.

Figure 25: Performance Estimates

However, this improvement is largely because of the process using both ports of a dual port memory. This can be seen from the Schedule Viewer in the HLS Project:

Figure 26: Schedule Viewer

Two read operations are performed per cycle to access all the values from the memory to calculate the sum. This is often an undesired result as this completely blocks the access to the memory. To further improve the results, the memory can be split into four smaller memories along the second dimension:

#pragma HLS ARRAY_PARTITION variable=inMem complete dim=2

For more information, see pragma HLS array_partition.

This results in four array reads, all executed on different memories using a single port:

Figure 27: Executed Four Arrays Results

Using a total of 256 * 4 cycles = 1024 cycles for loop 2.

Figure 28: Performance Estimates

Alternatively, the memory can be reshaped into to a single memory with four words in parallel. This is performed through the pragma:

#pragma HLS array_reshape variable=inMem complete dim=2

For more information, see pragma HLS array_reshape.

This results in the same latency as when the array partitioning, but with a single memory using a single port:

Figure 29: Latency Result

Although, either solution creates comparable results with respect to overall latency and utilization, reshaping the array results in cleaner interfaces and less routing congestion making this the preferred solution.

Note: This completes array optimization, in a real design the latency could be further improved by employing loop parallelism (see Loop Parallelism).
void run (ap_uint<16> in[256][4],
	  ap_uint<16> out[256]
	  ) {
  ...

  ap_uint<16> inMem[256][4];
  ap_uint<16> outMem[256];
  #pragma HLS array_reshape variable=inMem complete dim=2
  
  ... Preprocess input to local memory
  
  for( int j=0; j<256; j++) {
    #pragma HLS PIPELINE OFF
    ap_uint<16> sum = 0;
    for( int i = 0; i<4; i++) {
      #pragma HLS UNROLL
      sum += inMem[j][i];
    }
    outMem[j] = sum;
  } 

  ... Postprocess write local memory to output

}

Kernel SLR and DDR Memory Assignments

Kernel compute unit (CU) instance and DDR memory resource floorplanning are keys to meeting quality of results of your design in terms of frequency and resources. Floorplanning involves explicitly allocating CUs (a kernel instance) to SLRs and mapping CUs to DDR memory resources. When floorplanning, both CU resource usage and DDR memory bandwidth requirements need to be considered.

The largest Xilinx FPGAs are made up of multiple stacked silicon dies. Each stack is referred to as a super logic region (SLR) and has a fixed amount of resources and memory including DDR interfaces. Available device SLR resources which can be used for custom logic can be found in the Vitis 2019.2 Software Platform Release Notes, or can be displayed using the platforminfo utility described in platforminfo Utility.

You can use the actual kernel resource utilization values to help distribute CUs across SLRs to reduce congestion in any one SLR. The system estimate report lists the number of resources (LUTs, Flip-Flops, BRAMs, etc.) used by the kernels early in the design cycle. The report can be generated during hardware emulation and system compilation through the command line or GUI and is described in System Estimate Report.

Use this information along with the available SLR resources to help assign CUs to SLRs such that no one SLR is over-utilized. The less congestion in an SLR, the better the tools can map the design to the FPGA resources and meet your performance target. For mapping memory resources and CUs, see Mapping Kernel Ports to Global Memory and Assigning Compute Units to SLRs.

Note: While compute units can be connected to any available DDR memory resource, it is also necessary to account for the bandwidth requirements of the kernels when assigning to SLRs.

After allocating your CUs to SLRs, map any CU master AXI port(s) to DDR memory resources. Xilinx recommends connecting to a DDR memory resource in the same SLR as the CU. This reduces competition for the limited SLR-crossing connection resources. In addition, connections between SLRs use super long line (SLL) routing resources, which incurs a greater delay than a standard intra-SLR routing.

It might be necessary to cross an SLR region to connect to a DDR resource in a different SLR. However, if both the connectivity.sp and the connectivity.slr directives are explicitly defined, the tools automatically add additional crossing logic to minimize the effect of the SLL delay, and facilitates better timing closure.

Guidelines for Kernels that Access Multiple Memory Banks

The DDR memory resources are distributed across the super logic regions (SLRs) of the platform. Because the number of connections available for crossing between SLRs is limited, the general guidance is to place a kernel in the same SLR as the DDR memory resource with which it has the most connections. This reduces competition for SLR-crossing connections and avoids consuming extra logic resources associated with SLR crossing.

Figure 30: Kernel and Memory in Same SLR
Note: The image on the left shows a single AXI interface mapped to a single memory bank. The image on the right shows multiple AXI interfaces mapped to the same memory bank.

As shown in the previous figure, when a kernel has a single AXI interface that maps only a single memory bank, the platforminfo utility described in platforminfo Utility lists the SLR that is associated with the memory bank of the kernel; therefore, the SLR where the kernel would be best placed. In this scenario, the design tools might automatically place the kernel in that SLR without need for extra input; however, you might need to provide an explicit SLR assignment for some of the kernels under the following conditions:

  • If the design contains a large number of kernels accessing the same memory bank.
  • A kernel requires some specialized logic resources that are not available in the SLR of the memory bank.

When a kernel has multiple AXI interfaces and all of the interfaces of the kernel access the same memory bank, it can be treated in a very similar way to the kernel with a single AXI interface, and the kernel should reside in the same SLR as the memory bank that its AXI interfaces are mapping.

Figure 31: Memory Bank in Adjoining SLR
Note: The image on the left shows one SLR crossing is required when the kernel is placed in SLR0. The image on the right shows two SLR crossings are required for kernel to access memory banks.

When a kernel has multiple AXI interfaces to multiple memory banks in different SLRs, the recommendation is to place the kernel in the SLR that has the majority of the memory banks accessed by the kernel (shown it the figure above). This minimizes the number of SLR crossings required by this kernel which leaves more SLR crossing resources available for other kernels in your design to reach your memory banks.

When the kernel is mapping memory banks from different SLRs, explicitly specify the SLR assignment as described in Kernel SLR and DDR Memory Assignments.

Figure 32: Memory Banks Two SLRs Away
Note: The image on the left shows two SLR crossings are required to access all of the mapped memory banks. The image on the right shows three SLR crossings are required to access all of the mapped memory banks.

As shown in the previous figure, when a platform contains more than two SLRs, it is possible that the kernel might map a memory bank that is not in the immediately adjacent SLR to its most commonly mapped memory bank. When this scenario arises, memory accesses to the distant memory bank must cross more than one SLR boundary and incur additional SLR-crossing resource costs. To avoid such costs it might be better to place the kernel in an intermediate SLR where it only requires less expensive crossings into the adjacent SLRs.

Exploring Kernel Optimizations Using Vivado HLS

All kernel optimizations using OpenCL or C/C++ can be performed from within the Vitis core development kit. The primary performance optimizations, such as those discussed in this section (pipelining function and loops, applying dataflow to enable greater concurrency between functions and loops, unrolling loops, etc.), are performed by the Vivado HLS tool.

The Vitis core development kit automatically calls the HLS tool. However, to use the GUI analysis capabilities, you must launch the HLS tool directly from within the Vitis technology. Using the HLS tool in standalone mode, as discussed in Compiling Kernels Directly in Vivado HLS, enables the following enhancements to the optimization methodology:

  • The ability to focus solely on the kernel optimization because there is no requirement to execute emulation.
  • The skill to create multiple solutions, compare their results, and explore the solution space to find the most optimum design.
  • The competence to use the interactive Analysis Perspective to analyze the design performance.
IMPORTANT: Only the kernel source code is incorporated back into the Vitis core development kit. After exploring the optimization space, ensure that all optimizations are applied to the kernel source code as OpenCL attributes or C/C++ pragmas.

To open the HLS tool in standalone mode, from the Assistant window, right-click the hardware function object, and select Open HLS Project, as shown in the following figure.

Figure 33: Open HLS Project

Topological Optimization

This section focuses on the topological optimization. It looks at the attributes related to the rough layout and implementation of multiple compute units and their impact on performance.

Multiple Compute Units

Depending on available resources on the target device, multiple compute units of the same kernel (or different kernels) can be created to run in parallel, which improves the system processing time and throughput. For more details, see Creating Multiple Instances of a Kernel.

Using Multiple DDR Banks

Acceleration cards supported in Vitis technology provide one, two, or four DDR banks, and up to 80 GB/s raw DDR bandwidth. For kernels moving large amount of data between the FPGA and the DDR, Xilinx® recommends that you direct the Vitis compiler and runtime library to use multiple DDR banks.

In addition to DDR banks, the host application can access PLRAM to transfer data directly to a kernel. This feature is enabled using the connnectivity.sp option in a configuration file specified with the v++ --config option. Refer to Mapping Kernel Ports to Global Memory for more information on implementing this optimization and Memory Mapped Interfaces on data transfer to the global memory banks.

To take advantage of multiple DDR banks, you need to assign CL memory buffers to different banks in the host code as well as configure the xclbin file to match the bank assignment in v++ command line.

The following block diagram shows the Global Memory Two Banks (C) example in Vitis Examples on GitHub. This example connects the input pointer interface of the kernel to DDR bank 0, and the output pointer interface to DDR bank 1.

Figure 34: Global Memory Two Banks Example

Assigning DDR Bank in Host Code

Bank assignment in host code is supported by Xilinx vendor extension. The following code snippet shows the header file required, as well as assigning input and output buffers to DDR bank 0 and bank 1, respectively:

#include <CL/cl_ext.h>
…
int main(int argc, char** argv) 
{
…
    cl_mem_ext_ptr_t inExt, outExt;  // Declaring two extensions for both buffers
    inExt.flags  = 0|XCL_MEM_TOPOLOGY; // Specify Bank0 Memory for input memory
    outExt.flags = 1|XCL_MEM_TOPOLOGY; // Specify Bank1 Memory for output Memory
    inExt.obj = 0   ; outExt.obj = 0; // Setting Obj and Param to Zero
    inExt.param = 0 ; outExt.param = 0;

    int err;
    //Allocate Buffer in Bank0 of Global Memory for Input Image using Xilinx Extension
    cl_mem buffer_inImage = clCreateBuffer(world.context, CL_MEM_READ_ONLY | CL_MEM_EXT_PTR_XILINX,
            image_size_bytes, &inExt, &err);
    if (err != CL_SUCCESS){
        std::cout << "Error: Failed to allocate device Memory" << std::endl;
        return EXIT_FAILURE;
    }
    //Allocate Buffer in Bank1 of Global Memory for Input Image using Xilinx Extension
    cl_mem buffer_outImage = clCreateBuffer(world.context, CL_MEM_WRITE_ONLY | CL_MEM_EXT_PTR_XILINX,
            image_size_bytes, &outExt, NULL);
    if (err != CL_SUCCESS){
        std::cout << "Error: Failed to allocate device Memory" << std::endl;
        return EXIT_FAILURE;
    }
…
}

cl_mem_ext_ptr_t is a struct as defined below:

typedef struct{
    unsigned flags;
    void *obj;
    void *param;
  } cl_mem_ext_ptr_t;
  • Valid values for flags are:
    • XCL_MEM_DDR_BANK0
    • XCL_MEM_DDR_BANK1
    • XCL_MEM_DDR_BANK2
    • XCL_MEM_DDR_BANK3
    • <id> | XCL_MEM_TOPOLOGY
      Note: The <id> is determined by looking at the Memory Configuration section in the xxx.xclbin.info file generated next to the xxx.xclbin file. In the xxx.xclbin.info file, the global memory (DDR, PLRAM, etc.) is listed with an index representing the <id>.
  • obj is the pointer to the associated host memory allocated for the CL memory buffer only if CL_MEM_USE_HOST_PTR flag is passed to clCreateBuffer API, otherwise set it to NULL.
  • param is reserved for future use. Always assign it to 0 or NULL.

Assigning Global Memory for Kernel Code

Creating Multiple AXI Interfaces

OpenCL kernels, C/C++ kernels, and RTL kernels have different methods for assigning function parameters to AXI interfaces.

  • For OpenCL kernels, the --max_memory_ports option is required to generate one AXI4 interface for each global pointer on the kernel argument. The AXI4 interface name is based on the order of the global pointers on the argument list.

    The following code is taken from the example gmem_2banks_ocl in the kernel_to_gmem category from the Vitis Getting Started Examples on GitHub:

    __kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
    void apply_watermark(__global const TYPE * __restrict input, 
    __global TYPE * __restrict output, int width, int height) {
     ...
    }

    In this example, the first global pointer input is assigned an AXI4 name M_AXI_GMEM0, and the second global pointer output is assigned a name M_AXI_GMEM1.

  • For C/C++ kernels, multiple AXI4 interfaces are generated by specifying different “bundle” names in the HLS INTERFACE pragma for different global pointers. Refer to Kernel Interfaces and Memory Banks for more information.

    The following is a code snippet from the gmem_2banks_c example that assigns the input pointer to the bundle gmem0 and the output pointer to the bundle gmem1. The bundle name can be any valid C string, and the AXI4 interface name generated will be M_AXI_<bundle_name>. For this example, the input pointer will have AXI4 interface name as M_AXI_gmem0, and the output pointer will have M_AXI_gmem1. Refer to pragma HLS interface for more information.
    #pragma HLS INTERFACE m_axi port=input  offset=slave bundle=gmem0
    #pragma HLS INTERFACE m_axi port=output offset=slave bundle=gmem1
    
  • For RTL kernels, the port names are generated during the import process by the RTL kernel wizard. The default names proposed by the RTL kernel wizard are m00_axi and m01_axi. If not changed, these names have to be used when assigning a DDR bank through the connectivity.sp option in the configuration file. Refer to Mapping Kernel Ports to Global Memory for more information.
Assigning AXI Interfaces to DDR Banks
IMPORTANT: When using more than one DDR interface, Xilinx requires you to specify the DDR memory bank for each kernel/CU, and specify the SLR to place the kernel into. For more information, see Mapping Kernel Ports to Global Memory and Assigning Compute Units to SLRs.

The following is an example configuration file that specifies the connectivity.sp option, and the v++ command line that connects the input pointer (M_AXI_GMEM0) to DDR bank 0 and the output pointer (M_AXI_GMEM1) to DDR bank 1:

The config_sp.txt file:

[connectivity] 
sp=apply_watermark_1.m_axi_gmem0:DDR[0] 
sp=apply_watermark_1.m_axi_gmem1:DDR[1]

The v++ command line:

v++ apply_watermark --config config_sp.txt

You can use the Device Hardware Transaction view to observe the actual DDR Bank communication, and to analyze DDR usage.

Figure 35: Device Hardware Transaction View Transactions on DDR Bank
Assigning AXI Interfaces to PLRAM

Some platforms support PLRAMs. In these cases, use the same --connectivity.sp option as described in Assigning AXI Interfaces to DDR Banks, but use the name, PLRAM[id]. Valid names supported by specific platforms can be found in the Memory Configuration section of the xclibin.info file generated alongside xclbin.

Assigning Kernels to SLR Regions

Assigning ports to global memory banks requires the kernel to be physically routed on the FPGA, to connect to the assigned DDR, HBM, or block RAM. Currently, large FPGAs use stacked silicon devices with several super logic regions (SLRs). By default, the Vitis core development kit will place the compute units in the same SLR as the target platform. This is not always desirable, especially when the kernel connects to specific memory banks in a different SLR region. In this case, you will want to manually assign the kernel instance, or CU into the same SLR as the global memory. For more information, see Mapping Kernel Ports to Global Memory.

You can assign the CU instance to an SLR using the connectivity.slr option described in Assigning Compute Units to SLRs.

TIP: To better understand the platform attributes, such as the number of DDRs and SLR regions, you can detail the target platform using the platforminfo command described in platforminfo Utility.