Increasing System Parallelism and Concurrency

Increasing the level of concurrent execution is a standard way to increase overall system performance, and increasing the level of parallel execution is a standard way to increase concurrency. Programmable logic is well-suited to implement architectures with application-specific accelerators that run concurrently, especially communicating through flow-controlled streams that synchronize between data producers and consumers.

In the SDSoC environment, you influence the macro-architecture parallelism at the function and data mover level, and the micro-architecture parallelism within hardware accelerators. By understanding how the sdscc system compiler infers system connectivity and data movers, you can structure application code and apply pragmas as needed to control hardware connectivity between accelerators and software, data mover selection, number of accelerator instances for a given hardware function, and task level software control. You can control the micro-architecture parallelism, concurrency, and throughput for hardware functions within Vivado HLS or within the IPs you incorporate as C-callable/linkable libraries.

At the system level, the sdscc compiler chains together hardware functions when the data flow between them does not require transferring arguments out of programmable logic and back to system memory. For example, consider the code in the following figure, where mmult and madd functions have been selected for hardware.

Figure: Hardware /Software Connectivity with Direct Connection



Because the intermediate array variable tmp1 is used only to pass data between the two hardware functions, the sdscc system compiler chains the two functions together in hardware with a direct connection between them.

It is instructive to consider a time line for the calls to hardware as shown in the following figure.

Figure: Timeline for mmult/madd Function Calls



The program preserves the original program semantics, but instead of the standard ARM procedure calling sequence, each hardware function call is broken into multiple phases involving setup, execution, and cleanup, both for the data movers (DM) and the accelerators. The CPU in turn sets up each hardware function (that is, the underlying IP control interface) and the data transfers for the function call with non-blocking APIs, and then waits for all calls and transfers to complete. In the example shown in the diagram, the mmult and madd functions run concurrently whenever their inputs become available. The ensemble of function calls is orchestrated in the compiled program by control code automatically generated by sdscc according to the program, data mover, and accelerator structure.

In general, it is impossible for the sdscc compiler to determine side-effects of function calls in your application code (for example, sdscc may have no access to source code for functions within linked libraries), so any intermediate access of a variable occurring lexically between hardware function calls requires the compiler to transfer data back to memory. So for example, an injudicious simple change to uncomment the debug print statement (in the "wrong place") as shown in the figure below, can result in a significantly different data transfer graph and consequently, an entirely different generated system and application performance.

Figure: Hardware/Software Connectivity with Broken Direct Connection



A program can invoke a single hardware function from multiple call sites. In this case, the sdscc compiler behaves as follows. If any of the function calls results in "direct connection" data flow, then sdscc creates an instance of the hardware function that services every similar direct connection, and an instance of the hardware function that services the remaining calls between memory ("software") and programmable logic.

Structuring your application code with "direct connection" data flow between hardware functions is one of the best ways to achieve high performance in programmable logic. You can create deep pipelines of accelerators connected with data streams, increasing the opportunity for concurrent execution.

There is another way in which you can increase parallelism and concurrency using the sdscc compiler. You can direct the compiler to create multiple instances of a hardware function by inserting the following pragma immediately preceding a call to the function.
#pragma SDS resource(<id>) // <id> a non-negative integer

This pragma creates a hardware instance that is referenced by <id>.

A simple code snippet that creates two instances of a hardware function mmult is as follows.
{
#pragma SDS resource(1)
 mmult(A, B, C); // instance 1
#pragma SDS resource(2)
 mmult(D, E, F); // instance 2
}

The async mechanism gives the programmer ability to handle the "hardware threads" explicitly to achieve very high levels of parallelism and concurrency, but like any explicit multi-threaded programming model, requires careful attention to synchronization details to avoid non-deterministic behavior or deadlocks.