Optimizing Memory Architecture

Memory architecture is a key aspect of implementation. Due to the limited access bandwidth, it can heavily impact the overall performance. Look at the following example:

void run (ap_uint<16> in[256][4],
          ap_uint<16> out[256]
         ) {
  ...
  ap_uint<16> inMem[256][4];
  ap_uint<16> outMem[256];

  ... Preprocess input to local memory
  
  for( int j=0; j<256; j++) {
    #pragma HLS PIPELINE OFF
    ap_uint<16> sum = 0;
    for( int i = 0; i<4; i++) {

      sum += inMem[j][i];
    }
    outMem[j] = sum;
  } 

  ... Postprocess write local memory to output
}
This code adds the four values associated with the inner dimension of the two dimensional input array. If implemented without any additional modifications, it results in the following estimates:

The overall latency of 4608 (Loop 2) is due to 256 iterations of 18 cycles (16 cycles spent in the inner loop, plus the reset of sum, plus the output being written). This is can be observed in the Scheduler View in the HLS Project. The estimates become considerably better when unrolling the inner loop.

However, this improvement is largely due to the fact that this process uses both ports of a dual port memory. This can be seen from the Schedule Viewer in the HLS Project:

As you can see, two read operations are performed per cycle to access all the values from the memory to calculate the sum. This is often an undesired result as this completely blocks the access to the memory. To further improve the results, the memory can be split into four smaller memories along the second dimension:
#pragma HLS ARRAY_PARTITION variable=inMem complete dim=2 

This results in four array reads, all executed on different memories using a single port:

Using a total of 256 * 4 cycles = 1024 cycles for loop 2.

Alternatively, the memory can be reshaped into to a single memory with four words in parallel. This is performed through the pragma:
#pragma HLS array_reshape variable=inMem complete dim=2
This results in the same latency as when the array partitioning, but with a single memory using a single port:
Although, either solution creates comparable results with respect to overall latency and utilization, reshaping the array results in cleaner interfaces and less routing congestion making this the preferred solution. Note that this completes array optimization, in a real design the latency could further improved by exploiting loop parallelism (see the Loop Parallelism section).
void run (ap_uint<16> in[256][4],
	  ap_uint<16> out[256]
	  ) {
  ...

  ap_uint<16> inMem[256][4];
  ap_uint<16> outMem[256];
  #pragma HLS array_reshape variable=inMem complete dim=2
  
  ... Preprocess input to local memory
  
  for( int j=0; j<256; j++) {
    #pragma HLS PIPELINE OFF
    ap_uint<16> sum = 0;
    for( int i = 0; i<4; i++) {
      #pragma HLS UNROLL
      sum += inMem[j][i];
    }
    outMem[j] = sum;
  } 

  ... Postprocess write local memory to output

}