This lab demonstrates the performance advantages of accelerating Beamforming calculations using Xilinx® Vitis™ unified software platform. Depending on your performance requirements, Vitis technology can be used to apply the proper amount of parallelism to tailor the resources to your requirements. The lab will also show the importance of controlling the dataflow at the interface to the accelerated module. Reshaping 2D arrays will be key to optimizing the flow of data in and out of the module.
After completing this lab, you will be able to:
· Create a Vitis project using the GUI flow
· Use Software Emulation mode to validate the functional correctness of the Beamformer example.
· Manipulate the Pragmas/Directives using HLS to do ‘what if’ analysis on parallelism and interface data flow.
· Use Hardware Emulation mode to verify the Kernel performance and resources predicted by HLS.
This lab will implement the Digital Beamforming calculations circled in the diagram below.
The complex sensor data (RX_I and RX_Q) is delivered to the Beamformer module as a 2D array [SAMPLES][CHANNELS]. Similarly, the adaptive weights (W_I & W_Q) are also stored as a 2D array [BEAMS][CHANNELS]. Output Beam data (beams_i & beams_q) is computed for each sample [BEAMS][Sample].
This beamformer example will use two sets of parameter settings for CHANNELS, BEAMS and SAMPLES; ‘reduce’ and ‘full’ set. The ‘reduce’ set in Figure 1-3, will simplify software debug while maintaining the architectural data flow.
The figure below shows the ‘full’ set that is specified for the Radar example in WP452.
The # of SAMPLES represents the Pulse Repetition Interval (PRI). The Beamforming performance requirements are driven by the PRI. In this example, we need to achieve a PRI < 200 uS. See white paper WP452 for more background and details.
The overall application structure is represented in this block diagram with all the Beamformer calculations done by the Kernel C++ code in ‘beamformer()’.
The Host C++ code generates RX (sensor) and W (Weights) data and then will self-check the resulting B (beamforming) data that is returned from the Kernel. The Kernel code will be simulated in Software-Emulation (Step-3) and then will be targeted to run on a Xilinx Alveo U200 card in the Hardware Emulation phase (Step-5).
1-1. From a linux prompt, type Vitis to open the tool.
1-2. Select Workspace, then Launch.
1-3. Create new Application Project
1-4. Fill in Project name and select Platform
1-5. Select Empty Application and click Finish
1-6. Import Host (host.cpp) and Kernel (beamformer.cl) code into project.
1-7. From Application Project Settings, select Add Hardware Function and click OK.
1-8. Completed Project setup should look like this
This section will diagram the dataflow between the Host and Kernel. The API commands in the HOST code will be spotlighted that control the data movement for steps 4-7 in the Execution model shown here.
2-1. Allocate the buffers in Global Memory
2-2. Write the RX and Weights data to global memory.
2-3. Launch the Kernel accelerator
2-4. Copy results from Global memory to Host local memory
This section will show the steps used to functionally validate the Host and Kernel code using the Vitis software-emulation flow.
Remember that the Host code generates Weight (buffer_W) and Receiver (buffer_RX) stimulus and will verify the correctness of Beam (buffer_B) return data.
3-1. From the Vitis project view, select the build configuration to be ‘Emulation-SW’ and then click ‘Build’ .
3-2. Clicking ‘Run’ will kick off the Emulation-SW simulation to verify functionality.
3-3. The validation is successful if you see the following in the Console.
3-4. If you have experimented with a code change and the simulation no longer passes, then enter the Debug Perspective and use normal software breakpoint, single-step, and variable display capabilities to determine the issue.
This section will use HLS to tune the performance of the beamformer kernel using pragmas. The most common Pragmas used are PIPELINE and ARRAY_PARTITION. Here is a review of these two pragmas.
PIPELINE: Instructs HLS to process variables continuously rather than waiting for a loop or function to complete. Here is an example of a loop with default processing versus PIPELINE.
For the example, you can see that PIPELINE allows for a new read on successive clocks. This is referred to as an II=1 (Initiation Interval). This is the best-case performance for data flow and will be what we strive for in the beamformer example.
ARRAY_PARTITION: Can reshape arrays for wider data path access. Arrays can be partitioned on any dimension. This is a way to remove memory bottlenecks to get better II values.
4-1. Create an HLS project with the Source and Test Bench files provided. Individual steps for HLS flow are not included in this lab. For a tutorial on HLS basics see UG871.
4-2. Initially, the ‘beamformer.cpp’ file has the CHANNELS, SAMPLES and BEAMS parameters set to the reduce set. The provided testbench works with these parameter settings. For now, leave the parameter settings at these values.
4-3. Using the PIPELINE and ARRAY_PARTITION pragmas, experiment with applying them to the different loops and arrays shown here.
4-3. One recommended sequence of pragmas is to try the different solutions shown here. All solutions include the pragmas from the previous solution.
4-4. Here are the Latency and Resource results for solutions 1-6 shown above.
4-5. Soln06_partition_A_B has the best results with Latency = 80 clocks. The loop details are shown here. Note that the achieved II=1 for all the loops. This means that all loops can receive new data on every clock.
4-6. Now change the parameter settings to the ‘full’ values.
4-7. Use PIPELINE and ARRAY_PARTITION pragmas again to create a solution with the lowest latency and II=1. Here is an example solution sequence to follow. All solutions include the pragmas from the previous solution.
Here are the solution comparisons
Here are the loop details for best solution, soln04_partition_B. The target clock rate = 333Mhz.
4-7. The performance takeaway for soln04_partition_B is that:
Total latency = 55084 clocks
This will be used to determine if we can meet our PRI requirement of <200mS.
4-8. Now make sure that all the pragmas used in soln04_partition_B are added to the ‘beamformer.cpp’ file. You can manually edit the file or add using HLS by double-clicking on the pragma and selecting ‘Source File’ instead of ‘Directive File’.
The modified ‘beamformer.cpp’ file with all the pragmas will be used in the HW-Emulation step.
5-1. Go back to the Vitis project and replace the ‘beamformer.cpp’ file with the one created in Step4 (has all the performance pragmas)
5-2. Now change to Emulation-HW and click Run
5-3. When Emulation-HW completes (might be as long as 2 hrs), click on ‘Link Summary’ to open the results.
5-4. The Vitis Analyzer will open. From here select the Kernel Estimate and view the latency clock and time results.
- Latency= 55086 clocks @ 300Mhz
- Latency absolute time = 184 uS (this meets our 200 uS requirement).
For the Beamformer example in WP452, we were able to use Vitis to accelerate the beamforming matrix calculations and achieve the desired PRI spec < 200uS. The Vitis project is targeting an Alveo U200 board.
HLS was used to determine the pragmas that would give the best latency results (making sure II=1 is key). Using HLS allows quick ‘what if’ results in just 1-2 minutes; whereas, going through the Emulation-HW flow takes 1-2hrs.
Download source files here
Brian Stephens is a Senior FAE for Xilinx. He has worked with various A&D, Satellite and Networking companies during his 24-year career as an applications engineer in the Mid-Atlantic territory. His specialties include; partial reconfiguration design, advance timing closure techniques, DSP-Sygen implementations, and Vivado-HLS optimization.
Brian holds a B.S.E.E. degree from the University of Central Florida.