Deploying and Running the Model

Programming with VART

Vitis AI provides a C++ DpuRunner class with the following interfaces:

std::pair<uint32_t, int> execute_async(  
					const std::vector<TensorBuffer*>& input,  
					const std::vector<TensorBuffer*>& output);
Note: For historical reasons, this function is actually a blocking function, not an asynchronous non-blocking function.
  1. Submit input tensors for execution and output tensors to store results. The host pointer is passed using the TensorBuffer object. This function returns a job ID and the status of the function call.
    int wait(int jobid, int timeout);

    The job ID returned by execute_async is passed to wait() to block until the job is complete and the results are ready.

    TensorFormat get_tensor_format()
  2. Query the DpuRunner for the tensor format it expects.

    Returns DpuRunner::TensorFormat::NCHW or DpuRunner::TensorFormat::NHWC

    std::vector<Tensor*> get_input_tensors()
  3. Query the DpuRunner for the shape and name of the output tensors it expects for its loaded Vitis AI model.
    std::vector<Tensor*> get_output_tensors()
  4. To create a DpuRunner object call the following:
    create_runner(const xir::Subgraph* subgraph, const std::string& mode = "")

    It returns the following:

    std::unique_ptr<Runner>

The input to create_runner is a XIR subgraph generated by the Vitis AI compiler.

TIP: To enable multi-threading with VART, create a runner for each thread.

C++ Example

// get dpu subgraph by parsing model file
auto runner = vart::Runner::create_runner(subgraph, "run");
// populate input/output tensors
auto job_data = runner->execute_async(inputs, outputs);
runner->wait(job_data.first, -1);
// process outputs

Vitis AI also provides a Python ctypes Runner class that mirrors the C++ class, using the C DpuRunner implementation:

class Runner:
def __init__(self, path)
def get_input_tensors(self)
def get_output_tensors(self)
def get_tensor_format(self)
def execute_async(self, inputs, outputs)
# differences from the C++ API:
# 1. inputs and outputs are numpy arrays with C memory layout
#    the numpy arrays should be reused as their internal buffer 
#    pointers are passed to the runtime. These buffer pointers
#    may be memory-mapped to the FPGA DDR for performance.
# 2. returns job_id, throws exception on error
def wait(self, job_id)

Python Example

dpu_runner = runner.Runner(subgraph,"run")
# populate input/output tensors
jid = dpu_runner.execute_async(fpgaInput, fpgaOutput)
dpu_runner.wait(jid)
# process fpgaOutput

DPU Debug with VART

This chapter aims to demonstrate how to verify DPU inference result with VART tools. TensorFlow ResNet50, Caffe ResNet50, and PyTorch ResNet50 networks are used as examples. Following are the four steps for debugging the DPU with VART:

  1. Generate a quantized inference model and reference result
  2. Generate a DPU xmodel
  3. Generate a DPU inference result
  4. Crosscheck the reference result and the DPU inference result

Before you start to debug the DPU result, ensure that you have set up the environment according to the instructions in the Getting Started section.

TensorFlow Workflow

To generate the quantized inference model and reference result, follow these steps:

  1. Generate the quantized inference model by running the following command to quantize the model.
    The quantized model, quantize_eval_model.pb, is generated in the quantize_model folder.
    vai_q_tensorflow quantize 	                               \
    	--input_frozen_graph ./float/resnet_v1_50_inference.pb   \
    	--input_fn input_fn.calib_input			  		    \
    	--output_dir quantize_model				              \
    	--input_nodes input								      \
    	--output_nodes resnet_v1_50/predictions/Reshape_1 	   \
    	--input_shapes	?,224,224,3					        \
    	--calib_iter	100
  2. Generate the reference result by running the following command to generate reference data.
    vai_q_tensorflow dump --input_frozen_graph        \
                quantize_model/quantize_eval_model.pb \
         --input_fn input_fn.dump_input               \
         --output_dir=dump_gpu 

    The following figure shows part of the reference data.

  3. Generate the DPU xmodel by running the following command to generate the DPU xmodel file.
    vai_c_tensorflow --frozen_pb quantize_model/quantize_eval_model.pb \
      --arch /opt/vitis_ai/compiler/arch/DPUCAHX8H/U50/arch.json       \
      --output_dir compile_model                                       \
      --net_name resnet50_tf
  4. Generate the DPU inference result by running the following command to generate the DPU inference result and compare the DPU inference result with the reference data automatically.
    env XLNX_ENABLE_DUMP=1  XLNX_ENABLE_DEBUG_MODE=1 XLNX_GOLDEN_DIR=./dump_gpu/dump_results_0 \
       xilinx_test_dpu_runner ./compile_model/resnet_v1_50_tf.xmodel \
       ./dump_gpu/dump_results_0/input_aquant.bin                    \
        2>result.log 1>&2
    For xilinx_test_dpu_runner, the usage is as follow:
    xilinx_test_dpu_runner  <model_file> <input_data> 

    After the above command runs, the DPU inference result and the comparing result result.log are generated. The DPU inference results are located in the dump folder.

  5. Crosscheck the reference result and the DPU inference result.
    1. View comparison results for all layers.
      grep --color=always 'XLNX_GOLDEN_DIR.*layer_name' result.log
    2. View only the failed layers.
      grep --color=always 'XLNX_GOLDEN_DIR.*fail ! layer_name' result.log

    If the crosscheck fails, use the following methods to further check from which layer the crosscheck fails.

    1. Check the input of DPU and GPU, make sure they use the same input data.
    2. Use xir tool to generate a picture for displaying the network's structure.
      Usage: xir svg <xmodel> <svg>
      Note: In the Vitis AI docker environment, execute the following command to install the required library.
      sudo apt-get install graphviz

      When you open the picture you created, you can see many little boxes arround these ops. Each box means a layer on DPU. You can use the last op's name to find its corresponding one in GPU dump-result. The following figure shows parts of the structure.

    3. Submit the files to Xilinx.

      If certain layer proves to be wrong on DPU, prepare the quantized model, such as quantize_eval_model.pb as one package for further analysis by factory and send it to Xilinx with a detailed description.

Caffe Workflow

To generate the quantized inference model and reference result, follow these steps:

  1. Generate the quantized inference model by running the following command to quantize the model.
    vai_q_caffe quantize -model float/test_quantize.prototxt \
    -weights float/trainval.caffemodel                       \
    -output_dir quantize_model                               \
    -keep_fixed_neuron			                           \
    2>&1 | tee ./log/quantize.log

    The following files are generated in the quantize_model folder.

    • deploy.caffemodel
    • deploy.prototxt
    • quantize_train_test.caffemodel
    • quantize_train_test.prototxt
  2. Generate the reference result by running the following command to generate reference data.
    DECENT_DEBUG=5 vai_q_caffe test -model quantize_model/dump.prototxt \
    -weights quantize_model/quantize_train_test.caffemodel              \
    -test_iter 1                                                        \
    2>&1 | tee ./log/dump.log

    This creates the dump_gpu folder and files as shown in the following figure.

  3. Generate the DPU xmodel by running the following command to generate DPU xmodel file.
    vai_c_caffe --prototxt quantize_model/deploy.prototxt       \
    --caffemodel quantize_model/deploy.caffemodel               \
    --arch /opt/vitis_ai/compiler/arch/DPUCAHX8H/U50/arch.json  \
    --output_dir compile_model                                  \
    --net_name resnet50
  4. Generate the DPU inference result by running the following command to generate the DPU inference result.
    env XLNX_ENABLE_DUMP=1  XLNX_ENABLE_DEBUG_MODE=1           \
    	xilinx_test_dpu_runner ./compile_model/resnet50.xmodel \
    	./dump_gpu/data.bin 2>result.log 1>&2

    For xilinx_test_dpu_runner, the usage is as follow:

    xilinx_test_dpu_runner  <model_file> <input_data> 

    After the above command runs, the DPU inference result and the comparing result result.log are generated. The DPU inference results are under dump folder.

  5. Crosscheck the reference result and the DPU inference result.

    The crosscheck mechanism is to first make sure input(s) to one layer is identical to reference and then the output(s) is identical too. This can be done with commands like diff, vimdiff, and cmp. If two files are identical, diff and cmp will return nothing in the command line.

    1. Check the input of DPU and GPU, make sure they use the same input data.
    2. Use xir tool to generate a picture for displaying the network's structure.
      Usage: xir svg <xmodel> <svg>
      Note: In Vitis AI docker environment, execute the following command to install the required library.
      sudo apt-get install graphviz

      The following figure is part of the ResNet50 model structure generated by xir_cat.

    3. View the xmodel structure image and find out the last layer name of the model.
      Note: Check the last layer first. If the crosscheck of the last layer is successful, then the whole layers' crosscheck will pass and there is no need crosscheck the other layers.

      For this model, the name of the last layer is `subgraph_fc1000_fixed_(fix2float)`.

      1. Search the keyword fc1000 under dump_gpu and dump. You will find the reference result file fc1000.bin under dump_gpu and DPU inference result 0.fc1000_inserted_fix_2.bin under dump/subgraph_fc1000/output/.
      2. Diff the two files.

        If the last layer's crosscheck fails, then you have to do the crosscheck from the first layer until you find the layer where the crosscheck fails.

      Note: For the layers that have multiple input or output (e.g., res2a_branch1), input correctness should be checked first and then check the output.
    4. Submit the files to Xilinx if the DPU cross check fail.

      If a certain layer proves to be wrong on the DPU, prepare the following files as one package for further analysis by factory and send it to Xilinx with a detailed description.

      • Float model and prototxt file
      • Quantized model, such as deploy.caffemodel, deploy.prototxt, quantize_train_test.caffemodel, and quantize_train_test.prototxt.

PyTorch Workflow

To generate the quantized inference model and reference result, follow these steps:

  1. Generate the quantized inference model by running the following command to quantize the model.
    python resnet18_quant.py --quant_mode calib --subset_len 200
  2. Generate the reference result by running the following command to generate reference data.
    python resnet18_quant.py --quant_mode test
  3. Generate the DPU xmodel by running the following command to generate DPU xmodel file.
    vai_c_xir -x /PATH/TO/quantized.xmodel -a /PATH/TO/
    arch.json -o /OUTPUTPATH -n netname}
  4. Generate the DPU inference result.

    This step is same as the step in Caffe workflow.

  5. Crosscheck the reference result and the DPU inference result.

    This step is same as the step in Caffe workflow.

Multi-FPGA Programming

Most modern servers have multiple Xilinx® Alveo™ cards and you would want to take advantage of scaling up and scaling out deep-learning inference. Vitis AI provides support for multi-FPGA servers using the following building blocks.

Xbutler

The Xbutler tool manages and controls Xilinx FPGA resources on a machine. With the Vitis AI 1.0 release, installing Xbutler is mandatory for running a deep-learning solution using Xbutler. Xbutler is implemented as a server-client paradigm. Xbutler is an addon library on top of Xilinx XRT to facilitate multi-FPGA resource management. Xbutler is not a replacement to Xilinx XRT. The feature list for Xbutler is as follows:

  • Enables multi-FPGA heterogeneous support
  • C++/Python API and CLI for the clients to allocate, use, and release resources
  • Enables resource allocation at FPGA, compute unit (CU), and service granularity
  • Auto-release resource
  • Multi-client support: Enables multi-client/users/processes request
  • XCLBIN-to-DSA auto-association
  • Resource sharing amongst clients/users
  • Containerized support
  • User defined function
  • Logging support

Multi-FPGA, Multi-Graph Deployment with Vitis AI

Vitis AI provides different applications built using the Unified Runner APIs to deploy multiple models on single/multiple FPGAs. Detailed description and examples are available in the Vitis AI GitHub (Multi-Tenant Multi FPGA Deployment).

Xstream API

A typical end-to-end workflow involves heterogeneous compute nodes which include FPGA for accelerated services like ML, video, and database acceleration and CPUs for I/O with outside world and compute not implemented on FPGA. Vitis AI provides a set of APIs and functions to enable composition of streaming applications in Python. Xstream APIs build on top of the features provided by Xbutler. The components of Xstream API are as follows.

Xstream
Xstream ($VAI_PYTHON_DIR/vai/dpuv1/rt/xstream.py) provides a standard mechanism for streaming data between multiple processes and controlling execution flow and dependencies.
Xstream Channel
Channels are defined by an alphanumeric string. Xstream Nodes may publish payloads to channels and subscribe to channels to receive payloads. The default pattern is PUB-SUB, that is, all subscribers of a channel will receive all payloads published to that channel. Payloads are queued up on the subscriber side in FIFO order until the subscriber consumes them off the queue.
Xstream Payloads
Payloads contain two items: a blob of binary data and metadata. The binary blob and metadata are transmitted using Redis, as an object store. The binary blob is meant for large data. The metadata is meant for smaller data like IDs, arguments and options. The object IDs are transmitted through ZMQ. ZMQ is used for stream flow control. The ID field is required in the metadata. An empty payload is used to signal the end of transmission.
Xstream Node
Each Xstream Node is a stream processor. It is a separate process that can subscribe to zero or more input channels, and output to zero or more output channels. A node may perform computation on payload received on its input channel(s). The computation can be implemented in CPU, FPGA or GPU. To define a new node, add a new Python file in vai/dpuv1/rt/xsnodes. See ping.py as an example. Every node should loop forever upon construction. On each iteration of the loop, it should consume payloads from its input channel(s) and publish payloads to its output channel(s). If an empty payload is received, the node should forward the empty payload to its output channels by calling xstream.end() and exit.
Xstream Graph
Use $VAI_PYTHON_DIR/vai/dpuv1/rt/xsnodes/grapher.py to construct a graph consisting of one or more nodes. When Graph.serve() is called, the graph spawns each node as a separate process and connect their input/output channels. The graph manages the life and death of all its nodes. See neptune/services/ping.py for a graph example. For example:
graph = grapher.Graph("my_graph")
  graph.node("prep", pre.ImagenetPreProcess, args)
  graph.node("fpga", fpga.FpgaProcess, args)
  graph.node("post", post.ImagenetPostProcess, args)
 
  graph.edge("START", None, "prep")
  graph.edge("fpga", "prep", "fpga")
  graph.edge("post", "fpga", "post")
  graph.edge("DONE", "post", None)
 
  graph.serve(background=True)
  ...
  graph.stop()
Xstream Runner
The runner is a convenience class that pushes a payload to the input channel of a graph. The payload is submitted with a unique ID. The runner then waits for the output payload of the graph matching the submitted ID. The purpose of this runner is to provide the look-and-feel of a blocking function call. A complete standalone example of Xstream is here: ${VAI_ALVEO_ROOT}/ examples/deployment_modes/xs_classify.py.

AI Kernel Scheduler

Real world deep learning applications involve multi-stage data processing pipelines which include many compute intensive pre-processing operations like data loading from disk, decoding, resizing, color space conversion, scaling, and croping multiple ML networks of different kinds like CNN, and various post-processing operations like NMS.

The AI kernel scheduler (AKS) is an application to automatically and efficiently pipeline such graphs without much effort from the users. It provides various kinds of kernels for every stage of the complex graphs which are plug and play and are highly configurable. For example, pre-processing kernels like image decode and resize, CNN kernel like the Vitis AI DPU kernel and post processing kernels like SoftMax and NMS. You can create their graphs using kernels and execute their jobs seamlessly to get the maximum performance.

For more details and examples, see the Vitis AI GitHub (AI Kernel Scheduler).

Neptune

Neptune provides a web server with a modular collection of nodes defined in Python. These nodes can be strung together in a graph to create a service. You can interact with the server to start and stop these services. You can extend Neptune by adding your own nodes and services. Neptune builds on top of the Xstream API. In the following picture, the user is running three different machine learning models on 16 videos from YouTube in real-time. Through a single Neptune server, the time and space multiplexing of the FPGA resources are enabled. Detailed documentation and examples can be found here: ${VAI_ALVEO_ROOT}/neptune. Neptune is in the early access phase in this Vitis AI release.

Figure 1: Multi-stream, Multi-network Processing in Alveo

For more details see, Vitis AI GitHub (Neptune).

Apache TVM and Microsoft ONNX Runtime

In addition to VART and related APIs, Vitis AI has integrated with the Apache TVM and Microsoft ONNX Runtime frameworks for improved model support and automatic partitioning. This work incorporates community driven machine learning framework interfaces that are not available through the standard Vitis AI compiler and quantizers. In addition, it incorporates highly optimized CPU code for x86 and Arm CPUs, when certain layers may not yet be available on Xilinx DPUs.

TVM is currently supported on the following:

  • DPUCADX8G
  • DPUCZDX8G

ONNX Runtime is currently supported on the following:

  • DPUCADX8G

Apache TVM

Apache TVM is an open source deep learning compiler stack focusing on building efficient implementations for a wide variety of hardware architectures. It includes model parsing from TensorFlow, TensorFlow Lite (TFLite), Keras, PyTorch, MxNet, ONNX, Darknet, and others. Through the Vitis AI integration with TVM, Vitis AI is able to run models from these frameworks. TVM incorporates two phases. The first is a model compilation/quantization phase which produces the CPU/FPGA binary for your desired target CPU and DPU. Then by installing the TVM Runtime on your Cloud or Edge device, the TVM APIs in Python or C++ can be called to execute the model.

To read more about Apache TVM, see https://tvm.apache.org.

Vitis AI provides tutorials and installation guides on Vitis AI and TVM integration on theVitis AI GitHub repository: https://github.com/Xilinx/Vitis-AI/tree/master/external/tvm.

Microsoft ONNX Runtime

Microsoft ONNX Runtime is an open source inference accelerator focused on ONNX models. It is the platform Vitis AI has integrated with to provide first-class ONNX model support which can be exported from a wide variety of training frameworks. It incorporates very easy to use runtime APIs in Python and C++ and can support models without requiring the separate compilation phase that TVM requires. Included in ONNXRuntime is a partitioner that can automatically partition between the CPU and FPGA further enhancing ease of model deployment. Finally, it also incorporates the Vitis AI quantizer in a way that does not require separate quantization setup.

To read more about Microsoft ONNX Runtime, see https://microsoft.github.io/onnxruntime/.

Vitis AI provides tutorials and installation guides on Vitis AI and ONNXRuntime integration on the Vitis AI GitHub repository: https://github.com/Xilinx/Vitis-AI/tree/master/external/onnxruntime.