Accelerating Subgraph with ML Frameworks

Partitioning is the process of splitting the inference execution of a model between the FPGA and the host. Partitioning is necessary to execute models that contain layers unsupported by the FPGA. Partitioning can also be useful for debugging and exploring different computation graph partitioning and execution to meet a target objective.

Note: This feature is currently only available for Alveo™ U200/U250 with use of DPUCADX8G.

Partitioning Functional API Call in TensorFlow

Graph partitioning has the following general flow:
  1. Create/initialize the partition class:
    from vai.dpuv1.rt.xdnn_rt_tf import TFxdnnRT
    xdnnTF = TFxdnnRT(args)
  2. Loading the partitioned graph:
    graph = xdnnTF.load_partitioned_graph()
  3. Apply preprocessing and post processing as if the original graph is loaded.

Partitioner API

The main input argument (for example, args in item 1 from Partitioning usage flow) of the partitioner are as follows:
tf.Graph, tf.GraphDef, or path to the network file
Saving protocol of the network file. Supported formats [pb (default), chkpt, txt, savedmodel]
DPUCADX8G quantization file
Inference batch size. The default value for this is one.
List of start nodes for FPGA partition (optional. Defaults to all placeholders)
List of final nodes for FPGA partition (optional. Defaults to all sink nodes)

Partitioning Steps

  1. Loading the original graph

    Partitioner can handle frozen tf.Graph, tf.GraphDef, or a path to the network file/folder. If the pb file is provided the graph should be properly frozen. Other options include model stores using tf.train.Saver and tf.saved_model.

  2. Partitioning

    In this step the subgraph specified by startnode and finalnode sets is analyzed for FPGA acceleration. This is done in multiple phases.

    1. All graph nodes get partitioned into (FPGA) supported and unsupported sets using one of two method. The default (compilerFunc='SPECULATIVE') method uses rough estimate of the hardware operation tree. The second method (compilerFunc= ‘DEFINITIVE’) utilizes the hardware compiler. The latter is more accurate and can handle complex optimization schemes based on the specified options, however, it takes considerable more time to conclude the process.
    2. Adjacent supported and unsupported nodes get merged into (fine grained) connected components.
    3. Supported partitions get merged into maximally connected components, while maintaining the DAG property.
    4. Each supported partition gets (re)compiled using hardware compiler to create runtime code, quantization info, and relevant model parameters.
    5. Each supported partition subgraph is stored for visualization and debug purposes.
    6. Each supported subgraph gets replaced by tf.py_func node (with naming convention fpga_func_<partition_id>) that contains all necessary python function calls to accelerate that subgraph over FPGA.
  3. Freezing the modified graph

    The modified graph gets frozen and stored with “-fpga” suffix.

  4. Run natively in TensorFlow

    The modified graph can be loaded using load_partitioned_graph method of the partitioner class. The modified graph replaces the default TensorFlow graph and can be used similar to the original graph.

Practical Notes

The compiler optimizations can be modified by passing the applicable compiler arguments either through positional argument or options arguments to the Partitioner class TFxdnnRT. If model is not properly frozen, the compiler might fail optimizing some operations such as batchnorm.

startnode, and finalnode sets should be a vertex separators. This means that the removal of startnode or finalnode should separate the graph into two distinct connected components (except when startnode is a subset of graph placeholders).

Wherever possible, do not specify cut nodes between layers that are executed as a single macro layers, e.g., for Conv(x) -> BiasAdd(x), placing Conv(x) in a different FPGA partition than BiasAdd(x) may result in suboptimal performance (throughput, latency, and accuracy).

The partitioner initialization requires quant_cfgfile to exist to be able to create executable code for FPGA. In case FPGA execution is not intended, this requirement can be circumvented by setting quant_cfgfile=”IGNORE”.

Partitioning Support in Caffe

Xilinx has enhanced Caffe package to automatically partition a Caffe graph. This function separates the FPGA executable layers in the network and generates a new prototxt, which is used for the inference. The subgraph cutter creates a custom python layer to be accelerated on the FPGA. The following code snippet explains the code:

from vai.dpuv1.rt.scripts.framework.caffe.xfdnn_subgraph \
	import CaffeCutter as xfdnnCutter
def Cut(prototxt):
    cutter = xfdnnCutter(
#cutting and generating a partitioned graph auto_cut_deploy.prototxt


The auto_cut_deploy.prototxt generated in the previous step, has complete information to run inference. For example:

Notebook execution
There are two example notebooks (image detection and image classification) that can be accessed from $VAI_ALVEO_ROOT/notebooks to understand these steps in detail.
Script execution
There is a python script that can be used to run the models with default settings. It can be run using the following commands:
$VAI_ALVEO_ROOT/examples/caffe/ --prototxt <example prototxt> --caffemodel <example caffemodel> --prepare
Path to the prototxt of the model
Path to the caffemodel of the model
Path to save the quantization, compiler and subgraph_cut files
Number of iterations to test the quantization
Number of iterations to calibration used for quantization
Validate Phase
$VAI_ALVEO_ROOT/examples/caffe/ –validate
If output_dir is given in the prepare phase, give the same argument and value to use the files generated in prepare phase.
Number of batches which can be used to test the inference.