TensorFlow Version (vai_q_tensorflow)

vai_q_tensorflow Installation

The vai_q_tensorflow can be obtained in the following two ways:

Docker container

Vitis AI provide docker container for quantization tools including vai_q_tensorflow. After running a container, activate conda environment vitis-ai-tensorflow. All the requirements are ready there, vai_q_tensorflow can be run directly.

conda activate vitis-ai-tensorflow

Install from Source Code

vai_q_tensorflow is a fork of TensorFlow from branch "r1.15". It is open source in Vitis_AI_Quantizer. vai_q_tensorflow building process is same to Tensorflow 1.15. Refer to Tensorflow document for details.

Steps to Run vai_q_tensorflow

Use the following steps to Run vai_q_tensorflow.
  1. Prepare Float Model: Before running vai_q_tensorflow, prepare the frozen inference tensorflow model in floating-point format and calibration set, including the files listed in the following table.
    Table 1. Input Files for vai_q_tensorflow
    No. Name Description
    1 frozen_graph.pb Floating-point frozen inference graph. Ensure that the graph is the inference graph rather than the training graph.
    2 calibration dataset A subset of the training dataset containing 100 to 1000 images.
    3 input_fn An input function to convert the calibration dataset to the input data of the frozen_graph during quantize calibration. Usually performs data preprocessing and augmentation.

    For more information, see Getting the Frozen Inference Graph, Getting the Calibration Dataset and Input Function, and Custom Input Function.

  2. Run vai_q_tensorflow: the following commands to quantize the model:
    $vai_q_tensorflow quantize \
    					--input_frozen_graph  frozen_graph.pb \
    					--input_nodes  ${input_nodes} \
    					--input_shapes  ${input_shapes} \
    					--output_nodes   ${output_nodes} \
    					--input_fn  input_fn \
    					[options]
    				

    For more information, see Setting the --input_nodes and --output_nodes and Setting the Options.

  3. After successful execution of the above command, two files are generated in ${output_dir}:
    • quantize_eval_model.pb is used to evaluate on CPU/GPUs, and can be used to simulate the results on hardware. You need to run import tensorflow.contrib.decent_q explicitly to register the custom quantize operation, because tensorflow.contrib is now lazily loaded.
    • deploy_model.pb is used to compile the DPU codes and deploy on it, which can be used as the input files to the Vitis AI compiler.
    Table 2. vai_q_tensorflow Output Files
    No. Name Description
    1 deploy_model.pb Quantized model for VAI compiler (extended Tensorflow format)
    2 quantize_eval_model.pb Quantized model for evaluation
  4. After deployment of the quantized model, sometimes it is necessary to compare the simulation results on the CPU/GPU and the output values on the DPU. vai_q_tensorflow supports dumping the simulation results with the quantize_eval_model.pb generated in step 3.

    Run the following commands to dump the quantize simulation results:

    $vai_q_tensorflow dump \
    					--input_frozen_graph  quantize_results/quantize_eval_model.pb \
    					--input_fn  dump_input_fn \
    					--max_dump_batches 1 \
    					--dump_float 0 \
    					--output_dir quantize_reuslts \
    				

    The input_fn for dumping is similar to the input_fn for quantize calibration, but the batch size is often set to 1 to be consistent with the DPU results.

    After successful execution of the above command, dump results are generated in ${output_dir}. There are folders in ${output_dir}, and each folder contains the dump results for a batch of input data. In the folders, results for each node are saved separately. For each quantized node, results are saved in *_int8.bin and *_int8.txt format. If dump_float is set to 1, the results for unquantized nodes are dumped. The / symbol is replaced by _ for simplicity. Examples for dump results are shown in the following table.

    Table 3. Examples for Dump Results
    Batch No. Quant Node Name Saved files
    1 Yes resnet_v1_50/conv1/biases/wquant {output_dir}/dump_results_1/resnet_v1_50_conv1_biases_wquant_int8.bin

    {output_dir}/dump_results_1/resnet_v1_50_conv1_biases_wquant_int8.txt

    2 No resnet_v1_50/conv1/biases {output_dir}/dump_results_2/resnet_v1_50_conv1_biases.bin

    {output_dir}/dump_results_2/resnet_v1_50_conv1_biases.txt

Getting the Frozen Inference Graph

In most situations, training a model with TensorFlow gives you a folder containing a GraphDef file (usually ending with a .pb or .pbtxt extension) and a set of checkpoint files. What you need for mobile or embedded deployment is a single GraphDef file that has been “frozen”, or had its variables converted into inline constants so everything is in one file. To handle the conversion, Tensorflow provides freeze_graph.py, which is automatically installed with the vai_q_tensorflow quantizer.

An example of command-line usage is as follows:

$ freeze_graph \
    --input_graph  /tmp/inception_v1_inf_graph.pb \
    --input_checkpoint  /tmp/checkpoints/model.ckpt-1000 \
    --input_binary  true \
    --output_graph  /tmp/frozen_graph.pb \
    --output_node_names  InceptionV1/Predictions/Reshape_1

The –input_graph should be an inference graph other than the training graph. Some operations behave differently in the training and inference, such as dropout and batchnorm; ensure that they are in inference phase when freezing the graph. For examples, you can set the flag is_training=false when using tf.layers.dropout/tf.layers.batch_normalization. For models using tf.keras, call tf.keras.backend.set_learning_phase(0) before building the graph.

Because the operations of data preprocessing and loss functions are not needed for inference and deployment, the frozen_graph.pb should only include the main part of the model. In particular, the data preprocessing operations should be taken in the Input_fn to generate correct input data for quantize calibration.

Note: Type freeze_graph --help for more options.

The input and output node names vary depending on the model, but you can inspect and estimate them with the vai_q_tensorflow quantizer. See the following code snippet for an example:

$ vai_q_tensorflow inspect --input_frozen_graph=/tmp/inception_v1_inf_graph.pb

The estimated input and output nodes cannot be used for the quantization part if the graph has in-graph pre- and postprocessing, because some operations in these parts are not quantizable and might cause errors when compiled by the Vitis AI compiler if you need to deploy the quantized model to the DPU.

Another way to get the input and output name of the graph is by visualizing the graph. Both tensorboard and netron can do this. See the following example, which uses netron:

$ pip install netron
$ netron /tmp/inception_v3_inf_graph.pb

Getting the Calibration Dataset and Input Function

The calibration set is usually a subset of the training/validation dataset or actual application images (at least 100 images for performance). The input function is a python importable function to load the calibration dataset and perform data preprocessing. The vai_q_tensorflow quantizer can accept an input_fn to do the preprocessing which is not saved in the graph. If the preprocessing subgraph is saved into the frozen graph, the input_fn only needs to read the images from dataset and return a feed_dict.

Custom Input Function

The function input format is module_name.input_fn_name, (for example, my_input_fn.calib_input). The input_fn takes an int object as input, indicating the calibration step number, and returns a dict`(placeholder_name, numpy.Array)` object for each call, which is fed into the placeholder nodes of the model when running inference. The shape of numpy.array must be consistent with the placeholders. See the following pseudo code example:

$ “my_input_fn.py”
def calib_input(iter):
“””A function that provides input data for the calibration
Args:
iter: A `int` object, indicating the calibration step number
Returns:
    dict( placeholder_name, numpy.array): a `dict` object, which will be fed into the model
“””
  image = load_image(iter)
  preprocessed_image = do_preprocess(image)
  return {"placeholder_name": preprocessed_images}

Setting the --input_nodes and --output_nodes

The input_nodes and output_nodes arguments are the name list of input nodes of the quantize graph. They are the start and end points of quantization. The main graph between them is quantized if it is quantizable, as shown in the following figure.

Figure 1: Quantization Flow for TensorFlow

It is recommended to set –input_nodes to be the last nodes of the preprocessing part and to set -output_nodes to be the last nodes of the main graph part, because some operations in the pre- and postprocessing parts are not quantizable and might cause errors when compiled by the Vitis AI quantizer if you need to deploy the quantized model to the DPU.

The input nodes might not be the same as the placeholder nodes of the graph. If no in-graph preprocessing part is present in the frozen graph, the placeholder nodes should be set to input_nodes.

The input_fn should be consistent with the placeholder nodes.

Setting the Options

In the command line, [options] stands for optional parameters. The most commonly used options are as follows:

  • weight_bit: Bit width for quantized weight and bias (default is 8).
  • activation_bit: Bit width for quantized activation (default is 8).
  • method: Quantization methods, including 0 for non-overflow and 1 for min-diffs. The non-overflow method ensures that no values are saturated during quantization. The results can be easily affected by outliers. The min-diffs method allows saturation for quantization to achieve a lower quantization difference. It is more robust to outliers and usually results in a narrower range than the non-overflow method.

Evaluate Quantized Model (Optional)

If you have scripts to evaluate floating point model, like the models in Xilinx Modelzoo. Apply the following two changes to evaluate the quantized model.
  • Add “from tensorflow.contrib import decent_q” in front of the float evaluation script. This will register the quantize operation.
  • Replace the float model path in the scripts to quantization output model "quantize_results/quantize_eval_model.pb".
Then run the modified script to evaluate quantized model.

vai_q_tensorflow Usage

The options supported by vai_q_tensorflow are shown in the following tables.

Table 4. vai_q_tensorflow Options
Name Type Description
Common Configuration
--input_frozen_graph String TensorFlow frozen inference GraphDef file for the floating-point model, used for quantize calibration.
--input_nodes String The name list of input nodes of the quantize graph, used together with –output_nodes, comma separated. Input nodes and output_nodes are the start and end points of quantization. The subgraph between them is quantized if it is quantizable.

It is recommended to set –input_nodes to be the last nodes of the preprocessing part and to set –output_nodes to be the last nodes before the post-processing part, because some operations in the pre- and postprocessing parts are not quantizable and might cause errors when compiled by the Vitis AI compiler if you need to deploy the quantized model to the DPU. The input nodes might not be the same as the placeholder nodes of the graph.

--output_nodes String The name list of output nodes of the quantize graph, used together with –input_nodes, comma separated. Input nodes and output nodes are the start and end points of quantization. The subgraph between them is quantized if it is quantizable.

It is recommended to set –input_nodes to be the last nodes of the preprocessing part and to set –output_nodes to be the last nodes before the post-processing part, because some operations in the pre- and postprocessing parts are not quantizable and might cause errors when compiled by the Vitis AI compiler if you need to deploy the quantized model to the DPU.

--input_shapes String The shape list of input_nodes. Must be a 4-dimension shape for each node, comma separated, for example 1,224,224,3; support unknown size for batch_size, for example ?,224,224,3. In case of multiple input nodes, assign the shape list of each node separated by :, for example, ?,224,224,3:?,300,300,1.
--input_fn String This function provides input data for the graph used with the calibration dataset. The function format is module_name.input_fn_name (for example, my_input_fn.input_fn). The input_fn should take an int object as input which indicates the calibration step, and should return a dict`(placeholder_node_name, numpy.Array)` object for each call, which is then fed into the placeholder operations of the model.

For example, assign –input_fn to my_input_fn.calib_input, and write calib_input function in my_input_fn.py as:

def calib_input_fn:

# read image and do some preprocessing

return {“placeholder_1”: input_1_nparray, “placeholder_2”: input_2_nparray}

Note: You do not need to do in-graph preprocessing again in input_fn, because the subgraph before –input_nodes remains during quantization.

Remove the pre-defined input functions (including default and random) because they are not commonly used. The preprocessing part which is not in the graph file should be handled in in the input_fn.

Quantize Configuration
--weight_bit Int32 Bit width for quantized weight and bias.

Default: 8

--activation_bit Int32 Bit width for quantized activation.

Default: 8

--method Int32 The method for quantization.

0: Non-overflow method. Makes sure that no values are saturated during quantization. Sensitive to outliers.

1: Min-diffs method. Allows saturation for quantization to get a lower quantization difference. Higher tolerance to outliers. Usually ends with narrower ranges than the non-overflow method.

Choices: [0, 1]

Default: 1

--calib_iter Int32 The iterations of calibration. Total number of images for calibration = calib_iter * batch_size.

Default: 100

--ignore_nodes String The name list of nodes to be ignored during quantization. Ignored nodes are left unquantized during quantization.
--skip_check Int32 If set to 1, the check for float model is skipped. Useful when only part of the input model is quantized.

Choices: [0, 1]

Default: 0

--align_concat Int32 The strategy for the alignment of the input quantizeposition for concat nodes. Set to 0 to align all concat nodes, 1 to align the output concat nodes, and 2 to disable alignment.

Choices: [0, 1, 2]

Default: 0

--simulate_dpu Int32 Set to 1 to enable the simulation of the DPU. The behavior of DPU for some operations is different from Tensorflow. For example, the dividing in LeakyRelu and AvgPooling are replaced by bit-shifting, so there might be a slight difference between DPU outputs and CPU/GPU outputs. The vai_q_tensorflow quantizer simulates the behavior for these operations if this flag is set to 1.

Choices: [0, 1]

Default: 1

--output_dir String The directory in which to save the quantization results.

Default: “./quantize_results”

--max_dump_batches Int32 The maximum number of batches for dumping.

Default: 1

--dump_float Int32 If set to 1, the float weights and activations will also be dumped.

Choices: [0, 1]

Default: 0

Session Configurations
--gpu String The ID of the GPU device used for quantization, comma separated.
--gpu_memory_fraction Float The GPU memory fraction used for quantization, between 0-1.

Default: 0.5

Others
--help Show all available options of vai_q_tensorflow.
--version Show vai_q_tensorflow version information.

vai_q_tensorflow Supported Operations and APIs

The following table lists the supported operations and APIs for vai_q_tensorflow.

Table 5. Support Operations and APIs for vai_q_tensorflow
Type Operation Type tf.nn tf.layers tf.keras.layers
Convolution Conv2D

DepthwiseConv2dNative

atrous_conv2d

conv2d

conv2d_transpose

depthwise_conv2d_native

separable_conv2d

Conv2D

Conv2DTranspose

SeparableConv2D

Conv2D

Conv2DTranspose

DepthwiseConv2D

SeparaleConv2D

Fully Connected MatMul / Dense Dense
BiasAdd BiasAdd

Add

bias_add / /
Pooling AvgPool

Mean

MaxPool

avg_pool

max_pool

AveragePooling2D

MaxPooling2D

AveragePooling2D

MaxPool2D

Activation Relu

Relu6

relu

relu6

leaky_relu

/ ReLU

LeakyReLU

BatchNorm[#1] FusedBatchNorm batch_normalization

batch_norm_with_global_normalization

fused_batch_norm

BatchNormalization BatchNormalization
Upsampling ResizeBilinear

ResizeNearestNeighbor

/ / UpSampling2D
Concat Concat

ConcatV2

/ / Concatenate
Others Placeholder

Const

Pad

Squeeze

Reshape

ExpandDims

dropout[#2]

softmax[#3]

Dropout[#2]

Flatten

Input

Flatten

Reshape

Zeropadding2D

Softmax

  1. Only supports Conv2D/DepthwiseConv2D/Dense+BN. BN is folded to speed up inference.
  2. Dropout is deleted to speed up inference.
  3. There is no need to quantize softmax output and vai_q_tensorflow does not quantize it.

vai_q_tensorflow Quantize Finetuning

Generally, there is a small accuracy loss after quantization, but for some networks such as Mobilenets, the accuracy loss can be large. In this situation, quantize finetuning can be used to further improve the accuracy of quantized models.

APIs

There are 3 APIs for quantize finetuning in Python package tf.contrib.decent_q.

tf.contrib.decent_q.CreateQuantizeTrainingGraph(config)

Convert the float training graph to quantize training graph, this is done by in-place rewriting on the default graph.

Arguments:

  • config: A tf.contrib.decent_q.QuantizeConfig object, containing the configurations for quantization.
tf.contrib.decent_q.CreateQuantizeEvaluationGraph(config)

Convert the float evaluation graph to quantize evaluation graph, this is done by in-place rewriting on the default graph.

Arguments:

  • config: A tf.contrib.decent_q.QuantizeConfig object, containing the configurations for quantization.
tf.contrib.decent_q.CreateQuantizeDeployGraph(checkpoint, config)

Freeze the checkpoint into the quantize evaluation graph and convert the quantize evaluation graph to deploy graph.

Arguments:

  • checkpoint: A string object, the path to checkpoint folder of file.
  • config: A tf.contrib.decent_q.QuantizeConfig object, containing the configurations for quantization.

Steps for Quantize Finetuning

Quantize finetuning is almost the same as float model finetuning, the difference is that we will use the vai_q_tensorflow's APIs to rewrite the float graph to convert it to a quantized graph, before the training starts. Here is the typical workflow.

Step 0: Preparation

Before finetuning, please prepare the following files:

Table 6. Checkpoints
No. Name Description
1 Checkpoint files Floating-point checkpoint files to start from. Can be omitted if train from scratch.
2 Dataset The training dataset with labels.
3 Train Scripts The python scripts to run float train/finetuning of the model.

Step 1(Optional): Evaluate the Float Model

It is suggested to evaluate the float checkpoint files first before doing quantize finetuning, which can check the correctness of the scripts and dataset, and the accuracy and loss values of the float checkpoint can also be a baseline for the quantize finetuning.

Step 2: Modify the Training Scripts

To create the quantize training graph, we need to modify the training scripts to call the function after the float graph is built. The following is an example:

# train.py

# ...

# Create the float training graph
model = model_fn(is_training=True)

# *Set the quantize configurations
from tensorflow.contrib import decent_q
q_config = decent_q.QuantizeConfig(input_nodes=['net_in'],
                                   output_nodes=['net_out'], 
                                   input_shapes=[[-1, 224, 224, 3]])
# *Call Vai_q_tensorflow api to create the quantize training graph
decent_q.CreateQuantizeTrainingGraph(config=q_config)

# Create the optimizer 
optimizer = tf.train.GradientDescentOptimizer()

# start the training/finetuning, you can use sess.run(), tf.train, tf.estimator, tf.slim and so on
# ...

The QuantizeConfig contains the configurations for quantization.

Some basic configurations like input_nodes, output_nodes, input_shapes need to be set according to your model structure.

Other configurations like weight_bit, activation_bit, method have default values and can be modified as needed. See the "vai_q_tensorflow Usage" section for detailed information of all the configurations

  • input_nodes/output_nodes: They are used together to determine the subgraph range you want to quantize. The pre-processing and post-processing part are usually not quantizable and should be out of this range. Note that the input_nodes and output_nodes should be the same for the float training graph and float evaluation graph for correctly match the quantization operations between them. Currently operations with multiple output tensors (such as FIFO) can not be supported, in that case you can simply add a tf.identity node to make a alias for the input_tensor to make a single output input node.
  • input_shapes: The shape list of input_nodes, must be a 4-dimension shape for each node, comma separated, e.g. [[1,224,224,3] [1, 128, 128, 1]]; support unknown size for batch_size, e.g. [[-1,224,224,3]].

Step 4: Evaluate the Quantized Model and Generate the Deploy Model

After quantize finetuning, we can generate the deploy model. Before that, usually we need to evaluate the quantized graph with checkpoint file. This can be done by calling the below function after building the float evaluation graph. As the deploy process needs to run based on the quantize evaluation graph, so they are often called together.

# eval.py

# ...

# Create the float evaluation graph
model = model_fn(is_training=False)

# *Set the quantize configurations
from tensorflow.contrib import decent_q
q_config = decent_q.QuantizeConfig(input_nodes=['net_in'],
                                   output_nodes=['net_out'], 
                                   input_shapes=[[-1, 224, 224, 3]])
# *Call Vai_q_tensorflow api to create the quantize evaluation graph
decent_q.CreateQuantizeEvaluationGraph(config=q_config)
# *Call Vai_q_tensorflow api to freeze the model and generate the deploy model
decent_q.CreateQuantizeDeployGraph(checkpoint="path to checkpoint folder", config=q_config)

# start the evaluation, users can use sess.run, tf.train, tf.estimator, tf.slim and so on
# ...

Generated Files

After above steps, the generated file are in the ${output_dir}, list as below:

Table 7. Generated File Information
Name Tensorflow Compatable Usage Description
quantize_train_graph.pb Yes Train The quantize train graph.
quantize_eval_graph_{suffix}.pb Yes Evaluation with checkpoint The quantize evaluation graph with quantize information frozen inside. No weights inside, should be used together with the checkpoint file in evaluation.
quantize_eval_model_{suffix}.pb Yes 1. Evaluation; 2. Dump; 3. Input to VAI compiler (DPUCAHX8H) The frozen quantize evaluation graph, weights in the checkpoint and quantize information are frozen inside. It can be used to evaluate the quantized model on the host or to dump the outputs of each layer for cross check with DPU outputs. XIR compiler uses it as input.
deploy_model_{suffix}.pb No Input to VAI compiler (DPUCZDX8G) The deploy model, operations and quantize information are fused. DNNC compiler uses it as input.

The suffix contains the iteration information from the checkpoint file and the date information to make it clear to combine it to checkpoints files. For example, if the checkpiont file is "model.ckpt-2000.*" and the date is 20200611, then the suffix will be "2000_20200611000000".

Tips

The following are some tips for quantize finetuning.

  1. Dropout: Experiments shows that quantize finetuning works better without dropout ops. This tool does not support quantize finetuning with dropouts now, they should be removed or disabled before running the quantize finetuning. This can be done by setting is_training=false when using tf.layers or call tf.keras.backend.set_learning_phase(0) when using tf.keras.layers.
  2. Hyper-param: Quantize finetuning is like float finetuning, so the techniques for float finetuning is also needed. The optimizer type, learning rate curve are some important parameters to tune.