Libraries and Samples

The Vitis™ AI Library contains the following types of neural network libraries based on Caffe framework:

  • Classification
  • Face detection
  • SSD detection
  • Pose detection
  • Semantic segmentation
  • Road line detection
  • YOLOV3 detection
  • YOLOV2 detection
  • Openpose detection
  • RefineDet detection
  • ReID detection
  • Multitask
  • Face recognition
  • Plate detection
  • Plate recognition
  • Medical segmentation

Also, the Vitis™ AI contains the following types of neural network libraries based on TensorFlow framework:

  • Classification
  • SSD detection
  • YOLOv3 detection
  • Medical detection

And, the Vitis™ AI supports the following type of neural network libraries based on PyTorch framework.

  • Classification
  • ReID detection
  • Face recognition
  • Semantic segmentation
  • Point cloud
  • Medical segmentation
  • 3D segmentation

The related libraries are open source and can be modified as needed. The open source codes are available on Github.

The Vitis™ AI Library provides image test samples and video test samples for all the above networks. In addition, the kit provides the corresponding performance test program. For video based testing, we recommend to use raw video for evaluation. Decoding by software libraries on Arm® CPU may have inconsistent decoding time, which may affect the accuracy of evaluation.

Note: For Edge, all the sample programs can only run on the target side, but all the sample programs can be cross compiled on the host side or compiled on the target side.

Model Library

After the model packet is installed on the target, all the models are stored under /usr/share/vitis_ai_library/models/. Each model is stored in a separate folder, which is composed of the following files by default:

  • [model_name].xmodel
  • [model_name].prototxt
Note: The elf model is not supported by the Vitis AI Library in VAI 1.3.

Take the "inception_v1" model as an example. inception_v1.xmodel is the model data. inception_v1.prototxt is the parameter of the model.

Note: The name of the model directory should be the same with the model name.

Model Type

Classification

The Classification library is used to classify images. Such neural networks are trained on ImageNet for ILSVRC and they can identify the objects from its 1000 classification. The Vitis AI Library integrates networks including, but not limited to, ResNet18, ResNet50, Inception_v1, Inception_v2, Inception_v3, Inception_v4, Vgg, mobilenet_v1, mobilenet_v2, and Squeezenet into Xilinx libraries. The input is a picture with an object and the output is the top-K most probable category.

Figure 1: Classification Example

The following table lists the classification models supported by the Vitis AI Library.

Table 1. Classification Models
No Model Name Framework
1 inception_resnet_v2_tf TensorFlow
2 inception_v1_tf
3 inception_v3_tf
4 inception_v4_2016_09_09_tf
5 mobilenet_v1_0_25_128_tf
6 mobilenet_v1_0_5_160_tf
7 mobilenet_v1_1_0_224_tf
8 mobilenet_v2_1_0_224_tf
9 mobilenet_v2_1_4_224_tf
10 resnet_v1_101_tf
11 resnet_v1_152_tf
12 resnet_v1_50_tf
13 vgg_16_tf
14 vgg_19_tf
15 mobilenet_edge_1_0_tf
16 mobilenet_edge_0_75_tf
17 inception_v2_tf
18 MLPerf_resnet50_v1.5_tf
19 resnet50_tf2
20 mobilenet_1_0_224_tf2
21 inception_v3_tf2
22 resnet_v2_50_tf
23 resnet_v2_101_tf
24 resnet_v2_152_tf
25 resnet50 Caffe
26 resnet18
27 inception_v1
28 inception_v2
29 inception_v3
30 inception_v4
31 mobilenet_v2
32 squeezenet
33 resnet50_pt PyTorch
34 squeezenet_pt
35 inception_v3_pt

Face Detection

The Face Detection library uses the DenseBox neuron network to detect human faces. The input is a picture with the faces you want to detect and the output is a vector of the result structure containing the information of each detection box. The following image shows the result of face detection.

Figure 2: Face Detection Example

The following table lists the face detection models supported by the AI Library.

Table 2. Face Detection Models
No Model Name Framework
1 densebox_320_320 Caffe
2 densebox_640_360

Face Landmark Detection

The Face Landmark network is used to detect five key points on a human face. The five points include the left eye, the right eye, the nose, the left corner of the lips, and the right corner of the lips. This network is used to correct face direction (what this means is if a face is not directly facing the camera (e.g., tilted 20 degrees left or right), it is "adjusted" to face the camera directly) before face feature extraction. The input image should be a face which is detected by the face detection network. The output of the network is the five key points. The five key points are normalized. The following image shows the result of face detection.

Figure 3: Face Landmark Detection Example

The following table lists the face landmark models supported by the AI Library.

Table 3. Face Landmark Models
No Model Name Framework
1 face_landmark Caffe

SSD Detection

The SSD Detection library is commonly used with the SSD neuron network. SSD is a neural network which is used to detect objects. The input is a picture with some objects you want to detect. The output is a vector of the result structure containing the information of each detection box. The following image shows the result of SSD detection.

Figure 4: SSD Detection Example

The following table lists the SSD detection models supported by the Vitis AI Library.

Table 4. SSD Models
No Model Name Framework
1 ssd_mobilenet_v1_coco_tf TensorFlow
2 ssd_mobilenet_v2_coco_tf
3 ssd_resnet_50_fpn_coco_tf
4 mlperf_ssd_resnet34_tf
5 ssdlite_mobilenet_v2_coco_tf
6 ssd_inception_v2_coco_tf
7 ssd_pedestrian_pruned_0_97 Caffe
8 ssd_traffic_pruned_0_9
9 ssd_adas_pruned_0_95
10 ssd_mobilenet_v2

Pose Detection

The Pose Detection library is used to detect the posture of the human body. This library includes a neural network which can identify 14 key points on the human body (you can use our SSD detection library). The input is a picture that is detected by the pedestrian detection neural network. The output is a structure containing the coordinates of each point. The following image shows the result of pose detection.

Figure 5: Pose Detection Example

The following table lists the pose detection models supported by the Vitis AI Library.

Table 5. Pose Detection Models
No Model Name Framework
1 sp_net Caffe
Note: If the input image is arbitrary and you do not know the exact location of the person, perform the SSD detection first. See the test_jpeg_posedetect_with_ssd.cpp file. The input for test_jpeg_posedetect_ssd can be any image with or without a person in it. If there is a person in the image, it will first detect the person with SSD, then send the position of the person as the input for posedetect. If the SSD detection does not identify any person in the image, then posedetect does not run. As test_jpeg_posedetect only performs posedetect, so the input image must have atleast one person. If you input an image without a person for test_jpeg_posedetect, it will throw an error. See the test_jpeg_posedetect.cpp file.

Semantic Segmentation

Semantic segmentation assigns a semantic category to each pixel in the input image, that is, it identifies pixels as part of an object, say, a car, a road, a tree, a horse, etc. Libsegmentation is a segmentation library which can be used in ADAS applications. It offers simple interfaces for a developer to deploy segmentation tasks on a Xilinx® FPGA.

The following is an example of semantic segmentation, where "blue gray" denotes the sky, "green" denotes trees, "red" denotes people, "dark blue" denotes cars, "plum" denotes the road, and "gray" denotes structures.

Figure 6: Semantic Segmentation Example

The following table lists the semantic segmentation models supported by the Vitis AI Library.

Table 6. Semantic Segmentation Models
No Model Name Framework
1 fpn Caffe
2 FPN-resnet18_Endov
3 semantic_seg_citys_tf2 TensorFlow
4 mobilenet_v2_cityscapes_tf
5

SemanticFPN_cityscapes_pt

PyTorch
6 ENet_cityscapes_pt
7 unet_chaos-CT_pt

Road Line Detection

The Road Line Detection library is used to draw lane lines in ADAS applications. Each lane line is represented by a number representing the category. A vector<Point> is used to draw the lane line. In the test code, a color map is used. Different types of lane lines are represented by different colors. The point is stored in the container vector, and the polygon interface cv::polylines() of OpenCV is used to draw the lane line. The following image shows the result of road line detection.

Figure 7: Road Line Detection Example
The following table lists the road line detection models supported by the Vitis AI Library.
Table 7. Road Line Detection Models
No Model Name Framework
1 vpgnet_pruned_0_99 Caffe
Note: The input of the image is fixed at 480x640 and images of other sizes need to be resized.

YOLOv3 Detection

YOLO is a neural network which is used to detect objects. The current version is v3. The input is a picture with one or more objects and the output is a vector of the result struct which is composed of the detected information. The following image shows the result of YOLOv3 detection.

Figure 8: YOLOv3 Detection Example

The following table lists the YOLOv3 detection models supported by the Vitis AI Library.

Table 8. YOLOv3 Detection Models
No Model Name Framework
1 yolov3_voc_tf TensorFlow
2 yolov3_adas_pruned_0_9 Caffe
3 yolov3_voc
4 yolov3_bdd
5 yolov4_leaky_spp_m
6 tiny_yolov3_vmss

YOLOv2 Detection

YOLOv2 does the same thing as YOLOv3, which is an upgraded version of YOLOv2. The following table lists the YOLOv2 detection models supported by the Vitis AI Library.
Table 9. YOLOv2 Detection Models
No Model Name Framework
1 yolov2_voc Caffe
2 yolov2_voc_pruned_0_66
3 yolov2_voc_pruned_0_71
4 yolov2_voc_pruned_0_77

Openpose Detection

The Openpose Detection library is used to detect posture of the human body. The posture is represented by an array of 14 key points as shown below:
 0: head, 1: neck, 2: L_shoulder, 3:L_elbow, 4: L_wrist, 5: R_shoulder,
 6: R_elbow, 7: R_wrist, 8: L_hip, 9: L_knee, 10: L_ankle, 11: R_hip,
 12: R_knee, 13: R_ankle

The input of the network is 368x368. The following image shows the result of openpose detection.

Note: Use a square picture for input. If you need to detect pictures of other size ratios, use a network with the same input size ratio.
Figure 9: Openpose Detection Example

The following table lists the Openpose detection models supported by the Vitis AI Library.

Table 10. Openpose Detection Models
No Model Name Framework
1 openpose_pruned_0_3 Caffe

RefineDet Detection

RefineDet is a neural network that is used to detect human bodies. The input is a picture with some individuals that you would like to detect. The output is a vector of the result structure that contain each box’s information. The following image shows the result of RefineDet detection:

Figure 10: RefineDet Detection Example

The following table lists the RefineDet detection models supported by the Vitis AI Library.

Table 11. RefineDet Detection Models
No Model Name Framework
1 refinedet_pruned_0_8 Caffe
2 refinedet_pruned_0_92
3 refinedet_pruned_0_96
4 refinedet_baseline
5 refinedet_VOC_tf TensorFlow

ReID Detection

The task of person re-identification is to identify a person of interest at any time or place. This is done by extracting the image feature and comparing the features. Images of the same person should have similar features and have small feature distance, while images of different persons have large feature distance. Given a queried image and a pile of candidate images, the image that has the smallest feature distance is identified as the same person as the queried image. The following table lists the ReID detection models supported by the Vitis AI Library.

Table 12. ReID Detection Models
Number Model Name Framework
1 reid Caffe
2 personreid-res18_pt PyTorch
3

personreid-res50_pt

4

facereid-large_pt

5

facereid-small_pt

Multi-task

The multi-task library is appropriate for a model that has multiple sub-tasks. The Multi-task model in the Vitis AI Library has two sub-tasks: semantic segmentation and SSD detection. The following table listss the multi-task models supported by the Vitis AI Library.

Table 13. Multi-task Models
Number Model Name Framework
1 multi_task Caffe
2 MT-resnet18_mixed_pt PyTorch

Face Recognition

The models of face feature are used for face recognition. They can extract the features of a person's face. The output of these models are 512 features. If you have two different images and you want to know if they are of the same person, use these models to extract features of the two images, and then use calculation functions and mapped functions to get the similarity of the two images.

Figure 11: Face Recognition Example

The following table listss the face recognition models supported by the Vitis AI Library.

Table 14. Face Recognition Models
No Model Name Framework
1 facerec_resnet20 Caffe
2 facerec_resnet64
3

facerec-resnet20_mixed_pt

PyTorch

Plate Detection

The Plate Detection library uses the DenseBox neuron network to detect license plates. The input is a picture of the vehicle that is detected by the SSD and the output is a structure containing the plate location information. The following image shows the result of the plate detection.

Figure 12: Plate Detection Example

The following table lists the plate detection models supported by the Vitis AI Library.

Table 15. Plate Detection Models
No Model Name Framework
1 plate_detect Caffe

Plate Recognition

The Plate Recognition library uses a classification network to recognize license plate number (Chinese license plates only). The input is a picture of the license plate that is detected by plate detect. The output is a structure containing license plate number information. The following image shows the result of the plate recognition.

Figure 13: Plate Recognition Example

The following table lists the plate recognition models supported by the Vitis AI Library.

Table 16. Plate Recognition Models
No Model Name Framework
1 plate_num Caffe

Medical Segmentation

Endoscopy is a common clinical procedure for the early detection of cancers in hollow-organs such as nasopharyngeal cancer, esophageal adenocarcinoma, gastric cancer, colorectal cancer, and bladder cancer. Accurate and temporally consistent localization and segmentation of diseased region-of-interests enable precise quantification and mapping of lesions from clinical endoscopy videos, which is critical for monitoring and surgical planning.

The medical segmentation model is used to classify diseased region-of-interests in the input image. It can be classified into many categories, including BE, cancer, HGD, polyp, and suspicious.

Libmedicalsegmentation is a segmentation library which can be used in segmentation of multi-class diseases in endoscopy. It offers simple interfaces for developers to deploy segmentation tasks on Xilinx FPGAs. The following is an example of medical segmentation, where the goal is to mark the diseased region.

Figure 14: Marking the Diseased Region

The following is an example of semantic segmentation, where the goal is to predict class labels for each pixel in the image.

Figure 15: Medical Segmentation Example

The following table lists the medical segmentation models supported by the Vitis AI Library.

Table 17. Semantic Segmentation Models
No Model Name Framework
1 FPN_Res18_Medical_segmentation Caffe

Medical Detection

The RefineDet model is based on vgg16. It is used for medical detection and can detect five types of diseases, namely, BE, cancer, HGD, polyp, and suspicious from an input endoscopy image like the Endoscopy Disease Detection and Segmentation database (EDD2020).

Figure 16: Medical Detection Example

The following table lists the medical detection models supported by the Vitis AI Library.

Table 18. Semantic Detection Models
No Model Name Framework
1 RefineDet-Medical_EDD_tf TensorFlow

Medical Cell Segmentation

The nucleus is an organelle present within all eukaryotic cells, including human cells. Abberant nuclear shape can be used to identify cancer cells, for example, pap smear tests for the diagnosis of cervical cancer. Medical segmentation cell models offer nuclear segmentation in digital microscopic tissue images which can enable extraction of high quality features for nuclear morphometric and other analyses in computational pathology. The following images show the results of cell segmentation.

Figure 17: Medical Cell Segmentation Examples

The following table lists the Medical Cell Segmentation models supported by the Vitis AI Library.

Table 19. Medical Cell Segmentation Models
No Model Name Framework
1 medical_seg_cell_tf2 TensorFlow

Retinaface

This retinaface network is used to detect human face and face landmark. The input is a picture with some faces you would like to detect and the output contains face positions, scores, and landmarks of faces.

Figure 18: Retinaface Detection Example

The following table lists the retinaface detection models supported by the Vitis AI Library.

Table 20. Retinaface Detection Models
No Model Name Framework
1 retinaface Caffe

Face Quality

Th Face Quality library uses the face quality network to detect the quality score of a face. If a face is clear and a front face, the score is high. On the contrary, a blurry or side face will get a low score. The score range from 0 to 1. It also provide face landmark positions. The input is a face which is detected by face detect network and the output contains quality score and five landmark key points.

Figure 19: Face Quality Example

The following table lists the face quality models supported by the Vitis AI Library.

Table 21. Face Quality Models List
No Model Name Framework
1 face-quality Caffe
2 face-quality_pt PyTorch

Hourglass

The Hourglass library is used to detect posture of the human body. It is represented by an array of 16 joint points. Joint points are arranged in order:
0 - r ankle, 1 - r knee, 2 - r hip, 3 - l hip, 4 - l knee, 5 - l ankle,
6 - pelvis, 7 - thorax, 8 - upper neck, 9 - head top, 10 - r wrist,
11 - r elbow, 12 - r shoulder, 13 - l shoulder, 14 - l elbow, 15 - l wrist

This network can detect the posture of only one person in the input image. The input of the network is 256x256. The following image shows the result of hourglass detection.

Note: Use a square picture for input. If you need to detect pictures of other size ratios, use a network with the same input size ratio.

The following table lists the hourglass models supported by the Vitis AI Library.

Table 22. Hourglass Models
No Model Name Framework
1 hourglass-pe_mpii Caffe

Pointpillars

Object detection in point clouds is an important aspect of many robotics applications such as autonomous driving. The pointpillars model is a novel deep network and encoder that can be trained end-to-end on LiDAR point clouds. It offers the best architecture for 3D object detection from LiDAR. The following image shows the result of a pointpillar test.

Figure 20: Pointpillars Test Example

The following table lists the pointpillars models supported by the Vitis AI Library.

Table 23. Pointpillar Models
No Model Name Framework
1 pointpillars_kitti_12000_0_pt PyTorch
2 pointpillars_kitti_12000_1_pt PyTorch

3D Segmentation

The 3D segmentation library can support the SalsaNext model, which is used for the uncertainty-aware semantic segmentation of a full 3D LiDAR point cloud in real-time. SalsaNext is the next version of SalsaNet which has an encoder-decoder architecture, where the encoder unit has a set of ResNet blocks and the decoder unit combines upsampled features from the residual blocks.

The following table lists the3D segmentation models supported by the Vitis AI Library.

Table 24. 3D Segmentation Models
No Model Name Framework
1 salsanext_pt PyTorch

Covid19 Segmentation

The Covid19 segmentation library can support the COVID-Net model which is a deep convolutional neural network design tailored for the detection of COVID-19 cases from chest X-ray (CXR) images.

The following table lists the Covid19 segmentation models supported by the Vitis AI Library.

Table 25. Covid19 Segmentation Models
No Model Name Framework
1 FPN-resnet18_covid19-seg_pt PyTorch

Model Samples

Currently, there are 27 model samples that are located in ~/Vitis-AI/demo/Vitis-AI-Library/samples. Each sample has the following four kinds of test samples:

  • test_jpeg_[model type]
  • test_video_[model type]
  • test_performance_[model type]
  • test_accuracy_[model type]

Take YOLOv3 as an example.

  1. Before you run the YOLOv3 detection example, you can choose one of the following yolov3 models to run:
    1. yolov3_bdd
    2. yolov3_voc
    3. yolov3_voc_tf
  2. Ensure that the following test programs exists:
    1. test_jpeg_yolov3
    2. test_video_yolov3
    3. test_performance_yolov3
    4. test_accuracy_yolov3_bdd
    5. test_accuracy_yolov3_adas_pruned_0_9
    6. test_accuracy_yolov3_voc
    7. test_accuracy_yolov3_voc_tf

    If the executable program does not exist, you have to cross compile it on the host and then copy the executable program to the target.

  3. To test the image data, execute the following command:
    #./test_jpeg_yolov3 yolov3_bdd sample_yolov3.jpg

    The result is printed on the terminal. Also, you can view the output image: sample_yolov3_result.jpg.

  4. To test the video data, execute the following command:
    #./test_video_yolov3 yolov3_bdd video_input.mp4 -t 8
  5. To test the model performance, execute the following command:
    #./test_performance_yolov3 yolov3_bdd test_performance_yolov3.list -t 8
    The result is printed on the terminal.
  6. To test the model accuracy, prepare your own image dataset, image list file and the ground truth of the images. Then execute the following command:
    #./test_accuracy_yolov3_bdd [image_list_file] [output_file]

After the output_file is generated, a script file is needed to automatically compare the results. Finally, the accuracy result can be obtained.