Overview

The Xilinx Inference Server is the fastest new way to deploy your Vitis™ AI environment XModels for inferencing. You no longer need to write custom logic with the Vitis AI Runtime libraries for each XModel. Instead, you can use the Vitis AI tools to compile and prepare your XModel (or grab a trained one from the Vitis AI Model Zoo), and then use the Inference Server to make the XModel available for servicing inferencing requests. These requests can be easily made using the included Python API, which provides methods to load your XModel and directly make an inference without touching any C++. In addition to ease of use, the Inference Server provides a high-performance and scalable solution to leverage all the FPGAs on your machine or even in your cluster with Kubernetes and KServe. In the future, we plan on supporting other machine learning frameworks and even GPUs to create an all-in-one solution for heterogeneous machine learning inference.

The AMD Xilinx Inference Server is open-sourced on GitHub and under active development. Clone the repository and try it out! Take a look through the documentation for how to get started to set up the environment and walking through some examples.


How to Start

Say you wanted to make some inferences to a trained ResNet50 model with your Alveo™ U250 data center accelerator card. You’d be in luck as there’s already a trained XModel for this platform that you can find from the Vitis AI Model Zoo. But before you can use the Inference Server, you need to prepare your host and board. Follow the instructions in the Vitis AI repository to install the Xilinx Runtime (XRT), the AMD Xilinx Resource Manager (XRM), and the target platform on the Alveo card. Once your host and card are set up, you’re ready to use the server. Note that the following example and instructions are adapted from the documentation which will have the most up-to-date version of these instructions.

$ git clone https://github.com/Xilinx/inference-server.git

$ cd inference-server

$ ./proteus dockerize

First, we clone the repository and build the Docker image to run the server. The resulting Docker container contains all the dependencies to build, test and run the Inference Server. By using containers, we can easily run the server and deploy it onto clusters.

$ ./proteus run --dev

Once the container is built, we can start it by using this command. This will start the container, mount our local directory into it for development, pass along any FPGAs on the host, and drop us into a terminal in the container. The rest of these instructions are run inside the container.

$ proteus build –all

In the container, we can build the server executable. Once the executable is built, we’re ready to use it for inference. One easy way to do this is using a Python script, which we break down next.

import proteus

To simplify interacting with the server from Python, we provide a Python library that we can import into our script.

server = proteus.Server()

client = proteus.RestClient("127.0.0.1:8998")

server.start()

client.wait_until_live()

Next, we can create our server and client. We point our client to the address where the server is running (by default, the server will be running on the localhost at port 8998). Then, we can start our server and let our client wait until the server is live.

parameters = {"xmodel": path_to_xmodel}

response = client.load("Xmodel", parameters)

worker_name = response.html

while not client.model_ready(worker_name):

    pass

Since we want to run the ResNet50 XModel, we load the XModel worker and pass it the path to the XModel we downloaded from the Vitis AI Model Zoo. The server responds back with an endpoint that we can use for subsequent interactions with this worker. We then wait until the worker is ready.

images = []

for _ in range(batch_size):

    images.append(path_to_image)

images = preprocess(images)

request = proteus.ImageInferenceRequest(images, True)

response = client.infer(worker_name, request)

Now, we’re ready to make an inference. We can prepare a batch of images to send to the server and preprocess them in Python using custom logic. Finally, we can prepare the request using the preprocessed images and send it to the server for inference. The response can then be parsed, postprocessed, and evaluated.


Next Steps

The example above shows the basic method of interacting with the AMD Xilinx Inference Server. Check out the documentation to learn more about automatic batching, the C++ API, deploying on a cluster, user-defined parallelism, and running end-to-end inferences. Stay tuned to the AMD Xilinx Inference Server repository for future updates!


About Bingqing Guo

About Bingqing Guo

Bingqing Guo, SW & AI Product Marketing Manager at CPG AMD. Bingqing has been working in the marketing of AI acceleration solutions for years. With her understanding of the market and effective promotion strategies, more users have begun to use AMD Vitis AI in their product development and recognized the improvements that Vitis AI has brought to their performance.

See all Bingqing Guo's articles