GPUs play an important role in the deployment and inference of ML models, especially in large-scale models like GPT-3, BLOOM, or LLaMa. However, as ML practitioners it is very difficult to stay up to date with the latest research and techniques in how to best serve these models. At Cerebrium, we consistently test and implement new techniques and technologies to enhance the performance of our users’ underlying machine learning models. Our goal is to enable our users to focus on their end goal: creating great products and delivering value to their end-users.
In this tutorial, we introduce TensorRT, a SDK for high-performance deep learning inference from Nvidia as well as AITemplate from Meta AI, a unified inference system with seperate acceleration backends for both AMD and Nvidia GPU hardware.
With the rapid pace at which the ML space is advancing, developers are eager to try novel modelling techniques. However, high-performing GPU inference solutions are very inflexible due to them being hardware specific (AMD vs Nvidia) and are black box runtimes. This lack of flexibility makes it difficult to iterate and maintain the code that makes up these solutions, due to the hardware dependencies in the complex runtime environments.
To address these industry challenges, Meta AI has developed and is open-sourcing AITemplate (AIT). It delivers close to hardware-native Tensor Core (NVIDIA GPU) and Matrix Core (AMD GPU) performance on a variety of widely used AI models such as convolutional neural networks, transformers, and diffusers. With AIT, it is now possible to run performant inference on hardware from both GPU providers. We’ve used AIT to achieve performance improvements up to 12x on NVIDIA GPUs and 4x on AMD GPUs compared with eager mode within PyTorch.
Currently, AITemplate only supports Pytorch models but they are open to exploring integrations with other frameworks, such as ONNX. You can look at the Git repo here.
TensorRT is a high-performance deep learning inference library developed by NVIDIA for optimizing deep learning models for deployment on NVIDIA GPUs. It is designed to maximize the performance and efficiency of deep learning inference applications by using advanced optimization techniques such as layer fusion, precision calibration, and kernel auto-tuning.
TensorRT is integrated with PyTorch, TensorFlow, Onnx and more so you can achieve 6X faster inference with a single line of code.
In this tutorial we are going to run a Stable Diffusion model using AITemplate and TensorRT in order to see the impact on performance. As always we will be running our experiement on a A10 from Lambda Labs.
For experiment purposes, when you run Stable Diffusion 2.1, with fp16 precision on a Nvidia A10, the model typically takes 6–7 seconds to run with 50 inference steps.
First clone the repo from the AITemplate:
Next, build AITemplate to run on our server using docker. With the command
Run your docker container with the command docker run ait and you should be in your container. We then need to install the AI Template Python wheel using the following
Then go to the ‘examples/05_stable_diffusion_repo’.
Run the following pip command to install the required libraries:
pip install transformers diffusers torch
Then run python3 scripts/download_pipeline.py — token ACCESS_TOKEN to download the weights for Stable Diffusion. Access token is your hugging face token.
There are a 3 main steps in order to get Stable Diffusion working on AIT:
In this repo, you will see the code does this and we can run:
If you get a numpy error about typeDict then run the below command to work around it and then run the complie script again
You can then run the following command to generate a image with a prompt:
python3 scripts/demo.py — prompt “Mountain Rainier in van Gogh’s world”
Your image should be generated in less than <1s under the name example_ait.png. This was a 83% speedup by implementing AITemplate
First, let us install the required packages and frameworks
Next we need to conver the UNet from Onnx format to the format TensorRT expects. As we mentioned above, Stable Diffusion has 3 parts: a variational autoencoder, UNet and CLIP text encoder. If you run some basic tests, you will see that the UNet stage takes 90% of the time so in our tutorial we are only going to convert this part of the model.
Stable Diffusion is a Pytorch model and therefore needs to be converted into a TensorRT format. However, TensorRT has a built in function to convert models from Onnx so we will be using a Onnx version of Stable Diffusion — its easier.
ONNX is an open format used to represent and exchange machine learning models.
Let us download the Onnx model that will used
!tar -xf models.tar.gz
We can then run the code below to convert our UNet model from Onnx to a TensorRT format.
We are doing the following above:
This code takes about 10 minutes to execute since it runs into a few memory issues on an A10 that it tries different tactic. You could enable CUDA lazy loading to defer the loading of data into GPU memory until it is actually needed. However, running this script as is still works.
Below is the code we use to run the Stable diffusion model using the TensorRT UNet model. There two things to note below:
2. We are manually running each steps of the Stable diffusion model (text-encoder, embeddings, latent spaces etc), this is automatically done under the hood by the diffusers library when you run ‘stableDiffusionPipeline.from_pretrained()’
You should now be able to run the above:
Within 3 seconds you should have a image generated in your current directory. We generated a 2x speed up by converting Stable Diffusion to TensorRT format.
Both AITemplate and TensorRT offer significant improvements in model inference speed with minimal upfront effort. Additionally, both models are highly compatible with various frameworks and hardware configurations, making them a viable option to consider for your business.