GPUs play an important role in the deployment and inference of ML models, especially in large-scale models like GPT-3, BLOOM, or LLaMa. However, as ML practitioners it is very difficult to stay up to date with the latest research and techniques in how to best serve these models. At Cerebrium, we consistently test and implement new techniques and technologies to enhance the performance of our users’ underlying machine learning models. Our goal is to enable our users to focus on their end goal: creating great products and delivering value to their end-users.
In this tutorial, we introduce TensorRT, a SDK for high-performance deep learning inference from Nvidia as well as AITemplate from Meta AI, a unified inference system with seperate acceleration backends for both AMD and Nvidia GPU hardware.
AI Template
With the rapid pace at which the ML space is advancing, developers are eager to try novel modelling techniques. However, high-performing GPU inference solutions are very inflexible due to them being hardware specific (AMD vs Nvidia) and are black box runtimes. This lack of flexibility makes it difficult to iterate and maintain the code that makes up these solutions, due to the hardware dependencies in the complex runtime environments.
To address these industry challenges, Meta AI has developed and is open-sourcing AITemplate (AIT). It delivers close to hardware-native Tensor Core (NVIDIA GPU) and Matrix Core (AMD GPU) performance on a variety of widely used AI models such as convolutional neural networks, transformers, and diffusers. With AIT, it is now possible to run performant inference on hardware from both GPU providers. We’ve used AIT to achieve performance improvements up to 12x on NVIDIA GPUs and 4x on AMD GPUs compared with eager mode within PyTorch.
Currently, AITemplate only supports Pytorch models but they are open to exploring integrations with other frameworks, such as ONNX. You can look at the Git repo here.
TensorRT
TensorRT is a high-performance deep learning inference library developed by NVIDIA for optimizing deep learning models for deployment on NVIDIA GPUs. It is designed to maximize the performance and efficiency of deep learning inference applications by using advanced optimization techniques such as layer fusion, precision calibration, and kernel auto-tuning.
TensorRT is integrated with PyTorch, TensorFlow, Onnx and more so you can achieve 6X faster inference with a single line of code.
Tutorial
In this tutorial we are going to run a Stable Diffusion model using AITemplate and TensorRT in order to see the impact on performance. As always we will be running our experiement on a A10 from Lambda Labs.
For experiment purposes, when you run Stable Diffusion 2.1, with fp16 precision on a Nvidia A10, the model typically takes 6–7 seconds to run with 50 inference steps.
Next, build AITemplate to run on our server using docker. With the command
./docker/build.sh cuda
Run your docker container with the command docker run ait and you should be in your container. We then need to install the AI Template Python wheel using the following
cd python
python setup.py bdist_wheel
pip install dist/*.whl --force-reinstall
Then go to the ‘examples/05_stable_diffusion_repo’.
Run the following pip command to install the required libraries:
pip install transformers diffusers torch
Then run python3 scripts/download_pipeline.py — token ACCESS_TOKEN to download the weights for Stable Diffusion. Access token is your hugging face token.
There are a 3 main steps in order to get Stable Diffusion working on AIT:
AITemplate converts Pytorch code down to C++ and so parameter naming must follow the convention for C. We need to do this for the variational autoencoder, UNet and CLIP text encoder.
We need to create the AIT module and input/outputs.
Compile the AIT module into runtime.
In this repo, you will see the code does this and we can run:
python3 scripts/compile.py
If you get a numpy error about typeDict then run the below command to work around it and then run the complie script again
pip install numpy==1.21
You can then run the following command to generate a image with a prompt:
python3 scripts/demo.py — prompt “Mountain Rainier in van Gogh’s world”
Your image should be generated in less than <1s under the name example_ait.png. This was a 83% speedup by implementing AITemplate
TensorRT
First, let us install the required packages and frameworks
Next we need to conver the UNet from Onnx format to the format TensorRT expects. As we mentioned above, Stable Diffusion has 3 parts: a variational autoencoder, UNet and CLIP text encoder. If you run some basic tests, you will see that the UNet stage takes 90% of the time so in our tutorial we are only going to convert this part of the model.
Stable Diffusion is a Pytorch model and therefore needs to be converted into a TensorRT format. However, TensorRT has a built in function to convert models from Onnx so we will be using a Onnx version of Stable Diffusion — its easier.
ONNX is an open format used to represent and exchange machine learning models.
We can then run the code below to convert our UNet model from Onnx to a TensorRT format.
import torch
import tensorrt as trt
import os, sys, argparse
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
from time import time
onnx_model = "./models/unet/1/unet.onnx"
engine_filename = "unet_new.engine"
def convert_model():
batch_size = 1
height = 512
width = 512
latents_shape = (batch_size*2, 4, height // 8, width // 8)
embed_shape = (batch_size*2,64,768)
timestep_shape = (batch_size,)
TRT_LOGGER = trt.Logger(trt.Logger.INFO)
TRT_BUILDER = trt.Builder(TRT_LOGGER)
network = TRT_BUILDER.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
onnx_parser = trt.OnnxParser(network, TRT_LOGGER)
parse_success = onnx_parser.parse_from_file(onnx_model)
for idx in range(onnx_parser.num_errors):
print(onnx_parser.get_error(idx))
if not parse_success:
sys.exit('ONNX model parsing failed')
print("Load Onnx model done")
config = TRT_BUILDER.create_builder_config()
profile = TRT_BUILDER.create_optimization_profile()
profile.set_shape("sample", latents_shape, latents_shape, latents_shape)
profile.set_shape("encoder_hidden_states", embed_shape, embed_shape, embed_shape)
profile.set_shape("timestep", timestep_shape, timestep_shape, timestep_shape)
config.add_optimization_profile(profile)
config.set_flag(trt.BuilderFlag.FP16)
serialized_engine = TRT_BUILDER.build_serialized_network(network, config)
## save TRT engine
with open(engine_filename, 'wb') as f:
f.write(serialized_engine)
print(f'Engine is saved to {engine_filename}')
convert_model()
We are doing the following above:
We use the TensorRT package to create a TensorRT engine from the ONNX model and set various optimization parameters such as precision mode, maximum batch size, and maximum workspace size.
Next, we serialize the TensorRT engine: After optimizing the ONNX model, you need to serialize the TensorRT engine to a file using the serialize function.
This code takes about 10 minutes to execute since it runs into a few memory issues on an A10 that it tries different tactic. You could enable CUDA lazy loading to defer the loading of data into GPU memory until it is actually needed. However, running this script as is still works.
Below is the code we use to run the Stable diffusion model using the TensorRT UNet model. There two things to note below:
We are using the UNet from the model version, 1.4 however we use the variational auto-encoder and tokenizer from v2.1. This is because the UNet wasn’t a major change between 1.4 and 2.1 — it was rather the CLIP model of which we are using the CLIP model for v2.1.
2. We are manually running each steps of the Stable diffusion model (text-encoder, embeddings, latent spaces etc), this is automatically done under the hood by the diffusers library when you run ‘stableDiffusionPipeline.from_pretrained()’
import torch
from tqdm import tqdm
from PIL import Image
from trt_model import TRTModel
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL
from diffusers import LMSDiscreteScheduler
from torch import autocast
import argparse
import time
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--prompt",
default="Super Mario learning to fly in an airport, Painting by Leonardo Da Vinci",
help="input prompt",
)
parser.add_argument(
"--trt_unet_save_path",
default="./unet.engine",
type=str,
help="TensorRT unet saved path",
)
parser.add_argument("--batch_size", default=1, type=int, help="batch size")
parser.add_argument(
"--img_size", default=(512, 512), help="Unet input image size (h,w)"
)
parser.add_argument(
"--max_seq_length", default=64, help="Maximum sequence length of input text"
)
parser.add_argument(
"--benchmark",
action="store_true",
help="Running benchmark by average num iteration",
)
parser.add_argument(
"--n_iters", default=50, help="Running benchmark by average num iteration"
)
return parser.parse_args()
class TrtDiffusionModel:
def __init__(self, args):
self.device = torch.device("cuda")
self.unet = TRTModel(args.trt_unet_save_path)
self.vae = AutoencoderKL.from_pretrained(
"stabilityai/stable-diffusion-2-1", subfolder="vae", use_auth_token=True
).to(self.device)
self.tokenizer = CLIPTokenizer.from_pretrained(
"stabilityai/stable-diffusion-2-1", subfolder="tokenizer", use_auth_token=True
)
self.text_encoder = CLIPTextModel.from_pretrained(
"openai/clip-vit-large-patch14"
).to(self.device)
self.scheduler = LMSDiscreteScheduler(
beta_start=0.00085,
beta_end=0.012,
beta_schedule="scaled_linear",
num_train_timesteps=1000,
)
def predict(
self, prompts, num_inference_steps=50, height=512, width=512, max_seq_length=64
):
guidance_scale = 7.5
batch_size = 1
text_input = self.tokenizer(
prompts,
padding="max_length",
max_length=max_seq_length,
truncation=True,
return_tensors="pt",
)
text_embeddings = self.text_encoder(text_input.input_ids.to(self.device))[0]
uncond_input = self.tokenizer(
[""] * batch_size,
padding="max_length",
max_length=max_seq_length,
return_tensors="pt",
)
uncond_embeddings = self.text_encoder(uncond_input.input_ids.to(self.device))[0]
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
latents = torch.randn((batch_size, 4, height // 8, width // 8)).to(self.device)
self.scheduler.set_timesteps(num_inference_steps)
latents = latents * self.scheduler.sigmas[0]
with torch.inference_mode(), autocast("cuda"):
for i, t in tqdm(enumerate(self.scheduler.timesteps)):
latent_model_input = torch.cat([latents] * 2)
sigma = self.scheduler.sigmas[i]
latent_model_input = latent_model_input / ((sigma**2 + 1) ** 0.5)
# predict the noise residual
inputs = [
latent_model_input,
torch.tensor([t]).to(self.device),
text_embeddings,
]
noise_pred, duration = self.unet(inputs, timing=True)
noise_pred = torch.reshape(noise_pred[0], (batch_size * 2, 4, 64, 64))
# perform guidance
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (
noise_pred_text - noise_pred_uncond
)
# compute the previous noisy sample x_t -> x_t-1
latents = self.scheduler.step(noise_pred.cuda(), t, latents)[
"prev_sample"
]
# scale and decode the image latents with vae
latents = 1 / 0.18215 * latents
image = self.vae.decode(latents).sample
return image
if __name__ == "__main__":
args = get_args()
model = TrtDiffusionModel(args)
if args.benchmark:
n_iters = args.n_iters
# warm up
for i in range(3):
image = model.predict(
prompts=args.prompt,
num_inference_steps=50,
height=args.img_size[0],
width=args.img_size[1],
max_seq_length=args.max_seq_length,
)
else:
n_iters = 1
start = time.time()
for i in tqdm(range(n_iters)):
image = model.predict(
prompts=args.prompt,
num_inference_steps=50,
height=args.img_size[0],
width=args.img_size[1],
max_seq_length=args.max_seq_length,
)
end = time.time()
if args.benchmark:
print("Average inference time is: ", (end - start) / n_iters)
image = (image / 2 + 0.5).clamp(0, 1)
image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
images = (image * 255).round().astype("uint8")
pil_images = [Image.fromarray(image) for image in images]
pil_images[0].save("image_generated.png")
You should now be able to run the above:
!python demo.py — trt_unet_save_path ../../unet_new.engine — prompt “A astronaut riding a horse on the moon”
Within 3 seconds you should have a image generated in your current directory. We generated a 2x speed up by converting Stable Diffusion to TensorRT format.
Both AITemplate and TensorRT offer significant improvements in model inference speed with minimal upfront effort. Additionally, both models are highly compatible with various frameworks and hardware configurations, making them a viable option to consider for your business.