Stable diffusion is still one of the most popular large generative models on the Cerebrium platform that users are incorporating into production applications. Due to customer demand and the impact that inference time has on the user experience, we have had to react as a team and improve the speed and scalability of our Stable Diffusion model.
Below we show a series of 3 optimisations that allow us to achieve a ~2 second latency time for Stable Diffusion with a negligible loss in output quality. This is down from a 11.26s original inference time — almost an 550x speed increase. This article is a continuation on our previous post regarding DeepSpeed, a deep learning optimisation library.
In order to present this demo, we used a NVIDIA A10G from Lambda labs.
First, let us install the required libraries:
Use torch.float16 instead of torch.float32 with mixed precision from PyTorch.
Half-precision floating point format (FP16) uses 16 bits, compared to 32 bits for single precision (FP32). Peak float16 matrix multiplication and convolution performance is 16x faster than peak float32 performance on A100 GPUs. Since float16 are half the size of float32 they can double the performance of bandwidth-bound kernels and reduce the memory required to train a network, allowing for larger models, larger batches, or larger inputs. This makes it easy to get the speed and memory usage benefits of lower precision data types while preserving convergence behavior.
Current Inference: 4.57s
Result: This resulted in a 60% gain in inference speed
Use torch.inference_mode or torch.no_grad.
InferenceMode is a new context manager analogous to no_grad to be used when you are certain your operations will have no interactions with autograd (e.g. model training). Code run under this mode gets better performance by disabling view tracking and version counter bumps.
Current inference: 4.53s
Result: <1% gain in inference speed
From our previous articles you would know that Deepspeed is a deep learning optimization library that is beneficial both in training and inference use cases.
Using DeepSpeed, introduces several features to efficiently serve transformer-based PyTorch models with custom fused GPU kernels.Behind the scenes, DeepSpeed Inference replaces any layers with their optimized versions if they match DeepSpeed internal registered layers
Current inference: 2.68
Result: 41% gain in inference speed
Your finished code should look something like this.'
There are some other optimisations you can look at doing:
If you would like to deploy our 2s Stable Diffusion model to Serverless GPU’s with one click — you can on Cerebrium. Click here to get started.