Stable diffusion is still one of the most popular large generative models on the Cerebrium platform that users are incorporating into production applications. Due to customer demand and the impact that inference time has on the user experience, we have had to react as a team and improve the speed and scalability of our Stable Diffusion model.

Below we show a series of 3 optimisations that allow us to achieve a ~2 second latency time for Stable Diffusion with a negligible loss in output quality. This is down from a 11.26s original inference time — almost an 550x speed increase. This article is a continuation on our previous post regarding DeepSpeed, a deep learning optimisation library.

In order to present this demo, we used a NVIDIA A10G from Lambda labs.

First, let us install the required libraries:

Use torch.float16 instead of torch.float32 with mixed precision from PyTorch.

Half-precision floating point format (FP16) uses 16 bits, compared to 32 bits for single precision (FP32). Peak float16 matrix multiplication and convolution performance is 16x faster than peak float32 performance on A100 GPUs. Since float16 are half the size of float32 they can double the performance of bandwidth-bound kernels and reduce the memory required to train a network, allowing for larger models, larger batches, or larger inputs. This makes it easy to get the speed and memory usage benefits of lower precision data types while preserving convergence behavior.

**Current Inference: **4.57s

**Result**: This resulted in a 60% gain in inference speed

Use torch.inference_mode or torch.no_grad.

InferenceMode is a new context manager analogous to no_grad to be used when you are certain your operations will have no interactions with autograd (e.g. model training). Code run under this mode gets better performance by disabling view tracking and version counter bumps.

**Current inference:** 4.53s**Result:** <1% gain in inference speed

From our previous articles you would know that Deepspeed is a deep learning optimization library that is beneficial both in training and inference use cases.

Using DeepSpeed, introduces several features to efficiently serve transformer-based PyTorch models with custom fused GPU kernels.Behind the scenes, DeepSpeed Inference replaces any layers with their optimized versions if they match DeepSpeed internal registered layers

**Current inference: 2.68****Result:** 41% gain in inference speed

Your finished code should look something like this.'

There are some other optimisations you can look at doing:

- Look at using CUDA graphs. You can capture the graph operations and replay it at once, rather than in a sequence of individually-launched operations. This reduces overhead as GPU kernels are not returning back to Python.
- Look at VoltaML. They are also a deep learning optimization library and have some promising results.

If you would like to deploy our 2s Stable Diffusion model to Serverless GPU’s with one click — you can on Cerebrium. Click here to get started.

Back to blog