home
navigate_next
Blog
navigate_next
Tutorials

How to speed up Diffusion to a 2 second inference time 500x improvement

How to speed up Diffusion to a 2 second inference time 500x improvement
Michael Louis
Co-Founder & CEO

Stable diffusion is still one of the most popular large generative models on the Cerebrium platform that users are incorporating into production applications. Due to customer demand and the impact that inference time has on the user experience, we have had to react as a team and improve the speed and scalability of our Stable Diffusion model.

Below we show a series of 3 optimisations that allow us to achieve a ~2 second latency time for Stable Diffusion with a negligible loss in output quality. This is down from a 11.26s original inference time — almost an 550x speed increase. This article is a continuation on our previous post regarding DeepSpeed, a deep learning optimisation library.

In order to present this demo, we used a NVIDIA A10G from Lambda labs.

Tutorial

Environment

First, let us install the required libraries:


pip install torch==1.12.1 --extra-index-url https://download.pytorch.org/whl/cu116 --upgrade
pip install deepspeed==0.7.4 --upgrade
pip install diffusers==0.6.0 triton==2.0.0.dev20221031 --upgrade
pip install transformers[sentencepiece]==4.24.0 accelerate --upgrade


Optimisation 1:

Use torch.float16 instead of torch.float32 with mixed precision from PyTorch.

Half-precision floating point format (FP16) uses 16 bits, compared to 32 bits for single precision (FP32). Peak float16 matrix multiplication and convolution performance is 16x faster than peak float32 performance on A100 GPUs. Since float16 are half the size of float32 they can double the performance of bandwidth-bound kernels and reduce the memory required to train a network, allowing for larger models, larger batches, or larger inputs. This makes it easy to get the speed and memory usage benefits of lower precision data types while preserving convergence behavior.


pipeline = DiffusionPipeline.from_pretrained(HF_MODEL_ID, torch_dtype=torch.float16, revision="fp16", use_auth_token=HF_TOKEN)

Current Inference: 4.57s

Result: This resulted in a 60% gain in inference speed

Optimisation 2:

Use torch.inference_mode or torch.no_grad.

InferenceMode is a new context manager analogous to no_grad to be used when you are certain your operations will have no interactions with autograd (e.g. model training). Code run under this mode gets better performance by disabling view tracking and version counter bumps.

Current inference: 4.53s
Result: <1% gain in inference speed

Optimization 3

From our previous articles you would know that Deepspeed is a deep learning optimization library that is beneficial both in training and inference use cases.

Using DeepSpeed, introduces several features to efficiently serve transformer-based PyTorch models with custom fused GPU kernels.Behind the scenes, DeepSpeed Inference replaces any layers with their optimized versions if they match DeepSpeed internal registered layers


deepspeed.init_inference(
    model=getattr(pipeline, "model", pipeline),      # Transformers models
    mp_size=1,        # Number of GPU
    dtype=torch.float16, # dtype of the weights (fp16)
    replace_method="auto", # Lets DS autmatically identify the layer to replace
    replace_with_kernel_inject=False, # replace the model with the kernel injector
)

Current inference: 2.68
Result: 41% gain in inference speed

Your finished code should look something like this.'


import re
from diffusers import DiffusionPipeline
import torch
from torch import inference_mode
import deepspeed
from time import perf_counter
import numpy as np

HF_MODEL_ID="CompVis/stable-diffusion-v1-4"
HF_TOKEN="" # your hf token: https://huggingface.co/settings/tokens

pipe = DiffusionPipeline.from_pretrained(HF_MODEL_ID, torch_dtype=torch.float16 ,use_auth_token=HF_TOKEN).to("cuda")

#Used to conduct timings
def measure_latency(pipe, prompt):
    latencies = []
    # warm up
    pipe.set_progress_bar_config(disable=True)
    for _ in range(2):
        _ =  pipe(prompt)
    # Timed run
    for _ in range(10):
        start_time = perf_counter()
        _ = pipe(prompt)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_s = np.mean(latencies)
    time_std_s = np.std(latencies)
    time_p95_s = np.percentile(latencies,95)
    return f"P95 latency (seconds) - {time_p95_s:.2f}; Average latency (seconds) - {time_avg_s:.2f} +\- {time_std_s:.2f};", time_p95_s


prompt = "a photo of an astronaut riding a horse on mars"

with torch.inference_mode():
  deepspeed.init_inference(
      model=getattr(pipe,"model", pipe),      # Transformers models
      mp_size=1,        # Number of GPU
      dtype=torch.float16, # dtype of the weights (fp16)
      replace_method="auto", # Lets DS autmatically identify the layer to replace
      replace_with_kernel_inject=False, # replace the model with the kernel injector
  )
  ds_results = measure_latency(pipe,prompt)

  print(f"DeepSpeed model: {ds_results[0]}")   


There are some other optimisations you can look at doing:

  • Look at using CUDA graphs. You can capture the graph operations and replay it at once, rather than in a sequence of individually-launched operations. This reduces overhead as GPU kernels are not returning back to Python.
  • Look at VoltaML. They are also a deep learning optimization library and have some promising results.

If you would like to deploy our 2s Stable Diffusion model to Serverless GPU’s with one click — you can on Cerebrium. Click here to get started.

arrow_back
Back to blog