Stable Diffusion is coming up on its one year anniversary after being released by StablilityAI and has since taken the world by storm with no signs of slowing down. A year into the release, we at Cerebrium are still seeing customers investing time and research into developing some phenomenal applications using Stable Diffusion.
Many of our customers wanted to have end-to-end pipelines as a way of implementing personalised models for their users and did not want to focus on scripts and cumbersome pipelines. They also wanted to rapidly iterate with different parameters and imagery to see what would generate the best results.
Thats why we at Cerebrium implemented an end-to-end pipeline to fine-tune Stable Diffusion v1.5 with just a few lines of code! Before we show you the workflow it’s important to walk through some best practice and some theory behind the model so you can optimise your results. In this tutorial we will be fine-tuning a model based on our CEO, Michael, since he is the “face” of Cerebrium.
Theory
Stable Diffusion was trained on 5 billion image-text pairs that was derived from a Common Crawl dataset scraped from the Web. Due to this diverse set of training data, we only need to supply ~10-20 images in order to get phenomenal results.
The first tip is to give a diverse set of images that showcase different facial views (like front, profile, mixed angles), expressions (such as neutral, happy, sad), and diverse backgrounds. This allows our model to produce great results with a variety of prompts rather than a small subset.
Next is to tag them with a text prompt. According to DreamBooth’s guidelines, the prompt we will use should follow the format:
A [token name] [class noun]
where the [token name] will act as our personal reference, and the [class noun] represents a pre-existing category in the model’s vocabulary that pertains to us. So for Michael, a suitable prompt could be “A CerebriumCeo man.” Other possible class nouns could include woman, child, teenager, dog, or sunglasses. Using a unique identifier and class noun during training, help the mode comprehend that we are trying to generate humans and not kitchen appliances.
However, before we refine this model, two prevalent issues need to be addressed:
Overfitting. Stable Diffusion would overfit such a small amount of images, regardless of how diverse they are. This means that our model will generate images in the poses and contexts that are present in the training pictures which we might not want.
2. Language drift. The model replaces a class noun with the images we gave it rather than remembering what it was trained on. For example, using “a CerebriumCeo man” is going to make the model think that all men look like Michael.
DreamBooth’s authors proposed a solution for these two issues — a class-specific prior-preservation loss. Essentially this means that while the model is training it will try learn from our images as well as images from the original class noun as the same time. It is recommended that 200 x 200 images should be sampled for each training image we use.
One final point to mention is that Cerebrium has build their fine-tuning Stable Diffusion pipeline using LoRA (Low Rank Adaptation) which offers the following benefits:
The model is less likely to have catastrophic forgetting because the previous pre-trained weights are kept frozen
LoRA weights have fewer parameters than the original model and can be easily portable. This leads to faster training times.
LoRA allows control to which extent the model is adapted toward new training images (supports interpolation)
Quick Tips
Below are some quick tips in order to get the most of your training that we have experimented with.
Users have had most success with 10–12 images.
Make sure to crop your training images to a square since images are automatically scaled down to 64x64.
If your number of training iterations is too low, the model will underfit the subject’s images and won’t be able to reproduce it accurately during inference. If it’s too high, the model will overfit instead, making it unable to reproduce the subject with expressions, poses, or contexts outside of those in the training subset. A rule of thumb that has shown good results in our experiments is to use between 100 and 200 iterations per training image.
The guidance scale is a float that controls how much importance is given to the input text prompt. Lower values of this parameter will allow the model to take more artistic liberties and so we would want a high value for our user case
Tutorial
To start, you will need to install the Cerebrium framework
pip install --upgrade cerebrium
We will then create out base project for stable diffusion with the following line:
cerebrium init-trainer diffuser ./config.yaml
This initialises a new diffuser training project where the configuration is stored in the config.yaml file
Update your config.yaml file based on the tips on the section, Quick Tips, above
%YAML 1.2
---
training_type: diffuser # Type of training to run. Either "diffuser" or "transformer". In this case, "transformer".
name: sd-test # Your name for the fine-tuning run.
hf_model_path: "runwayml/stable-diffusion-v1-5" # Path to the huggingface diffusion model to train.
train_prompt: "a CerebriumCeo man" # Your prompt to train.
log_level: "INFO" # log_level level for logging. Can be "DEBUG", "INFO", "WARNING", "ERROR".
###############################################################
# Optional Parameters
###############################################################
# Diffuser params
prior_class_prompt: "a man" # Your prompt to train prior class images. Only use if you would like to train prior class images.
revision: "main" # Revision of the diffuser model to use.
validation_prompt: ~ # an optional validation prompt to use. If ~, will use the training prompt.
custom_tokenizer: "" # custom tokenizer from AutoTokenizer if required.
# Dataset params
train_image_dir: ./images/ # Directory of training images.
prior_class_image_dir: ~ # or "path/to/your/prior_class_images". Optional directory of images to use if you would like to train prior class images as well.
# Training params
training_args:
# General training params
learning_rate: 1.0E-5
num_validation_images: 4 # Number of images to generate in validation.
num_train_epochs: 1650
seed: 1
resolution: 512 # Resolution to train images at.
center_crop: False # Whether to center crop images to resolution.
train_batch_size: 2
num_prior_class_images: 2200 # Number of prior class images to train on. If 0, will not generate any prior class images. Requires prior_class_prompt to be set.
prior_class_generation_batch_size: 2
prior_loss_weight: 1.0 # Weight of prior loss in the total loss if using.
max_train_steps: ~ # maximum training steps which overrides number of training epochs
validation_epochs: 5 # number of epochs before running validation and checkpointing
# Training loop params
gradient_accumulation_steps: 1
lr_scheduler: "constant"
lr_warmup_steps: 50å
lr_num_cycles: 1
lr_power: 1.0
allow_tf32: False
max_grad_norm: 1.0
mixed_precision: "no" # If you would like to use mixed precision. Supports fp16 and bf16. Defaults to 'no'
prior_generation_precision: ~
scale_lr: False
use_8bit_adam: True
use_xformers: True # Whether to use xformers memory efficient attention or not.
Cerebrium will then start a job which you can track the training logs of using the command:
cerebrium get-training-logs {JOB_ID}
Once the model has completed training you should receive a email. You can then run the following command to download your model weights and use them in your deployment:
cerebrium download-model {JOB_ID}
Using our models weights we can then deploy of model for inference with the following main.py. Make sure you list the required libraries in your requirements.txt.
from typing import Optional
from pydantic import BaseModel
from diffusers import (
DiffusionPipeline,
DPMSolverMultistepScheduler,
)
import torch
import io
import base64
# Boilerplate loading of model
pipeline = DiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", revision="main", torch_dtype=torch.float16
)
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
pipeline = pipeline.to("cuda")
# LOAD IN YOUR TRAINING RESULTS
# load attention processors from where they are saved in your_results/checkpoints/final/attn_procs/pytorch_lora_weights.bin
pipeline.unet.load_attn_procs("./results/checkpoints/final/attn_procs/")
class Item(BaseModel):
# Add your input parameters here
prompt: str
num_images: Optional[int] = 4
negative_prompt: Optional[str] = ""
guidance: Optional[float] = 6.0
num_inference_steps: Optional[int] = 100
seed: Optional[int] = 42
def image_to_base64(image) -> str:
byte_arr = io.BytesIO()
image.save(byte_arr, format='JPEG') # format could be PNG or other formats too
byte_data = byte_arr.getvalue()
return base64.b64encode(byte_data).decode('utf-8') # decode to get string from bytes
def predict(item, run_id, logger):
item = Item(**item)
# run inference as you normally would
generator = torch.Generator(device="cuda").manual_seed(item.your_manual_seed)
images_base64 = [
image_to_base64(
pipeline(
item.prompt,
num_inference_steps=item.num_inference_steps,
generator=generator,
guidance_scale=item.guidance_scale,
negative_prompt=item.negative_prompt
).images[0]
)
for _ in range(item.num_images)
]
return {"results": images_base64} # return your results
We then deploy our model using the following line of code:
cerebrium deploy fine-tuned-sd --hardware A10
Once your model has been deployed we then make the following curl request:
curl — location — request POST ‘https://run.cerebrium.ai/v2/p-xxxx/fine-tuned-sd/predict' \ — header ‘Authorization: public-xxxxxxx’ \ — header ‘Content-Type: application/json’ \ — data ‘{“prompt”: “A professional headshot photo of a CerebriumCEO man. close-up RAW photo, sharp focus, ultra-high pixel detail, intricate, realistic, movie scene, cinematic, high-quality, full colors, incredibly detailed, 4k, 8k, 16k, hyper-realistic, RAW photo, masterpiece, ultra-detailed, professionally color graded”, “negative_prompt”: “cartoon, anime, 3d, painting, b&w, worst quality, low quality, normal quality, low-res, skin spots, acne, skin blemishes, age spots, ugly, duplicate, morbid, mutilated, mutated hands, poorly drawn hands, blurry, bad anatomy, bad proportions, extra limbs, disfigured, missing arms, extra legs, fused fingers, too many fingers, unclear eyes, low-resolution, bad hands, missing fingers, bad hands, missing fingers, cartoon, low poly, text, signature, watermark, username”}’
You will see that the generated images will have some potential deformities around fingers, eyes, teeth etc. You can read the article here on some potential techniques to resolve it but one important one is to implement a model like Codeformer post model generation. You can test it here however the implementation is out of scope for this tutorial.
Voila! We now have a fine-tuned Stable diffusion model that can generate images around our CEO, Michael.
Orginal Image (left) Generated Image (right)
There are many more optimizations we could have done to our model such as better quality images, make use ofin-painting etc however it was out of scope for this article. We recommend looking at the many resources on stable-diffusion-art.com to improve your generated images.
You should now be fully equipped to fine-tune Stable Diffusion models around any object or person you like. Fine-tuning is a craft, nuanced with its quirks and tricks and so we recommend and experimenting with many different types of image and hyper-parameter configurations in order to get the best results. Cerebrium makes the end-to-end process of fine-tuning easy and accessible to anyone and so we are very excited to see what the community builds! Please tag us in your image generations so we can share it with our community