Over our last few posts we have been mentioning the hype around Large Language and Generative AI models and how you can decrease both inference and training time. As our users have begun to use these models and fine-tune them, they naturally desire to fine-tune and deploy models containing hundreds of billions of parameters, with the aim of boosting performance for their specific use cases.
Typically this would be a very demanding task, requiring large amounts of compute and the storing of 40GB checkpoints. This is infeasible to do on normal computer hardware. Besides the power and storage required, fine-tuning models of this nature take a long time to run and inherently are very expensive — until now.
Introducing the PEFT library by Huggingface, a library that supports Parameter Efficient Fine-tuning methods such as LoRA, Prefix Tuning etc. that enables the efficient adaption of pre-trained language models to various downstream applications without fine-tuning all the model parameters. These various techniques achieve performance comparable to that of full fine-tuning.
An important paradigm of natural language processing consists of large-scale pre-training on general domain data and the adaptation to particular tasks or domains. When it comes to fine-tuning, we update the entire set of model parameters for the target task. While fine-tuning obtains good performance, it is memory-consuming during training because gradients and optimizer states for all parameters must be stored. Moreover, keeping a copy of model parameters for each task during inference is inconvenient since pre-trained models are usually large.
Currently the PEFT library supports 4 methods, LoRA, Prefix Tuning, P-Tuning and Prompt Tuning. While these methods have subtle differences they revolve around a similar paradigm. They freeze the pre-trained model weights and only update a subset of the parameter weights. Previously, existing techniques often introduced latency by extending model depth or reducing the models usable sequence length. These methods therefore previously failed to match the fine-tuning baselines, posing a trade-off between efficiency and model quality.
For example, with LoRA, it freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency.
There is a nice example on how to use PEFT to train a Dreambooth model in the PEFT repository.
First let us git clone the repository: git clone firstname.lastname@example.org:huggingface/peft.git and go to the examples/lora-dreambooth directory.
Install the required dependancies: pip install r requirements.txt . They forgot the Peft package in the requirements so run pip install peft .
Once the dependancies are installed you can run the code below and set the following environment variables:
I ran the above on a AWS g4dn.xlarge instance which runs a Nvidia T4 GPU. I gave 5 images of myself and am training the model to create business headshot photos — think LinkedIn profile photos. Running the above script takes about 35 minutes but if you would like it to go faster just reduce the number of training steps.
Running the above code trains our model on the images you provided and outputs the model weights that you can then use to call inference. Use the code below to call inference on your own model:
In the above snippet you will see we have to load the checkpoints from our training and replace some of the layers in the base model network architecture. You can see that in lines 14–26 where we are adding our LoRA config to the model unet.
I then ran the same tutorial we did a few weeks ago training a Dreambooth model normal on a A10 GPU using the same parameters as above and below of the images for comparison:
The images above aren’t perfect, I only supplied 5 images and most of them weren’t the best quality. However, I am very impressed with the results of LoRA since the results resemble me a lot better.
To test what the final output of this tutorial looks like you can also check it out on a HuggingFace space here.
As you can see the difference in images are negligible and so the only trade-off you would have to make is speed vs cost. PEFT methods are making fine-tuning large language and generative models more accessible to users who might not have the budget or access to performant GPU’s.
We are working on bringing this functionality to the Cerebrium platform so if you would like to fine-tune large Models such as BLOOM, GPT-Neo 20b etc please reach out to us. Additionally please join our communities on Slack and Discord to stay up to date with our latest news.