home
navigate_next
Blog
navigate_next

Speed up training and inference of GPT-Neo 1.6B by 45+% using DeepSpeed

Speed up training and inference of GPT-Neo 1.6B by 45+% using DeepSpeed

In this tutorial we are going to be looking at using DeepSpeed speed up fine-tuning and inference of GPT-Neo 1.3B on a multi-GPU cluster.

There are many reasons why we wanted to cover this:

  1. We wanted to know more about the capabilities of DeepSpeed due to its rising popularity and performance benefits we have heard of. It was used to train BLOOM.
  2. We have had many users looking for ways to fine-tune GPT-J or GPT-Neo as many have noticed superior performance on fine-tuned use cases vs GPT-3.
  3. Due to the proprietary nature of GPT-3 and how expensive fine-tuning/experimentation can get on OpenAI, users are looking for alternative ways to test assumptions and increase the performance of their applications.
  4. Lastly, for clients deploying large multi-GPU instances we are looking at DeepSpeed as a potential solution to help decrease inference time.

What is DeepSpeed?

DeepSpeed is an easy-to-use deep learning optimization library that helps to reduce memory usage and speed up the training of large-scale models. It is built on top of PyTorch and provides state-of-the-art techniques for model compression, such as model parallelism, zero redundancy optimizer (ZeRO), and memory optimization.

With DeepSpeed you can:

  • Train/Infer dense or sparse models with billions or trillions of parameters
  • Achieve excellent system throughput; and efficiently scale to thousands of GPUs
  • Train/Infer on resource constrained GPU systems
  • Achieve unprecedented low latency and high thoughput for inference
  • Achieve extreme compression for an unparalleled inference latency and model size reduction with low costs.

In our example we will fine tune GPT-Neo on Netflix movie descriptions. Given what type of input we give in terms of a movie it will generate a further description of the movie.

Environment Setup

In order to run these experiments I ran a Nvidia A10 instance from Lambda Labs.

First let us go ahead and install a couple of libraries by running pip install -r requirements.txt. Subsequently run pip install deepspeed

To start our training we set a seed in order to get reproducible results for the tests we will run. We then download the tokenizer and model — you will notice that we had to set beginning/end token sequences and padding. The reason for this is in the original GPT2 model, there were special tokens for bos and eos which are the ones specified above which is why we need to use them in fine-tuning. Note that you do not have to add these sequence tokens to your input.

We then set out training arguments which (based on a bit of Googling) are good options to start with but you should experiment to see what works best for your dataset. The one main thing here is that we pass a config file to DeepSpeed which enables DeepSpeed on our trainer. You can find the file here. Lastly we have to resize the model since we just added 3 tokens.

Next we ingest our data file which is a bunch of information about movies (titles, descriptions, actors, genre etc). You can download the file here. After we ingest the Netflix file, we need to figure out what the longest movie description is in the dataset which will be used later for padding our inputs via the tokenizer since we want all inputs to be of the same length.

We set out training size to 90% of the dataset and will use the other 10% for our evaluation set. We then pass all our arguments to the trainer to start training and use a custom lambda function to create the batches of data. You can run the fine tuning step which takes about 2 hours on my Nvidia A10 instance.

Lastly we evaluate our results. We want every description to start with our beginning sequence token and to only keep the top 50 tokens that have a cumulative probability above 95% of being the next token to generate. We then set the tempreature to 1.9 where the higher the temperature, the more random the results — 1.9 is a nice medium thats realistic but can still be creative. Lastly we have to decode the encodings generated back through the tokenizer to get the generated movie descriptions.

For evaluation purposes we ran the same training function without DeepSpeed and it took our model 3.5 hours. Therefore we had a 43% saving in training time!

Inference using DeepSpeed

DeepSpeed can not only be used for training but also for inference and so we are going to do a basic test on our on the original model to see the effect DeepSpeed has on inference times. You can use DeepSpeed inference on any model of type torch.nn.module.


To get started, we download and instantiate our model and tokenizer and try a prediction.

We are going to create a latency baseline which will be a Python function that runs inference on our model and calculates avg, mean and p95 latency using the below function:

We run a test on our input calling inference on the model without DeepSpeed enabled to see the results. We set the length of tokens to generate to 200 so it’s consistent for our tests. We get the following output:

Vanilla model: P95 latency (ms) — 2972.7617085512975; Average latency (ms) — 2893.36 +\- 61.89;

Now that we have a baseline we are going to use the DeepSpeed InferenceEngine for GPU based inference. We initialize the inference engine using the init inference method which expects the following parameters:

  • Model: The model to optimize
  • mp_size: The number of GPU’s to use
  • dtype: The data type to use
  • replace_with_kernel_inject: Whether we inject custom kernels

Lastly we run a latency test on the optimized model to check if there is an increase in performance if any. One thing to also note with model optimization is there is usually an impact on the accuracy of the model. It is up to you to determine if the loss in accuracy is negligible to your use-case or if the increase in latency is worth the loss in accuracy. From running inference we see the following results:

DeepSpeed model: P95 latency (ms) — 1450.9557320003296; Average latency (ms) — 1447.23 +\- 2.22;

As you can see we have a ~49% (2972ms vs 1450ms) speed up when enabling DeepSpeed. Therefore with model inference becoming ever more important for end-users its extremely worthwhile looking at how you can use DeepSpeed in your implementations.

To get more more updates on the latest ML techniques please subscribe to our page as well as follow us on discord or check out our ML framework at Cerebrium.

arrow_back
Back to blog