Fine-tuning has many benefits over using closed source models from OpenAI, Anthropic and other providers. Businesses see benefits such as lower latencies, lower cost per generation, the ability to handle edge cases as well as the ability to know what is going on inside the model so you can optimize it. Allowing users to fine-tune open-source models has been one of our most widely requested features at Cerebrium and so the team has been working hard to get this in the hands of users
In this tutorial we are going to showcase how we used the Cerebrium fine-tuning capability to fine-tune a LLaMA based model to create a Cerebrium customer support chat bot.
The foundation of a good AI model often hinges on the dataset it uses for learning and fine-tuning. The primary goal in this process is to generate a dataset that encapsulates the nature of our customer queries and tries to teach the model how it should extract data from context and how it should respond based on the nature of the question and the information at its disposal. One of the big misconceptions about fine-tuning is that you should fine-tune specific information into a model — this is not correct.
You want to teach the model how to handle certain use cases and how to present information to the user based on the context and the question the user asked.
Information around a product or business will most likely always be changing and so to continuously fine-tune a model is infeasible. Rather up-to-date information should be injected into the context for a model to use. Additionally, information around your use-case can be sensitive and so you wouldn’t want it to be engrained into the model and accessible by all but rather injected for those that have clearance. Lastly, most foundational models are trained on a trove of diverse data and so its unlikely that if you fine-tune a model on your pricing structure that its going to forget about all previous pricing and business models it was trained on — this can lead to hallucinations.
We gathered our data from two sources:
I then built a quick script to send chunks of our Discord support messages to a GPT4 API with the following prompt:
Please create a JSON array in the following format from our Discord channel that is used to answer customer support questions:
This is a very easy way to start with an initial dataset that you can go through and make sure the prompt/completion pairs are correct.
In order to quickly find and set the relative context that our answers could have come from. I used LLamaIndex to chunk our documentation into a vector database and looped through the JSON dataset above to find the passages of text most similar to the completion parameter.
Your dataset should now look like:
You should go through the dataset to make sure that the context relates to the completion and that the answer is indeed hidden in the context. If the answer cannot be found in the context, I recommend you handle the use case accordingly:
Here is an example of a user asking how they can get access to our enterprise plan however the context we were able to retrive is just the details of our 3 pricing plans — therefore there is no information in the context to answer this question. We are trying to teach the model to respond with “Contact the Cerebrium team” since it cannot answer the question based on the given context
There are a few more steps we can take to curate our dataset even more by asking questions in different formats. Now that the dataset is curated, let’s train our model.
With our refined dataset in hand, we can move on to fine-tuning our LLaMA model. The Cerebrium platform makes this extremely easy, requiring me to only specify my training parameters and takes care of dataset validation and all infrastructure. We use the default training configuration found here and the following prompt template
We use our own prompt template to train that specifies that if the model cannot answer the question from the provided context, it should tell the user to contact the Cerebrium team.
To kick of the training job we run the following command:
Your training job should start and you can run the following command to see your training logs:
should see the following output in your terminal:
You will be notified via email once the training job has completed and then you can run the following code snippet to download your model checkpoint files
To deploy this model to Cerebrium you can look at the example here. A LLaMA model isn’t the quickest so you can look at implementing the vllm framework to speed up inference time to ~0.9 tokens per second on a A10 which is 24x faster than the base LLaMA model.
Now that our fine-tuned LLM is completed, we will need information from our documentation to aid its responses ie: the context. We chose to use LLamaIndex for this as its extremely easy to setup and has a wide variety of integrations and functionality. It allows us to chunk relevant data from our documentation and then input it into the context of our fine-tuned LLM. Essentially, LLamaIndex simplifies the process of retrieving the right data at the right time, and ensuring the chatbot’s responses remain accurate and informed.
You can install LLamaIndex using pip install llama-index . We then have the following snippet of code:
Two things to notice:
LLamaIndex then allows you to connect your LLM deployed on Cerebrium to the vector store and therefore you can run the following code to get answers to your questions:
What is happening in the above code is:
As you can see our result is pretty accurate! There are many improvements we could still make to this such as including stop tokens in our training dataset and deployment, fine-tuning a model such as Vicuna in order to get a large speedup etc but this is out of scope for this article.
In this tutorial we have been able to show a simple way to use Cerebrium to fine-tune and deploy a LLM with just a few lines of code. Its important to remember that you don’t neccesarily want to fine-tune information into a model but rather characteristics you would like it to adhere to. If you are looking to lower latency, save on costs, or if you don’t want to share your data with OpenAI fine-tuning is a great alternative to go forward with!