Fine-tuning and deploying a LLaMA-based support bot with Cerebrium and LLamaIndex

Fine-tuning and deploying a LLaMA-based support bot with Cerebrium and LLamaIndex
Michael Louis
Co-Founder & CEO

Fine-tuning has many benefits over using closed source models from OpenAI, Anthropic and other providers. Businesses see benefits such as lower latencies, lower cost per generation, the ability to handle edge cases as well as the ability to know what is going on inside the model so you can optimize it. Allowing users to fine-tune open-source models has been one of our most widely requested features at Cerebrium and so the team has been working hard to get this in the hands of users

In this tutorial we are going to showcase how we used the Cerebrium fine-tuning capability to fine-tune a LLaMA based model to create a Cerebrium customer support chat bot.

1. Dataset Curation for Fine-tuning:

The foundation of a good AI model often hinges on the dataset it uses for learning and fine-tuning. The primary goal in this process is to generate a dataset that encapsulates the nature of our customer queries and tries to teach the model how it should extract data from context and how it should respond based on the nature of the question and the information at its disposal. One of the big misconceptions about fine-tuning is that you should fine-tune specific information into a model — this is not correct.

You want to teach the model how to handle certain use cases and how to present information to the user based on the context and the question the user asked.

Information around a product or business will most likely always be changing and so to continuously fine-tune a model is infeasible. Rather up-to-date information should be injected into the context for a model to use. Additionally, information around your use-case can be sensitive and so you wouldn’t want it to be engrained into the model and accessible by all but rather injected for those that have clearance. Lastly, most foundational models are trained on a trove of diverse data and so its unlikely that if you fine-tune a model on your pricing structure that its going to forget about all previous pricing and business models it was trained on — this can lead to hallucinations.

How we gathered our dataset

We gathered our data from two sources:

  • Our documentation where all files are in markdown
  • Our support channel in Discord. We used a Chrome plugin to download all the messages.

I then built a quick script to send chunks of our Discord support messages to a GPT4 API with the following prompt:

Please create a JSON array in the following format from our Discord channel that is used to answer customer support questions:

{"prompt": {question}, "completion": {completion}} where {question} is the question the user ased in the text below and {completion} is the answer to that question from the text below.

Discord messages:

This is a very easy way to start with an initial dataset that you can go through and make sure the prompt/completion pairs are correct.

Dataset Preparation

In order to quickly find and set the relative context that our answers could have come from. I used LLamaIndex to chunk our documentation into a vector database and looped through the JSON dataset above to find the passages of text most similar to the completion parameter.

Your dataset should now look like:

    "prompt": "What is Cerebrium?",
    "completion": "Cerebrium is an AWS Sagemaker alternative providing all the features you need to quickly build an ML product. We provide serverless compute and infrastructure as well as try abstract as much of the latest research so you can build ML based solutions without the hassle.",
    "context": "Cerebrium is an AWS Sagemaker alternative providing all the features you need to\nquickly build an ML product.\"\n---\n\nOur goal is to help companies create value through machine learning as quickly and as painlessly as possible by abstracting away a lot of the complexity and mundane infrastructure setup.\n\nSagemaker is feature rich, so we are developing the most requested functionality\nweekly based on your feedback. You can send us feedback requests at\n[support@cerebrium.ai](support@cerebrium.ai)\n\n## How we do this\n\n- \n    Infrastructure\n  \n  We abstract away all of the complexity around infrastructure so you don't have\n  to worry about CPUs/GPUs, Kubernetes, queues, monitoring, scaling etc. We take\n  care of this to create a robust and seamless developer experience.\n- \n    Research\n  \n  We try to implement the latest research towards your model as best we can in\n  order for you to deliver the best experience to your users. We automatically implement\n  quantization, LoRA, pruning and many more techniques to get your model to run faster\n  and cheaper without sacrificing performance.\n\n## Our users favorite features\n\n- \\<30 second cold-start times\n- Wide variety of GPUs\n- Automatic scaling from 1 to 10k requests in \\<15s\n- Define pip/conda container environments\n- Secrets manager\n- One-click deploys\n- Monitoring tool integration\n\nAll of this in just a few lines of code!\n\nTo get started, head to our [installation page](/installation)"

You should go through the dataset to make sure that the context relates to the completion and that the answer is indeed hidden in the context. If the answer cannot be found in the context, I recommend you handle the use case accordingly:

    "prompt": "How can I apply for the unlimited service if I want to use it?",
    "completion": "Please contact a member of the Cerebrium team @Cerebrium Team",
    "context": "Cerebrium has 3 pricing plans.\nHobby: This is a free plan and you only pay for compute. You are only allowed to deploy 3 models maximum and can only have 5 max concurrent GPU workers.\nStandard: Deploy unlimited models and have up to 30 concurrent GPU workers. This plan costs $100 a month + compute.\nEnterprise: Contact us so we can understand your requirements better but essential you get dedicated support, unlimited GPU concurrency etc."

Here is an example of a user asking how they can get access to our enterprise plan however the context we were able to retrive is just the details of our 3 pricing plans — therefore there is no information in the context to answer this question. We are trying to teach the model to respond with “Contact the Cerebrium team” since it cannot answer the question based on the given context

There are a few more steps we can take to curate our dataset even more by asking questions in different formats. Now that the dataset is curated, let’s train our model.

2. Fine-tuning and deploy LLaMA model using the Cerebrium

With our refined dataset in hand, we can move on to fine-tuning our LLaMA model. The Cerebrium platform makes this extremely easy, requiring me to only specify my training parameters and takes care of dataset validation and all infrastructure. We use the default training configuration found here and the following prompt template

    "description": "Template for Cerebrium Support bot"
    "prompt_input": "Below is an question paired with an input that provides further context. Write a response that appropriately answers the question. Only use information from the context. If you can't answer a question from the context, ask the user to ask for help from the Cerebrium team.\n\n### Question:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
    "prompt_no_input":  "### Your Question:\n{instruction}\n\n### Response:\n"
    "response_split": "### Response:"
instruction_column: "prompt"
context_column: "context"

We use our own prompt template to train that specifies that if the model cannot answer the question from the provided context, it should tell the user to contact the Cerebrium team.

To kick of the training job we run the following command:

cerebrium train - config-file ./config.yaml

Your training job should start and you can run the following command to see your training logs:

cerebrium get-training-logs {JOB_ID}

should see the following output in your terminal:

You will be notified via email once the training job has completed and then you can run the following code snippet to download your model checkpoint files

cerebrium download-model {JOB_ID}

To deploy this model to Cerebrium you can look at the example here. A LLaMA model isn’t the quickest so you can look at implementing the vllm framework to speed up inference time to ~0.9 tokens per second on a A10 which is 24x faster than the base LLaMA model.

3. Using LLamaIndex for Documentation Chunk Data:

Now that our fine-tuned LLM is completed, we will need information from our documentation to aid its responses ie: the context. We chose to use LLamaIndex for this as its extremely easy to setup and has a wide variety of integrations and functionality. It allows us to chunk relevant data from our documentation and then input it into the context of our fine-tuned LLM. Essentially, LLamaIndex simplifies the process of retrieving the right data at the right time, and ensuring the chatbot’s responses remain accurate and informed.

You can install LLamaIndex using pip install llama-index . We then have the following snippet of code:

from llama_index import Document, VectorStoreIndex, SimpleDirectoryReader, download_loader, VectorStoreIndex, ServiceContext, LLMPredictor
from pathlib import Path

import fnmatch

def file_metadata(url):
    metadata = {
        'url': url,
    return metadata

all_docs = []
def find_mdx_files(directory):
    # List to hold the files
    mdx_files = []

    # Iterate over the directory
    for root, dirs, files in os.walk(directory):
        for file in files:
            # Check if file ends with .mdx
            if file.endswith('.mdx'):
                path = os.path.join(root, file)
                documents = SimpleDirectoryReader(root, [path]).load_data()
                path = os.path.join(root, file)
                for d in documents:
                    d.metadata = {"url": path.replace('data/', '')}


for chat in chats:
  doc = Document(text=f"{chat['prompt']} {chat['completion']}")

Two things to notice:

  1. All our documentation files that are in markdown format are located in the data/ folder. LLamaIndex recursively goes through the entire directory and chunks all the documents to store in a vector store.
  2. I didn’t only chunk our documentation into LLamaIndex but also the question+completion from the support channel that is not contained in our documentation so it can be used for future questions.

LLamaIndex then allows you to connect your LLM deployed on Cerebrium to the vector store and therefore you can run the following code to get answers to your questions:

What is happening in the above code is:

  1. You are creating your Cerebrium endpoint where you deployed your model.
  2. You are creating a vector store on all your previous support messages and documentation. As well as setting the default LLM for the vector store to be Cerebrium.
  3. Lastly, you are asking it a question and retriving the answer.

As you can see our result is pretty accurate! There are many improvements we could still make to this such as including stop tokens in our training dataset and deployment, fine-tuning a model such as Vicuna in order to get a large speedup etc but this is out of scope for this article.

In this tutorial we have been able to show a simple way to use Cerebrium to fine-tune and deploy a LLM with just a few lines of code. Its important to remember that you don’t neccesarily want to fine-tune information into a model but rather characteristics you would like it to adhere to. If you are looking to lower latency, save on costs, or if you don’t want to share your data with OpenAI fine-tuning is a great alternative to go forward with!

Back to blog