Due to the high cost, latency, and rate limit issues associated with OpenAI, many of our customers have opted to fine-tune and deploy their own models and integrate them into their own applications. However, determining the adequacy of the output generated by these models, whether they are developed by OpenAI or by our customers, as well as identifying the prompts or edge cases that may lead to poor performance, can be a challenging task. To address this challenge, I will show you a reliable method for monitoring model performance using Arize.
Starting today, Cerebrium is announcing its integration with Arize, a leading ML observability platform used by the likes of Instacart, Uber and many more! Arize has many great features such as performance tracing, dashboards, drift analysis and many more that make it a favourite tool. We are going to make use of the embeddings UMAP visualization that Arize offers in order to determine model degradation of our fine-tuned GPT Neo model.
Vector embeddings, are lists of numbers, that represent data such as text in order to perform various operations. Most commonly, vector embeddings are used in recommendations engines and voice assistants.
In this tutorial we are going to follow on from our previous tutorial here, fine-tuning a GPT Neo model on a Netflix dataset to help generate description of movies. After developing the model, we will use Cerebrium for deployment and Arize to monitor the performance of our deployed model. Arize will identify issues with our unstructured data model so we can improve performance .
To get started, fine-tune a GPT Neo model on the Netflix dataset by following the instructions from our previous tutorial here.
To compare changes, perform analysis, and root cause performance degradations, your model needs a model baseline. A model baseline is a reference data set used to compare your current data with either training, validation, or prior time periods in production. Here, we log our descriptions, both the raw text data and the vector embedding representation to Arize and say whether it was a positive or negative generation. We will do this randomly in order to create material for this tutorial. You will see we give every generated description a positive class (represented by 1) meaning this is a good generation and randomly assign if it was actually a good generation or not. You would typically build this binary classification into your application to get feedback from the user.
You will notice that for Arize, we specified the environment as Training, since we will set it as our model baseline. We then pass our movie descriptions we trained the model on, both as text and as vector embeddings to Arize. Arize has an interactive UMAP visualization to represent high dimensional vectors in 2D or 3D spaces, to help determine outliers in our dataset.
Arize usually takes 5–15 minutes to ingest all the data we have now logged. Once the data is in Arize, click on your model, and go to the Datasets tab. Click “Configure Baseline” in the top right corner and select “Pre-production”. We have now set up our model baseline
In order to deploy our fine-tuned model to Cerebrium with monitoring enabled its as easy as uploading a serialised model file of our fine-tuned model and adding our Arize Schema and platform arguments. Cerebrium will automatically log all inputs and outputs to Arize.
Your fine-tuned GPT Neo model should now be deployed on Serverless GPU’s with all monitoring being sent to Arize. To read more about the capabilities we offer you can read more in our documentation here.
In order to demonstrate how we can pick up the cause for the degradation in our model performance we are going to be using the embeddings feature from Arize. Below we do two things:
Below is the code snippet we used to generate this fake data:
For the above script, we use the same Arize client and schema we used above to send our training data to Arize. After about 10–15 minutes you should see your data in Arize. In your Arize space, click the gpt-neo-neflix model, click the embeddings tab and then click on the Euclidean Distance number.
Above we see our data in Arize in a reduced dimensional space (UMAP). There are a few things to point out here:
Taking the 3 points above into consideration Arize allows us to deep dive into why our fine-tuned model is performing differently in production as opposed to training. Using the ability to filter by cluster, features, datasets, incorrect predictions and more we can come to a conclusion relatively quickly that users are entering Spanish prompts, generating Spanish data. We can easily correct this by doing a check that users are entering prompts in English.
We have only touched the surface in this article about how you can deploy and monitor your models in production using Cerebrium and Arize. If you would like to stay up to date with our latest releases and community please join our Slack, Twitter and/or Discord.