Following on from our previous article where we showed you how to setup Arize in order to monitor your models in production. In this tutorial we will work through three key use cases and show you how to identify certain model issues and how to resolve them. These use cases are:
In order to run these use cases, let’s create some production data in Arize. We will need to update the timestamps to align with current day and time. This is to ensure that the sample data shows up as recent in Arize.
Remember, Arize takes 10 minutes to index the data.
⚠️ DON’T SKIP: In order to move on to the next step, make sure your actuals and training/validation sets are loaded into the platform.
To check:
It is important to monitor the performance of models that are used in an automated process. An example of this is a model that decides if a loan should be awarded. It is therefore extremely important to monitor the performance of our models. Arize can automatically configure monitors that are best suited to your data. From Overview tab, on the right you will see a Monitors window. Click Set Up Monitors
Through Arize you can notify users via email, Slack or PagerDuty
It’s important to monitor and immediately surface data quality issues to identify how your data quality maps to your model’s performance. Arize has great monitoring tools to analyse failures in your data quality pipeline, such as missing data or cardinality shifts.
To look at data quality, you can go to the “Overview” tab and scroll to the bottom. You should then see all 7 features used there and see their PSI (Population Stability Index), cardinality, etc. PSI is a metric which measures the magnitude at which a variable has changed or shifted in a distribution between two samples over the time. The larger the value, the larger the PSI, the less similar your distributions are.
From our dataset you can see that our PSI is high for 3 features (typically a threshold will be set at 0.25) and we have many empty values for each of them too — this isn’t good.
If you click on the “M1” feature you can see drift overtime which shows the PSI as well as the distribution in values compared to the baseline model we uploaded. We can see that there was a sudden spike today, and if we click on it the distribution is different when compared to our baseline model in that there are no null values for M1 in our latest predictions submitted. Typically this would be a good thing, however it is good to be aware and check for ourselves.
You then also have a data quality tab that shows you cardinality as well as the % of empty values for that feature.
Note:
Prediction Drift can surface when data drift has impacted your model. Drift is measured by PSI. When your data changes in a way the model is not expecting, it can impact the performance of your model negatively. This can be due to factors such as new data you are collecting which your model hasn’t seen before, or a change in user behaviour based on a change your business made, etc. Before we demonstrate this, we first need to know how accurate our model is.
In order to monitor the accuracy of your model, you need to log the actual output values of your model. In our case, was the transaction actually fraudulent if we follow up with the customer. With Arize, you can log the actual results at a later time by referencing the prediction id. We are going to simulate actual values to demonstrate this.
On the “Data Ingestion” tab you should see that Actuals should have a value of 9000. You might have to wait longer than 10 minutes for Arize to index the data.
Navigate to the “Performance tracing” tab to monitor your model performance and gain an understanding of what is caused the degradation. The accuracy (our default performance metric) is plotted over the 30 days and it is overlaid on top of bars which measure the volume of predictions. Our model currently has a very stable accuracy of ~50%.
If you scroll down to the Output Segmentation section includes a confusion matrix.
As we can see, we have a high False Negative rate(predicting incorrectly the transaction is not a fraudulent transaction) and a high True Negative rate (predicting correctly it’s not a fraudulent transaction). This could mean our model is overfitting and simply predicting “Not fraudulent for most transactions. In order to confirm these assumptions, let us compare this to our baseline model. At the top, you should see a button to add a comparison — select it and for the comparison dataset select “Training” as the dataset.
As you can see from the screenshot, 97% of transactions of our training set where correctly predicted as not fraudulent! This explains why our model is more than likely to predict “Not Fraudulent” for the majority for the transactions. One of the things we can do to fix this is to train the model on a more balanced dataset of fraudulent/non-fraudulent transactions. There are other model changes you can make but we will not go into that depth in this article but you are welcome to try and see the results.
As we continue to check in and improve our model’s performance, we want to be able to quickly and efficiently view all our important model metrics in a single pane. We can set up a Model Performance Dashboard to view our model’s most important metrics in a single configurable layout.
Navigate to the “Dashboard” and select the “Scored Model” template. From there select your model, the features you care to investigate, and the positive class which in this case is 1.
You can edit the dashboard to your liking by clicking the pencil in the top right corner as the template does have a lot of verbose information
Sometimes, we need metrics other than traditional statistical measures to define model performance. Business Impact is a way to measure your scored model’s payoff at different thresholds (i.e, decision boundary for a scored model).
Knowing the effect of predicting a fraudulent transaction when its not versus predicting a transaction is not fraudulent when it indeed is can add costs to a business. In Arize they have the functionality for you to create your own equation to present the financial impact on the business. In the top right, pick the positive outcome of your model, in this case it is identifying a fraudulent transaction (ie: 1). A formula bar will then appear and the formula works as follows:
Our equation is: 60 * TP_COUNT + 500 * TN_COUNT — 200 * FP_COUNT — 300 * FN_COUNT
As you can see, the accuracy of our model is making a positive impact of seven hundred thousand dollars (and this is when the model is at 50% accuracy 👀)
In this walkthrough we’ve shown how Arize can be used to log prediction and actual data for a model. How to setup alerts and monitor model performance and data drift and how to resolve these issues. Lastly we showed you how to measure the business impact of your model as well as setup a model performance dashboard.