Monitoring ML Models in Production using Arize — Part (2/2)

Monitoring ML Models in Production using Arize — Part (2/2)
Michael Louis
Co-Founder & CEO

Following on from our previous article where we showed you how to setup Arize in order to monitor your models in production. In this tutorial we will work through three key use cases and show you how to identify certain model issues and how to resolve them. These use cases are:

  • A new untrained domain is introduced
  • Bad data is received in one of the features
  • The model is inaccurate during some time period

In order to run these use cases, let’s create some production data in Arize. We will need to update the timestamps to align with current day and time. This is to ensure that the sample data shows up as recent in Arize.

import uuid
from datetime import timedelta
from random import randrange
df_prod = pd.read_csv("test_transaction.csv")
#Only using 300 transactions per day over the last 30 days
df_prod = df_prod[:9000]
#Get model predictions
X_prod = df_prod[feature_column_names]
X_prod = pd.DataFrame(enc.transform(X_prod).toarray(), columns=enc.get_feature_names_out().reshape(-1))
X_prod["TransactionAmt"] = df_prod[["TransactionAmt"]].to_numpy()
prod_pred = xgb.predict(X_prod)
prod_pred_score = xgb.predict_proba(X_prod)
def random_date(start, end):    
   This function will return a random datetime between two datetime     
   delta = end - start    
   int_delta = (delta.days * 24 * 60 * 60) + delta.seconds    
   random_second = randrange(int_delta)    
  return start + timedelta(seconds=random_second)
# Adjusting dates for ease of visualization
END_DATE = datetime.date.today()
START_DATE = (datetime.date.today() - timedelta(31))
def setPredictionIDandTime(df, start, end):    
   out_df = pd.DataFrame()    
   for i in range(9000):        
      df_prod.loc[i, 'prediction_ts'] = int(random_date(START_DATE,END_DATE).strftime("%s"))        
      val = randrange(10)        
      #we are adding in a new value to show you data drift later        
      df_prod.loc[i, 'P_emaildomain'] = 'cerebrium.ai' if val < 8 else df_prod.loc[i, 'P_emaildomain']
setPredictionIDandTime(df_prod, START_DATE, END_DATE)
test_prod_pred = pd.DataFrame(prod_pred, columns=['predictedFraud'])
test_prod_pred_score = pd.DataFrame(prod_pred, columns=['predictedFraudScore'])
combined_prod_df = pd.concat([df_prod.reset_index(drop=True), test_prod_pred, test_prod_pred_score], axis=1)
# Define a Schema() object for Arize to pick up data from the correct columns for logging
prod_schema = Schema(    
# Logging Production Prediction
prod_response = arize_client.log(    
# If successful, the server will return a status_code of 200
if prod_response.status_code != 200:    
   print(f"logging failed with response code {prod_response.status_code}, {prod_response.text}")
   print(f"✅ You have successfully logged production set to Arize")

Remember, Arize takes 10 minutes to index the data.

⚠️ DON’T SKIP: In order to move on to the next step, make sure your actuals and training/validation sets are loaded into the platform.

To check:

  1. Navigate to models from the left bar, locate and click on model fraud-detection-model
  2. Click the Overview Tab and make sure you see the actuals as shown below.
  3. Actual data will show up under Model Health. Once the number changes from 0 Actuals to x Actuals (with summary statistics such as cardinality listed in the tables), your production actuals will have been fully recorded on Arize!
  4. Verify the list of features under Model Health.

Setup proactive alerts

It is important to monitor the performance of models that are used in an automated process. An example of this is a model that decides if a loan should be awarded. It is therefore extremely important to monitor the performance of our models. Arize can automatically configure monitors that are best suited to your data. From Overview tab, on the right you will see a Monitors window. Click Set Up Monitors

  1. Datasets: Training Version 1.0
  2. Default Metric: Accuracy, Trigger Alert When: Accuracy, Positive Class: 1
  3. Turn On Monitoring: Drift ✅, Data Quality ✅, Performance ✅

Through Arize you can notify users via email, Slack or PagerDuty

Data Quality

It’s important to monitor and immediately surface data quality issues to identify how your data quality maps to your model’s performance. Arize has great monitoring tools to analyse failures in your data quality pipeline, such as missing data or cardinality shifts.

To look at data quality, you can go to the “Overview” tab and scroll to the bottom. You should then see all 7 features used there and see their PSI (Population Stability Index), cardinality, etc. PSI is a metric which measures the magnitude at which a variable has changed or shifted in a distribution between two samples over the time. The larger the value, the larger the PSI, the less similar your distributions are.

From our dataset you can see that our PSI is high for 3 features (typically a threshold will be set at 0.25) and we have many empty values for each of them too — this isn’t good.

If you click on the “M1” feature you can see drift overtime which shows the PSI as well as the distribution in values compared to the baseline model we uploaded. We can see that there was a sudden spike today, and if we click on it the distribution is different when compared to our baseline model in that there are no null values for M1 in our latest predictions submitted. Typically this would be a good thing, however it is good to be aware and check for ourselves.

You then also have a data quality tab that shows you cardinality as well as the % of empty values for that feature.


  • Missing / Null values could be an indicator of issues from an upstream data source.
  • Cardinality is checked to ensure there are no spikes / drops in feature values.

Prediction Drift

Prediction Drift can surface when data drift has impacted your model. Drift is measured by PSI. When your data changes in a way the model is not expecting, it can impact the performance of your model negatively. This can be due to factors such as new data you are collecting which your model hasn’t seen before, or a change in user behaviour based on a change your business made, etc. Before we demonstrate this, we first need to know how accurate our model is.

In order to monitor the accuracy of your model, you need to log the actual output values of your model. In our case, was the transaction actually fraudulent if we follow up with the customer. With Arize, you can log the actual results at a later time by referencing the prediction id. We are going to simulate actual values to demonstrate this.

#Create drift by setting most transactions to be fraudulent.
for i in range(9000):  
   val = randrange(10)  
   combined_prod_df.loc[i,'isFraud'] = 1 if val < 8 else 0
# Define a Schema() object for Arize to pick up data from the correct columns for logging
actual_schema = Schema(    
   prediction_id_column_name="TransactionID",  # REQUIRED    
# arize_client.log returns a Response object from Python's requests moduleresponse = arize_client.log(    
# If successful, the server will return a status_code of 200
if response.status_code != 200:    
   print(f"❌ logging failed with response code {response.status_code}, {response.text}")
   print(f"Step 4 ✅: You have successfully logged {len(combined_prod_df)} data points to Arize!")

On the “Data Ingestion” tab you should see that Actuals should have a value of 9000. You might have to wait longer than 10 minutes for Arize to index the data.

Performance Analysis

Navigate to the “Performance tracing” tab to monitor your model performance and gain an understanding of what is caused the degradation. The accuracy (our default performance metric) is plotted over the 30 days and it is overlaid on top of bars which measure the volume of predictions. Our model currently has a very stable accuracy of ~50%.

If you scroll down to the Output Segmentation section includes a confusion matrix.

As we can see, we have a high False Negative rate(predicting incorrectly the transaction is not a fraudulent transaction) and a high True Negative rate (predicting correctly it’s not a fraudulent transaction). This could mean our model is overfitting and simply predicting “Not fraudulent for most transactions. In order to confirm these assumptions, let us compare this to our baseline model. At the top, you should see a button to add a comparison — select it and for the comparison dataset select “Training” as the dataset.

As you can see from the screenshot, 97% of transactions of our training set where correctly predicted as not fraudulent! This explains why our model is more than likely to predict “Not Fraudulent” for the majority for the transactions. One of the things we can do to fix this is to train the model on a more balanced dataset of fraudulent/non-fraudulent transactions. There are other model changes you can make but we will not go into that depth in this article but you are welcome to try and see the results.

Model Performance Overview

As we continue to check in and improve our model’s performance, we want to be able to quickly and efficiently view all our important model metrics in a single pane. We can set up a Model Performance Dashboard to view our model’s most important metrics in a single configurable layout.

Navigate to the “Dashboard” and select the “Scored Model” template. From there select your model, the features you care to investigate, and the positive class which in this case is 1.

You can edit the dashboard to your liking by clicking the pencil in the top right corner as the template does have a lot of verbose information

Business Impact

Sometimes, we need metrics other than traditional statistical measures to define model performance. Business Impact is a way to measure your scored model’s payoff at different thresholds (i.e, decision boundary for a scored model).

Knowing the effect of predicting a fraudulent transaction when its not versus predicting a transaction is not fraudulent when it indeed is can add costs to a business. In Arize they have the functionality for you to create your own equation to present the financial impact on the business. In the top right, pick the positive outcome of your model, in this case it is identifying a fraudulent transaction (ie: 1). A formula bar will then appear and the formula works as follows:

  • When the model correctly predicts fraud (TP), the business makes a 60 dollars profit from the customer staying with us forever.
  • When the model incorrectly predicts fraud (FP), the business incurs a 500 dollars loss in order to re-acquire a new customer
  • When the model correctly predicts no fraud (TN), the business makes on average $500 from money spent.
  • When the model incorrectly predicts no fraud (FN), we incur a 300 dollars loss from the customer.

Our equation is: 60 * TP_COUNT + 500 * TN_COUNT — 200 * FP_COUNT — 300 * FN_COUNT

As you can see, the accuracy of our model is making a positive impact of seven hundred thousand dollars (and this is when the model is at 50% accuracy 👀)


In this walkthrough we’ve shown how Arize can be used to log prediction and actual data for a model. How to setup alerts and monitor model performance and data drift and how to resolve these issues. Lastly we showed you how to measure the business impact of your model as well as setup a model performance dashboard.

Back to blog