Monitoring ML Models in Production using Arize — Part (1/2)

Monitoring ML Models in Production using Arize — Part (1/2)
Michael Louis
Co-Founder & CEO

Once you have a model that is live and in production, the first thing you want to know is if it is performing as expected!

In this tutorial we will be using Arize to build robust model observability, focusing on areas such as data drift, model decay, etc. and demonstrate how to resolve these issues. We look through common use-cases of picking up fraud in financial transactions and ensuring the model behaves as expected over time.

By the end of this tutorial you will be able to:
  • Integrate Arize API and log data
  • Configure a baseline model for measuring performance
  • Simulate and correct data drift
  • Notify respective parties on model decay
  • Setup model performance monitoring

To get started, the first thing we need is to build a model that we can deploy and monitor. Below are the (extremely) basic steps taken to train our fraud detection model. The data from our model can be found here.

!pip install -U xgboost -q
!pip install --upgrade twisted -q
#Don't worry if you get shown a warning here!
pip install attrs==19.2.0 -i http://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com -q
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
df = pd.read_csv("train_transaction.csv")
df = pd.concat([df[df.isFraud == 0].sample(n=len(df[df.isFraud == 1])), df[df.isFraud == 1]], axis=0)
feature_column_names = ["ProductCD", "P_emaildomain", "R_emaildomain", "card4", "M1", "M2", "M3"]
X = df[feature_column_names]
y = df.isFraud
enc = OneHotEncoder(handle_unknown="ignore")
X = pd.DataFrame(enc.transform(X).toarray(), columns=enc.get_feature_names_out().reshape(-1))
X["TransactionAmt"] = df[["TransactionAmt"]].to_numpy()
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)from sklearn.metrics import accuracy_score
# Create and train a XGBoost Classifier
xgb = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, objective='binary:logistic', nthread=4, scale_pos_weight=1, seed=27)
model = xgb.fit(X_train, y_train)
# Predict on test set
preds = xgb.predict(X_test)
print(accuracy_score(y_test, preds))

Integrating Arize API

Now that we have built our model, we need to setup model monitoring to make sure our model behaves as expected in production as well as to be quick to identify root causes if problems occur. Arize is a great model monitoring platform because it covers sufficient functionality for enterprise but is just as easy to setup for small businesses.

You will need to create an Arize account to continue. However, it does have a free tier available which is sufficient for this tutorial. On Arize, go to “Space Settings” and on the right you will see your “Space Key” and “API Key”. We will need these below.

Below we install the Arize API to use in our project.

!pip install arize -q

We need to log our training data to Arize to use it as a baseline model. Typically you would use training data for your baseline model and use your baseline model to identify data drift and root cause model decay.

If you noticed we used one-hot encoding for our model to take care of the class variables in our dataset but do not want to upload our dataframe as is. If you did so, you would then have 100’s of features to monitor. For example, imagine we had a state variable is_state. With one-hot encoding it would transform into is_state_CA, is_state_NY, is_state_WA, etc. When uploading is_state as a single feature, we can monitor the data of all states.

#Need to copy original dataframe since X_train is one-hot encoded
orig = df.copy()
train_orig = orig.iloc[X_train.index.values]
train_orig.reset_index(drop=True, inplace=True)
train_preds = xgb.predict(X_train)
train_pred = pd.DataFrame(train_preds, columns=['predictedFraud'])
combined_train_df = pd.concat([train_orig.reset_index(drop=True), train_pred], axis=1)
combined_train_df.fillna('', inplace=True)
from arize.pandas.logger import Client, Schema
from arize.utils.types import Environments, ModelTypes
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)
model_id = (    
   "fraud-detection-tutorial"  # This is the model name that will show up in Arize
model_version = "v1.0"  # Version of model - can be any string
#Don't change the values here, we are just making sure you changed the keys to yours
   raise ValueError("❌ NEED TO CHANGE SPACE AND/OR API_KEY")
   print("✅ Arize setup complete!")

Now that the Arize library is installed and our keys are added, we can load our training data into Arize.

You will see we set a schema, which is then picked up from our dataframe which we pass to the log command

  • Prediction id column name: The unique identifier of what we are trying to predict. In this case, the unique id of the customer
  • Prediction label column name: The name of the column that contains the prediction. In this case, whether the transaction is fraud or not fraud
  • Prediction score column name: The column that contains the accuracy associated to our prediction.
  • Actual label column name: The column that contains the actual result. Ie: Was the transaction reported as fraud
  • Feature Column name: The rest of the columns in your dataframe which are used as input features to the model

When we log our data to Arize, we give it the full dataframe. Based on the schema we set, it will automatically pick up the appropriate columns. We also pass information such as:

  • Model Version: As you continue to make updates to your model, you would like to keep track of the model version this data relates to.
  • Model Type: You only have 4 options: Binary, Categorical, Score Categorical and Numeric, which relates to the type of model. We will use Score Categorical.
  • Environment: Used to define different environments if we are logging training, validation or production data.
# Define a Schema() object for Arize to pick up data from the correct columns for logging
training_schema = Schema(    
# Logging Training DataFrame
training_response = arize_client.log(    
# If successful, the server will return a status_code of 200
if training_response.status_code != 200:
      f"logging failed with response code {training_response.status_code}, {training_response.text}"    
   print(f"✅ You have successfully logged training set to Arize")

Once you have run the above, you can navigate back to the Arize web app and click “Space Overview” and click the fraud-detection-tutorial. Navigate to the “Data Ingestion” tab. You will see the data is being ingested and processed by Arize. It can take up for 10 minutes for Arize to process this data. Once 10 minutes has passed you should see information on the “Datasets” tab. If you prediction volume isn’t ~6000, then Arize is still busy indexing the data.

We now want to set this training data as our baseline model — so let’s do that. On your fraud-detection-tutorial model, go to your “datasets” tab and in the right corner, click “configure baseline”.

To configure your baseline model, select the following on the prompts:

  • Set up a baseline: Pre-production
  • Set up a pre-production baseline: Training Version 1.0

Now that this is set up, any future logs that come into your production model will be compared to the baseline so we can monitor data and prediction drift.

Am I the only one feeling pretty badass right now? 👀

Simulate Production Deployment

We are going to simulate a production deployment in our notebook here. However, if you want to deploy the model and test the functionality in a deployed model.

In your deployed model, you want to log both the data sent in and every prediction returned so we can compare the results to our baseline model and determine if there is any data or prediction drift. It is also useful just to monitor incoming traffic.

Below we follow similar steps to what we had to do before for logging training data:

  • We define our schema, this time we don’t have the actual label column since we don’t know if the transaction was indeed fraud or not.
  • We set the environment to production. Note the model version and ID is the same.

Let us simulate sending data into our model and returning a result. Typically this information would be sent in via a REST endpoint. For simulation purposes we are just using data from the test set defined above.

#Predict a single prediction
pred = xgb.predict(X_test.head(1))
pred_score = [max(x) for x in xgb.predict_proba(X_test.head(1))]
#prepare dataframe to upload to Arize
test_orig = orig.iloc[X_test.index.values]
test_orig.reset_index(drop=True, inplace=True)
test_pred = pd.DataFrame(pred, columns=['predictedFraud'])
test_pred_score = pd.DataFrame(pred_score, columns=['predictedFraudScore'])
combined_test_df = pd.concat([test_orig.head(1).reset_index(drop=True), test_pred, test_pred_score], axis=1)
combined_test_df.fillna('', inplace=True)

Then we make sure to log a production prediction

prod_schema = Schema(    
# Logging Production Prediction
prod_response = arize_client.log(    
if prod_response.status_code != 200:    
      f"logging failed with response code {prod_response.status_code}, {prod_response.text}"    
   print(f"✅ You have successfully logged your request to Arize")

Once you had sent a request to you endpoint, you should see it in Arize. Navigate back to the “Data ingestion” tab and you should see an additional request. It may take a few minutes.

Thats it for part 1 of our model monitoring tutorial. In part 2, we will look at how to identify model issues in Arize and how to resolve them. We will work through three key use cases which are:

  • A new untrained domain is introduced
  • Bad data is received in one of the features
  • The model is inaccurate during some time period

Back to blog