Practical ways to measure the success of your ML Models in production

Practical ways to measure the success of your ML Models in production
Michael Louis
Co-Founder & CEO

A challenge commonly faced by data and machine learning teams is convincing management the effect their data/ML driven implementations can have on the business. This is often due to the long delay between the implementing these machine learning projects and their realised ROI — Naturally this is tough to sell to management.

Measuring the success of these projects is difficult. We have faced this problem ourselves and have heard from many others who share the same sentiment. We hope that by sharing our learnings and those of our peers we can help businesses have a higher success rate in implementing machine learning solutions within their business.

Model performance vs business performance

To start its important to make a distinction between model performance and business performance. Model performance refers to model metrics such as accuracy, precision, recall etc. Metric performance had an impact on business performance.

Business performance is usually associated with a KPI or combination of KPI’s that a business is trying to improve such as “increase sign ups” and “decrease default rates”.

Business performance is a function of many variables, not just model performance. Poor model performance will likely negatively impact business performance, but good model performance does not guarantee good business performance!

What are the situations that make it difficult to measure success?

Delayed feedback

A common problem in many machine learning models is delayed feedback — you don’t know how your model has been performing for a long period of time. This happens frequently in credit models — you will not know if your model estimated the right credit limit to a customer until they default which could be a year or two later!

Recommended Actions

There are many machine learning products that recommend actions to the user where following these actions will lead to an uplift/cost-saving. However, often clients don’t follow your recommendation since they might feel differently about the situation, or don’t react in a short enough time period. This can show that your model is performing poorly when in fact the user or business is just not following your advice.

KPIs are qualitative which are difficult to measure

In some instances, machine learning products optimise a process that leads to a more enjoyable user experience yet user experience is hard to quantify. An example of this would be, Otter.AI, the automatic voice note taking app. It allows me to concentrate on calls with customers, review call transcripts and summarises key discussion points but what is it doing? Saving my time, taking more detailed notes? It is hard to quantify those.

When multiple metrics could be used

With recommendation systems, there could be multiple measures of success. For example, in a e-commerce platform, some of the metrics you could measure are average time to add-to-cart, average time to checkout, the percentage of cart items that were recommended etc. So do you monitor all of these or just one of these?

Steps to take when measuring the success of a Model

1. Decide on a KPI that you have ‘easy visibility’ of

What I mean by ‘easy visibility’ is that it should be able to quantify. Too often teams come up with elaborate metrics that are hard to measure or quantify or that are made up of a combination of metric. Choose one metric that is your north star and have at most 2 secondary metrics.

2. Have a baseline or history regarding that metric

In order to quantify progress you need to have a metric to serve as a baseline for comparison. After selecting a KPI decide if you will compare it daily, weekly monthly etc. If you don’t have a baseline or historical data to compare against then I would revisit step 1.

3. Have a good evaluation method

If your KPI is dependant on a user taking action recommended by your system, ie: “do this in order to optimise this process”, make sure you have a method to monitor the user did in fact listen to your recommendation. Too often, a model seems to under perform but in discovery it turns out its suggestions were ignored by the user. This does beg the question if the model is performant if the user ignored the advice but this is out of scope for this article.

4. A change in the model should directly impact change

At the beginning of the article we said model performance does not guarantee business performance which is true, but it does affect it. Therefore, we should be able to know which model metric (accuracy, recall etc) has the biggest impact on our north star KPI. This also gives us an indication of how much further we could improve a business metric with some work or if time spent improving a model will lead to suboptimal returns.

Example: Dependant on user action

A machine learning company helps manufacturing and production companies reduce waste, pick up machine defects and optimise processes. They recommend to business users industrial settings to reduce waste, machines that should be serviced and optimisations that should be made to production processes. Initially, they encountered the issue we mentioned above whereby a customer was not implementing recommendations during the POC (Proof of Concept) stage leading to no efficiency gain.

Eventually, the machine learning company tracked metrics regarding temperature, pressure, machine service dates etc which gave them exact data to show when the recommendation was made and when/if it was implemented by the business. It also allowed the machine learning company to notify the business if they had not implemented the recommendation within a certain time limit and speak to the productivity losses. That company now generates efficiencies gains between 30–40% for companies in the manufacturing space.

Example: Delayed feedback in credit models

Most banks and fintech’s alike implement credit models where the goal is to loan the most amount of money to consumers with the lowest default rate . The biggest issue these companies face is clients only default 4 months to 2+ years later and so it is difficult to determine if your model is working as intended right now.

A evaluation method seems to be the most common way businesses have measured success and they have done the following:

  • Backtesting: The testing of their predictive model on historical data. There is a lot more to say here in terms of bias, changing in consumer markets and creating good hold-out strategies but we will not dive into that here.
  • Simulations: Some businesses have found success in MonteCarlo simulations which are used to predict the probability of a variety of outcomes- essentially running hundreds of thousands of experiments and seeing how many times would a client default. However, this requires a lot of work to be effective.


The points mentioned above may seem like a small part of the process when implementing machine learning solutions, which it is but you will be surprised how many companies have struggled with it. The list above is also not exhaustive and there are many more complex discussions to have when it comes to model and business evaluation of at a later stage. We would love to hear about issues you faced when implementing your own ML driven products.

Back to blog