Cerebrium articles | How to Deploy Machine Learning Models: A comprehensive Guide

January 9, 2025

How to Deploy Machine Learning Models: A comprehensive Guide

Michael Louis

CEO & Founder

Deploying machine learning (ML) models is a critical step in transforming AI projects into impactful applications. Whether you’re creating personalized recommendations, creating agentic workflows, or analyzing and transforming large datasets, the success of your application depends on its deployment and its ability to scale. In this guide, we’ll discuss key considerations for deploying ML models and walk through an example tutorial using Cerebrium, a serverless AI infrastructure platform.

What to Consider When Deploying Machine Learning Models

1. Infrastructure and Scalability

Before deploying your ML model, consider where and how it will run. Will it operate on a cloud platform, on-premises servers, or edge devices? This choice is usually made up of decisions around security, latency, and compute resources required.

2. Latency Requirements

Real-time applications, like voice assistants or fraud detection systems, require low-latency deployment. For smaller models, it would be best to do this on edge however if that doesn't suit your requirements you should look at platforms that have efficient networking in order to meet quick response times.

3. Model Performance

Evaluate your model’s computational needs, such as GPU or CPU requirements. Performance bottlenecks can arise if your deployment environment lacks the resources to handle inference workloads.

Serverless vs non-serverless

If your traffic patterns are volatile and there are many peaks and troughs then it might be worthwhile exploring platforms that offer serverless capabilities ie: where you are only charged for usage instead of running instances 24/7. If from the point above your service also requires GPU’s then serverless becomes even more of a need due to the expensive nature of GPUs.

5. Monitoring and Logging

Monitoring and logging is key to any model deployment - you need to know how your model is performing over time as well as know when something is going wrong. Select a platform that gives you the flexibility to Implement your own logging to track predictions, errors, and resource usage, or that comes with it out the box enabling quick identification and resolution of issues.

6. Security and Compliance

Ensure the deployment environment adheres to necessary security standards, such as encryption and secure APIs. For sensitive data, compliance with regulations like GDPR or HIPAA is crucial.

7. Cost Management

The cost of deploying an ML model can escalate with increasing inference requests and resource consumption. Optimize your deployment for cost-effectiveness without compromising performance. Levers you can experiment with here are things like serverless deployment and batching requests to use compute resources more efficiently.

Based on the above we will show an example of how to deploy a model on Cerebrium - a serverless AI infrastructure platform. Cerebrium takes into account all the above points and allows you to deploy models onto serverless CPUs/GPUs that autoscale in the cloud. It also comes with monitoring and logging out the box and is SOC 2, GDPR and HIPAA compliant.

Tutorial: Deploying a Machine Learning Model with Cerebrium

In this tutorial we will deploy a simple distilled BERT model that classifies the sentiment of a sentence.

To get started, let us install the Cerebrium Python package

pip install cerebrium

Let us login to our account. If you do not have an account, you can create one by running the same code below and creating an account

cerebrium login

Let us create our base cerebrium project with the name “text-classifier”

cerebrium init text-classifier

You will see this line of code created a folder with two main files:

cerebrium.toml - this is responsible for our application environment setup. Everything from hardware, python and apt packages, shell commands and scaling criteria.
main.py - this is where our Python code lives.

Let us first setup our cerebrium.toml:

Add the following to your cerebrium.toml

[cerebrium.dependencies.pip]
torch = "latest"
transformers = "latest"

You will see we use all the default settings which specifies this is running on a CPU and the scaling criteria ie: this will scale down to zero with no traffic and scale to a max of 5 instances. Above we simply add the pip dependancies we need in our environment.

Next, we update your main.py to reflect the following:

import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

def run(prompt: str):
	inputs = tokenizer(prompt, return_tensors="pt")
	with torch.no_grad():
	    logits = model(**inputs).logits
	
	predicted_class_id = logits.argmax().item()
	return model.config.id2label[predicted_class_id]

What is unique about Cerebrium is there is no special syntax or way of doing things, we simply turn each function into an API endpoint where the parameters of the function are you JSON parameters.

In order to deploy this code, we can run:

cerebrium deploy

Once deployment has been successful, you should see a link printed out of the endpoint as well as a link to your app dashboard. Click the link to your app dashboard.

You should be able to see the status of your application, analytics around its performances, logs, metrics and much more! On the overview tab, you will see a curl request that you can use in order to call your model. Just update the end of the URL with the name of the function you would like to call - in this case run. Your curl request should look something like this:

curl --location 'https://api.cortex.cerebrium.ai/v4/p-xxxxxx/text-classifier/run' \
--header 'Authorization: Bearer <TOKEN>' \
--header 'Content-Type: application/json' \
--data '{"prompt": "I really like your house"}'

You should then get the following output:

{
    "run_id": "397c6dfc-f7b9-96b3-85be-8fd94e2b4e88",
    "result": "POSITIVE",
    "run_time_ms": 11.26408576965332
}

Conclusion

Deploying machine learning models doesn’t have to be daunting. By considering factors like infrastructure, latency, and monitoring, and leveraging platforms like Cerebrium, you can focus on creating impactful applications while leaving the complexities of infrastructure to the experts.

Product

Pricing

Developers

Docs

Status

Company

Blog

About

Use cases

Large language models

Voice

Image & Video

Resources

Examples

Articles

Brand assets