Cerebrium articles | Top 5 Serverless GPU providers

November 18, 2024

Top 5 Serverless GPU providers

Michael Louis

Founder & CEO

With the rise in demand for AI powered workloads and the ever improving state of both closed and open-source models it is no surprise that the landscape of GPU infrastructure has evolved over the last few years. While GPUs are getting more efficient and are decreasing in cost, companies are still looking for ways to optimize spend and bring down cost as well as have access to the latest in-demand chips. Serverless GPU providers offering flexible and efficient solutions to developers and companies deploying AI applications.

Unlike traditional setups, serverless GPU platforms enable users to pay only for the compute time they actually use, which can be a cost-effective option for projects with fluctuating workloads or intermittent demand. In this article, we explore five prominent serverless GPU providers, highlighting their cold-start times, unique features and areas of specialization.

When is it best to use serverless GPUs?

Serverless GPUs can be a good fit for:

Model Serving: Deploying and running AI models for inference with volatile traffic.
Model Fine-tuning: Fine-tuning pre-trained models on custom datasets for specific use cases.
Video and image processing: Accelerating tasks such as video encoding, image processing, and rendering on-demand.
CI/CD: Streamlining GPU-intensive CI/CD pipelines for faster builds and testing.
Batch Processing: Running large-scale simulations, predictions, or analytics workloads that don’t need constant resource allocation and for which you don’t want to set up infrastructure.
Data Augmentation: Transforming, preprocessing, or augmenting datasets for machine learning pipelines, especially for computer vision tasks.
Event-Driven Computing: Handling GPU-intensive tasks triggered by events, such as processing user uploads or on-demand computations.

1. Cerebrium

Cerebrium is a serverless AI infrastructure platform designed to support a wide range of AI applications, serving AI models, running real-time application and batch jobs and voice AI applications. Cerebrium offers over ten types of GPUs, providing users the flexibility to select the hardware that best matches their workload requirements.

Cerebrium takes your Python code and deploys it as is - no learning curve or migrations needed. Cerebrium focuses on performant use cases, with extremely low cold start times and minimal network latency added to each request. It also offers many abstractions that make building and deploying AI applications easy such as the support for batching, websockets and allowing users to bring their own ASGI applications (Gradio, Streamlit etc). Their platform is suitable for a wide range of AI application use cases.

Average cold start times: 2-4 seconds

Pros:

Build times - 8-14 seconds
Over 10 types of GPU varieties

2. Replicate

Replicate offers serverless GPU-powered inference for a wide range of pre-trained models, as well as the ability to deploy custom models on GPUs behind a serverless endpoint.

Pre-trained Models

For most users, the main benefit of Replicate is the extensive library of pre-trained models that are ready to use. The details of which GPU resources are needed for each model are generally abstracted away from the user, who can simply specify the model name. This is great for creating a low-barrier to entry for users to call the models and introduce them into their workloads.

Custom Models

Replicate also allows users to deploy custom models. In the context of Replicate, a “model” refers to a trained, packaged, and published software program that accepts inputs and returns outputs.

To create and deploy a custom model on Replicate, you will use their framework that packages the model called Cog. You specify your build requirements in a .yaml file and your prediction code in a predict.py file and Cog will build, deploy and scale this as a Docker Container.

Average cold start times: 60+ seconds

Pros:

Extensive Model Library
Client SDK’s in Node.js, Elixir, Python

3. RunPod

Runpod has two offerings:

A serverless GPU offering is called RunPod Serverless.
A non-serverless GPU offering, called RunPod Pods, which are virtual machines with GPUs.

RunPod Serverless lets you deploy custom endpoints with your choice of GPU via a couple different modalities:

Quick Deploy: Pre-built custom endpoints for popular AI models.
Handler Functions: Bring your own functions to run in the cloud.
vLLM/SGLang Endpoint: Specify and run a Hugging Face model in the cloud.

Runpod is very centered around Docker meaning you can bring your own Docker container or use Docker containers contributed by the community. A lot of the setup with Runpod is done via the web console in terms of specifying your GPU, your scaling requirements, region etc. From there, you can deploy your endpoint and call it via a request API and Runpod will automatically handle the scaling. They have a large community so asking for help or finding an existing Docker container for your use case is easy.

Cold-start times: 6-12s

Pros:

Extensive GPU varieties - only one with support for AMD chips
Cheapest pricing

4. Baseten

Baseten is a serverless platform that is highly focused on model serving and inference.

They offer an open-source framework called Truss, for configuration, packaging and deployment of models. To deploy a model on Baseten you specify the resources you need in a config.yaml file such as your hardware configurations, python environment configurations, scaling requirements etc.

Truss will then package your model and create a Docker image that it pushes to Baseten where it can be deployed and run. Baseten will automatically handle the auto-scaling of the endpoint based on the requirements you set and you can interact with it via an API. Baseten has a really great platform UX. Baseten isn’t as suitable if you have use cases outside of model serving and inference.

Cold-start times: 8-12s

Pros:

Have an offering to deploy to your own cloud
Regional Support

5. Modal

Modal offers functions as a service - essentially the ability to deploy and run GPU accelerate functions via their Python SDK.

Modal provides many abstractions allowing users to take advantage of these abstractions by adding decorators to their Python code. The abstractions can be turning a function into an endpoint, websocket, support batching and much more. Modal is suitable for a wide range of use cases - not just model serving, fine-tuning, and training, but also other potentially GPU-accelerated workflows like CI/CD since it allows you to chain functions together whether or not they are CPU or GPU based.

Cold starts: 2-4 seconds

Pros:

Hot GPU reloading
Data processing

Product

Pricing

Developers

Docs

Status

Company

Blog

About

Use cases

Large language models

Voice

Image & Video

Resources

Examples

Articles

Brand assets