
How DistilLabs is Delivering 50% Lower Inference Costs with Production-Grade Autoscaling on Cerebrium
"
"
With Cerebrium, we built an inference platform that scales from zero to triple-digit RPS dynamically, maintains competitive pricing, and avoids the operational drag of managing GPU infrastructure internally.
Selim Nowicki
Co-Founder & CEO


Use case
LLM-to-SLM replacement, model distillation & fine-tuning, high-scale inference optimization
Location
Berlin, Germany
Customer since
November 2025
Features used
Dedicated AI onboarding engineer, Granular deployment controls, Limitless autoscaling across regions globally, Fast cold starts, Competitive on-demand pricing
Highlights
50% lower inference costs, Zero-to-150 RPS scalably, Faster time-to-production, Optimized latency, cold starts, & global deployment regions
Want to learn more?
Find your perfect fit with a tailored demo
Introduction
Distil Labs is the developer platform for building task-specific SLMs with LLM-level accuracy. Companies train models on Distil Labs in a matter of hours, replacing expensive large language models with smaller, faster, and dramatically more cost-efficient models. This enables their customers to preserve output quality, while significantly reducing inference costs and latency.
In production, their workloads span both training and inference with usage patterns fluctuating from low, steady traffic to bursts reaching hundreds of requests per second per customer. For Distil Labs, maintaining steady inference operations through bursts is not optional - it’s paramount.
The Challenge
As Distil Labs refined its product, it became clear that model optimizations alone would not unlock sustainable unit economics. Infrastructure would be integral to enable cheaper, faster inference at scale. And managing sophisticated hyperscale-grade GPU infrastructure internally would require significant engineering effort - effort that would pull the team away from building their core product.
Their system needed to dynamically scale from zero to hundreds of requests in under a minute. It needed to support fully custom model deployments, including custom weights and API definitions. It needed access to a variety of GPU types, be distributed across regions globally enabling data sovereignty. And most critically, it needed fast cold starts and cost efficiencies at scale.
Distil Labs quickly realized solving their infrastructure challenges, meant partnering with the right provider.
Evaluation & Implementation
After reviewing a multitude of vendors, including both cloud and on-prem solutions, Cerebrium was selected as the key partnership, combining key requirements into one wholistic platform solution.
Dedicated AI Onboarding Engineer
Granular Deployment Controls
Configurable Autoscaling based on traffic patterns
Deploy across regions globally
Optimized cold starts
Competitive On Demand Pricing
The onboarding process was highly collaborative. Within the first week, Cerebrium enabled Distil Labs to programmatically spin deployments up and down per customer request while monitoring usage at a granular level per deployment and meeting all their infrastructure needs.
The Results
Instead of investing development cycles into GPU orchestration, autoscaling logic, and reliability engineering, Distil Labs was able to immediately concentrate on their core USP – improving models and delivering customer value instead of maintaining infrastructure.
Dynamic scaling at both the deployment and per-instance level became essential to their roadmap. Without the ability to spin instances up and down easily - or to create multiple deployments as needed - they would not have been able to as easily offer competitive pricing. For Distil Labs, infrastructure wasn’t just a backend decision - it was the key to making efficient, small-model inference viable at scale.
Today, Distil Labs runs up to 150 requests per second per model during high-traffic periods, with multiple models deployed concurrently. Autoscaling handles traffic spikes smoothly, and reliability has reached production-grade stability. Upcoming GPU memory snapshot improvements promise even faster cold starts, further strengthening performance under bursty demand.
Beyond the platform itself, the working relationship has been a differentiator. The Cerebrium team integrates directly into Distil Labs’ Slack and remains highly responsive, acting less like a vendor and more like an infrastructure partner.
“Our customer was running at single-digit millions of requests per day at ~1s p99 latency. With Cerebrium, we reduced their inference costs by 50%, and increased accuracy from 83% to 92%, while keeping latency and reliability consistent at production scale.”
- Maciej, ML Engineer
Trying out AI at your company?
We offer up to $1,000.00 in free credits and face-time with our engineers to get you started.
Company
Resources
© 2025 Cerebrium, Inc.