Cerebrium articles | The Shortcomings of Celery + Redis for ML Workloads and How Cerebrium Solves It

October 24, 2025

The Shortcomings of Celery + Redis for ML Workloads and How Cerebrium Solves It

Michael Louis

Founder & CEO

Machine learning inference isn't like traditional web APIs. When a user uploads an image for classification or sends text for generation, the server might need 5-30 seconds of GPU compute time - sometimes for model loading, often for generation itself. A request that takes that long to start up or return a response causes several issues: Most servers and browsers timeout after a few seconds or minutes, users can’t navigate away or perform other actions while waiting, and resources stay tied up for the entire duration of the request.

What do 100 requests, each running for 30 seconds look like with this setup?

When 100 clients make requests to your model simultaneously and each endpoint call lasts for roughly 30 seconds, the last client waits 50 minutes for all previous requests to complete. This is unacceptable for production applications. Task queues solve this by decoupling the API from the computation.

The typical architecture involves three components:

The API receives requests, creates async tasks, and returns task IDs immediately. The message broker (usually Redis or RabbitMQ) stores tasks in queues. Workers (managed by Celery) pull tasks from queues, execute GPU/CPU inference on something like Kubernetes, and store the results.

This is why ML teams turn to task queues. Queues help engineers fire requests off in quick succession and forget about them until they have finished processing. They also help to offload long-running workloads, by handling:

State tracking: Return a task ID immediately so the API can track job progress.
Async user experience: Users can close the browser or do other things while the model runs and track progress
Long-running operations: Handle tasks that take 15+ minutes (training, batch processing, video analysis) without hitting request limits.
Traffic spike handling: Scale compute resources independently based on queue depth rather than concurrent HTTP connections.

The standard solution: Offload the heavy inference work to asynchronous task queues using Celery and Redis, keeping the API responsive while workers process tasks in the background.

The Operational Reality of Celery + Redis

Setting up a production-ready task queue for ML requires extensive configuration. Celery configuration includes broker URLs, result backends, serializers, worker prefetch settings, acknowledgment policies, task tracking, result expiration, priority queues, and retry behavior

Teams report common challenges such as:

Cold Start Problems - Teams must override Celery's default task behavior to load models once per worker process, rather than per task. However, this still causes a cold start whenever new workers spin up, and models load multiple times if workers run multiple threads. Each worker process maintains its own model copy in memory. For large models this is painful.

Resource Management - Celery workers require careful configuration to avoid loading models multiple times in memory. With process-based worker pools, each process maintains its own model copy. This wastes GPU memory and requires precise worker configuration to match available GPU capacity.

Infrastructure Complexity - Production deployments require a message broker (like Redis or RabbitMQ), result storage, and Kubernetes for worker orchestration. Each service needs monitoring, scaling policies, and failure handling.

Scaling Coordination - Celery separates API traffic from worker scaling. The API scales with HTTP load, while workers scale with queue depth — meaning two separate autoscaling systems that don’t talk to each other. Cerebrium eliminates this mismatch by using a single autoscaler that monitors replica_concurrency across all instances, ensuring the API and inference layer scale together in real time.

Other Limitations:

Timeout ceilings - 15-min limits on synchronous requests forces result polling and callback logic for long-running tasks.
Latency tax - Redis introduces 100-500ms of broker overhead per task, which adds up in low-latency pipelines.
Per-region queues = per-region ops. - Duplicated deployments, fragmentation of capacity, manual spillover/anycast routing, and uneven backlogs.
Large payload limits - Celery tasks degrade beyond ~100 MB; Cerebrium supports payloads up to 1 GB natively.

How Cerebrium Handles Queuing

Cerebrium eliminates the need for external queue infrastructure by integrating it directly into the serverless platform. This means no Redis, no RabbitMQ, and no separate scaling logic — queuing and scaling are handled natively.

Here’s how it works:

The platform continuously monitors two key metrics:

Queue depth – the number of requests currently waiting for processing
Queue time – how long each request has been waiting

When either metric exceeds its threshold, Cerebrium automatically provisions new instances (workers) — typically within 2–4 seconds.

Unlike Celery, you don’t configure brokers or routing policies. Instead, you define how your application scales using a simple configuration:

`[cerebrium.scaling]
min_replicas = 0
max_replicas = 10
cooldown = 60
replica_concurrency = 1`
scaling_metric = "concurrency_utilization"
scaling_target = 70

The critical parameter here is replica_concurrency, which defines the maximum number of requests each replica can process concurrently. This value acts as a hard limit. Once it’s reached and additional requests are still in flight, the platform scales out automatically to handle the overflow.

For example, if replica_concurrency = 1 and three new requests arrive while no replicas are available, Cerebrium instantly spins up three new instances.

This single parameter replaces Celery's worker pool configuration, prefetch multipliers, and queue routing logic.

Key Capabilities

Built-in queuing: All incoming requests are automatically queued and processed as soon as resources become available — no external queue configuration required.
Queue persistence: If the maximum number of workers is reached, requests remain in the queue until a worker becomes available. Items can wait up to 1 hour by default, configurable via response_grace_period in your cerebrium.toml.
Long-running tasks: Tasks can be synchronous or asynchronous and can run for up to 12 hours.
Real-time visibility: Use the Cerebrium REST API to query request status, logs, and results directly.
Per-app isolation: Queues are defined per application, ensuring workloads don’t interfere with one another.
Ultra-low internal latency: Internal routing operates at ~5 ms latency, enabling near-instant request dispatch.
Large payload support: Requests up to 1 GB are supported natively — no need for external object storage hacks.

Aspect	Traditional	Cerebrium
Queue Infrastructure	Redis/RabbitMQ cluster to configure	Built-in, no configuration
Worker Management	Separate Celery worker pools	Replicas handle both API and processing
Scaling Configuration	Two autoscalers (K8s HPA + KEDA)	Single autoscaler with `replica_concurrency`
Configuration	50+ lines across multiple files	10 lines in `cerebrium.toml`
Cold Start Time	30–60 seconds	2–4 seconds
Scaling Response	60–120 seconds	2–4 seconds
Concurrency Control	Multiple parameters	Single `replica_concurrency`
Services to Monitor	API, Redis, Workers, Queue	Containers only
Result Storage	Separate Redis/DB	Direct HTTP response

What a request looks on Celery/Redis vs. Cerebrium

🧱 Celery + Redis (Traditional Setup)

Each Celery worker runs as a Kubernetes pod, requiring separate orchestration for scaling, monitoring, and model loading. This adds operational complexity and GPU memory overhead as every pod loads its own model copy.

⚡ Cerebrium (All-in-one Platform)

Cerebrium eliminates the need for Kubernetes orchestration entirely. The platform manages replicas, autoscaling, and queueing internally — so you deploy once, and it scales automatically without pods, YAMLs, or cluster tuning.

Intelligent Autoscaling

Kubernetes autoscalers were designed for traditional CPU workloads, not machine learning inference. They rely on CPU or memory thresholds - metrics that fail to capture how ML workloads actually behave. A single model request might occupy a GPU for 10 seconds, while another finishes in 300 ms. Scaling based on CPU usage or average latency leads to over-provisioning, cold starts, and unpredictable response times.

Cerebrium’s autoscaler is built specifically for ML workloads. It looks at how models behave in production - not just how busy the hardware appears and combines four scaling mechanisms to ensure consistent performance across any workload:

Concurrency Utilization: This is the default metric - and the heart of Cerebrium’s scaling logic. The autoscaler measures how much of each replica’s configured concurrency (replica_concurrency) is currently in use, averaged across all replicas. For example, with: replica_concurrency = 1 and scaling_target = 70 Cerebrium maintains roughly 0.7 requests per replica, leaving 30 % headroom for new traffic. This ensures scaling is proactive, not reactive — avoiding overload before it happens.
CPU and Memory Utilization: For mixed workloads - such as preprocessing on CPUs or multimodal tasks - Cerebrium can also scale on CPU and memory utilization. This hybrid approach ensures the system adapts intelligently to the dominant bottleneck, whether it’s GPU concurrency, CPU load, or memory pressure.
Scaling Buffers: Some workloads - particularly real-time voice, video, or chat applications — can’t afford even a 2-second cold start. The scaling_buffer parameter keeps a fixed number of replicas warm and ready. This guarantees instant responsiveness, even during quiet periods or when models take time to initialize.

Together, these mechanisms replace the patchwork of Kubernetes HPA policies, Redis queue metrics, and custom scaling scripts.

Cerebrium’s autoscaler is predictable, efficient, and GPU-aware, delivering the elasticity of serverless compute while maintaining the reliability and performance that production ML workloads demand.

Why It’s More Cost-Effective

Cerebrium removes the operational drag of managing multiple systems while cutting GPU costs nearly in half and reducing engineering overhead by more than 90% — letting teams focus entirely on building, not babysitting infrastructure. Here’s why Cerebrium helps teams cut costs:

No idle services — Containers spin up and down intelligently based on demand and usage.
A managed service — no need for managing your own Redis, Celery, or extra API containers. Make a request to your endpoint and all the complex infrastructure orchestration gets handled by Cerebrium.
Usage-based billing — pay only for active compute time.
Minimal maintenance — autoscaling, monitoring and ensuring the tech stack is performant, stable and up to date are all handled by the platform.

Conclusion

Task queues solved a real need for ML teams — keeping APIs responsive during long-running inference, managing GPU utilization efficiently, and handling unpredictable traffic spikes. The Celery + Redis architecture achieved these goals, but it also carried over operational baggage from the world of traditional web development: brokers, result stores, scaling scripts, and endless tuning.

Modern serverless ML platforms take a different approach. They deliver the same outcomes — async execution, workload isolation, and scaling efficiency — but bake queue management directly into the infrastructure.

replica_concurrency replaces worker pool tuning. Internal queue telemetry replaces Redis configuration. Automatic scaling replaces custom HPA logic.

For teams focused on building ML products rather than managing infrastructure, the question isn’t whether you need a task queue — you do — but whether managing it yourself adds any value.

In the end, the best queue is the one you never have to configure.

Product

Pricing

Developers

Docs

Status

Company

Blog

About

Use cases

Large language models

Voice

Image & Video

Resources

Examples

Articles

Brand assets