Cerebrium blog | Deploying a global scale, AI voice agent with 500ms latency.

Tutorial

Jun 25, 2025

Deploying a global scale, AI voice agent with 500ms latency.

Michael Louis

Founder & CEO

We recently hosted a webinar on Building Global, Low-Latency Voice Agents—and the response was incredible. Hundreds of builders, engineers, and product teams tuned in to learn how to assemble real-time speech pipelines with STT, LLMs, and TTS, all optimized for sub-500ms response times. The feedback was clear: this is a problem many teams are actively working on, and they want practical, scalable solutions.

So we decided to turn our learnings, suggestions, and architectural thinking into this post.

In this article, we’ll break down how to build a real-time, scalable voice agent that feels fast and human. We’ll cover the core components of any modern voice agent (speech-to-text, LLM, text-to-speech, media transport, and agent framework), walk through how we deploy these globally on Cerebrium for performance and compliance, and share tips on keeping costs low—even at scale. We’ll also include code examples using LiveKit to tie everything together.

You can find code examples for LiveKit here.

Core Components of a Real-Time Voice Agent

Most real-time voice agents are made up of five essential components:

Speech-to-Text (STT) to transcribe incoming audio from the client,
Large Language Model (LLM) to generate the response,
Text-to-Speech (TTS) to convert the reply back to audio,
Agent framework to define business logic and tie all services together,
Media transport to stream voice data between the client and agent in real time.

To achieve true low-latency performance—especially under 500ms—you need to think beyond just the model speed. Network latency plays a huge role. Every time your voice agent makes a request to a service (whether it’s STT, LLM, or TTS), there’s latency incurred from crossing regions or networks. Even if you’re in the same region as the service, each call can add 50ms or more. Multiply that across three components and your latency stacks up fast. With Cerebrium, we take advantage of our inter-cluster routing in order to make the network latency negligible – more on this later.

Speech-to-Text (STT)

For STT, we have a partnership with Deepgram that allows you to deploy their STT model on Cerebrium. In order to do this, you need to have an enterprise account with them - you can then follow our instructions here on how to deploy their service on Cerebrium. The cost associated with doing this is the cost you incur from Deepgram, essentially the license fee and the cost of the compute you use to run the model on Cerebrium. There are other alternatives like using faster-whisper etc.

Typically, the TTFB (time-to-first-byte) from the Deepgram API is ~250ms whereas the TTFB hosting it locally on Cerebrium is ~110ms which is a 140ms saving. This is thanks to the inter-cluster routing done by Cerebrium and so requests don’t incur network latency. On Cerebrium, a Deepgram service can run on an A10 GPU and can handle roughly 160-180 connections concurrently. Therefore the cost for this on Cerebrium is $1.44 per hour per instance and Cerebrium manages all autoscalling.

Large Language Model (LLM)

The LLM is where you can get the largest latency savings as well as the most consistent latency as you aren’t affected by external traffic. The TTFT (time-to-first-token) of OpenAI 4o-mini varies from 700ms - 1.5s whereas if you deploy a llama-3-8b model or llama-3-70b, you can consistently achieve ~300ms.

Depending on the model you wish to deploy, the GPU you select usually depends on your expectations around:

TTFT expectation
Concurrency

You can use vLLM, SGLang or TensorTRT-LLM as serving frameworks that have been optimized to reduce TTFT. Below are rough estimates for 2048 input tokens and 256 output tokens for the following models on different hardware:

llama-3.1-8b-fp8:
- TTFT: 400ms
- Concurrent requests: 9 requests
- Hardware: 1xH100
- Cost: $4.32 per hour
llama-3.1-70b-fp8:
- TTFT: 400ms
- Concurrent requests: 5 requests
- Hardware: 2xH100
- Cost: $8.64 per hour

You will notice that you will get a significantly better price for LLMs and similar performance using a proxy like OpenRouter, which is great for early prototyping and usage. However, many applications outgrow the limitations quickly. Cerebrium offers a compelling alternative:

Consistent, low latency: Hosted APIs are subject to external traffic fluctuations. By deploying your own LLM on Cerebrium, you’re isolated from noisy neighbors, allowing for more predictable TTFT and end-to-end latency—critical for real-time applications like voice agents.
Full stack ownership: Unlike hosted APIs, Cerebrium gives you control over using fine-tuned model weights, using custom tokenizers, observability, and model routing—letting you optimize for performance, cost, and user experience in ways that black-box APIs simply don’t allow.
Global deployments: Cerebrium lets you deploy your model in multiple regions (US, EU, UAE, India), reducing latency for users and helping you meet data residency and compliance requirements.
Security and data control: Cerebrium is SOC 2 Type I and HIPAA compliant. You can control what data is stored, issue purge requests via API, and ensure sensitive information is handled according to your policies.

Text-to-Speech (TTS)

Lastly, for TTS you have a range of options. You can use the Aura models via Deepgram, an ultra-realistic voice model from Rime Labs (who are very generous on pricing and their TTFB is ~80ms), or you can deploy one of the open-source models like Sesame or Orpheus.

By deploying these models locally on Cerebrium, you can achieve ~80ms TTFB as opposed to APIs that typically respond in ~150ms, leading to a ~90ms saving. They usually require an A10 GPU which will cost $1.2 per hour for 180 concurrent conversations.

Agent Framework

If you consider the above components collectively as the “model” layer in a traditional Model-View-Controller (MVC) architecture, an agent framework is akin to the “controller” or business logic layer. LiveKit’s open source Agents framework offers a rich plugin ecosystem, allowing you to easily integrate any STT, LLM, and TTS provider into a voice pipeline. It handles conversational dynamics like noise/echo cancellation, end-of-turn detection, and interruptions for you. You can also use it to build complex multi-agent workflows that leverage tool calls and dial in or out from phone numbers.

Unlike web applications, voice agents are stateful: persistently connected to the user, constantly processing incoming speech, and steadily building up conversational context. This architecture motivates a different approach to load balancing and scaling. LiveKit Agents automatically handles agent routing, concurrent conversations, failover, and context migration if a particular STT, LLM, or TTS service is slow or unavailable, or if the agent itself crashes.

Media Transport

To carry voice data between the user and the voice agent with as low latency as possible, we need an efficient network transport. LiveKit’s media server–built on the WebRTC protocol–is one of the most popular options available. It integrates seamlessly with their Agents framework and on the client side offers SDKs across virtually every platform.

Misc:

Other small improvements that lower latency:

Ideally, you should deploy all these services within the same VPC/cluster in order to reduce network latency to single digit milliseconds.
Lastly, preferably you would like your VPC/cluster to be as close to the customer as possible in order to reduce network delay.

System Architecture on Cerebrium

The above diagram is the architecture we used in order to achieve the voice-to-voice response latency of ~500ms. By deploying them inside Cerebrium, we take advantage of inter-cluster routing between the services that allows them to incur a “network request latency” of ~2ms, which alone saves us 150ms+.

You can think of each container as its own independent autoscaling service. This is required since each service has different concurrency capacity and hardware requirements. For example, Deepgram can take 180 concurrent connections and runs on an A10 GPU whereas a self-hosted llama model can take 10 requests and runs on a H100.

The other benefit of this architecture is you can use these services for other parts of your workflow, such as Deepgram for sentiment analysis or the LLM for summarisation etc. Cerebrium handles the autoscaling of these services based on traffic.

Since this solution is self-contained, you can deploy this solution across multiple regions—such as the US, EU, UAE, and India—to take full advantage of both low-latency performance and data residency requirements. Cerebrium automatically routes traffic to the closest region based on the user’s location, but you can also explicitly control routing if needed. This ensures that latency stays low (often under 500ms end-to-end), while also meeting strict compliance requirements like GDPR or HIPAA by keeping data in-region without the overhead of managing regional infrastructure yourself.

Cost

Below we breakdown what the cost looks like running something like this at scale. To summarize:

STT: $1.44 per hour for 180 conversations per second
LLM: $32.04 per hour for 4 conversations
TTS: $1.44 per hour for 180 conversations per second
Media transport: $0.03 per hour per conversation

Therefore to work out the cost per minute per call for comparisons we get the following:

TTS and TTS combined: $0.0000266 per minute per call
LLM: $0.0288 per minute per call
Media transport: $0.0005 per minute per call
Total: $0.02932 per minute per call.

Note: If using Deepgram, Rime or other proprietary models, then you will incur a licensing fee that is not incorporated in the above. The above is simply the infrastructure/compute costs.

Note: The above price is using Cerebrium’s on-demand price. We offer volume discounts and discounts for commitments.

By running this in Cerebrium you get the following benefits:

A platform that can do CPU and GPU deployments with extremely low cold start times (2 seconds)
Your services can autoscale based on your requirements, handling spikes of traffic as well as scaling to zero in periods of no usage - Cerebrium only charges you for the compute used to the second.
You can deploy this solution across a variety of regions (US, EU, UAE and India) adhering to all data residency and compliance requirements as well as keeping latency low

Code Examples:

We have a code repository using LiveKit that illustrates everything we spoke about above.

Building a real-time, low-latency voice agent that feels natural requires more than just fast models—it requires thoughtful orchestration, regional deployment, and tight infrastructure tuning. In this post, we broke down each core component (STT, LLM, TTS, media transport, and the agent framework), showed how to deploy them efficiently using Cerebrium, and shared benchmarks and pricing to help you make informed decisions. With inter-cluster routing, autoscaling GPU services, and support for global deployment, you can reliably hit sub-500ms latency while meeting data residency and compliance needs.

At around $0.03 per minute per call, this setup is both performant and cost-effective at scale. If you’re building a voice agent and want to explore this setup—or want to optimize what you already have—we’d love to hear from you.

Integrating PayPal’s Model Context Protocol (MCP) into a Real-time Voice Agent

Jul 31, 2025

Tutorial

Integrating PayPal’s Model Context Protocol (MCP) into a Real-time Voice Agent

Jul 31, 2025

Tutorial

Deploying Ultravox on Cerebrium for Ultra-low Latency Voice Applications

Apr 28, 2025

Tutorial

Deploying Ultravox on Cerebrium for Ultra-low Latency Voice Applications

Apr 28, 2025

Tutorial

Building a Real-time Coding Assistant

Feb 20, 2025

Tutorial

Building a Real-time Coding Assistant

Feb 20, 2025

Product

Pricing

Developers

Docs

Status

Company

Blog

About

Use cases

Large language models

Voice

Image & Video

Resources

Examples

Articles

Brand assets

Deploying a global scale, AI voice agent with 500ms latency.

Core Components of a Real-Time Voice Agent

Speech-to-Text (STT)

Large Language Model (LLM)

Text-to-Speech (TTS)

Agent Framework

System Architecture on Cerebrium

Cost

Code Examples:

MORE ARTICLES LIKE THIS

Integrating PayPal’s Model Context Protocol (MCP) into a Real-time Voice Agent

Integrating PayPal’s Model Context Protocol (MCP) into a Real-time Voice Agent

Deploying Ultravox on Cerebrium for Ultra-low Latency Voice Applications

Deploying Ultravox on Cerebrium for Ultra-low Latency Voice Applications

Building a Real-time Coding Assistant

Building a Real-time Coding Assistant

Product

Developers

Company

Use cases

Resources