Tutorial
Apr 28, 2025
Deploying Ultravox on Cerebrium for Ultra-low Latency Voice Applications

Kyle Gani
Senior Technical Product Manager
Voice applications are rapidly evolving, and the demand for real-time, ultra-low latency processing is higher than ever. Traditional voice AI pipelines involve multiple stages, including a Speech-to-text (STT) model, a large language model (LLM) and a Text-to-speech (TTS) model. This sequential approach introduces latency, making real-time interactions challenging.
Ultravox is a breakthrough multimodal LLM designed to eliminate a part of this bottleneck by combining the STT and LLM model into one, rapidly improving latency. By integrating directly with Cerebrium’s serverless AI infrastructure, developers can build and deploy highly responsive voice applications with minimal overhead. Using this pipeline, we are able to achieve an end-to-end latency (First time to audio, in just 600 ms). In this article, we’ll explore what makes Ultravox unique, how to set it up on Cerebrium using Pipecat, and how to get it running.
You can find the final code in our examples repository, here.
What Makes Ultravox Unique?
Ultravox is fundamentally different from traditional voice AI architectures due to its ability to process audio directly into an LLM without requiring a separate ASR stage. This unique design is built on research from models like AudioLM, SeamlessM4T, Gazelle, and SpeechGPT. Here are the key advantages of Ultravox:
Direct Audio-to-LLM Processing: Instead of transcribing speech into text before feeding it into an LLM, Ultravox uses a multimodal projector that maps audio directly into the high-dimensional space of the model. This reduces latency and eliminates potential ASR errors.
Low Latency Responses: The absence of an ASR stage allows for much faster response times, crucial for applications like AI voice assistants, real-time customer support, and interactive voice-based agents.
Scalability & Efficiency: With variants trained on Llama 3, Mistral, and Gemma, Ultravox is optimized for different hardware and latency requirements, making it adaptable to various use cases.
Customisation: Ultravox has posted the instructions in its Github repo in order to fine-tune their open weights for your use case in order to improve the accuracy of the model for your use case. You can follow the instructions here
Prerequisites
Before getting started, you’ll need access to the following:
A Cerebrium account, sign up here
A Huggingface account (Sign up here) and a API Key
A Daily account (Sign up here) and a API Key
A Cartesia Account (Sign up here) and a API Key
Access to the Ultravox 0.5 llama 3.1 8B Model, which you can request here
Setting Up Ultravox on Cerebrium
First things first - let's create a new project directory called ultravox
This creates:
Next, let us create a .env file with the following structure entering your keys from the respective platforms.
You can then navigate to your secrets tab on the Cerebrium dashboard and upload your .env file
In order to create our example application, we will be using the PipeCat framework that takes care of stringing together all the components and it handles some of the functionality we might need such as user interruptions, dealing with audio data etc. In this demo, we use Daily meeting rooms but you can change it to accept Twilio phone calls or similar (Check out tutorial here)
Let us add the following to the generated main.py
file:
Above, you will see we define the ultravox_processor
outside of our functions at the top of the file. This is because this code will run when the container starts on the Cerebrium platform; we want to load our Ultravox model into memory before we start allowing calls to come in so that our endpoint is ready to process calls as soon as they arrive. One thing to note, is that a single ultravox model can handle multiple concurrent conversations based on the GPU you are using so increase the GPU compute if you would like to allow more concurrency.
In the pipeline definition, we simply have Ultravox and Cartesia, this is because the Ultravox model handles the STT and LLM portion of our pipeline. Cartesia then handles the TTS portion. This is what helps to reduce our overall latency.
We also have an endpoint that creates a room and token (Through a separate helper function), which we’ll then use to pass to our main endpoint so that our bot can join the daily room and start conversing. We’ll also use this returned room url ourselves, so that we can join the room and talk to our bot.
Before we deploy, let us make sure we populate our cerebrium.toml
with the correct values:
In the above you will see we do the following:
We set our hardware to an AMPERE_A10 which has enough VRAM to run our 8B model
Set up our custom runtime (As this is a FastAPI application which is run through uvicorn)
We set our scaling_buffer=1. This means we will always have 1 extra container running. The reason for this is as new connections come in, we don’t want to have to wait for the model to load which typically takes 40s. Note: Don't forget to scale this down to 0 if you're not using your application, keeping your app running permanently could incur surprising costs.
Lastly, we install all the required pip and apt packages.
You can now deploy this with the following command:
On initial deploy, it will download the model to your Cerebrium volume so give it a few minutes. From then on, deployments should be snappy. Once deployed you should see the following:

Now to get things running:
Make a request to your
create-room
endpoint: To create a room for your application and a token, which the bot will use to join. your application url can be found in your dashboard or in the output from when you deployed your application. It looks like this:https://api.cortex.cerebrium.ai/v4/<PROJECT_ID>/[APP_ID]/create-room
. After calling this endpoint, you'll receive aroom_url
and atoken
, which you'll make use of in the following stepsMake an additional request to your
run
endpoint: Pass your room_id and token that were returned in the previous step. Once submitted, this endpoint will continue to process until the request is either cancelled, or the room is closed. This is normal, so keep it running in the background. The URL looks similar to the above:https://api.cortex.cerebrium.ai/v4/<PROJECT_ID>/[APP_ID]/run
Join the room: by opening the same URL returned in the first step, in your browser.
That’s it, now you’re able to interact with your highly-performant Ultravox bot and hear low-latency responses in real time!
Conclusion
Ultravox represents a major leap in real-time voice AI by removing ASR dependencies and directly mapping audio into LLMs. When deployed on Cerebrium, it enables developers to build ultra-low-latency voice applications with minimal infrastructure management. As Ultravox evolves, its ability to understand tone, emotion, and even generate natural speech responses will unlock even more powerful voice-first applications.
Start building with Ultravox on Cerebrium today and push the boundaries of real-time AI interaction!