Tutorial

Apr 28, 2025

Deploying Ultravox on Cerebrium for Ultra-low Latency Voice Applications

Kyle Gani

Senior Technical Product Manager

Voice applications are rapidly evolving, and the demand for real-time, ultra-low latency processing is higher than ever. Traditional voice AI pipelines involve multiple stages, including a Speech-to-text (STT) model, a large language model (LLM) and a Text-to-speech (TTS) model. This sequential approach introduces latency, making real-time interactions challenging.

Ultravox is a breakthrough multimodal LLM designed to eliminate a part of this bottleneck by combining the STT and LLM model into one, rapidly improving latency. By integrating directly with Cerebrium’s serverless AI infrastructure, developers can build and deploy highly responsive voice applications with minimal overhead. Using this pipeline, we are able to achieve an end-to-end latency (First time to audio, in just 600 ms). In this article, we’ll explore what makes Ultravox unique, how to set it up on Cerebrium using Pipecat, and how to get it running.

You can find the final code in our examples repository, here.

What Makes Ultravox Unique?

Ultravox is fundamentally different from traditional voice AI architectures due to its ability to process audio directly into an LLM without requiring a separate ASR stage. This unique design is built on research from models like AudioLM, SeamlessM4T, Gazelle, and SpeechGPT. Here are the key advantages of Ultravox:

  1. Direct Audio-to-LLM Processing: Instead of transcribing speech into text before feeding it into an LLM, Ultravox uses a multimodal projector that maps audio directly into the high-dimensional space of the model. This reduces latency and eliminates potential ASR errors.

  2. Low Latency Responses: The absence of an ASR stage allows for much faster response times, crucial for applications like AI voice assistants, real-time customer support, and interactive voice-based agents.

  3. Scalability & Efficiency: With variants trained on Llama 3, Mistral, and Gemma, Ultravox is optimized for different hardware and latency requirements, making it adaptable to various use cases.

  4. Customisation: Ultravox has posted the instructions in its Github repo in order to fine-tune their open weights for your use case in order to improve the accuracy of the model for your use case. You can follow the instructions here

Prerequisites

Before getting started, you’ll need access to the following:

  • A Cerebrium account, sign up here

  • A Huggingface account (Sign up here) and a API Key

  • A Daily account (Sign up here) and a API Key

  • A Cartesia Account (Sign up here) and a API Key

  • Access to the Ultravox 0.5 llama 3.1 8B Model, which you can request here

Setting Up Ultravox on Cerebrium

First things first - let's create a new project directory called ultravox

pip install --upgrade cerebrium
cerebrium login
cerebrium init 11-ultravox

This creates:

ultravox/
├── main.py         # Our core application logic
└── cerebrium.toml  # Deployment configuration

Next, let us create a .env file with the following structure entering your keys from the respective platforms.

DAILY_TOKEN=<DAILY_TOKEN>
CARTESIA_API_KEY=<CARTESIA_TOKEN>
HF_TOKEN=<HUGGINGFACE_TOKEN

You can then navigate to your secrets tab on the Cerebrium dashboard and upload your .env file

In order to create our example application, we will be using the PipeCat framework that takes care of stringing together all the components and it handles some of the functionality we might need such as user interruptions, dealing with audio data etc. In this demo, we use Daily meeting rooms but you can change it to accept Twilio phone calls or similar (Check out tutorial here)

Let us add the following to the generated main.py file:

import os
import time
from loguru import logger
import requests
from dotenv import load_dotenv

from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask, PipelineParams
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.services.ultravox.stt import UltravoxSTTService
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.transports.services.daily import DailyParams, DailyTransport
from fastapi import FastAPI

# Load our environment variables
load_dotenv()

# Initialize FastAPI
app = FastAPI()

# Lod our ultravox model
ultravox_processor = UltravoxSTTService(
    model_name="fixie-ai/ultravox-v0_5-llama-3_1-8b",
    hf_token=os.getenv("HF_TOKEN"),
)

# Endpoint which sets up our STT service and allows the bot to join a daily room
@app.post("/run")
async def run(room_url: str, token: str):
    transport = DailyTransport(
        room_url,
        token,
        "Respond bot",
        DailyParams(
            audio_out_enabled=True,
            transcription_enabled=False,
            vad_enabled=True,
            vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
            vad_audio_passthrough=True,
        ),
    )

    tts = CartesiaTTSService(
        api_key=os.environ.get("CARTESIA_API_KEY"),
        voice_id='97f4b8fb-f2fe-444b-bb9a-c109783a857a',

    )

    # Create pipeline using transport.input() and transport.output()
    pipeline = Pipeline([transport.input(), ultravox_processor, tts, transport.output()])
    task = PipelineTask(
        pipeline,
        params=PipelineParams(
            allow_interruptions=True,
            enable_metrics=True,
        ),
    )
    runner = PipelineRunner()

    logger.info("Starting pipeline...")
    await runner.run(task)

# Helper endpoint for creating a room (Which we then pass to the above endpoint)
@app.post("/create-room")
def create_room():
    url = "<https://api.daily.co/v1/rooms/>"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {os.environ.get('DAILY_TOKEN')}",
    }
    data = {
        "properties": {
            "exp": int(time.time()) + 60 * 5,  ##5 mins
            "eject_at_room_exp": True,
        }
    }

    response = requests.post(url, headers=headers, json=data)
    if response.status_code == 200:
        room_info = response.json()
        token = create_token(room_info["name"])
        if token and "token" in token:
            room_info["token"] = token["token"]
        else:
            print("Failed to create token")
            return {
                "message": "There was an error creating your room",
                "status_code": 500,
            }
        return room_info
    else:
        data = response.json()
        if data.get("error") == "invalid-request-error" and "rooms reached" in data.get(
                "info", ""
        ):
            print("We are currently at capacity for this demo. Please try again later.")
            return {
                "message": "We are currently at capacity for this demo. Please try again later.",
                "status_code": 429,
            }
        print(f"Failed to create room: {response.status_code}")
        return {"message": "There was an error creating your room", "status_code": 500}

# Helper function for creating a token to authenticate against for our daily room
def create_token(room_name: str):
    url = "<https://api.daily.co/v1/meeting-tokens>"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {os.environ.get('DAILY_TOKEN')}",
    }
    data = {
        "properties": {
            "room_name": room_name,
            "is_owner": True,
        }
    }

    response = requests.post(url, headers=headers, json=data)
    if response.status_code == 200:
        token_info = response.json()
        return token_info
    else:
        print(f"Failed to create token: {response.status_code}")
        return None

# Health endpoint which let's Cerebrium know that our application is running
@app.get("/health")
async def health():
    return {"status": "ok"}

Above, you will see we define the ultravox_processor outside of our functions at the top of the file. This is because this code will run when the container starts on the Cerebrium platform; we want to load our Ultravox model into memory before we start allowing calls to come in so that our endpoint is ready to process calls as soon as they arrive. One thing to note, is that a single ultravox model can handle multiple concurrent conversations based on the GPU you are using so increase the GPU compute if you would like to allow more concurrency.

In the pipeline definition, we simply have Ultravox and Cartesia, this is because the Ultravox model handles the STT and LLM portion of our pipeline. Cartesia then handles the TTS portion. This is what helps to reduce our overall latency.

We also have an endpoint that creates a room and token (Through a separate helper function), which we’ll then use to pass to our main endpoint so that our bot can join the daily room and start conversing. We’ll also use this returned room url ourselves, so that we can join the room and talk to our bot.

Before we deploy, let us make sure we populate our cerebrium.toml with the correct values:

[cerebrium.deployment]
name = "11-ultravox"
python_version = "3.11"
docker_base_image_url = "debian:bookworm-slim"
disable_auth = false
include = ['./*', 'main.py', 'cerebrium.toml']
exclude = ['.*']

[cerebrium.runtime.custom]
port = 8765
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8765"]
healthcheck_endpoint = "/health"

[cerebrium.hardware]
cpu = 4
memory = 16.0
compute = "AMPERE_A10"

[cerebrium.scaling]
min_replicas = 0
max_replicas = 5
cooldown = 90
replica_concurrency = 1
response_grace_period = 900
scaling_metric = "concurrency_utilization"
scaling_target = 80
scaling_buffer= 1

[cerebrium.dependencies.pip]
transformers = "latest"
peft = "latest"
librosa = "latest"
"huggingface-hub[hf-transfer]" = "latest"
vllm = "latest"
pyaudio = "latest"
pydantic-settings = "latest"
textual = "latest"
loguru = "latest"
"pipecat-ai[cartesia,daily,silero,ultravox]" = "0.0.62"
fastapi = "latest"
uvicorn = "latest"

[cerebrium.dependencies.apt]
ffmpeg = "latest"
libportaudio2 = "latest"
portaudio19-dev = "latest"

In the above you will see we do the following:

  • We set our hardware to an AMPERE_A10 which has enough VRAM to run our 8B model

  • Set up our custom runtime (As this is a FastAPI application which is run through uvicorn)

  • We set our scaling_buffer=1. This means we will always have 1 extra container running. The reason for this is as new connections come in, we don’t want to have to wait for the model to load which typically takes 40s. Note: Don't forget to scale this down to 0 if you're not using your application, keeping your app running permanently could incur surprising costs.

  • Lastly, we install all the required pip and apt packages.

You can now deploy this with the following command:

cerebrium deploy

On initial deploy, it will download the model to your Cerebrium volume so give it a few minutes. From then on, deployments should be snappy. Once deployed you should see the following:

Now to get things running:

  • Make a request to your create-room endpoint: To create a room for your application and a token, which the bot will use to join. your application url can be found in your dashboard or in the output from when you deployed your application. It looks like this: https://api.cortex.cerebrium.ai/v4/<PROJECT_ID>/[APP_ID]/create-room. After calling this endpoint, you'll receive a room_url and a token, which you'll make use of in the following steps

  • Make an additional request to your run endpoint: Pass your room_id and token that were returned in the previous step. Once submitted, this endpoint will continue to process until the request is either cancelled, or the room is closed. This is normal, so keep it running in the background. The URL looks similar to the above: https://api.cortex.cerebrium.ai/v4/<PROJECT_ID>/[APP_ID]/run

  • Join the room: by opening the same URL returned in the first step, in your browser.

That’s it, now you’re able to interact with your highly-performant Ultravox bot and hear low-latency responses in real time!

Conclusion

Ultravox represents a major leap in real-time voice AI by removing ASR dependencies and directly mapping audio into LLMs. When deployed on Cerebrium, it enables developers to build ultra-low-latency voice applications with minimal infrastructure management. As Ultravox evolves, its ability to understand tone, emotion, and even generate natural speech responses will unlock even more powerful voice-first applications.

Start building with Ultravox on Cerebrium today and push the boundaries of real-time AI interaction!

© 2024 Cerebrium, Inc.