August 28, 2025

Orpheus TTS: How to Deploy Orpheus at Scale for Production Inference

Michael Louis

Founder & CEO

Text to speech technology has evolved from robotic-sounding output to human sounding speech that rivals natural conversation. At the forefront of this revolution stands Orpheus TTS, a groundbreaking open source system and open source project led by Canopy Labs that combines cutting-edge language model technology with real time streaming capabilities to deliver exceptional voice synthesis results.Built on proven infrastructure and optimized for both research and production environments, Orpheus represents a significant leap forward in accessible, high-performance speech synthesis.

This comprehensive guide explores every aspect of the system, from its technical foundations to practical implementation strategies that enable developers and organizations to harness its full potential. As text to speech technology continues to evolve, future advancements and upcoming versions of Orpheus TTS are planned, including expanded language support, inference optimizations, and improved model versioning to enhance functionality and performance.This guide walks through how to deploy Orpheus TTS on Cerebrium for scalable, low-latency inference in real-world environments.You can find our final Github repository here.

What is Orpheus TTS

Orpheus TTS stands as a state-of-the-art open source text to speech system developed by Canopy Labs, designed to address the growing demand for high-quality, scalable voice synthesis solutions. The platform leverages the robust Llama-3B language model backbone, creating a powerful foundation for superior speech synthesis capabilities that consistently deliver natural intonation and human-like vocal characteristics, while allowing users to specify the target model for inference requests.

What sets Orpheus apart in the competitive TTS landscape is its dual availability model. Organisations can access finetuned models optimized for immediate production deployment, while researchers and developers can utilize pretrained base models for extensive customization and experimental work. There is a significant difference in dataset quality and training approaches between these models, which directly impacts the performance, diversity, and adaptability of the system. Maintaining different versions of the models ensures compatibility across environments and supports ongoing enhancements.

The architecture incorporates advanced features like zero shot voice cloning, enabling the system to generate speech in new voices with minimal training data. Orpheus TTS supports multiple languages, a variety of voices, emotive tags, and other functionalities to meet diverse application needs. This capability, combined with simple tags and low latency processing, positions Orpheus as a versatile solution for applications ranging from customer service automation to creative content generation.

Input prompts and customization options are designed for flexibility, but using a consistent prompt format is crucial for achieving optimal results and facilitating training and inference workflows. In addition to customization and experimental work, typical usage patterns involve establishing connections, sending inference requests, and interacting with the system through provided APIs or code repositories.

How to deploy Orpheus on Cerebrium

Prerequisites:

Before getting started, you’ll need access to the following:

A Cerebrium account, sign up here and follow the quickstart to install our Python SDK.

We will deploying based on the instructions from this Github repository which has two parts:

A fastAPI service that:

Create RESTful API endpoints
Handles request validation and procesing
Manages communication with the Orpheus server
Provides streaming audio responses. For example, when sending a request, you can structure your prompt to include language and emotion tags, such as: {"prompt": "Hello world!<chuckle>"}.

A Orpheus Model Server:

Hosts the Orpheus TTS Model
Manages different voice models
Let us deploy our Orpheus Model server first

Orpheus Model Server:

First lets create our Cerebrium app:

cerebrium init orpheus-server

This would have created:

cerebrium.toml - Here is where we set all our container definitions and scaling parameters
main.py - We wont be using this so you can ignore/delete for now

We will be transforming the Orpheus part of the Docker compose from the original repository for the server to its own Dockerfile called Dockerfile.llama with the following:

# Dockerfile for llama.cpp server with Orpheus model
FROM ghcr.io/ggml-org/llama.cpp:server-cuda

# Install wget for model downloading
RUN apt-get update && apt-get install -y wget && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Create models directory
RUN mkdir -p /models

# Set environment variables for CUDA optimization
ENV GGML_CUDA_FORCE_CUBLAS=1 \
    CUDA_VISIBLE_DEVICES=0 \
    GGML_CUDA_FORCE_MMQ=1 \
    GGML_CUDA_DMMV_X=1024 \
    GGML_CUDA_MMV_Y=1024 \
    GGML_CUDA_FORCE_F16=1 \
    GGML_CUDA_PEER_MAX_BATCH_SIZE=8000 \
    LLAMA_CUDA_HOST_BUFFER=16384 \
    LLAMA_MMAP=1 \
    GGML_CUDA_SPLIT_TENSORS=1 \
    GGML_TENSOR_SPLIT_MODE=row \
    GGML_CUDA_COMPUTE_CAPABILITY=8 \
    GGML_CUDA_NO_FUSED_MLP=0

# Default model configuration
ENV ORPHEUS_MODEL_NAME=Orpheus-3b-FT-Q8_0.gguf \
    LLAMA_ARG_HOST=0.0.0.0 \
    LLAMA_ARG_PORT=5006

# Expose the server port
EXPOSE 5006

# Find the correct server binary location and create entrypoint
RUN find /app -name "*server*" -type f && ls -la /app/
RUN echo '#!/bin/bash' > /entrypoint.sh && \
    echo 'if [ ! -f "/models/$ORPHEUS_MODEL_NAME" ]; then' >> /entrypoint.sh && \
    echo '  echo "Downloading model file..."' >> /entrypoint.sh && \
    echo '  wget --no-check-certificate -O "/models/$ORPHEUS_MODEL_NAME" "https://huggingface.co/lex-au/Orpheus-3b-FT-Q8_0.gguf/resolve/main/Orpheus-3b-FT-Q8_0.gguf"' >> /entrypoint.sh && \
    echo 'else' >> /entrypoint.sh && \
    echo '  echo "Model file already exists"' >> /entrypoint.sh && \
    echo 'fi' >> /entrypoint.sh && \
    echo 'echo "Starting llama.cpp server with Orpheus model..."' >> /entrypoint.sh && \
    echo 'SERVER_BIN=""' >> /entrypoint.sh && \
    echo 'if [ -f "/app/server" ]; then SERVER_BIN="/app/server"; fi' >> /entrypoint.sh && \
    echo 'if [ -f "/app/llama-server" ]; then SERVER_BIN="/app/llama-server"; fi' >> /entrypoint.sh && \
    echo 'if command -v llama-server >/dev/null 2>&1; then SERVER_BIN="llama-server"; fi' >> /entrypoint.sh && \
    echo 'if [ -z "$SERVER_BIN" ]; then echo "Error: Could not find server binary"; exit 1; fi' >> /entrypoint.sh && \
    echo 'echo "Using server binary: $SERVER_BIN"' >> /entrypoint.sh && \
    echo 'exec "$SERVER_BIN" \' >> /entrypoint.sh && \
    echo '  -m "/models/$ORPHEUS_MODEL_NAME" \' >> /entrypoint.sh && \
    echo '  --host "$LLAMA_ARG_HOST" \' >> /entrypoint.sh && \
    echo '  --port "$LLAMA_ARG_PORT" \' >> /entrypoint.sh && \
    echo '  --n-gpu-layers 29 \' >> /entrypoint.sh && \
    echo '  --threads 6 \' >> /entrypoint.sh && \
    echo '  --batch-size 8192 \' >> /entrypoint.sh && \
    echo '  --ubatch-size 8192 \' >> /entrypoint.sh && \
    echo '  --ctx-size 16834 \' >> /entrypoint.sh && \
    echo '  --cont-batching \' >> /entrypoint.sh && \
    echo '  --timeout 100 \' >> /entrypoint.sh && \
    echo '  --mlock \' >> /entrypoint.sh && \
    echo '  --flash-attn \' >> /entrypoint.sh && \
    echo '  --parallel 25 \' >> /entrypoint.sh && \
    echo '  --numa numactl \' >> /entrypoint.sh && \
    echo '  --threads-http 32' >> /entrypoint.sh && \
    chmod +x /entrypoint.sh

# Set the entrypoint
ENTRYPOINT ["/entrypoint.sh"]

This just defines our container, its dependencies, downloads the model, and runs it using llama.cpp. In order to use this Dockerfile and run the application we should set the following in our cerebrium.toml:

[cerebrium.deployment]
name = "orpheus-server"
python_version = "3.11"
docker_base_image_url = "debian:bookworm-slim"
disable_auth = true
include = ['./*', 'main.py', 'cerebrium.toml']
exclude = ['.*']

[cerebrium.hardware]
cpu = 6.0
memory = 12.0
compute = "ADA_L40"
provider = "aws"
region = "us-east-1"

[cerebrium.scaling]
min_replicas = 1
max_replicas = 5
cooldown = 30
replica_concurrency = 25
scaling_metric = "concurrency_utilization"

[cerebrium.runtime.custom]
port = 5006
dockerfile_path = "./Dockerfile.llama"

Above I point to the Dockerfile and the port it should be listening on. I also, based on the GPU and hardware I selected, am setting the replica concurrency which is the number of requests the container can receive at any given time. If you want to is based on the price/performance tradeoff - change the parallel/http threads value in the Dockerfile as well as the replica_concurrency in the cerebrium.toml.

Once the above is complete, you can run:

cerebrium deploy

You should then see your application deployed to your Cerebrium dashboard under a url like:

https://api.aws.us-east-1.cerebrium.ai/v4/p-xxxxxx/orpheus-server/

FastAPI Server:

Start with cloning the Github repository:

git clone https://github.com/timonharz/Orpheus-FastAPI

Make sure this is not inside your folder of the orpheus-sever. These should be two separate directories.To the repository folder you just cloned, add a cerebrium.toml file with the following:

[cerebrium.deployment]
name = "17-orpheus-fastapi"
python_version = "3.11"
docker_base_image_url = "debian:bookworm-slim"
disable_auth = true
include = ['./*', 'main.py', 'cerebrium.toml']
exclude = ['.*']

[cerebrium.hardware]
cpu = 4.0
memory = 8.0
compute = "CPU"
provider = "aws"
region = "us-east-1"

[cerebrium.scaling]
min_replicas = 0
max_replicas = 2
cooldown = 30
replica_concurrency = 100
scaling_metric = "concurrency_utilization"

[cerebrium.runtime.custom]
port = 5005
entrypoint = ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "5005"]

[cerebrium.dependencies.paths]
pip = "requirements.txt"

[cerebrium.dependencies.apt]
libportaudio2 = "latest"

Similar to above, you set the entrypoint of the FastAPI, that it should run on a CPU and specify how it should scale.In the file requirements.txt, uncomment the line near the bottom that installs "torch torchvision and torchaudio"

Update your .env file with the deployment url from your Orpheus server deployment above. Once complete, you can upload your .env file to your secrets on the Cerebrium dashboard. We import these values as ENV vars at runtime.

Run the command:

cerebrium deploy

And thats it! You should now be able to make the following CURL request to receive audio back from the server.

curl --location 'https://api.aws.us-east-1.cerebrium.ai/v4/p-xxxxxx/17-orpheus-fastapi/v1/audio/speech/stream' \
--header 'Authorization: Bearer < AUTH_TOKEN>' \
--header 'Content-Type: application/json' \
--data '{
    "input": "Your text to convert to speech",
    "model": "orpheus",
    "voice": "tara",
    "response_format": "wav",
    "speed": 1.0
  }'

You can get the exact url and Auth token from the overview section on your dashboard.The response from this endpoint is streamed back as audio data in the specified format. The response is structured to allow real-time playback, and you should handle the response accordingly in your client application.From our testing this endpoints has a TTFB (Time-to-first-byte) of ~100ms which is perfect for low latency voice applications.

Conclusion

Orpheus TTS demonstrates how far open-source speech synthesis has come - delivering natural, human-like voices with the performance characteristics needed for real-world production. By combining advanced model design with features like zero-shot voice cloning, multi-language support, and real-time streaming, Orpheus moves beyond a research curiosity into a practical system for enterprise applications, creative work, and accessible technology.

Deploying Orpheus on Cerebrium makes it possible to scale this capability without the heavy lifting of managing GPUs, containers, multi-region support and autoscaling yourself. As the ecosystem evolves - with future releases bringing expanded language coverage, improved inference optimizations, and tighter integrations - the combination of Orpheus + Cerebrium offers a foundation for deploying next-generation voice systems at scale.

Explore the final GitHub repository to get started with your own deployment.

Product

Pricing

Developers

Docs

Status

Company

Blog

About

Use cases

Large language models

Voice

Image & Video

Resources

Examples

Articles

Brand assets