August 28, 2025
Orpheus TTS: How to Deploy Orpheus at Scale for Production Inference

Michael Louis
Founder & CEO
Text to speech technology has evolved from robotic-sounding output to human sounding speech that rivals natural conversation. At the forefront of this revolution stands Orpheus TTS, a groundbreaking open source system and open source project led by Canopy Labs that combines cutting-edge language model technology with real time streaming capabilities to deliver exceptional voice synthesis results.Built on proven infrastructure and optimized for both research and production environments, Orpheus represents a significant leap forward in accessible, high-performance speech synthesis.
This comprehensive guide explores every aspect of the system, from its technical foundations to practical implementation strategies that enable developers and organizations to harness its full potential. As text to speech technology continues to evolve, future advancements and upcoming versions of Orpheus TTS are planned, including expanded language support, inference optimizations, and improved model versioning to enhance functionality and performance.This guide walks through how to deploy Orpheus TTS on Cerebrium for scalable, low-latency inference in real-world environments.You can find our final Github repository here.
What is Orpheus TTS
Orpheus TTS stands as a state-of-the-art open source text to speech system developed by Canopy Labs, designed to address the growing demand for high-quality, scalable voice synthesis solutions. The platform leverages the robust Llama-3B language model backbone, creating a powerful foundation for superior speech synthesis capabilities that consistently deliver natural intonation and human-like vocal characteristics, while allowing users to specify the target model for inference requests.
What sets Orpheus apart in the competitive TTS landscape is its dual availability model. Organisations can access finetuned models optimized for immediate production deployment, while researchers and developers can utilize pretrained base models for extensive customization and experimental work. There is a significant difference in dataset quality and training approaches between these models, which directly impacts the performance, diversity, and adaptability of the system. Maintaining different versions of the models ensures compatibility across environments and supports ongoing enhancements.
The architecture incorporates advanced features like zero shot voice cloning, enabling the system to generate speech in new voices with minimal training data. Orpheus TTS supports multiple languages, a variety of voices, emotive tags, and other functionalities to meet diverse application needs. This capability, combined with simple tags and low latency processing, positions Orpheus as a versatile solution for applications ranging from customer service automation to creative content generation.
Input prompts and customization options are designed for flexibility, but using a consistent prompt format is crucial for achieving optimal results and facilitating training and inference workflows. In addition to customization and experimental work, typical usage patterns involve establishing connections, sending inference requests, and interacting with the system through provided APIs or code repositories.
How to deploy Orpheus on Cerebrium
Prerequisites:
Before getting started, you’ll need access to the following:
A Cerebrium account, sign up here and follow the quickstart to install our Python SDK.
We will deploying based on the instructions from this Github repository which has two parts:
A fastAPI service that:
Create RESTful API endpoints
Handles request validation and procesing
Manages communication with the Orpheus server
Provides streaming audio responses. For example, when sending a request, you can structure your prompt to include language and emotion tags, such as: {"prompt": "Hello world!<chuckle>"}.
A Orpheus Model Server:
Hosts the Orpheus TTS Model
Manages different voice models
Let us deploy our Orpheus Model server first
Orpheus Model Server:
First lets create our Cerebrium app:
cerebrium init orpheus-server
This would have created:
cerebrium.toml - Here is where we set all our container definitions and scaling parameters
main.py - We wont be using this so you can ignore/delete for now
We will be transforming the Orpheus part of the Docker compose from the original repository for the server to its own Dockerfile called Dockerfile.llama with the following:
This just defines our container, its dependencies, downloads the model, and runs it using llama.cpp. In order to use this Dockerfile and run the application we should set the following in our cerebrium.toml:
Above I point to the Dockerfile and the port it should be listening on. I also, based on the GPU and hardware I selected, am setting the replica concurrency which is the number of requests the container can receive at any given time. If you want to is based on the price/performance tradeoff - change the parallel/http threads value in the Dockerfile as well as the replica_concurrency in the cerebrium.toml.
Once the above is complete, you can run:
cerebrium deploy
You should then see your application deployed to your Cerebrium dashboard under a url like:
FastAPI Server:
Start with cloning the Github repository:
Make sure this is not inside your folder of the orpheus-sever. These should be two separate directories.To the repository folder you just cloned, add a cerebrium.toml file with the following:
Similar to above, you set the entrypoint of the FastAPI, that it should run on a CPU and specify how it should scale.In the file requirements.txt, uncomment the line near the bottom that installs "torch torchvision and torchaudio"
Update your .env file with the deployment url from your Orpheus server deployment above. Once complete, you can upload your .env file to your secrets on the Cerebrium dashboard. We import these values as ENV vars at runtime.
Run the command:
cerebrium deploy
And thats it! You should now be able to make the following CURL request to receive audio back from the server.
You can get the exact url and Auth token from the overview section on your dashboard.The response from this endpoint is streamed back as audio data in the specified format. The response is structured to allow real-time playback, and you should handle the response accordingly in your client application.From our testing this endpoints has a TTFB (Time-to-first-byte) of ~100ms which is perfect for low latency voice applications.
Conclusion
Orpheus TTS demonstrates how far open-source speech synthesis has come - delivering natural, human-like voices with the performance characteristics needed for real-world production. By combining advanced model design with features like zero-shot voice cloning, multi-language support, and real-time streaming, Orpheus moves beyond a research curiosity into a practical system for enterprise applications, creative work, and accessible technology.
Deploying Orpheus on Cerebrium makes it possible to scale this capability without the heavy lifting of managing GPUs, containers, multi-region support and autoscaling yourself. As the ecosystem evolves - with future releases bringing expanded language coverage, improved inference optimizations, and tighter integrations - the combination of Orpheus + Cerebrium offers a foundation for deploying next-generation voice systems at scale.
Explore the final GitHub repository to get started with your own deployment.