XTTS2-Docker is a high-performance, standalone FastAPI server for XTTSv2, specifically optimized for the latest generation of NVIDIA GPUs (Blackwell / RTX 50xx series).
This project allows you to run high-quality Text-to-Speech (TTS) generation in an isolated Docker container, keeping your local system clean and avoiding complex dependency issues.
- Blackwell Optimization: Uses CUDA 12.8 nightly builds for maximum performance on RTX 50-series cards.
- Dockerized: Easy installation and clean separation from the host OS.
- Standalone: Completely independent of other workspaces or projects.
- DeepSpeed Support: Built-in acceleration for faster inference.
- Simple API: Compatible with projects like SillyTavern or MegaRAG.
- Docker Desktop (or Docker Engine on Linux)
- NVIDIA Container Toolkit (for GPU acceleration)
- An NVIDIA GPU (optimized for Blackwell / SM 12.0, but compatible with other CUDA-enabled cards).
- Clone this repository or download the files.
- Run the
start_standalone.batfile.- The script will automatically create the necessary folders (
xtts_models,speakers,output). - It builds the Docker image (this may take several minutes the first time) and starts the container.
- The script will automatically create the necessary folders (
- The server is now accessible at
http://localhost:8020.
- Models: Your XTTS models should be stored in
/xtts_models. - Speaker Samples: Place your
.wavfiles in the/speakersfolder. - Output: Generated audio files will be saved in the
/outputfolder.
Once the server is running, you can find the interactive documentation at:
http://localhost:8020/docs
This project is based on the great work of:
- daswer123 (xtts-api-server)
- Haurrus (Mantella / Blackwell Compatibility)
- Coqui AI (XTTSv2 Model)
The extraction, standalone configuration, and cleanup for this repository were performed in collaboration between myself (the user) and Antigravity (my AI coding partner from Google DeepMind).
Happy generating!