Skip to content

Latest commit

 

History

History
18 lines (16 loc) · 5.61 KB

File metadata and controls

18 lines (16 loc) · 5.61 KB

TTS Engines Reference Tables

Engine Comparison

Engine Models Size TTS SRT VC ASR Training License Special Features Languages
F5-TTS Base, v1, E2TTS + 8 lang models ~1.2GB each CC-BY-NC-4.0 Targeted Word/Speech Editing, Speed control 🇺🇸​🇩🇪​🇪🇸​🇫🇷​🇮🇹​🇯🇵​🇧🇷​🇵🇱​🇮🇳​🇹🇭
ChatterBox EN, DE×3, IT, FR, RU, HY, KA, JA, KO, NO ~4.3GB MIT Expressiveness slider 🇺🇸​🇩🇪​🇫🇷​🇮🇹​🇯🇵​🇰🇷​🇷🇺​🇳🇴​🇦🇲​🇬🇪
ChatterBox 23L v1, v2, Vietnamese (Viterbox), Egyptian Arabic (oddadmix) ~4.3GB MIT 24 languages in single model, emotion tokens (v2 - doesn't work) 🇺🇸​🇨🇳​🇩🇪​🇪🇸​🇫🇷​🇮🇹​🇯🇵​🇰🇷​🇷🇺​🇵🇹​🇵🇱​🇮🇳​🇪🇬​🇹🇷​🇹🇭​🇳🇴​🇻🇳​🇩🇰​🇫🇮​🇬🇷​🇮🇱​🇲🇾​🇳🇱​🇸🇪​🇰🇪(+9)
VibeVoice 1.5B, 7B, KugelAudio-0 (7B), kugel-2 (7B), Hindi-1.5B/7B 5.4GB / 18GB MIT (research-only per model card) 90-min long-form, Native 4-speaker (Base models), Multilingual (KugelAudio variants), 4-bit quantization 🇺🇸​🇨🇳​🇩🇪​🇪🇸​🇫🇷​🇮🇹​🇯🇵​🇰🇷​🇷🇺​🇧🇷​🇵🇱​🇮🇳​🇪🇬​🇹🇷​🇹🇭​🇳🇴​🇻🇳​🇦🇲​🇬🇪​🇩🇰​🇫🇮​🇬🇷​🇮🇱​🇲🇾​🇳🇱​🇸🇪​🇰🇪
Higgs Audio 2 3B ~9GB Boson Higgs Audio 2 Community License 3 multi-speaker, CUDA graphs (55+ tokens/sec) 🇺🇸​🇨🇳​🇩🇪​🇪🇸​🇰🇷
IndexTTS-2 IndexTTS-2 ~4.7GB bilibili Model Use License Emotion Control: 8 vectors, Text as reference, Audio as reference 🇺🇸​🇨🇳​🇯🇵
CosyVoice3 0.5B, 0.5B-RL ~5.4GB Apache-2.0 Paralinguistic tags 🇺🇸​🇨🇳​🇯🇵​🇰🇷
Qwen3-TTS 0.6B, 1.7B (CustomVoice/VoiceDesign/Base) ~3-6GB Apache-2.0 Voice design, ASR (Automatic Speech Recognition) 🇺🇸​🇨🇳​🇩🇪​🇪🇸​🇫🇷​🇮🇹​🇯🇵​🇰🇷​🇷🇺​🇵🇹
Granite ASR granite-4.0-1b-speech ~4.6GB Apache-2.0 ASR (Automatic Speech Recognition), Custom timestamps/SRT via reused Qwen forced aligner, Speech translation (experimental) 🇺🇸​🇩🇪​🇪🇸​🇫🇷​🇯🇵​🇵🇹
Step Audio EditX 3B LLM + CosyVoice ~7GB Apache-2.0 (verify before commercial use) Second Pass Speech Editing Node: 14 emotions, 32 speaking styles, Paralinguistic effects 🇺🇸​🇨🇳​🇯🇵​🇰🇷
Echo-TTS echo-tts-base + fish-s1-dac-min ~5.3GB + ~1.8GB CC-BY-NC-SA-4.0 Diffusion-based (~30s best), Force Speaker KV (speaker drift control) 🇺🇸
RVC Community .pth 100-300MB MIT (framework); community models vary Real-time VC, Integrated training workflow, Pitch shift (±14), 6 HuBERT models, Language-independent 🇺🇸​🇨🇳​🇩🇪​🇪🇸​🇫🇷​🇮🇹​🇯🇵​🇰🇷​🇷🇺​🇧🇷​🇵🇱​🇮🇳​🇪🇬​🇹🇷​🇹🇭​🇳🇴​🇻🇳​🇦🇲​🇬🇪​🇩🇰​🇫🇮​🇬🇷​🇮🇱​🇲🇾​🇳🇱​🇸🇪​🇰🇪