| F5-TTS |
Base, v1, E2TTS + 8 lang models |
~1.2GB each |
✅ |
✅ |
❌ |
❌ |
❌ |
CC-BY-NC-4.0 |
Targeted Word/Speech Editing, Speed control |
🇺🇸🇩🇪🇪🇸🇫🇷🇮🇹🇯🇵🇧🇷🇵🇱🇮🇳🇹🇭 |
| ChatterBox |
EN, DE×3, IT, FR, RU, HY, KA, JA, KO, NO |
~4.3GB |
✅ |
✅ |
✅ |
❌ |
❌ |
MIT |
Expressiveness slider |
🇺🇸🇩🇪🇫🇷🇮🇹🇯🇵🇰🇷🇷🇺🇳🇴🇦🇲🇬🇪 |
| ChatterBox 23L |
v1, v2, Vietnamese (Viterbox), Egyptian Arabic (oddadmix) |
~4.3GB |
✅ |
✅ |
✅ |
❌ |
❌ |
MIT |
24 languages in single model, emotion tokens (v2 - doesn't work) |
🇺🇸🇨🇳🇩🇪🇪🇸🇫🇷🇮🇹🇯🇵🇰🇷🇷🇺🇵🇹🇵🇱🇮🇳🇪🇬🇹🇷🇹🇭🇳🇴🇻🇳🇩🇰🇫🇮🇬🇷🇮🇱🇲🇾🇳🇱🇸🇪🇰🇪(+9) |
| VibeVoice |
1.5B, 7B, KugelAudio-0 (7B), kugel-2 (7B), Hindi-1.5B/7B |
5.4GB / 18GB |
✅ |
✅ |
❌ |
❌ |
❌ |
MIT (research-only per model card) |
90-min long-form, Native 4-speaker (Base models), Multilingual (KugelAudio variants), 4-bit quantization |
🇺🇸🇨🇳🇩🇪🇪🇸🇫🇷🇮🇹🇯🇵🇰🇷🇷🇺🇧🇷🇵🇱🇮🇳🇪🇬🇹🇷🇹🇭🇳🇴🇻🇳🇦🇲🇬🇪🇩🇰🇫🇮🇬🇷🇮🇱🇲🇾🇳🇱🇸🇪🇰🇪 |
| Higgs Audio 2 |
3B |
~9GB |
✅ |
✅ |
❌ |
❌ |
❌ |
Boson Higgs Audio 2 Community License |
3 multi-speaker, CUDA graphs (55+ tokens/sec) |
🇺🇸🇨🇳🇩🇪🇪🇸🇰🇷 |
| IndexTTS-2 |
IndexTTS-2 |
~4.7GB |
✅ |
✅ |
❌ |
❌ |
❌ |
bilibili Model Use License |
Emotion Control: 8 vectors, Text as reference, Audio as reference |
🇺🇸🇨🇳🇯🇵 |
| CosyVoice3 |
0.5B, 0.5B-RL |
~5.4GB |
✅ |
✅ |
✅ |
❌ |
❌ |
Apache-2.0 |
Paralinguistic tags |
🇺🇸🇨🇳🇯🇵🇰🇷 |
| Qwen3-TTS |
0.6B, 1.7B (CustomVoice/VoiceDesign/Base) |
~3-6GB |
✅ |
✅ |
❌ |
✅ |
❌ |
Apache-2.0 |
Voice design, ASR (Automatic Speech Recognition) |
🇺🇸🇨🇳🇩🇪🇪🇸🇫🇷🇮🇹🇯🇵🇰🇷🇷🇺🇵🇹 |
| Granite ASR |
granite-4.0-1b-speech |
~4.6GB |
❌ |
✅ |
❌ |
✅ |
❌ |
Apache-2.0 |
ASR (Automatic Speech Recognition), Custom timestamps/SRT via reused Qwen forced aligner, Speech translation (experimental) |
🇺🇸🇩🇪🇪🇸🇫🇷🇯🇵🇵🇹 |
| Step Audio EditX |
3B LLM + CosyVoice |
~7GB |
✅ |
✅ |
❌ |
❌ |
❌ |
Apache-2.0 (verify before commercial use) |
Second Pass Speech Editing Node: 14 emotions, 32 speaking styles, Paralinguistic effects |
🇺🇸🇨🇳🇯🇵🇰🇷 |
| Echo-TTS |
echo-tts-base + fish-s1-dac-min |
~5.3GB + ~1.8GB |
✅ |
✅ |
❌ |
❌ |
❌ |
CC-BY-NC-SA-4.0 |
Diffusion-based (~30s best), Force Speaker KV (speaker drift control) |
🇺🇸 |
| RVC |
Community .pth |
100-300MB |
❌ |
❌ |
✅ |
❌ |
✅ |
MIT (framework); community models vary |
Real-time VC, Integrated training workflow, Pitch shift (±14), 6 HuBERT models, Language-independent |
🇺🇸🇨🇳🇩🇪🇪🇸🇫🇷🇮🇹🇯🇵🇰🇷🇷🇺🇧🇷🇵🇱🇮🇳🇪🇬🇹🇷🇹🇭🇳🇴🇻🇳🇦🇲🇬🇪🇩🇰🇫🇮🇬🇷🇮🇱🇲🇾🇳🇱🇸🇪🇰🇪 |