119 lines (93 loc) · 9.19 KB

Model Download Sources

All entries below are generated from the single source of truth YAML. Use this as the canonical list of model repositories/links for offline setup.

F5-TTS

Component	Source	Size	Auto-Download	Notes
F5TTS_Base	SWivid/F5-TTS	~1.2GB	✅	English base model
F5TTS_v1_Base	SWivid/F5-TTS	~1.2GB	✅	English v1 model
E2TTS_Base	SWivid/E2-TTS	~1.2GB	✅	English E2-TTS model
F5-DE	aihpi/F5-TTS-German	~1.2GB	✅	German finetune
F5-ES	jpgallegoar/F5-Spanish	~1.2GB	✅	Spanish finetune
F5-FR	RASPIAUDIO/F5-French-MixedSpeakers-reduced	~1.2GB	✅	French finetune
F5-JP	Jmica/F5TTS	~1.2GB	✅	Japanese finetune
F5-Hindi-Small	SPRINGLab/F5-Hindi-24KHz	~632MB	✅	Hindi finetune
Vocos Mel-24kHz	charactr/vocos-mel-24khz	N/A	✅	Optional vocoder

ChatterBox

Component	Source	Size	Auto-Download	Notes
English	ResembleAI/chatterbox	~2GB	✅	.pt model set
German	stlohrey/chatterbox_de	~4.3GB	✅	.safetensors model set
German (havok2)	niobures/Chatterbox-TTS	~4.3GB	✅	.safetensors model set
German (SebastianBodza)	niobures/Chatterbox-TTS	~4.3GB	✅	.safetensors model set
Italian	niobures/Chatterbox-TTS	~4.3GB	✅	.pt model set
French	Thomcles/ChatterBox-fr	~4.3GB	✅	.safetensors model set
Russian	niobures/Chatterbox-TTS	~4.3GB	✅	.safetensors model set
Armenian	niobures/Chatterbox-TTS	~4.3GB	✅	.safetensors model set
Georgian	niobures/Chatterbox-TTS	~4.3GB	✅	.safetensors model set
Japanese	niobures/Chatterbox-TTS	~4.3GB	✅	.safetensors model set
Korean	niobures/Chatterbox-TTS	~4.3GB	✅	.safetensors model set
Norwegian	akhbar/chatterbox-tts-norwegian	~4.3GB	✅	.safetensors model set

ChatterBox 23L

Component	Source	Size	Auto-Download	Notes
Official 23-Lang (v1/v2)	ResembleAI/chatterbox	~4.3GB	✅	v1 + v2 files and tokenizer
Russian stress dictionary (Russian only)	Vuizur/add-stress-to-epub release	~1.5GB	✅	Auxiliary Official 23-Lang Russian stress-labeling data; downloads on demand only when Russian stress support is used
Vietnamese (Viterbox)	dolly-vn/viterbox	~4.3GB	✅	Vietnamese community finetune used by downloader
Egyptian Arabic (oddadmix)	oddadmix/chatterbox-egyptian-v0	~4.3GB	✅	Egyptian Arabic community finetune (architecture v2)

VibeVoice

Component	Source	Size	Auto-Download	Notes
vibevoice-1.5B	microsoft/VibeVoice-1.5B	~5.4GB	✅	Microsoft official model
vibevoice-7B	aoi-ot/VibeVoice-Large	~18GB	✅	Community mirror used by downloader
kugelaudio-0-open	kugelaudio/kugelaudio-0-open	~18GB	✅	KugelAudio multilingual 7B variant
kugel-2	kugelaudio/kugel-2	~18.7GB	✅	KugelAudio v2 merged 7B variant

Higgs Audio 2

Component	Source	Size	Auto-Download	Notes
higgs-audio-v2-3B	bosonai/higgs-audio-v2-generation-3B-base	~9GB	✅	Generation model
Audio tokenizer	bosonai/higgs-audio-v2-tokenizer	~200MB	✅	Tokenizer model

IndexTTS-2

Component	Source	Size	Auto-Download	Notes
IndexTTS-2	IndexTeam/IndexTTS-2	Multiple files	✅	Main TTS engine
w2v-bert-2.0	facebook/w2v-bert-2.0	~2GB	✅	Semantic feature extractor
qwen0.6bemo4-merge	Included with IndexTTS-2	Included	✅	Text emotion model bundle

CosyVoice3

Component	Source	Size	Auto-Download	Notes
Fun-CosyVoice3-0.5B / 0.5B-RL	FunAudioLLM/Fun-CosyVoice3-0.5B-2512	~5.4GB first variant (+~2GB second)	✅	Both variants share common files

Qwen3-TTS

Component	Source	Size	Auto-Download	Notes
CustomVoice 0.6B	Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice	~1.5GB	✅	Preset voices + instructions
CustomVoice 1.7B	Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice	~4.2GB	✅	Preset voices + instructions
VoiceDesign 1.7B	Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign	~4.2GB	✅	Text-to-voice design model
Base 0.6B	Qwen/Qwen3-TTS-12Hz-0.6B-Base	~1.5GB	✅	Zero-shot voice cloning
Base 1.7B	Qwen/Qwen3-TTS-12Hz-1.7B-Base	~4.2GB	✅	Zero-shot voice cloning
Qwen3-ASR-1.7B	Qwen/Qwen3-ASR-1.7B	N/A	✅	ASR transcribe model
Qwen3-ForcedAligner-0.6B	Qwen/Qwen3-ForcedAligner-0.6B	N/A	✅	Word-level timestamps

Granite ASR

Component	Source	Size	Auto-Download	Notes
granite-4.0-1b-speech	ibm-granite/granite-4.0-1b-speech	~4.6GB	✅	Main Granite ASR / AST model
Qwen3-ForcedAligner-0.6B	Qwen/Qwen3-ForcedAligner-0.6B	N/A	✅	Optional custom word-level timestamps/SRT path; reused from Qwen folder

Step Audio EditX

Component	Source	Size	Auto-Download	Notes
Step-Audio-EditX	stepfun-ai/Step-Audio-EditX	~7GB	✅	Main 3B audio editing model
Step-Audio-Tokenizer	stepfun-ai/Step-Audio-Tokenizer	Included	✅	Tokenizer bundle used by Step EditX

Echo-TTS

Component	Source	Size	Auto-Download	Notes
echo-tts-base (model + PCA state)	jordand/echo-tts-base	~5.3GB	✅	pytorch_model.safetensors + pca_state.safetensors
fish-s1-dac-min (audio codec)	jordand/fish-s1-dac-min	~1.8GB	✅	pytorch_model.safetensors — audio codec required by Echo-TTS

RVC

Component	Source	Size	Auto-Download	Notes
RVC character pack	SayanoAI/RVC-Studio (RVC folder)	Varies	✅	Default auto-download characters: Claire, Sayano, Mae_v2, Fuji, Monika (extras also available)
RVC index pack (.index)	SayanoAI/RVC-Studio (.index folder)	Varies	✅	Optional FAISS indexes for improved voice similarity
content-vec-best.safetensors	lengyue233/content-vec-best	~300MB	✅	Voice feature model
rmvpe.pt	lj1995/VoiceConversionWebUI	~55MB	✅	Pitch extraction model
pretrained_v2 (f0 G/D pairs)	lj1995/VoiceConversionWebUI	~300MB total	✅	Training init checkpoints for 32k/40k/48k RVC runs; downloaded on first training use

Generated from tts_audio_suite_engines.yaml.