TTS-Audio-Suite/requirements.txt at main · diodiogod/TTS-Audio-Suite · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
# TTS Audio Suite - Universal TTS for ComfyUI
# Comprehensive multi-engine TTS with Python 3.13 compatibility

# --- INSTALLATION METHOD ---
# This custom node uses install.py for intelligent dependency management.
# ComfyUI Manager automatically runs install.py which handles:
# - Python 3.13 compatibility issues (MediaPipe → OpenSeeFace fallback)
# - NumPy version conflicts (constraints to avoid Numba issues)
# - Package dependency conflicts (selective --no-deps installation)
# - All bundled engines: ChatterBox, F5-TTS, Higgs Audio
# - Optional features: RVC voice conversion, mouth movement analysis

# --- CORE SAFE PACKAGES ---
# These packages rarely cause conflicts and install normally

# Foundation ML packages
torch>=2.0.0
torchaudio>=2.0.0
numpy>=1.26.4,<2.3.0  # Compatible with both numpy 1.26.4 and 2.x series
setuptools>=65.0.0    # Provides distutils compatibility for Python 3.12+ (required by FunASR bundled code)

# Audio processing (safe)
soundfile>=0.12.0
sounddevice>=0.4.0

# Text processing (safe)
jieba
pypinyin
unidecode
phonemizer              # IPA phonemization for multilingual TTS (requires espeak system dependency)
omegaconf>=2.3.0
transformers>=4.51.3,<=4.57.3  # Required for VibeVoice compatibility (4.51.3+). Transformers 5.0.0 breaks Qwen3-TTS tokenizer loading.

# ML utilities (safe)
accelerate
datasets
requests
dacite
bitsandbytes>=0.47.0     # 4-bit quantization support for VibeVoice memory efficiency

# Bundled engine dependencies (safe)
conformer>=0.3.2      # ChatterBox engine
x-transformers
torchdiffeq          # F5-TTS differential equations
wandb                # F5-TTS logging
ema-pytorch          # F5-TTS exponential moving average
vocos                # F5-TTS vocoder

# Echo-TTS engine (CUDA recommended)
echo-tts

# Audio restoration
# VoiceFixer is bundled in utils/voicefixer_bundled/
# Uses librosa (already required) for STFT/ISTFT instead of torchlibrosa - reduces dependencies

# IndexTTS-2 engine dependencies (safe)
cn2an>=0.5.22         # Chinese number to Arabic number conversion
g2p-en>=2.1.0         # English grapheme-to-phoneme conversion
keras>=2.9.0          # Deep learning framework
modelscope>=1.27.0    # Chinese model hub for IndexTTS-2
munch>=4.0.0          # Dictionary access with dot notation
json5>=0.12.0         # JSON5 parsing for IndexTTS-2 config files
ninja>=1.11.0         # Build tool for CUDA kernel compilation (BigVGAN optimization)
sentencepiece>=0.2.1  # Text tokenization
textstat>=0.7.10      # Text statistics and readability
punctuators           # ONNX punctuation/truecase post-processing for ASR text

# Step Audio EditX engine dependencies (safe)
openai-whisper        # Mel spectrogram extraction for audio tokenizer
funasr>=1.1.3         # FunASR speech processing toolkit
nagisa>=0.2.11        # Japanese tokenizer required by Qwen3-ASR forced aligner
hyperpyyaml           # YAML configuration parser
protobuf>=3.20.0      # Protocol buffers (compatible with descript-audiotools)
# onnxruntime installed by install.py with --no-deps to avoid conflicts

# RVC voice conversion (safe)
monotonic-alignment-search
faiss-cpu>=1.7.4
praat-parselmouth>=0.4.6  # Praat-based f0 extraction for RVC (pm method)
pyworld>=0.3.5           # World vocoder for RVC harvest/dio methods
torchfcpe>=0.0.4         # Fast Context-based Pitch Estimation for RVC (fcpe method)

# Optional performance enhancements
# sageattention  # Optional: GPU-optimized mixed-precision attention for VibeVoice (requires CUDA SM80+)

# --- PROBLEMATIC PACKAGES ---
# These are installed by install.py with special handling (NOT here in requirements.txt):
# - librosa (--no-deps): Forces numpy downgrade
# - descript-audio-codec (--no-deps): Conflicts with protobuf
# - cached-path (--no-deps): Forces package downgrades
# - torchcrepe (--no-deps): Conflicts via librosa dependency
# - onnxruntime (--no-deps): Forces numpy 2.3.x, needed for OpenSeeFace
# - opencv-python (--no-deps): Forces numpy downgrade via numpy<2.3.0 constraint
# - gradio (--no-deps): Forces pydantic, pillow, pydantic-core downgrades

# --- PYTHON 3.13 NOTES ---
# [OK] All TTS engines work (ChatterBox, F5-TTS, Higgs Audio, CosyVoice3)
# [OK] RVC voice conversion works
# [OK] OpenSeeFace mouth movement (experimental alternative)
# [NO] MediaPipe incompatible (binary compatibility issue)

# CosyVoice3 engine dependencies (bundled in engines/cosyvoice/impl/)
# Most dependencies handled by install.py (diffusers, hydra-core, matplotlib, rich, uvicorn, wetext, onnxruntime)
inflect>=7.3.0           # Text normalization for English (used in frontend.py)

# --- BUNDLED ENGINES ---
# All engines are bundled to avoid external dependency conflicts:
# - ChatterBox: engines/chatterbox/ (modified for ComfyUI)
# - F5-TTS: engines/f5_tts/ (numpy 2.x compatible fork)
# - Higgs Audio: engines/higgs_audio/ (transformers 4.46+ compatible)
# - IndexTTS-2: engines/index_tts/ (emotion disentanglement TTS)
# - CosyVoice3: engines/cosyvoice/impl/ (multilingual zero-shot voice cloning)