Skip to content

Releases: diodiogod/TTS-Audio-Suite

v4.25.1 - Integrated RVC Model Training

05 Apr 21:12

Choose a tag to compare

TTS Audio Suite v4.25.1

This release adds integrated RVC model training to TTS Audio Suite.

You can now prepare datasets, train RVC models, monitor progress in a live dashboard, and load the resulting models back into the normal RVC voice conversion workflow without leaving the suite.

New

  • Integrated RVC model training inside TTS Audio Suite
  • New 📦 RVC Dataset Prep node
  • New 🎛️ RVC Training Config node
  • New unified 🎓 Model Training node
  • New live training dashboard with progress, ETA, speed, and recent loss graph
  • Support for resume from real checkpoints
  • Support for continue_from using prior training artifacts or an existing 🎭 Load RVC Character Model
  • Safer interrupt handling with resumable checkpoint saving
  • Improved RVC model loading and index auto-detection
  • New RVC training workflow example
  • Updated README, model layout docs, and engine capability tables

Notes

  • resume continues the same training job from saved checkpoints
  • continue_from starts a new run from an already trained model or prior training artifacts
  • RVC is the first engine with integrated training support in the suite
  • The training architecture is unified so future engines can plug into the same workflow later

Included workflow

  • RVC 🎓 Model Training

Restart ComfyUI after updating.

v4.22.0 - Echo-TTS Engine

01 Mar 04:15
5cdc6c8

Choose a tag to compare

🎧 Echo-TTS Engine

New engine contribution by @drphero!

✨ New Features

  • Echo-TTS - DiT-based voice cloning engine (English only)
  • Full integration with character switching, pause tags, and SRT timing
  • Auto-downloads models on first use (~7.1GB total)
  • License: CC-BY-NC-SA (non-commercial use only)

🙏 Credits

Full credit to @drphero for the original Echo-TTS integration.


TTS Audio Suite now includes 11 engines: ChatterBox, ChatterBox 23-Lang, F5-TTS, Higgs Audio 2, VibeVoice, IndexTTS-2, CosyVoice3, Qwen3-TTS, Echo-TTS, Step Audio EditX, and RVC.

v4.19.0 - Qwen3-TTS Engine with Voice Designer

28 Jan 14:22

Choose a tag to compare

🎨 Qwen3-TTS Engine - Create Voices from Text!

Major new engine addition! Qwen3-TTS brings a unique Voice Designer feature that lets you create custom voices from natural language descriptions. Plus three distinct model types for different use cases!

✨ New Features

Qwen3-TTS Engine

  • 🎨 Voice Designer - Create custom voices from text descriptions! "A calm female voice with British accent" → instant voice generation
  • Three model types with different capabilities:
    • CustomVoice: 9 high-quality preset speakers (Vivian, Serena, Dylan, Eric, Ryan, etc.)
    • VoiceDesign: Text-to-voice creation - describe your ideal voice and generate it
    • Base: Zero-shot voice cloning from audio samples
  • 10 language support - Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
  • Model sizes: 0.6B (low VRAM) and 1.7B (high quality) variants
  • Character voice switching with [CharacterName] syntax - automatic preset mapping
  • SRT subtitle timing support with all timing modes (stretch_to_fit, pad_with_silence, etc.)
  • Inline edit tags - Apply Step Audio EditX post-processing (emotions, styles, paralinguistic effects)
  • Sage attention support - Improved VRAM efficiency with sageattention backend
  • Smart caching - Prevents duplicate voice generation, skips model loading for existing voices
  • Per-segment parameters - Control [seed:42], [temperature:0.8] inline
  • Auto-download system - All 6 model variants downloaded automatically when needed

🎙️ Voice Designer Node

The standout feature of this release! Create voices without audio samples:

  • Natural language input - Describe voice characteristics in plain English
  • Disk caching - Saved voices load instantly without regeneration
  • Standard format - Works seamlessly with Character Voices system
  • Unified output - Compatible with all TTS nodes via NARRATOR_VOICE format

Example descriptions:

  • "A calm female voice with British accent"
  • "Deep male voice, authoritative and professional"
  • "Young cheerful woman, slightly high-pitched"

📚 Documentation

  • YAML-driven engine tables - Auto-generated comparison tables
  • Condensed engine overview in README
  • Portuguese accent guidance - Clear documentation of model limitations and workarounds

🎯 Technical Highlights

  • Official Qwen3-TTS implementation bundled for stability
  • 24kHz mono audio output
  • Progress bars with real-time token generation tracking
  • VRAM management with automatic model reload and device checking
  • Full unified architecture integration
  • Interrupt handling for cancellation support

Qwen3-TTS brings a total of 10 TTS engines to the suite, each with unique capabilities. Voice Designer is a first-of-its-kind feature in ComfyUI TTS extensions!

v4.16.0 - CosyVoice3 TTS Engine

30 Dec 01:17

Choose a tag to compare

🎙️ CosyVoice3 TTS Engine - Zero-Shot Voice Conversion!

Major new engine addition! CosyVoice3 brings powerful TTS capabilities AND zero-shot voice conversion. Previously, ChatterBox was the only zero-shot voice changer option (RVC requires training). Now you have another high-quality option with CosyVoice3 VC!

✨ New Features

CosyVoice3 Engine

  • Zero-shot voice conversion (VC) - Convert any voice to match another voice without training! Another option alongside ChatterBox VC
  • Iterative refinement cache - Improve VC quality through multiple passes
  • Multilingual TTS - 4 core languages (Chinese, English, Japanese, Korean) plus 5 additional languages
  • Three TTS modes: zero-shot, instruction, and cross-lingual voice cloning
  • Model variants: standard and RL-enhanced (improved quality, set as default)
  • Paralinguistic tag support for natural speech effects (laughter, breath, cough, etc.)
  • Character voice switching with [CharacterName] syntax
  • Language switching with [lang:code] syntax
  • SRT subtitle timing support for synchronized audio generation
  • Per-segment parameter control ([seed:42], [speed:1.5])
  • Live generation progress with token-by-token updates and ETA

🔧 Improvements

  • Fix ChatterBox and RVC model discovery to work with custom model paths (extra_model_paths.yaml)
  • Fix RVC audio chunking errors with short segments

📚 Documentation

  • New CosyVoice3 paralinguistic tags guide
  • Updated README with CosyVoice3 features and examples

🙏 Credits

Initial CosyVoice3 implementation by @tazztone

v4.15.0 - Step Audio EditX Engine & Universal Inline Edit Tags

13 Dec 00:15

Choose a tag to compare

TTS Audio Suite v4.15.0

🎉 Major New Features

⚙️ Step Audio EditX TTS Engine

A powerful new AI-powered text-to-speech engine with zero-shot voice cloning:

  • Clone any voice from just 3-10 seconds of audio
  • Natural-sounding speech generation
  • Memory-efficient with int4/int8 quantization options (uses less VRAM)
  • Character switching and per-segment parameter support

🎨 Step Audio EditX Audio Editor

Transform any TTS engine's output with AI-powered audio editing (post-processing):

  • 14 emotions: happy, sad, angry, surprised, fearful, disgusted, contempt, neutral, etc.
  • 32 speaking styles: whisper, serious, child, elderly, neutral, and more
  • Speed control: make speech faster or slower
  • 10 paralinguistic effects: laughter, breathing, sigh, gasp, crying, sniff, cough, yawn, scream, moan
  • Audio cleanup: denoise and voice activity detection
  • Universal compatibility: Works with audio from ANY TTS engine (ChatterBox, F5-TTS, Higgs Audio, VibeVoice)

🏷️ Universal Inline Edit Tags

Add audio effects directly in your text across all TTS engines:

  • Easy syntax: "Hello <Laughter> this is amazing!"
  • Works everywhere: Compatible with all TTS engines using Step Audio EditX post-processing
  • Multiple tag types: <emotion>, <style>, <speed>, and paralinguistic effects
  • Control intensity: <Laughter:2> for stronger effect, <Laughter:3> for maximum
  • Voice restoration: <restore> tag to return to original voice after edits
  • 📖 Read the complete Inline Edit Tags guide

📝 Multiline TTS Tag Editor Enhancements

  • New tabbed interface for inline edit tag controls
  • Quick-insert buttons for emotions, styles, and effects
  • Better copy/paste compatibility with ComfyUI v0.3.75+
  • Improved syntax highlighting and text formatting

📦 New Example Workflows

  • Step Audio EditX Integration - Basic TTS usage examples
  • Audio Editor + Inline Edit Tags - Advanced editing demonstrations
  • Updated Voice Cleaning workflow with Step Audio EditX denoise option

🔧 Improvements

  • Better memory management and model caching across all engines

TTS Audio Suite v4.9.0 - IndexTTS-2 with Advanced Emotion Control

16 Sep 18:19

Choose a tag to compare

🌈 IndexTTS-2 with Advanced Emotion Control

20250915_222427.mp4

This major release introduces IndexTTS-2, a revolutionary TTS engine with sophisticated emotion control capabilities that takes voice synthesis to the next level.

🎯 Key Features

🆕 IndexTTS-2 TTS Engine

  • New state-of-the-art TTS engine with advanced emotion control system
  • Multiple emotion input methods supporting audio references, text analysis, and manual vectors
  • Dynamic text emotion analysis with QwenEmotion AI and contextual {seg} templates
  • Per-character emotion control using [Character:emotion_ref] syntax for fine-grained control
  • 8-emotion vector system (Happy, Angry, Sad, Surprised, Afraid, Disgusted, Calm, Melancholic)
  • Audio reference emotion support including Character Voices integration
  • Emotion intensity control from neutral to maximum dramatic expression

📖 Documentation

  • Complete IndexTTS-2 Emotion Control Guide with examples and best practices
  • Updated README with IndexTTS-2 features and model download information
  • Timeline updated to v4.9.0 milestone

🚀 Getting Started

  1. Install/Update via ComfyUI Manager or manual installation
  2. Find IndexTTS-2 nodes in the TTS Audio Suite category
  3. Connect emotion control using any supported method (audio, text, vectors)
  4. Read the guide: docs/IndexTTS2_Emotion_Control_Guide.md

🌟 Emotion Control Examples

Welcome to our show! [Alice:happy_sarah] I'm so excited to be here!
[Bob:angry_narrator] That's completely unacceptable behavior.

📋 Full Changelog

Added

  • New IndexTTS-2 engine with sophisticated emotion control system
  • Unified emotion control supporting multiple input methods (audio, text, vectors)
  • Dynamic text emotion analysis with QwenEmotion AI and contextual templates
  • Per-character emotion control using [Character:emotion_ref] syntax
  • 8-emotion vector control (Happy, Angry, Sad, Surprised, Afraid, Disgusted, Calm, Melancholic)
  • Audio reference emotion support including Character Voices integration
  • Emotion intensity control from neutral to maximum dramatic expression
  • Advanced caching system for improved performance
  • Complete IndexTTS-2 Emotion Control Guide: docs/IndexTTS2_Emotion_Control_Guide.md

Changed

  • Updated emotion control feature description in README
  • Enhanced API compatibility with modern PyTorch/transformers versions

📖 Full Documentation: IndexTTS-2 Emotion Control Guide
💬 Discord: https://discord.gg/EwKE8KBDqD
☕ Support: https://ko-fi.com/diogogo

TTS Audio Suite v4.8.6 - ChatterBox Official 23-Language Engine

06 Sep 03:19

Choose a tag to compare

🌍 ChatterBox Official 23-Language Multilingual Engine

This release introduces the powerful ChatterBox Official 23-Lang engine, bringing professional multilingual TTS capabilities to ComfyUI.

🆕 Major Features

ChatterBox Official 23-Lang Engine

  • 23 Languages Supported: Arabic, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Swahili, Turkish
  • Complete SRT Support: Full subtitle processing in all 23 languages with character switching
  • Unified Architecture: Seamless integration with existing TTS Audio Suite workflows
  • Voice Conversion Support: Full integration with Voice Changer unified node for multilingual voice conversion with refinement passes

🔧 Improvements & Fixes

Stability & Performance

  • Enhanced cache invalidation for real-time parameter changes
  • Improved audio processing reliability for complex multilingual content
  • Better memory management and model loading optimization
  • Fixed sample rate consistency issues

🎯 Usage

The new ChatterBox Official 23-Lang engine is available through the standard TTS Audio Suite nodes:

  1. Configure engine with ChatterBox Official 23Lang Engine node
  2. Use TTS Text or TTS SRT nodes for generation
  3. Use Voice Changer node for multilingual voice conversion
  4. Supports all existing character switching and voice management features

Perfect for content creators, developers, and anyone needing professional multilingual TTS capabilities in ComfyUI.

v4.7.0 - ChatterBox 11 models/Languages

04 Sep 05:29

Choose a tag to compare

🌍 ChatterBox Language Expansion

New Languages Added (7 new, 11 total models)

  • 🇮🇹 Italian - Bilingual model with automatic [it] prefix for Italian text
  • 🇫🇷 French - 1400+ hours training dataset with voice cloning
  • 🇷🇺 Russian - Complete model with training artifacts
  • 🇦🇲 Armenian - Full model with unique architecture
  • 🇬🇪 Georgian - Full model with specialized features
  • 🇯🇵 Japanese - With proper Japanese tokenizer support
  • 🇰🇷 Korean - With Korean tokenizer support
  • 🇩🇪 German variants:
    • havok2 - Multi-speaker hybrid, best quality
    • SebastianBodza - Emotion control with <haha>, <wow> tags

Critical Fixes

  • Fixed tokenizer discovery for non-Latin languages (Japanese/Korean were using English tokenizer)
  • Fixed vocabulary size mismatches for extended vocabularies
  • Fixed state dict key format issues for incomplete models
  • Italian prefix system for proper bilingual support

Technical Improvements

  • Unified model architecture support for Italian single-checkpoint model
  • Smart tokenizer discovery prioritizing language-specific files
  • Extended vocabulary support (1500 tokens for Italian)

All models auto-download from HuggingFace on first use.

TTS Audio Suite v4.6.16 - Complete VibeVoice Integration

30 Aug 17:18

Choose a tag to compare

🎉 Complete VibeVoice Integration Release

🆕 VibeVoice Engine - Now Fully Integrated!

This release marks the complete integration of Microsoft's VibeVoice engine into TTS Audio Suite, bringing professional-quality multilingual text-to-speech with advanced multi-speaker capabilities.

✨ What's New in VibeVoice

🎭 Dual Multi-Speaker Modes

  • Native Multi-Speaker Mode: Use VibeVoice's built-in 4-speaker system with "Speaker 1:", "Speaker 2:" format
  • Custom Character Switching: Full character voice management with unlimited speakers using your own voice references

📝 Complete SRT Subtitle Support

  • Full subtitle timing with all modes: stretch_to_fit, pad_with_silence, smart_natural, concatenate
  • Multi-character subtitle processing with proper timing
  • Seamless integration with existing SRT workflows

🤖 Two Model Options Available

  • vibevoice-1.5B (~5.4GB) - Faster inference, great quality
  • vibevoice-7B (~18GB) - Maximum quality, slower inference
  • Auto-download with HuggingFace integration and legacy path support

🧠 Smart Memory Management

  • Proper integration with ComfyUI's "Clear VRAM" button
  • Automatic model unloading when memory is low
  • Consistent architecture with other TTS engines

💡 VibeVoice Pro Tips

⚠️ Text Length Matters: VibeVoice works best with medium to long texts. Short phrases may not capture the voice reference quality well - aim for at least 2-3 sentences for optimal results.

🎵 Watch for Music Mode: VibeVoice has built-in music/podcast detection. Avoid starting text with greetings like "Hello!" or "Welcome!" as these may trigger a different speaking style than intended.

🎯 Best Practices:

  • Use complete sentences rather than short phrases
  • Provide context in your text for better voice matching
  • Test different text lengths to find the sweet spot for your voice references

🌍 Supported Languages

VibeVoice supports English and Chinese with high-quality synthesis for both languages.

📋 How to Use VibeVoice

  1. Basic TTS: Use "TTS Text" node, select VibeVoice engine
  2. SRT Subtitles: Use "TTS SRT" node with VibeVoice engine
  3. Multi-Speaker: Choose between Native (4 speakers max) or Custom Character modes
  4. Voice References: Add your own voice samples via Character Voices node

🔧 Full Engine Lineup

TTS Audio Suite now includes 5 complete TTS engines:

  • ChatterBox - Fast, efficient TTS with voice conversion
  • F5-TTS - Zero-shot voice cloning with reference audio
  • Higgs Audio 2 - Professional voice cloning and synthesis
  • VibeVoice - Multilingual TTS with multi-speaker support
  • RVC Integration - Voice conversion post-processing

🐛 Bug Fixes & Improvements

  • Improve VibeVoice memory management and unloading
  • Better integration with ComfyUI's memory management system
  • More reliable model unloading when using 'Clear VRAM' button
  • Consistent architecture across all TTS engines for better maintainability
  • Enhanced stability when switching between different models

Full Changelog: https://github.com/diodiogod/TTS-Audio-Suite/blob/main/CHANGELOG.md

Download: Install via ComfyUI Manager or clone from GitHub
Documentation: Check the folder and example workflows
Support: Report issues on GitHub Issues page

v4.5.5 - 🔧 Automatic Installation

22 Aug 17:46

Choose a tag to compare

🔧 Automatic Installation

Installation now works automatically through ComfyUI Manager with zero manual setup required.

What's New

  • 🤖 Automatic dependency resolution - All conflicts handled automatically with the install.py
  • 🐍 Python 3.13 full support - Works on latest Python versions
  • ⚡ Smart installation system - Prevents version conflicts that caused engine failures

For Users

Install or update through ComfyUI Manager and everything works. No manual steps needed.

✅ All 19 nodes load successfully without manual install intervention.