SwarmUI AudioLab Extension

A modular audio processing extension for SwarmUI that adds text-to-speech, speech-to-text, audio generation, voice conversion, and audio processing — all through a provider-based architecture integrated directly into the Generate tab.

Features

Text-to-Speech (TTS) — 16 providers: Chatterbox, Kokoro, Bark, Orpheus, Piper, Dia, F5-TTS, Fish Speech, Pocket TTS, Kyutai TTS, Qwen3, Zonos, CSM, VibeVoice, CosyVoice, and NeuTTS
Speech-to-Text (STT) — 5 providers: Whisper, Kyutai STT, Distil-Whisper, Moonshine, and RealtimeSTT
Audio Generation — ACE-Step 1.5 (6 DiT models, 6 task types, lyrics alignment, 50 languages), MusicGen (text-to-music with melody conditioning), and AudioGen (text-to-sound-effects)
Voice Conversion — RVC (re-voice existing audio), OpenVoice (tone/style transfer), GPT-SoVITS (TTS with cloned voice)
Audio Processing — Demucs (stem separation) and Resemble Enhance (audio enhancement/denoising)
Multi-Track DAW Editor — Full digital audio workstation with timeline, transport controls, per-track mute/solo/volume, clip arrangement via drag-and-drop, mixer panel, loop regions, undo/redo, and multi-track mixdown export (WAV/MP3/OGG/FLAC/AAC)
Video + Audio — Combine audio with video or extract audio from video via ffmpeg
Streaming TTS — Chunked text-to-speech with auto-play for immediate playback while generating
Generation Cancellation — Stop Generation and Stop All Generations buttons work for all audio providers
Pure C# Inference (no Python) — Audio models run in-process via the HartsyInference engine, hosted by the companion HartsyInference (Pure C# Inference) backend extension. No Python, no venvs, no Docker.
On-Demand Weights — Install only the models you need; the Install button downloads the model's .safetensors weights (STT models download themselves on first use)

Requirements

SwarmUI installed and working
The HartsyInference (Pure C# Inference) backend extension installed and added in Server > Backends (this is the in-process engine AudioLab runs models on)
ffmpeg on PATH (for audio decode/encode and video+audio features)
A CUDA or Vulkan GPU is recommended for the larger models; several (e.g. Kokoro, Moonshine) are CPU-capable

No Python or Docker required. Earlier versions ran each engine in a Python virtual environment; the audio engines now run as pure C# in-process. The per-engine capability tables below are the coverage roadmap — models light up as the C# engine adds support.

Installation

Clone into the SwarmUI extensions directory:

cd /path/to/SwarmUI/src/Extensions/
git clone https://github.com/Hartsy/SwarmUI-AudioLab.git

Restart SwarmUI. The extension loads automatically.
In SwarmUI, go to Server > Backends and add the Audio Backend.
Open the Generate tab, select an audio model, and use the Install button to install the model you want. Its weights are downloaded automatically; STT models fetch themselves on first use. Generation runs in-process on the C# engine.

Supported Engines

Text-to-Speech (16 Providers, 30+ Models)

Engine	Voice Reference	Streaming	VRAM	Notes
Chatterbox	Optional	Yes	~4 GB	Expressive with exaggeration/CFG controls
Kokoro	No	Yes	~1 GB	96x real-time on GPU, CPU-capable, multiple built-in voices
Pocket TTS	Optional	No	CPU (~200MB)	100M params, 8 built-in voices, voice cloning, MIT license, ~6x real-time on CPU
Kyutai TTS	Optional	No	~8 GB	1.8B params, English+French, voice conditioning, 75x real-time, ~200ms latency
Piper	No	Yes	CPU only	CPU-only ONNX runtime, lightweight, auto-downloads voices
Bark	No	Yes	~5 GB	Multi-language, emotion/music/SFX support
Orpheus	No	Yes	~16 GB	3B params, emotion tags (`<laugh>`, `<sigh>`, etc.)
Dia	No	Yes	~10 GB	1.6B params, 2-speaker dialogue with nonverbal sounds
F5-TTS	Required	Yes	~4 GB	Flow-matching, zero-shot cloning from ~10s reference
Fish Speech	Optional	Yes	4–24 GB	80+ languages, inline prosody tags (`[whisper]`, `[emphasis]`, etc.)
Qwen3 TTS	Optional	Yes	4–8 GB	5 model variants: cloning, custom voices, voice design from descriptions
CSM	No	Yes	~4.5 GB	1B params, multi-turn conversational speech
VibeVoice	Optional	Yes	3–16 GB	3 sizes (0.5B–7B), multi-speaker, up to 90 min long-form
Zonos	Optional	Yes	~4 GB	Emotion control, transformer and hybrid variants (EN/JP/CN/FR/DE)
CosyVoice	Optional	Yes	~8 GB	Ultra-low latency streaming, multilingual
NeuTTS	Required	Yes	~2 GB	0.5B params, instant voice cloning, CPU-capable

Speech-to-Text (5 Providers, 14 Models)

Engine	Models	VRAM	Notes
Whisper	7 sizes (tiny–turbo)	1–10 GB	OpenAI Whisper, transcribe + translate, multi-language
Kyutai STT	1B (en+fr), 2.6B (en)	3–6 GB	Auto capitalization/punctuation, 1B has voice activity detection
Distil-Whisper	large-v3, large-v3.5	~2 GB	6x faster than Whisper large-v3
Moonshine	base, tiny	~1 GB / CPU	Lightweight, CPU-capable
RealtimeSTT	default	~2 GB	Real-time streaming with wake word detection (not yet runnable on the C# engine — use Whisper)

Audio Generation (3 Providers, 17 Models)

Engine	Models	VRAM	Notes
ACE-Step 1.5	6 DiT variants (turbo/sft/base)	8–10 GB	6 task types (text2music, cover, repaint, extract, lego, complete), lyrics alignment, 50 languages, optional LM planner
MusicGen	10 variants (mono/stereo/melody)	4–10 GB	Text-to-music with optional melody conditioning, sampling controls
AudioGen	medium (1.5B)	~4 GB	Text-to-sound-effect generation

Voice Conversion (3 Providers)

These engines transform voice characteristics. RVC and OpenVoice are post-processing tools that take existing audio and change the voice (audio in → audio out). GPT-SoVITS is different — it generates new speech from text in a cloned voice (text in → audio out).

Voice Conversion vs. TTS Voice Reference: Many TTS engines above (F5, Fish Speech, Zonos, etc.) also support voice cloning via a reference audio clip, but they are TTS engines that generate speech from text. The engines below are specifically designed for voice transformation or voice-cloned speech synthesis.

Engine	Type	VRAM	Notes
RVC V2	Audio → Audio	~4 GB	Re-voices existing audio using a trained voice model (.safetensors). Pitch shift; F0 via YIN today (RMVPE/Harvest/PM coming). ContentVec encoder auto-downloads.
OpenVoice V2	Audio → Audio	~2 GB	Transfers the tone/style of a reference voice onto existing audio. Zero-shot (no model training, just a wav clip).
GPT-SoVITS	Text → Audio	~4 GB	Generates new speech from text in a cloned voice using a reference clip + its transcript. English today (CJK pending).

Audio Processing (2 Providers, 5 Models)

Engine	Models	VRAM	Notes
Demucs	htdemucs, htdemucs_ft, htdemucs_6s	~2 GB	Source separation (vocals, drums, bass, other; 6-stem variant adds guitar + piano)
Resemble Enhance	denoise, enhance	~2 GB	Speech denoising and super-resolution to 44.1 kHz (engine support pending — DeepSpeed checkpoint loader)

Usage

Add the Audio Backend — Go to Server > Backends and add "Audio Backend".
Install an Engine — In the Generate tab, browse the audio models and click Install for the engine you want. The extension downloads the model's weights and runs it in-process on the C# engine — no virtual environment, no pip, no Python. Progress streams in real time via WebSocket. (STT models and several others fetch their weights on first use.)
Select a Model — Choose an installed audio model from the model selector.
Set Parameters — The sidebar shows relevant parameter groups (TTS, STT, Audio Generation, Voice Conversion, Audio Processing) based on the selected model.
Generate — Enter your prompt and click Generate. Audio output appears in the output area with a waveform player.
Cancel — Click "Stop Generation" to cancel the current generation or "Stop All Generations" to cancel all active sessions. Works for all providers.

Streaming TTS

Set the Stream Chunk Size parameter to control how text is split for streaming:

Mode	Behavior
`word`	Each word generates separately
`phrase`	~5 words per chunk, snaps to nearby punctuation
`sentence`	Splits on `.` `!` `?` boundaries (respects abbreviations)
`paragraph`	Splits on double newlines, falls back to sentences

Each chunk generates and plays back immediately while the next chunk processes. The final output is a concatenated WAV file saved to the output directory.

TTS Voice Reference (Voice Cloning in TTS)

Many TTS engines accept a reference audio file (WAV) and optional reference text (transcript of the reference audio). Upload a short clip (~5–15 seconds) of the target voice. The model generates new speech from your text prompt that sounds like the reference voice.

Supported by: F5-TTS, Fish Speech, Pocket TTS, Kyutai TTS, Qwen3, Zonos, VibeVoice, Chatterbox, CosyVoice, NeuTTS.

This is different from the Voice Conversion engines (RVC, OpenVoice) which take existing audio and change the voice without generating new speech.

Video + Audio

Use the API endpoints or UI to combine generated audio with video files or extract audio tracks from video. Requires ffmpeg on PATH.

Replace mode swaps the video's audio track with your generated audio.
Mix mode blends the original and new audio tracks together.

Multi-Track DAW Editor

Click Audio Lab on any audio output to open the DAW editor. The editor opens near-fullscreen with the audio loaded as the first clip on Track 1.

Layout:

Transport Bar — Record, rewind, play/stop, forward, loop toggle, time display, BPM, and zoom slider
Timeline Ruler — Canvas-rendered time ruler with beat grid, playhead indicator, and draggable loop region handles
Track Headers — Per-track controls: editable name, mute (M), solo (S), volume slider, arm (R), and remove (X) button
Clip Lanes — Drag clips horizontally to reposition, or drag across tracks to move between lanes. Right-click clips for context menu (split at playhead, delete, duplicate, rename, mute/unmute)
Bottom Panel — Tabbed panel with Clip Editor (details + actions for selected clip), Mixer (vertical faders, pan, mute/solo per track + master), and Apply to Model (set clip as voice reference for TTS)
Footer — Add Track, Import Audio, Export Mixdown (WAV/MP3/OGG/FLAC/AAC), and Close

Playback: Uses the Web Audio API (AudioBufferSourceNode) for sample-accurate multi-track synchronized playback. WaveSurfer.js provides visual-only waveform rendering per clip. Loop regions wrap playback between start and end markers.

Export: Multi-track mixdown renders via OfflineAudioContext with per-track gain and pan. WAV exports directly from the browser. MP3, OGG, FLAC, and AAC formats route through the backend ffmpeg conversion endpoint.

Keyboard Shortcuts:

Key	Action
Space	Play / Stop
Ctrl+Z	Undo
Ctrl+Shift+Z	Redo
Delete	Delete selected clip

API Endpoints

All endpoints require authentication and use SwarmUI's permission system.

Audio Processing

Endpoint	Method	Description
`ProcessAudio`	POST	Generic entry point — routes to any provider by `provider_id`
`ProcessTTS`	POST	Text-to-speech with `text`, `voice`, `language`, `volume` params
`ProcessSTT`	POST	Speech-to-text with `audio_data` (base64), `language` params
`ProcessWorkflow`	POST	Chain multiple operations (e.g., STT then TTS) with ordered steps

Engine Management

Endpoint	Method	Description
`AudioLabListEngines`	GET	List all engines with install status, models, and metadata
`AudioLabInstallEngine`	POST (WS)	Install an engine (download weights) with real-time WebSocket progress streaming
`AudioLabUninstallEngine`	POST	Remove engine from registry (optionally delete its weights)
`GetAllProvidersStatus`	GET	List all registered providers with metadata
`GetInstallationStatus`	GET	Per-provider install status
`GetInstallationProgress`	GET	Poll real-time installation/download progress

Audio Format Conversion

Endpoint	Method	Description
`ConvertAudioFormat`	POST	Convert WAV audio to MP3, OGG, FLAC, AAC, or M4A via ffmpeg. Used by DAW export.

Video + Audio

Endpoint	Method	Description
`CombineVideoAudio`	POST	Merge audio track into video (replace or mix mode), 200 MB video / 50 MB audio limit
`ExtractAudioFromVideo`	POST	Extract audio track as 16-bit PCM WAV at 44.1 kHz, 200 MB video limit

Permissions

Permission	Level	Covers
`audio_process`	Power Users	ProcessAudio, ProcessTTS, ProcessSTT, ProcessWorkflow, CombineVideoAudio, ExtractAudioFromVideo, ConvertAudioFormat
`audio_manage_backends`	Power Users	AudioLabInstallEngine, AudioLabUninstallEngine
`audio_check_status`	Power Users	GetAllProvidersStatus, GetInstallationStatus, GetInstallationProgress, AudioLabListEngines

Architecture

SwarmUI-AudioLab/
├── AudioLab.cs                          # Extension entry point
├── AudioLabParams.cs                    # T2I parameter registration + BuildEngineArgs param→engine mapping
├── AudioAPI/
│   ├── AudioLabAPI.cs                   # API endpoints (process, install, status)
│   ├── AudioParameters.cs               # Shared parameter helpers
│   ├── AudioProgressTracking.cs         # Install/generation progress
│   └── VideoAudioEndpoints.cs           # Video+audio combining/extraction via ffmpeg
├── AudioBackends/
│   └── DynamicAudioBackend.cs           # Unified routing backend (model routing, streaming, cancellation, install)
├── AudioProviders/
│   ├── AudioProviderDefinitions.cs      # Provider registry (auto-discovers all IAudioProviderSource)
│   ├── KokoroProvider.cs                # One file per provider — engine-backed (Kokoro, Chatterbox, ACE-Step, …)
│   ├── ElevenLabsTTSProvider.cs         #   and cloud-API providers (ElevenLabs, Azure, OpenAI, …)
│   └── ...                              # ~56 provider files total
├── AudioProviderTypes/
│   ├── AudioCategory.cs                 # TTS, STT, AudioGeneration, VoiceConversion, AudioProcessing
│   ├── AudioProviderDefinition.cs       # Provider definition schema
│   ├── AudioModelDefinition.cs          # Per-model metadata (id, license, source URL, size, VRAM)
│   ├── AudioProviderDefinitionBuilder.cs # Fluent builder for provider definitions
│   └── IAudioProviderSource.cs          # Provider interface
├── AudioServices/                       # The in-process C# inference layer (HartsyInference engine)
│   ├── AudioEngine.cs                   # Dispatch table: provider id → handler; owns the compute device
│   ├── AudioWeights.cs / AudioWeightsRegistry.cs # Download URLs + on-disk weight resolution
│   ├── AudioServerManager.cs            # Routes a request to the engine (or a cloud API handler)
│   ├── AudioUnsupportedReasons.cs       # Precise "not runnable yet" messages
│   ├── Tts/  Stt/  Music/  Vc/  Fx/     # Per-category handlers + model descriptors
│   │   └── Models/                      #   one descriptor per model (repo + how to load + how to synth)
│   └── ApiHandlers/                     # Cloud-API providers (Azure, Google, Suno, Udio, …)
├── Assets/
│   ├── audio-core.js                    # Frontend UI (engine browser, param groups)
│   ├── audio-api.js                     # API client (backend communication)
│   ├── audio-player.js                  # Waveform player (WaveSurfer.js)
│   ├── audio-daw*.js                    # DAW editor (orchestrator, tracks, timeline, mixer)
│   ├── audio-integration.js             # SwarmUI integration hooks
│   ├── audio-lab.css                    # Styling (theme-aware)
│   └── lib/                             # WaveSurfer, Crunker, Timeline, Minimap plugins
└── README.md

The extension follows a two-layer architecture (no Python, no separate server process):

C# layer registers providers with a fluent builder API, manages the routing backend lifecycle, routes generation requests by model prefix, maps UI parameters to engine arguments (BuildEngineArgs), and runs inference in-process via the HartsyInference engine. AudioEngine holds a dispatch table from provider id to a per-category handler (TTS/STT/Music/VC/FX); each handler drives a per-model descriptor that knows the model's HuggingFace repo and how to load + run it. Cloud-API providers route through ApiHandlers/ instead. Weights download and cache to the Audio Model Root; pipelines stay resident in GPU/CPU memory between requests.
Frontend adds audio parameter groups to the Generate tab sidebar, provides a waveform-based audio player via WaveSurfer.js, and integrates with SwarmUI's generation lifecycle (model selection, parameter visibility, streaming playback, cancellation).

Cancellation

When the user clicks Stop Generation, SwarmUI fires the session's InterruptToken. The C# layer observes it and cancels in two ways:

Infrastructure — the in-process generation call is passed the cancellation token; the engine aborts at the next checkpoint and the result is discarded.
Cooperative — pipelines with iterative loops check the token periodically for fast mid-inference cancellation.

Both "Stop Generation" (current session) and "Stop All Generations" (all sessions) work through the same token mechanism. For single-shot calls (e.g. Bark), the computation may finish before the cancel arrives — the result is still discarded.

Backend Settings

Setting	Default	Description
Audio Model Root	`Models/audio`	Storage path for downloaded audio model weights
Auto Redownload Missing Weights	`true`	If a model's weights are missing at generation time (e.g. deleted to free space), re-download them automatically; when off, generation refuses with a clear message
Debug Mode	`false`	Enable verbose audio engine logging
Use Docker	`false`	Legacy flag from the old Python backend; the C# engine runs in-process and does not use it

Troubleshooting

Engine install / weight download fails:

Check that you have a stable internet connection for downloading model weights
Check the SwarmUI server logs for detailed error output
Some models aren't runnable on the C# engine yet — the install/generate error names the exact missing piece

Gated model access denied:

Some models (e.g., certain Fish Speech or Qwen3 variants) require accepting a license agreement on HuggingFace
Go to the model's HuggingFace page, accept the agreement, then set your HuggingFace token in SwarmUI: Server > User Settings > API Keys
Get a token at https://huggingface.co/settings/tokens (needs "Read" permission)

A model says "not runnable in the C# engine yet":

That model's pipeline exists but is missing a specific prerequisite (a tokenizer asset, phonemizer, or confirmed weight layout). The message names it. Pick another model in the same category in the meantime.

No audio output:

Verify the engine is installed (check the model browser for audio models)
Check that the Audio Backend is running (Server > Backends)
Look at the SwarmUI server logs for [AudioLab] errors

Video+audio features not working:

Install ffmpeg and ensure it is on your system PATH

Stop Generation not working:

Ensure the Audio Backend is running and healthy (check Server > Backends)
For single-call engines (e.g., Bark), the GPU computation may finish before the cancel signal arrives — the result is still discarded

License

MIT License - see LICENSE for details.

Acknowledgments

SwarmUI — Base platform
WaveSurfer.js — Audio waveform visualization (player + DAW clip rendering)
Crunker — Audio concatenation
FFMpegCore — FFmpeg wrapper for audio format conversion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SwarmUI AudioLab Extension

Features

Requirements

Installation

Supported Engines

Text-to-Speech (16 Providers, 30+ Models)

Speech-to-Text (5 Providers, 14 Models)

Audio Generation (3 Providers, 17 Models)

Voice Conversion (3 Providers)

Audio Processing (2 Providers, 5 Models)

Usage

Streaming TTS

TTS Voice Reference (Voice Cloning in TTS)

Video + Audio

Multi-Track DAW Editor

API Endpoints

Audio Processing

Engine Management

Audio Format Conversion

Video + Audio

Permissions

Architecture

Cancellation

Backend Settings

Troubleshooting

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
Assets		Assets
AudioAPI		AudioAPI
AudioBackends		AudioBackends
AudioModels		AudioModels
AudioProviderTypes		AudioProviderTypes
AudioProviders		AudioProviders
AudioServices		AudioServices
.gitignore		.gitignore
AudioLab.cs		AudioLab.cs
AudioLabParams.cs		AudioLabParams.cs
LICENSE		LICENSE
README.md		README.md
SwarmUI-AudioLab.csproj		SwarmUI-AudioLab.csproj

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SwarmUI AudioLab Extension

Features

Requirements

Installation

Supported Engines

Text-to-Speech (16 Providers, 30+ Models)

Speech-to-Text (5 Providers, 14 Models)

Audio Generation (3 Providers, 17 Models)

Voice Conversion (3 Providers)

Audio Processing (2 Providers, 5 Models)

Usage

Streaming TTS

TTS Voice Reference (Voice Cloning in TTS)

Video + Audio

Multi-Track DAW Editor

API Endpoints

Audio Processing

Engine Management

Audio Format Conversion

Video + Audio

Permissions

Architecture

Cancellation

Backend Settings

Troubleshooting

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages