Introduction
"A free local alternative to ElevenLabs and WisprFlow — in one app."
This is article #103 in the Open Source Project of the Day series. Today's project is Voicebox — a local-first open-source AI voice studio built by independent developer Jamie Pine.
ElevenLabs produces excellent voice cloning. But subscriptions start at $22/month, character limits apply, and your voice data goes to their servers. WisprFlow handles voice dictation, with the same cloud dependency. Both are genuinely useful, and both have the same structural limitation.
Voicebox integrates both functions into one locally-running desktop application: 7 TTS engines, zero-shot voice cloning, a global dictation hotkey, a multi-track Stories editor, and MCP integration so AI agents can speak in voices you own. Every model runs on-device. No audio data leaves the machine.
32.3k Stars, MIT license, built with Tauri + Rust.
What You'll Learn
- The 7 built-in TTS engines: characteristics and best use cases (from the 82M Kokoro to Qwen3-TTS high-quality cloning)
- Zero-shot voice cloning: how a few seconds of audio becomes a reusable voice model
- Voice Personality: how to make AI speak in a specific persona
- Stories Editor: multi-track timeline for podcast dialogue production
- MCP integration: giving Claude Code and other agents a speaking voice
- MLX acceleration on Apple Silicon
Prerequisites
- Familiarity with TTS (text-to-speech) and voice cloning concepts
- Experience with ElevenLabs or similar voice AI tools helps for comparison
- Claude Code or MCP experience (for the agent voice integration section)
Project Background
What Is Voicebox?
Voicebox is a local-first AI voice studio — "a free and open-source alternative to ElevenLabs and WisprFlow in one app."
It integrates two function categories: voice output (TTS + voice cloning) and voice input (dictation + STT). These are usually separate tools; Voicebox puts them in one desktop application sharing the same local model infrastructure.
The technical core is the integration of Qwen3-TTS — Alibaba's open-source TTS model released in 2025, which brought zero-shot voice cloning quality close to or matching ElevenLabs for the first time at a local inference level.
Author
- Author: Jamie Pine (jamiepine)
- Website: voicebox.sh
- License: MIT
- Version: v0.5.0, 25 releases, 588 commits
Project Stats
- ⭐ GitHub Stars: 32,300+
- 🍴 Forks: 3,900+
- 📄 License: MIT
- 💻 Stack: TypeScript 55% / Python 34% / Rust 9%
Core Features
7 TTS Engines
Seven engines bundled, each suited to different scenarios:
| Engine | Languages | Key Strength |
|---|---|---|
| Qwen3-TTS | 10 | High-quality multilingual voice cloning (Alibaba open-source) |
| Qwen CustomVoice | 10 | 9 preset voices, natural-language delivery control |
| LuxTTS | English | Lightweight, 48kHz, 150x realtime on CPU |
| Chatterbox Multilingual | 23 | Broadest language coverage (Arabic, Swahili, and more) |
| Chatterbox Turbo | English | Paralinguistic emotion tags ([laugh], [sigh]) |
| TADA (HumeAI) | 10 | 700+ seconds of coherent long-form audio generation |
| Kokoro | 8 | Tiny 82M model, 50 preset voices |
The differentiation is concrete: LuxTTS runs fast on laptops without GPUs; Chatterbox Turbo expresses laughter and sighs; TADA handles audiobooks without segment stitching; Kokoro is the smallest possible footprint.
Voice Cloning
Zero-shot cloning: upload a few seconds of reference audio, Voicebox extracts the voice signature and creates a reusable Voice Profile.
Workflow:
1. Record or upload reference audio (3-10 seconds produces good results)
2. Create a Voice Profile (name it, add a description)
3. Select this profile for any TTS task
4. Combine multiple samples to improve clone accuracy
Reference audio can be recorded directly inside the app — no external recording file needed.
Voice Personality
This is a feature rare in TTS tools: bind a persona description to a Voice Profile, then have a local LLM rewrite the input text in that persona's voice before it reaches the TTS engine.
Example:
Voice Profile: "Alex - Tech podcast host"
Personality: "Enthusiastic about technical topics, occasional jargon,
likes to explain complex concepts with analogies"
Input text: "Quantum computers use qubits to process information."
LLM rewrite: "Quantum computers are fascinating — instead of ordinary 0s and 1s,
they use qubits that can be both 0 and 1 at the same time.
Think of it like Schrödinger's cat, but for computing."
Then the rewritten text goes to the TTS engine.
A local Qwen3 model (0.6B / 1.7B / 4B options) handles the rewrite. The whole pipeline runs offline.
Stories Editor
A multi-track timeline editor for producing conversational content — podcasts, roleplay, audiobook dialogue:
- Each track corresponds to a Voice Profile (different speaker)
- Audio segments arranged on a timeline
- Reorder and trim segments directly in the editor
- Export as a complete multi-speaker audio file
Global Dictation Hotkey
The STT component uses Whisper (Base through Large, plus Turbo variant):
- Global hotkey that works in any application
- Push-to-Talk and Toggle modes
- Recognized text inputs directly into the active text field
- All recordings stored in the Captures tab with full transcripts
- Edit transcripts inline; regenerate audio from the corrected text
Post-Processing Effects
Powered by Spotify's Pedalboard library; effects applied after generation:
- Pitch Shift, Reverb, Delay
- Chorus, Compression
- High-pass/Low-pass filters
- Effect presets for reuse
MCP Agent Voice Integration
Four MCP tools let Claude Code and other AI agents speak out loud:
# Add MCP server to Claude Code
claude mcp add voicebox \
--transport http \
--url http://127.0.0.1:17493/mcp \
--header "X-Voicebox-Client-Id: claude-code"
| MCP Tool | Function |
|---|---|
voicebox.speak |
Read text aloud using a specified Voice Profile |
voicebox.transcribe |
Transcribe an audio file |
voicebox.list_captures |
List saved recordings |
voicebox.list_profiles |
List available Voice Profiles |
REST API works the same way:
curl -X POST http://127.0.0.1:17493/generate \
-H "Content-Type: application/json" \
-d '{"text": "Deploy complete.", "profile_id": "abc123", "language": "en"}'
Deep Dive
Technical Architecture
Frontend (React + TypeScript + Tailwind)
↓
Tauri bridge (Rust)
↓
FastAPI backend (Python)
├── TTS engines (Qwen3-TTS / LuxTTS / Kokoro / ...)
├── Whisper STT
├── Qwen3 LLM (Personality rewrites)
├── Pedalboard post-processing
└── SQLite data storage
Tauri + Rust keeps the desktop application lightweight (no V8 engine). The Python FastAPI backend handles the heavy model inference. This split — web technology for UI, Python for inference, Tauri bridging them — is becoming standard for AI desktop applications.
Inference Backend by Hardware
| Hardware | Backend | Notes |
|---|---|---|
| macOS Apple Silicon | MLX / Metal | Neural Engine acceleration, 4-5× faster than CPU PyTorch |
| Windows / Linux NVIDIA | PyTorch CUDA | Standard GPU inference |
| Linux AMD | PyTorch ROCm | AMD GPU support |
| Windows (any GPU) | DirectML | Microsoft universal GPU interface |
| Intel Arc | IPEX/XPU | Intel GPU support |
| No GPU | CPU fallback | LuxTTS runs at 150× realtime on CPU |
The MLX acceleration on Apple Silicon is a real advantage: the M-series Neural Engine specializes in matrix operations. TTS inference runs 4-5× faster than CPU PyTorch — relevant for real-time dictation and fast generation loops.
The Captures System
All voice interactions are archived:
Captures tab
├── Audio recordings (dictation + deliberate recordings)
├── Auto-generated transcripts (Whisper)
├── Inline editing (modify the transcript directly)
├── Regenerate audio (with a different Voice Profile)
└── Export audio
This design moves Voicebox from "one-shot generation tool" to "voice content workbench." Old recordings can be regenerated with new voice profiles; transcripts can be edited and re-voiced; the whole thing acts as a lightweight voice content management system.
Unlimited Length Generation
TTS models typically cap single-generation length at a few hundred words. Voicebox handles long content:
- Split long text into sentence/paragraph chunks
- Generate each chunk separately with the TTS engine
- Stitch chunks together with crossfade smoothing
- Output a complete long audio file
The TADA engine natively supports 700+ seconds of coherent generation — no chunking required — which makes it particularly useful for audiobooks where consistent tone and pacing across a long piece matters.
Quick Start
Install prerequisites (macOS):
# Bun (frontend package manager)
curl -fsSL https://bun.sh/install | bash
# Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Python 3.11+ (pyenv recommended)
pyenv install 3.11.9 && pyenv global 3.11.9
# Tauri prerequisites (macOS)
xcode-select --install
Clone and run:
git clone https://github.com/jamiepine/voicebox.git
cd voicebox
just setup # create Python venv, install all dependencies (downloads models on first run)
just dev # start backend + desktop app
Links and Resources
Official Resources
- 🌟 GitHub: jamiepine/voicebox
- 🌐 Website: voicebox.sh
Related Projects
- Qwen3-TTS: Alibaba open-source TTS model
- Kokoro TTS: hexgrad/kokoro
- Chatterbox: Resemble AI open-source TTS
- Spotify Pedalboard: Python audio effects library
Conclusion
Voicebox consolidates several previously separate local voice AI tools into one application: TTS engines, voice cloning, dictation, audio post-processing, multi-track editing, and an agent voice interface.
32.3k Stars growing quickly relative to the project's age reflects genuine demand — ElevenLabs's subscription model has kept people looking for alternatives, and Qwen3-TTS's release was the first time local cloning quality practically matched cloud-based solutions.
The architecture choices — Tauri, MLX, multi-engine design — fit the use case. UI stays lightweight, inference runs fast, hardware backends are selectable. Voice Personality and the Captures system show the author thinking about workflows, not just individual features.
For developers who generate large volumes of voiceover content, have privacy concerns about uploading audio, or want to give AI agents a speaking voice in their own stack, Voicebox is worth installing.
Explore PrimeSkills — A marketplace for handpicked AI Agents and skills. Each is validated in real enterprise workflows, stripping away hype and keeping only what truly works.
Welcome to my Homepage for more useful insights and interesting products.
Top comments (0)