WonderLab

Posted on Jun 23

Open Source Project of the Day (#103): Voicebox — A Local, Open-Source Alternative to ElevenLabs

#tts #elevenlabs #voiceclone #opensource

Introduction

"A free local alternative to ElevenLabs and WisprFlow — in one app."

This is article #103 in the Open Source Project of the Day series. Today's project is Voicebox — a local-first open-source AI voice studio built by independent developer Jamie Pine.

ElevenLabs produces excellent voice cloning. But subscriptions start at $22/month, character limits apply, and your voice data goes to their servers. WisprFlow handles voice dictation, with the same cloud dependency. Both are genuinely useful, and both have the same structural limitation.

Voicebox integrates both functions into one locally-running desktop application: 7 TTS engines, zero-shot voice cloning, a global dictation hotkey, a multi-track Stories editor, and MCP integration so AI agents can speak in voices you own. Every model runs on-device. No audio data leaves the machine.

32.3k Stars, MIT license, built with Tauri + Rust.

What You'll Learn

The 7 built-in TTS engines: characteristics and best use cases (from the 82M Kokoro to Qwen3-TTS high-quality cloning)
Zero-shot voice cloning: how a few seconds of audio becomes a reusable voice model
Voice Personality: how to make AI speak in a specific persona
Stories Editor: multi-track timeline for podcast dialogue production
MCP integration: giving Claude Code and other agents a speaking voice
MLX acceleration on Apple Silicon

Prerequisites

Familiarity with TTS (text-to-speech) and voice cloning concepts
Experience with ElevenLabs or similar voice AI tools helps for comparison
Claude Code or MCP experience (for the agent voice integration section)

Project Background

What Is Voicebox?

Voicebox is a local-first AI voice studio — "a free and open-source alternative to ElevenLabs and WisprFlow in one app."

It integrates two function categories: voice output (TTS + voice cloning) and voice input (dictation + STT). These are usually separate tools; Voicebox puts them in one desktop application sharing the same local model infrastructure.

The technical core is the integration of Qwen3-TTS — Alibaba's open-source TTS model released in 2025, which brought zero-shot voice cloning quality close to or matching ElevenLabs for the first time at a local inference level.

Author

Author: Jamie Pine (jamiepine)
Website: voicebox.sh
License: MIT
Version: v0.5.0, 25 releases, 588 commits

Project Stats

⭐ GitHub Stars: 32,300+
🍴 Forks: 3,900+
📄 License: MIT
💻 Stack: TypeScript 55% / Python 34% / Rust 9%

Core Features

7 TTS Engines

Seven engines bundled, each suited to different scenarios:

Engine	Languages	Key Strength
Qwen3-TTS	10	High-quality multilingual voice cloning (Alibaba open-source)
Qwen CustomVoice	10	9 preset voices, natural-language delivery control
LuxTTS	English	Lightweight, 48kHz, 150x realtime on CPU
Chatterbox Multilingual	23	Broadest language coverage (Arabic, Swahili, and more)
Chatterbox Turbo	English	Paralinguistic emotion tags (`[laugh]`, `[sigh]`)
TADA (HumeAI)	10	700+ seconds of coherent long-form audio generation
Kokoro	8	Tiny 82M model, 50 preset voices

The differentiation is concrete: LuxTTS runs fast on laptops without GPUs; Chatterbox Turbo expresses laughter and sighs; TADA handles audiobooks without segment stitching; Kokoro is the smallest possible footprint.

Voice Cloning

Zero-shot cloning: upload a few seconds of reference audio, Voicebox extracts the voice signature and creates a reusable Voice Profile.

Workflow:
1. Record or upload reference audio (3-10 seconds produces good results)
2. Create a Voice Profile (name it, add a description)
3. Select this profile for any TTS task
4. Combine multiple samples to improve clone accuracy

Reference audio can be recorded directly inside the app — no external recording file needed.

Voice Personality

This is a feature rare in TTS tools: bind a persona description to a Voice Profile, then have a local LLM rewrite the input text in that persona's voice before it reaches the TTS engine.

Example:
Voice Profile: "Alex - Tech podcast host"
Personality: "Enthusiastic about technical topics, occasional jargon,
              likes to explain complex concepts with analogies"

Input text: "Quantum computers use qubits to process information."

LLM rewrite: "Quantum computers are fascinating — instead of ordinary 0s and 1s,
              they use qubits that can be both 0 and 1 at the same time.
              Think of it like Schrödinger's cat, but for computing."

Then the rewritten text goes to the TTS engine.

A local Qwen3 model (0.6B / 1.7B / 4B options) handles the rewrite. The whole pipeline runs offline.

Stories Editor

A multi-track timeline editor for producing conversational content — podcasts, roleplay, audiobook dialogue:

Each track corresponds to a Voice Profile (different speaker)
Audio segments arranged on a timeline
Reorder and trim segments directly in the editor
Export as a complete multi-speaker audio file

Global Dictation Hotkey

The STT component uses Whisper (Base through Large, plus Turbo variant):

Global hotkey that works in any application
Push-to-Talk and Toggle modes
Recognized text inputs directly into the active text field
All recordings stored in the Captures tab with full transcripts
Edit transcripts inline; regenerate audio from the corrected text

Post-Processing Effects

Pitch Shift, Reverb, Delay
Chorus, Compression
High-pass/Low-pass filters
Effect presets for reuse

MCP Agent Voice Integration

Four MCP tools let Claude Code and other AI agents speak out loud:

# Add MCP server to Claude Code
claude mcp add voicebox \
  --transport http \
  --url http://127.0.0.1:17493/mcp \
  --header "X-Voicebox-Client-Id: claude-code"

MCP Tool	Function
`voicebox.speak`	Read text aloud using a specified Voice Profile
`voicebox.transcribe`	Transcribe an audio file
`voicebox.list_captures`	List saved recordings
`voicebox.list_profiles`	List available Voice Profiles

REST API works the same way:

curl -X POST http://127.0.0.1:17493/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Deploy complete.", "profile_id": "abc123", "language": "en"}'

Deep Dive

Technical Architecture

Frontend (React + TypeScript + Tailwind)
    ↓
Tauri bridge (Rust)
    ↓
FastAPI backend (Python)
    ├── TTS engines (Qwen3-TTS / LuxTTS / Kokoro / ...)
    ├── Whisper STT
    ├── Qwen3 LLM (Personality rewrites)
    ├── Pedalboard post-processing
    └── SQLite data storage

Tauri + Rust keeps the desktop application lightweight (no V8 engine). The Python FastAPI backend handles the heavy model inference. This split — web technology for UI, Python for inference, Tauri bridging them — is becoming standard for AI desktop applications.

Inference Backend by Hardware

Hardware	Backend	Notes
macOS Apple Silicon	MLX / Metal	Neural Engine acceleration, 4-5× faster than CPU PyTorch
Windows / Linux NVIDIA	PyTorch CUDA	Standard GPU inference
Linux AMD	PyTorch ROCm	AMD GPU support
Windows (any GPU)	DirectML	Microsoft universal GPU interface
Intel Arc	IPEX/XPU	Intel GPU support
No GPU	CPU fallback	LuxTTS runs at 150× realtime on CPU

The MLX acceleration on Apple Silicon is a real advantage: the M-series Neural Engine specializes in matrix operations. TTS inference runs 4-5× faster than CPU PyTorch — relevant for real-time dictation and fast generation loops.

The Captures System

All voice interactions are archived:

Captures tab
    ├── Audio recordings (dictation + deliberate recordings)
    ├── Auto-generated transcripts (Whisper)
    ├── Inline editing (modify the transcript directly)
    ├── Regenerate audio (with a different Voice Profile)
    └── Export audio

This design moves Voicebox from "one-shot generation tool" to "voice content workbench." Old recordings can be regenerated with new voice profiles; transcripts can be edited and re-voiced; the whole thing acts as a lightweight voice content management system.

Unlimited Length Generation

TTS models typically cap single-generation length at a few hundred words. Voicebox handles long content:

Split long text into sentence/paragraph chunks
Generate each chunk separately with the TTS engine
Stitch chunks together with crossfade smoothing
Output a complete long audio file

The TADA engine natively supports 700+ seconds of coherent generation — no chunking required — which makes it particularly useful for audiobooks where consistent tone and pacing across a long piece matters.

Quick Start

Install prerequisites (macOS):

# Bun (frontend package manager)
curl -fsSL https://bun.sh/install | bash

# Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Python 3.11+ (pyenv recommended)
pyenv install 3.11.9 && pyenv global 3.11.9

# Tauri prerequisites (macOS)
xcode-select --install

Clone and run:

git clone https://github.com/jamiepine/voicebox.git
cd voicebox
just setup   # create Python venv, install all dependencies (downloads models on first run)
just dev     # start backend + desktop app

Links and Resources

Official Resources

🌟 GitHub: jamiepine/voicebox
🌐 Website: voicebox.sh

Related Projects

Qwen3-TTS: Alibaba open-source TTS model
Kokoro TTS: hexgrad/kokoro
Chatterbox: Resemble AI open-source TTS
Spotify Pedalboard: Python audio effects library

Conclusion

Voicebox consolidates several previously separate local voice AI tools into one application: TTS engines, voice cloning, dictation, audio post-processing, multi-track editing, and an agent voice interface.

32.3k Stars growing quickly relative to the project's age reflects genuine demand — ElevenLabs's subscription model has kept people looking for alternatives, and Qwen3-TTS's release was the first time local cloning quality practically matched cloud-based solutions.

The architecture choices — Tauri, MLX, multi-engine design — fit the use case. UI stays lightweight, inference runs fast, hardware backends are selectable. Voice Personality and the Captures system show the author thinking about workflows, not just individual features.

For developers who generate large volumes of voiceover content, have privacy concerns about uploading audio, or want to give AI agents a speaking voice in their own stack, Voicebox is worth installing.

Explore PrimeSkills — A marketplace for handpicked AI Agents and skills. Each is validated in real enterprise workflows, stripping away hype and keeping only what truly works.

Welcome to my Homepage for more useful insights and interesting products.

Top comments (1)

Daniel Shilansky • Jun 24

Voicebox looks great for the own-your-voice, local-control crowd. One thing worth separating out for anyone here specifically trying to produce an audiobook (as opposed to general TTS or dictation): the TTS engine is maybe a third of the job. The rest is pronunciation of names and invented words, keeping one voice consistent across a whole book, mastering to ACX spec, and clean chapter markers in the M4B. All doable yourself with a tool like this if you enjoy the production side.

If you'd rather skip that and just get a finished audiobook back, there are done-for-you options (I've used tomevox.com: upload the manuscript, pick a voice, get a finished M4B plus per-chapter MP3/WAV, human-checked, first chapter free). Different trade-off than local DIY, you give up control but you're not the one assembling and mastering it. Comes down to whether the production process is the fun part or the part you want gone.