DEV Community

Cover image for Open Source Project of the Day (#103): Voicebox — A Local, Open-Source Alternative to ElevenLabs
WonderLab
WonderLab

Posted on

Open Source Project of the Day (#103): Voicebox — A Local, Open-Source Alternative to ElevenLabs

Introduction

"A free local alternative to ElevenLabs and WisprFlow — in one app."

This is article #103 in the Open Source Project of the Day series. Today's project is Voicebox — a local-first open-source AI voice studio built by independent developer Jamie Pine.

ElevenLabs produces excellent voice cloning. But subscriptions start at $22/month, character limits apply, and your voice data goes to their servers. WisprFlow handles voice dictation, with the same cloud dependency. Both are genuinely useful, and both have the same structural limitation.

Voicebox integrates both functions into one locally-running desktop application: 7 TTS engines, zero-shot voice cloning, a global dictation hotkey, a multi-track Stories editor, and MCP integration so AI agents can speak in voices you own. Every model runs on-device. No audio data leaves the machine.

32.3k Stars, MIT license, built with Tauri + Rust.

What You'll Learn

  • The 7 built-in TTS engines: characteristics and best use cases (from the 82M Kokoro to Qwen3-TTS high-quality cloning)
  • Zero-shot voice cloning: how a few seconds of audio becomes a reusable voice model
  • Voice Personality: how to make AI speak in a specific persona
  • Stories Editor: multi-track timeline for podcast dialogue production
  • MCP integration: giving Claude Code and other agents a speaking voice
  • MLX acceleration on Apple Silicon

Prerequisites

  • Familiarity with TTS (text-to-speech) and voice cloning concepts
  • Experience with ElevenLabs or similar voice AI tools helps for comparison
  • Claude Code or MCP experience (for the agent voice integration section)

Project Background

What Is Voicebox?

Voicebox is a local-first AI voice studio — "a free and open-source alternative to ElevenLabs and WisprFlow in one app."

It integrates two function categories: voice output (TTS + voice cloning) and voice input (dictation + STT). These are usually separate tools; Voicebox puts them in one desktop application sharing the same local model infrastructure.

The technical core is the integration of Qwen3-TTS — Alibaba's open-source TTS model released in 2025, which brought zero-shot voice cloning quality close to or matching ElevenLabs for the first time at a local inference level.

Author

  • Author: Jamie Pine (jamiepine)
  • Website: voicebox.sh
  • License: MIT
  • Version: v0.5.0, 25 releases, 588 commits

Project Stats

  • ⭐ GitHub Stars: 32,300+
  • 🍴 Forks: 3,900+
  • 📄 License: MIT
  • 💻 Stack: TypeScript 55% / Python 34% / Rust 9%

Core Features

7 TTS Engines

Seven engines bundled, each suited to different scenarios:

Engine Languages Key Strength
Qwen3-TTS 10 High-quality multilingual voice cloning (Alibaba open-source)
Qwen CustomVoice 10 9 preset voices, natural-language delivery control
LuxTTS English Lightweight, 48kHz, 150x realtime on CPU
Chatterbox Multilingual 23 Broadest language coverage (Arabic, Swahili, and more)
Chatterbox Turbo English Paralinguistic emotion tags ([laugh], [sigh])
TADA (HumeAI) 10 700+ seconds of coherent long-form audio generation
Kokoro 8 Tiny 82M model, 50 preset voices

The differentiation is concrete: LuxTTS runs fast on laptops without GPUs; Chatterbox Turbo expresses laughter and sighs; TADA handles audiobooks without segment stitching; Kokoro is the smallest possible footprint.

Voice Cloning

Zero-shot cloning: upload a few seconds of reference audio, Voicebox extracts the voice signature and creates a reusable Voice Profile.

Workflow:
1. Record or upload reference audio (3-10 seconds produces good results)
2. Create a Voice Profile (name it, add a description)
3. Select this profile for any TTS task
4. Combine multiple samples to improve clone accuracy
Enter fullscreen mode Exit fullscreen mode

Reference audio can be recorded directly inside the app — no external recording file needed.

Voice Personality

This is a feature rare in TTS tools: bind a persona description to a Voice Profile, then have a local LLM rewrite the input text in that persona's voice before it reaches the TTS engine.

Example:
Voice Profile: "Alex - Tech podcast host"
Personality: "Enthusiastic about technical topics, occasional jargon,
              likes to explain complex concepts with analogies"

Input text: "Quantum computers use qubits to process information."

LLM rewrite: "Quantum computers are fascinating — instead of ordinary 0s and 1s,
              they use qubits that can be both 0 and 1 at the same time.
              Think of it like Schrödinger's cat, but for computing."

Then the rewritten text goes to the TTS engine.
Enter fullscreen mode Exit fullscreen mode

A local Qwen3 model (0.6B / 1.7B / 4B options) handles the rewrite. The whole pipeline runs offline.

Stories Editor

A multi-track timeline editor for producing conversational content — podcasts, roleplay, audiobook dialogue:

  • Each track corresponds to a Voice Profile (different speaker)
  • Audio segments arranged on a timeline
  • Reorder and trim segments directly in the editor
  • Export as a complete multi-speaker audio file

Global Dictation Hotkey

The STT component uses Whisper (Base through Large, plus Turbo variant):

  • Global hotkey that works in any application
  • Push-to-Talk and Toggle modes
  • Recognized text inputs directly into the active text field
  • All recordings stored in the Captures tab with full transcripts
  • Edit transcripts inline; regenerate audio from the corrected text

Post-Processing Effects

Powered by Spotify's Pedalboard library; effects applied after generation:

  • Pitch Shift, Reverb, Delay
  • Chorus, Compression
  • High-pass/Low-pass filters
  • Effect presets for reuse

MCP Agent Voice Integration

Four MCP tools let Claude Code and other AI agents speak out loud:

# Add MCP server to Claude Code
claude mcp add voicebox \
  --transport http \
  --url http://127.0.0.1:17493/mcp \
  --header "X-Voicebox-Client-Id: claude-code"
Enter fullscreen mode Exit fullscreen mode
MCP Tool Function
voicebox.speak Read text aloud using a specified Voice Profile
voicebox.transcribe Transcribe an audio file
voicebox.list_captures List saved recordings
voicebox.list_profiles List available Voice Profiles

REST API works the same way:

curl -X POST http://127.0.0.1:17493/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Deploy complete.", "profile_id": "abc123", "language": "en"}'
Enter fullscreen mode Exit fullscreen mode

Deep Dive

Technical Architecture

Frontend (React + TypeScript + Tailwind)
    ↓
Tauri bridge (Rust)
    ↓
FastAPI backend (Python)
    ├── TTS engines (Qwen3-TTS / LuxTTS / Kokoro / ...)
    ├── Whisper STT
    ├── Qwen3 LLM (Personality rewrites)
    ├── Pedalboard post-processing
    └── SQLite data storage
Enter fullscreen mode Exit fullscreen mode

Tauri + Rust keeps the desktop application lightweight (no V8 engine). The Python FastAPI backend handles the heavy model inference. This split — web technology for UI, Python for inference, Tauri bridging them — is becoming standard for AI desktop applications.

Inference Backend by Hardware

Hardware Backend Notes
macOS Apple Silicon MLX / Metal Neural Engine acceleration, 4-5× faster than CPU PyTorch
Windows / Linux NVIDIA PyTorch CUDA Standard GPU inference
Linux AMD PyTorch ROCm AMD GPU support
Windows (any GPU) DirectML Microsoft universal GPU interface
Intel Arc IPEX/XPU Intel GPU support
No GPU CPU fallback LuxTTS runs at 150× realtime on CPU

The MLX acceleration on Apple Silicon is a real advantage: the M-series Neural Engine specializes in matrix operations. TTS inference runs 4-5× faster than CPU PyTorch — relevant for real-time dictation and fast generation loops.

The Captures System

All voice interactions are archived:

Captures tab
    ├── Audio recordings (dictation + deliberate recordings)
    ├── Auto-generated transcripts (Whisper)
    ├── Inline editing (modify the transcript directly)
    ├── Regenerate audio (with a different Voice Profile)
    └── Export audio
Enter fullscreen mode Exit fullscreen mode

This design moves Voicebox from "one-shot generation tool" to "voice content workbench." Old recordings can be regenerated with new voice profiles; transcripts can be edited and re-voiced; the whole thing acts as a lightweight voice content management system.

Unlimited Length Generation

TTS models typically cap single-generation length at a few hundred words. Voicebox handles long content:

  1. Split long text into sentence/paragraph chunks
  2. Generate each chunk separately with the TTS engine
  3. Stitch chunks together with crossfade smoothing
  4. Output a complete long audio file

The TADA engine natively supports 700+ seconds of coherent generation — no chunking required — which makes it particularly useful for audiobooks where consistent tone and pacing across a long piece matters.


Quick Start

Install prerequisites (macOS):

# Bun (frontend package manager)
curl -fsSL https://bun.sh/install | bash

# Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Python 3.11+ (pyenv recommended)
pyenv install 3.11.9 && pyenv global 3.11.9

# Tauri prerequisites (macOS)
xcode-select --install
Enter fullscreen mode Exit fullscreen mode

Clone and run:

git clone https://github.com/jamiepine/voicebox.git
cd voicebox
just setup   # create Python venv, install all dependencies (downloads models on first run)
just dev     # start backend + desktop app
Enter fullscreen mode Exit fullscreen mode

Links and Resources

Official Resources

Related Projects

  • Qwen3-TTS: Alibaba open-source TTS model
  • Kokoro TTS: hexgrad/kokoro
  • Chatterbox: Resemble AI open-source TTS
  • Spotify Pedalboard: Python audio effects library

Conclusion

Voicebox consolidates several previously separate local voice AI tools into one application: TTS engines, voice cloning, dictation, audio post-processing, multi-track editing, and an agent voice interface.

32.3k Stars growing quickly relative to the project's age reflects genuine demand — ElevenLabs's subscription model has kept people looking for alternatives, and Qwen3-TTS's release was the first time local cloning quality practically matched cloud-based solutions.

The architecture choices — Tauri, MLX, multi-engine design — fit the use case. UI stays lightweight, inference runs fast, hardware backends are selectable. Voice Personality and the Captures system show the author thinking about workflows, not just individual features.

For developers who generate large volumes of voiceover content, have privacy concerns about uploading audio, or want to give AI agents a speaking voice in their own stack, Voicebox is worth installing.


Explore PrimeSkills — A marketplace for handpicked AI Agents and skills. Each is validated in real enterprise workflows, stripping away hype and keeping only what truly works.

Welcome to my Homepage for more useful insights and interesting products.

Top comments (0)