Voicebox: The Open-Source AI Voice Studio That Just Hit 28K Stars

#ai #voice #opensource #programming

Voicebox: The Open-Source AI Voice Studio That Just Hit 28K Stars

I've been watching the voice AI space for a while. ElevenLabs does voice cloning incredibly well. WisprFlow nails voice dictation. But both live in the cloud, both cost money every month, and both require uploading your voice data to someone else's server.

That's why Voicebox caught my attention. 28.5k GitHub stars, MIT license, and it runs entirely on your machine. It combines what ElevenLabs does (voice output) with what WisprFlow does (voice input), ties them together with a local LLM, and wraps everything in a polished desktop app.

Clone Your Voice in Seconds

The voice cloning takes seconds of reference audio. Upload a short clip, and Voicebox builds a voice model that sounds like you. It covers 23 languages — English, Chinese, Japanese, Arabic, Hindi, Swahili, and more.

Under the hood, Voicebox ships with 7 TTS engines:

Engine	Best For
Qwen3-TTS	High-quality multilingual cloning, natural-language delivery instructions
Chatterbox Turbo	Emotion tags (`[laugh]`, `[sigh]`, `[gasp]`) for expressive speech
LuxTTS	Lightweight (~1GB VRAM), 48kHz, 150x realtime on CPU
Kokoro	82M model, 50 curated preset voices, runs on CPU
TADA	HumeAI speech-language model, 700s+ coherent audio
Qwen CustomVoice	Delivery control without reference audio
Chatterbox Multilingual	23 languages, broadest coverage

If you don't want to clone anything, there are 50+ preset voices ready to go. And after generating audio, you get a full effects panel — reverb, delay, compression, pitch shift, chorus — all powered by Spotify's Pedalboard library, with real-time preview.

Give Your AI Agents a Voice

This is the feature that made me actually excited.

Voicebox ships a built-in MCP (Model Context Protocol) server. Any MCP-compatible agent — Claude Code, Cursor, Cline, Windsurf — can call it to speak. Setup takes one command:

claude mcp add voicebox \
  --transport http \
  --url http://127.0.0.1:17493/mcp \
  --header "X-Voicebox-Client-Id: claude-code"

After that, your agent can speak through your cloned voice. "Tests passed, ready to merge" — in a voice you chose.

You can assign different voices to different agents. Hear one voice for your code reviewer, another for your deployment bot. And the real kicker: voice personalities. Attach a persona description like "calm engineer" or "sarcastic code reviewer," and Voicebox's local LLM rewrites the agent's output to match that personality before synthesizing speech. Your agents don't just sound different — they talk differently.

Dictation That Doesn't Leak Your Voice

Voicebox includes a global hotkey for dictation. Hold it, speak, release — text pastes into whatever text field you're focused on. On macOS, it uses the accessibility API for precise paste injection without touching your clipboard.

All dictation stays local. Whisper-based STT runs on your machine. An optional LLM refinement pass cleans up ums and stutters.

Runs on Almost Anything

Hardware	Backend
Apple Silicon	MLX (Metal, 4-5x speed)
NVIDIA GPU	CUDA
AMD GPU	ROCm
Intel Arc	IPEX/XPU
CPU only	Kokoro 82M works fine

The app ships as a DMG for macOS and MSI for Windows. First launch auto-downloads the model weights you need — Kokoro is 82MB, Qwen3-TTS a few GB. REST API and MCP server both listen on localhost:17493, with docs at http://127.0.0.1:17493/docs.

The Bigger Picture

Voice I/O going local was always going to happen. Cloud convenience is real, but voice data is biometric data — losing it is closer to losing your fingerprint than losing your email. The fact that open-source TTS and STT models are now good enough to run on consumer hardware changes the equation.

Voicebox isn't just a useful tool. It's a proof point that agents don't have to be silent text boxes. They can speak, emote, and have personality — all without sending your voice to a data center.

GitHub: jamiepine/voicebox