OmniVoice: Open-Source TTS with 600+ Languages and Zero-Shot Voice Cloning
The TTS landscape just shifted. On March 31, 2026, the k2-fsa team — the same group behind Kaldi and k2, with Daniel Povey as a core contributor — released OmniVoice: an Apache 2.0 licensed TTS model supporting 600+ languages zero-shot, with 40x real-time inference speed.
In just three weeks, it hit 3,775 GitHub stars and 460,000+ HuggingFace downloads. Here's why developers are paying attention and how to run it locally.
The Numbers That Matter
| Metric | Value |
|---|---|
| Languages supported | 600+ (zero-shot) |
| RTF (Real-Time Factor) | 0.025 (40x faster than real-time) |
| License | Apache 2.0 (commercial use OK) |
| Base model | Qwen3-0.6B |
| Reference audio needed | 3–10 seconds (or none) |
| Hardware | Runs on consumer GPUs, Apple Silicon MPS supported |
Compare this to commercial services:
- ElevenLabs Pro: $22/month, limited characters
- ElevenLabs Business: $99/month
- Azure TTS: $16/million characters
- Google Cloud TTS: $16/million characters
OmniVoice: zero cost after deployment. Unlimited usage on your own hardware.
Three Modes in One Model
OmniVoice supports three inference modes through a single unified API:
1. Voice Cloning
Clone a voice from a short reference audio clip. Whisper auto-transcribes the reference text if you don't provide it.
from omnivoice import OmniVoice
import soundfile as sf
import torch
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cuda:0",
dtype=torch.float16,
)
audio = model.generate(
text="This voice was cloned from a 3-second reference.",
ref_audio="ref.wav",
# ref_text optional - Whisper auto-transcribes
)
sf.write("cloned.wav", audio[0], 24000)
2. Voice Design
Design a voice from scratch using natural language attributes. No reference audio required.
audio = model.generate(
text="This is a designed voice.",
instruct="female, low pitch, british accent",
)
Combine attributes freely: gender, age, pitch, speech speed, accent, dialect, emotional tone.
3. Auto Voice
No voice prompt at all. Fastest mode for quick prototyping.
audio = model.generate(text="Quick test output.")
Installation
PyTorch must be installed first, with version pinned to 2.8.0.
# NVIDIA (CUDA 12.8)
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 \
--extra-index-url https://download.pytorch.org/whl/cu128
# Apple Silicon (M1/M2/M3)
pip install torch==2.8.0 torchaudio==2.8.0
# OmniVoice
pip install omnivoice
Verify GPU/MPS detection:
import torch
print("CUDA:", torch.cuda.is_available())
print("MPS:", torch.backends.mps.is_available())
Fast Path: Web Demo
The fastest way to validate your setup is the bundled Gradio demo.
omnivoice-demo --ip 0.0.0.0 --port 8001
Navigate to http://localhost:8001 and test all three modes through a UI.
CLI Tools
Besides the Python API, OmniVoice ships with two CLI tools:
# Single inference
omnivoice-infer --model k2-fsa/OmniVoice \
--text "Hello world." \
--ref_audio ref.wav \
--output hello.wav
# Multi-GPU batch inference
omnivoice-infer-batch --model k2-fsa/OmniVoice \
--test_list test.jsonl \
--res_dir results/
Batch format (JSONL):
{"id": "clip_001", "text": "First clip.", "ref_audio": "ref.wav"}
{"id": "clip_002", "text": "Second clip.", "ref_audio": "ref.wav"}
Perfect for audiobook generation or large-scale narration pipelines.
Expression Control with Inline Tokens
Drop non-verbal tokens anywhere in the text:
text = "That's hilarious [laughter] but also a bit concerning [sigh]."
audio = model.generate(text=text, ref_audio="ref.wav")
Available tags include [laughter], [sigh], [question-ah], [surprise-wa].
Pronunciation Override
For homophones and proper nouns, override pronunciation directly.
English (CMU notation)
# "bass" as musical instrument, not low-frequency sound
audio = model.generate(text="He plays [B EY1 S] guitar.")
Chinese (Pinyin with tone numbers)
# Force specific tone
audio = model.generate(text="打ZHE2")
Architecture Notes
OmniVoice uses a Diffusion Language Model hybrid architecture. It's neither pure diffusion nor pure autoregressive — it combines the quality benefits of diffusion with the speed advantages of LLM-style generation. The base model is Qwen3-0.6B, making it light enough for consumer hardware while leveraging the language understanding of a modern LLM.
This is a different direction from previous open-source TTS projects (Bark, XTTS, F5-TTS), and it seems to be paying off in both quality and inference speed.
Production Tips from Issue #44
The community has been active on GitHub Issue #44 discussing real-world usage.
Voice Design consistency. Each call produces slightly different timbre. Generate once, save the output, then reuse as ref_audio to lock in a consistent voice for an entire project.
Prompt caching. Use create_voice_clone_prompt to precompute reference audio encodings once, then reuse the cached prompt for repeated generation. Critical for throughput on long-form content.
Number normalization. Raw digits like "123" can produce inconsistent output. Normalize to words ("one hundred twenty-three") using WeTextProcessing or similar before passing text to the model.
Cross-lingual accent bleed. If you use a Korean reference to generate English, the output has a Korean accent. For neutral target language accent, use native-speaker references.
Who This Is For
OmniVoice fits several developer profiles well:
- Solo developers and indie hackers who were paying commercial TTS subscriptions just for hobby projects
- AI agent builders needing voice output without vendor lock-in
- Content creators doing multilingual localization (YouTube, podcasts)
- Voice cloning experiments where 3-second references unlock a lot of creative possibilities
- Low-resource language applications (the 600+ language coverage includes many languages with no good commercial option)
What's Next
The k2-fsa ecosystem has a strong track record of long-term maintenance (Kaldi is still actively used 15+ years after release). That matters when you're deciding whether to build production infrastructure on a new model.
If you're evaluating TTS options, OmniVoice deserves a spot in the comparison. The combination of 600+ language support, 40x real-time inference, and Apache 2.0 licensing is genuinely rare in open source TTS today.
Resources
Have you tried OmniVoice yet? Would love to hear how it compares to your current TTS setup in the comments.
Top comments (0)