정상록

Posted on Apr 21

OmniVoice: Open-Source TTS with 600+ Languages and Zero-Shot Voice Cloning

#opensource #ai #machinelearning #nlp

OmniVoice: Open-Source TTS with 600+ Languages and Zero-Shot Voice Cloning

The TTS landscape just shifted. On March 31, 2026, the k2-fsa team — the same group behind Kaldi and k2, with Daniel Povey as a core contributor — released OmniVoice: an Apache 2.0 licensed TTS model supporting 600+ languages zero-shot, with 40x real-time inference speed.

In just three weeks, it hit 3,775 GitHub stars and 460,000+ HuggingFace downloads. Here's why developers are paying attention and how to run it locally.

The Numbers That Matter

Metric	Value
Languages supported	600+ (zero-shot)
RTF (Real-Time Factor)	0.025 (40x faster than real-time)
License	Apache 2.0 (commercial use OK)
Base model	Qwen3-0.6B
Reference audio needed	3–10 seconds (or none)
Hardware	Runs on consumer GPUs, Apple Silicon MPS supported

Compare this to commercial services:

ElevenLabs Pro: $22/month, limited characters
ElevenLabs Business: $99/month
Azure TTS: $16/million characters
Google Cloud TTS: $16/million characters

OmniVoice: zero cost after deployment. Unlimited usage on your own hardware.

Three Modes in One Model

OmniVoice supports three inference modes through a single unified API:

1. Voice Cloning

Clone a voice from a short reference audio clip. Whisper auto-transcribes the reference text if you don't provide it.

from omnivoice import OmniVoice
import soundfile as sf
import torch

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16,
)

audio = model.generate(
    text="This voice was cloned from a 3-second reference.",
    ref_audio="ref.wav",
    # ref_text optional - Whisper auto-transcribes
)

sf.write("cloned.wav", audio[0], 24000)

2. Voice Design

Design a voice from scratch using natural language attributes. No reference audio required.

audio = model.generate(
    text="This is a designed voice.",
    instruct="female, low pitch, british accent",
)

Combine attributes freely: gender, age, pitch, speech speed, accent, dialect, emotional tone.

3. Auto Voice

No voice prompt at all. Fastest mode for quick prototyping.

audio = model.generate(text="Quick test output.")

Installation

PyTorch must be installed first, with version pinned to 2.8.0.

# NVIDIA (CUDA 12.8)
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 \
    --extra-index-url https://download.pytorch.org/whl/cu128

# Apple Silicon (M1/M2/M3)
pip install torch==2.8.0 torchaudio==2.8.0

# OmniVoice
pip install omnivoice

Verify GPU/MPS detection:

import torch
print("CUDA:", torch.cuda.is_available())
print("MPS:", torch.backends.mps.is_available())

Fast Path: Web Demo

The fastest way to validate your setup is the bundled Gradio demo.

omnivoice-demo --ip 0.0.0.0 --port 8001

Navigate to http://localhost:8001 and test all three modes through a UI.

CLI Tools

Besides the Python API, OmniVoice ships with two CLI tools:

# Single inference
omnivoice-infer --model k2-fsa/OmniVoice \
    --text "Hello world." \
    --ref_audio ref.wav \
    --output hello.wav

# Multi-GPU batch inference
omnivoice-infer-batch --model k2-fsa/OmniVoice \
    --test_list test.jsonl \
    --res_dir results/

Batch format (JSONL):

{"id": "clip_001", "text": "First clip.", "ref_audio": "ref.wav"}
{"id": "clip_002", "text": "Second clip.", "ref_audio": "ref.wav"}

Perfect for audiobook generation or large-scale narration pipelines.

Expression Control with Inline Tokens

Drop non-verbal tokens anywhere in the text:

text = "That's hilarious [laughter] but also a bit concerning [sigh]."
audio = model.generate(text=text, ref_audio="ref.wav")

Available tags include [laughter], [sigh], [question-ah], [surprise-wa].

Pronunciation Override

For homophones and proper nouns, override pronunciation directly.

English (CMU notation)

# "bass" as musical instrument, not low-frequency sound
audio = model.generate(text="He plays [B EY1 S] guitar.")

Chinese (Pinyin with tone numbers)

# Force specific tone
audio = model.generate(text="打ZHE2")

Architecture Notes

OmniVoice uses a Diffusion Language Model hybrid architecture. It's neither pure diffusion nor pure autoregressive — it combines the quality benefits of diffusion with the speed advantages of LLM-style generation. The base model is Qwen3-0.6B, making it light enough for consumer hardware while leveraging the language understanding of a modern LLM.

This is a different direction from previous open-source TTS projects (Bark, XTTS, F5-TTS), and it seems to be paying off in both quality and inference speed.

Production Tips from Issue #44

The community has been active on GitHub Issue #44 discussing real-world usage.

Voice Design consistency. Each call produces slightly different timbre. Generate once, save the output, then reuse as ref_audio to lock in a consistent voice for an entire project.

Prompt caching. Use create_voice_clone_prompt to precompute reference audio encodings once, then reuse the cached prompt for repeated generation. Critical for throughput on long-form content.

Number normalization. Raw digits like "123" can produce inconsistent output. Normalize to words ("one hundred twenty-three") using WeTextProcessing or similar before passing text to the model.

Cross-lingual accent bleed. If you use a Korean reference to generate English, the output has a Korean accent. For neutral target language accent, use native-speaker references.

Who This Is For

OmniVoice fits several developer profiles well:

Solo developers and indie hackers who were paying commercial TTS subscriptions just for hobby projects
AI agent builders needing voice output without vendor lock-in
Content creators doing multilingual localization (YouTube, podcasts)
Voice cloning experiments where 3-second references unlock a lot of creative possibilities
Low-resource language applications (the 600+ language coverage includes many languages with no good commercial option)

What's Next

The k2-fsa ecosystem has a strong track record of long-term maintenance (Kaldi is still actively used 15+ years after release). That matters when you're deciding whether to build production infrastructure on a new model.

If you're evaluating TTS options, OmniVoice deserves a spot in the comparison. The combination of 600+ language support, 40x real-time inference, and Apache 2.0 licensing is genuinely rare in open source TTS today.

Resources

Have you tried OmniVoice yet? Would love to hear how it compares to your current TTS setup in the comments.

Top comments (1)

Ross Peili • Apr 21

This is really interesting bro. Checeking it out now. Would you be up to integrate with Skillware? (github.com/arpahls/skillware), to give any LLMs or local models voice out of the box?