DEV Community

정상록
정상록

Posted on

OmniVoice: Open-Source TTS with 600+ Languages and Zero-Shot Voice Cloning

OmniVoice: Open-Source TTS with 600+ Languages and Zero-Shot Voice Cloning

The TTS landscape just shifted. On March 31, 2026, the k2-fsa team — the same group behind Kaldi and k2, with Daniel Povey as a core contributor — released OmniVoice: an Apache 2.0 licensed TTS model supporting 600+ languages zero-shot, with 40x real-time inference speed.

In just three weeks, it hit 3,775 GitHub stars and 460,000+ HuggingFace downloads. Here's why developers are paying attention and how to run it locally.

The Numbers That Matter

Metric Value
Languages supported 600+ (zero-shot)
RTF (Real-Time Factor) 0.025 (40x faster than real-time)
License Apache 2.0 (commercial use OK)
Base model Qwen3-0.6B
Reference audio needed 3–10 seconds (or none)
Hardware Runs on consumer GPUs, Apple Silicon MPS supported

Compare this to commercial services:

  • ElevenLabs Pro: $22/month, limited characters
  • ElevenLabs Business: $99/month
  • Azure TTS: $16/million characters
  • Google Cloud TTS: $16/million characters

OmniVoice: zero cost after deployment. Unlimited usage on your own hardware.

Three Modes in One Model

OmniVoice supports three inference modes through a single unified API:

1. Voice Cloning

Clone a voice from a short reference audio clip. Whisper auto-transcribes the reference text if you don't provide it.

from omnivoice import OmniVoice
import soundfile as sf
import torch

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16,
)

audio = model.generate(
    text="This voice was cloned from a 3-second reference.",
    ref_audio="ref.wav",
    # ref_text optional - Whisper auto-transcribes
)

sf.write("cloned.wav", audio[0], 24000)
Enter fullscreen mode Exit fullscreen mode

2. Voice Design

Design a voice from scratch using natural language attributes. No reference audio required.

audio = model.generate(
    text="This is a designed voice.",
    instruct="female, low pitch, british accent",
)
Enter fullscreen mode Exit fullscreen mode

Combine attributes freely: gender, age, pitch, speech speed, accent, dialect, emotional tone.

3. Auto Voice

No voice prompt at all. Fastest mode for quick prototyping.

audio = model.generate(text="Quick test output.")
Enter fullscreen mode Exit fullscreen mode

Installation

PyTorch must be installed first, with version pinned to 2.8.0.

# NVIDIA (CUDA 12.8)
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 \
    --extra-index-url https://download.pytorch.org/whl/cu128

# Apple Silicon (M1/M2/M3)
pip install torch==2.8.0 torchaudio==2.8.0

# OmniVoice
pip install omnivoice
Enter fullscreen mode Exit fullscreen mode

Verify GPU/MPS detection:

import torch
print("CUDA:", torch.cuda.is_available())
print("MPS:", torch.backends.mps.is_available())
Enter fullscreen mode Exit fullscreen mode

Fast Path: Web Demo

The fastest way to validate your setup is the bundled Gradio demo.

omnivoice-demo --ip 0.0.0.0 --port 8001
Enter fullscreen mode Exit fullscreen mode

Navigate to http://localhost:8001 and test all three modes through a UI.

CLI Tools

Besides the Python API, OmniVoice ships with two CLI tools:

# Single inference
omnivoice-infer --model k2-fsa/OmniVoice \
    --text "Hello world." \
    --ref_audio ref.wav \
    --output hello.wav

# Multi-GPU batch inference
omnivoice-infer-batch --model k2-fsa/OmniVoice \
    --test_list test.jsonl \
    --res_dir results/
Enter fullscreen mode Exit fullscreen mode

Batch format (JSONL):

{"id": "clip_001", "text": "First clip.", "ref_audio": "ref.wav"}
{"id": "clip_002", "text": "Second clip.", "ref_audio": "ref.wav"}
Enter fullscreen mode Exit fullscreen mode

Perfect for audiobook generation or large-scale narration pipelines.

Expression Control with Inline Tokens

Drop non-verbal tokens anywhere in the text:

text = "That's hilarious [laughter] but also a bit concerning [sigh]."
audio = model.generate(text=text, ref_audio="ref.wav")
Enter fullscreen mode Exit fullscreen mode

Available tags include [laughter], [sigh], [question-ah], [surprise-wa].

Pronunciation Override

For homophones and proper nouns, override pronunciation directly.

English (CMU notation)

# "bass" as musical instrument, not low-frequency sound
audio = model.generate(text="He plays [B EY1 S] guitar.")
Enter fullscreen mode Exit fullscreen mode

Chinese (Pinyin with tone numbers)

# Force specific tone
audio = model.generate(text="打ZHE2")
Enter fullscreen mode Exit fullscreen mode

Architecture Notes

OmniVoice uses a Diffusion Language Model hybrid architecture. It's neither pure diffusion nor pure autoregressive — it combines the quality benefits of diffusion with the speed advantages of LLM-style generation. The base model is Qwen3-0.6B, making it light enough for consumer hardware while leveraging the language understanding of a modern LLM.

This is a different direction from previous open-source TTS projects (Bark, XTTS, F5-TTS), and it seems to be paying off in both quality and inference speed.

Production Tips from Issue #44

The community has been active on GitHub Issue #44 discussing real-world usage.

Voice Design consistency. Each call produces slightly different timbre. Generate once, save the output, then reuse as ref_audio to lock in a consistent voice for an entire project.

Prompt caching. Use create_voice_clone_prompt to precompute reference audio encodings once, then reuse the cached prompt for repeated generation. Critical for throughput on long-form content.

Number normalization. Raw digits like "123" can produce inconsistent output. Normalize to words ("one hundred twenty-three") using WeTextProcessing or similar before passing text to the model.

Cross-lingual accent bleed. If you use a Korean reference to generate English, the output has a Korean accent. For neutral target language accent, use native-speaker references.

Who This Is For

OmniVoice fits several developer profiles well:

  • Solo developers and indie hackers who were paying commercial TTS subscriptions just for hobby projects
  • AI agent builders needing voice output without vendor lock-in
  • Content creators doing multilingual localization (YouTube, podcasts)
  • Voice cloning experiments where 3-second references unlock a lot of creative possibilities
  • Low-resource language applications (the 600+ language coverage includes many languages with no good commercial option)

What's Next

The k2-fsa ecosystem has a strong track record of long-term maintenance (Kaldi is still actively used 15+ years after release). That matters when you're deciding whether to build production infrastructure on a new model.

If you're evaluating TTS options, OmniVoice deserves a spot in the comparison. The combination of 600+ language support, 40x real-time inference, and Apache 2.0 licensing is genuinely rare in open source TTS today.

Resources

Have you tried OmniVoice yet? Would love to hear how it compares to your current TTS setup in the comments.

Top comments (0)