How We Built a Free Voice Cloning Tool That Supports 646 Languages

#ai #nlp #opensource #showdev

If you've ever tried to add multilingual text-to-speech to your app, you know the pain: ElevenLabs caps at 32 languages, PlayHT at 132, and the pricing scales fast. We built OmniVoice — a free, open-source voice generator that covers 646 languages with zero-shot voice cloning. Here's what we learned.

The Problem

Most TTS APIs force you to choose between quality and coverage. Want natural-sounding English? Easy. Want the same quality in Yoruba, Kazakh, or Cantonese? Good luck. And if you need voice cloning across languages — where a speaker's voice stays consistent regardless of the language — you're basically out of options.

The Architecture

OmniVoice uses a non-autoregressive diffusion language model — a single-stage architecture that skips the typical two-step "text → tokens → audio" pipeline. Key design decisions:

Qwen3-0.6B as text encoder — LLM initialization dramatically improves intelligibility across languages
Full-codebook random masking — the diffusion process operates on all codebook levels simultaneously, avoiding the quality degradation of cascaded approaches
581k hours of open-source training data — no proprietary datasets

The result: 2.85% WER (vs. ElevenLabs' 10.95%) and 0.830 speaker similarity (vs. 0.655) on standardized benchmarks.

Voice Cloning in 3 Lines

from omnivoice import OmniVoice

engine = OmniVoice()
engine.tts(
    text="Hello from OmniVoice",
    reference_audio="speaker.wav",  # 3-30 seconds of audio
    output="output.wav"
)

That's it. No fine-tuning, no training, no API keys. The model clones the voice from 3-30 seconds of reference audio and works cross-lingually — record in English, generate in Japanese.

Voice Design (No Audio Needed)

This is the feature that surprised us most during development. You can create entirely new voices from text descriptions:

engine.tts(
    text="Welcome to the future of speech",
    voice_design="A young female speaker with a British accent, medium pitch, calm and professional tone",
    output="designed_voice.wav"
)

Combine gender, age, pitch, accents (10 English variants, 12 Chinese dialects), and speaking styles freely.

Performance

On a single GPU, OmniVoice runs at RTF 0.025 (~40x real-time). A 10-second clip generates in ~250ms. For production deployments, the OpenAI-compatible REST API wrapper (OmniVoice-local) makes integration straightforward:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello world",
    "voice": "reference_speaker",
    "model": "omnivoice"
  }'

Try It

Browser demo (no signup): omnivoice.pro
HuggingFace Space: k2-fsa/OmniVoice
GitHub: k2-fsa/OmniVoice (Apache 2.0)
Paper: arXiv:2604.00688

One Caveat

The Higgs-audio tokenizer (from Boson AI) requires an extended license if you exceed 100k monthly active users. Below that threshold, it's fully free under Apache 2.0.

We'd love feedback from anyone working on multilingual apps, accessibility tools, or content localization. What languages or features would matter most for your use case?