DEV Community

Cover image for Fine-Tuning OmniVoice for Yoruba Zero-Shot Voice Cloning: Lessons from 9.6 Hours of Speech Data
Samuel Oyerinde
Samuel Oyerinde

Posted on

Fine-Tuning OmniVoice for Yoruba Zero-Shot Voice Cloning: Lessons from 9.6 Hours of Speech Data

However, when you move beyond high-resource languages and start working with low-resource, tonal languages like Yoruba, many of the assumptions behind standard TTS pipelines begin to fail.

Over the past week, I fine-tuned OmniVoice for Yoruba using approximately 9.6 hours of speech data collected from 156 speakers. The goal was to build a model capable of zero-shot voice cloning while preserving the tonal characteristics that are fundamental to Yoruba speech.

This article documents the dataset construction process, the engineering decisions that mattered, why some popular multilingual baselines struggle with Yoruba, and how I evaluated synthesis quality.


Why Yoruba is Different

Unlike English, Yoruba is a tonal language.

Words that appear similar can have completely different meanings depending on tone:

Word Meaning
ọkọ husband
òkò hoe
ọ̀kọ̀ vehicle

For humans, tone is obvious.

For TTS models, tone must be learned through the interaction between:

  • Text representations
  • Acoustic patterns
  • Prosodic modeling

A model that ignores tonal information will often generate speech that is intelligible acoustically but semantically incorrect.

This creates unique challenges for text normalization, tokenization, and model training.


Dataset Construction

To create a balanced Yoruba speech corpus, I merged two complementary datasets:

1. Google Yoruba Speech Dataset (OpenSLR 86)

A collection of high-quality studio recordings with clean acoustic conditions.

Benefits:

  • Consistent microphone setup
  • Minimal background noise
  • Reliable pronunciation

2. IroyinSpeech

A crowd-sourced Yoruba speech dataset containing diverse accents and speaking styles.

Benefits:

  • Greater speaker diversity
  • Real-world recording conditions
  • More representative pronunciation variation

Final Dataset Statistics

Metric Value
Total Utterances 6,989
Total Duration ~9.6 Hours
Total Speakers 156
Sample Rate 24 kHz
Channels Mono

The idea was simple:

Use studio-quality recordings to establish acoustic stability while using crowd-sourced speech to improve speaker diversity and robustness.


Data Pipeline

The preprocessing pipeline looked like this:

Google OpenSLR 86 (Studio)
          +
IroyinSpeech (Crowd)
          │
          ▼
     24 kHz Mono WAV
          │
          ▼
 Discrete Audio Tokenization
 (HiggsAudio 8-Codebook VQ)
          │
          ▼
     OmniVoice Training
Enter fullscreen mode Exit fullscreen mode

Rather than feeding raw waveforms directly into the language model, OmniVoice operates on discrete audio tokens generated through vector quantization.

For this project, I used the HiggsAudio 8-Codebook VQ tokenizer, which converts continuous speech into sequences of discrete acoustic symbols that can be modeled autoregressively.


Understanding the Zero-Shot Paradigm

One common misconception about fine-tuning voice cloning models is that fine-tuning teaches the model a single voice.

That is not what happens here.

Even after fine-tuning, the model remains a zero-shot voice cloning system.

The architecture consists of:

  • Qwen3-0.6B as the autoregressive language model
  • Discrete audio tokens as targets
  • Reference audio for speaker conditioning

During training, the model learns:

Yoruba Text
     +
Audio Tokens
     ↓
Speech Generation
Enter fullscreen mode Exit fullscreen mode

At inference time:

Text Prompt
     +
5–10 Second Reference Clip
     ↓
Cloned Speaker Output
Enter fullscreen mode Exit fullscreen mode

The reference clip provides speaker identity dynamically.

This means the model can synthesize speech in voices it has never seen before, provided there is a short reference sample.


Diacritics Are Acoustic Features, Not Formatting

This was arguably the most important lesson from the project.

Many NLP pipelines treat diacritics as optional formatting.

For Yoruba TTS, this assumption is catastrophic.

Characters such as:

ẹ
ọ
ṣ
à
á
ì
ó
Enter fullscreen mode Exit fullscreen mode

are not stylistic variations.

They encode phonological and tonal information directly tied to pronunciation.

Removing diacritics forces the model to map identical text sequences to multiple conflicting acoustic realizations.

The result is:

  • Tone confusion
  • Prosody degradation
  • Lower intelligibility
  • Inconsistent synthesis

Safeguard #1: Unicode NFC Normalization

Before training, all transcripts were normalized using Unicode Normalization Form C (NFC).

Without normalization:

Enter fullscreen mode Exit fullscreen mode

can be represented as:

e + U+0323
Enter fullscreen mode Exit fullscreen mode

instead of a single Unicode codepoint.

This can silently break tokenization and create inconsistent text representations.

Using NFC ensured that all Yoruba characters were encoded consistently across the corpus.


Safeguard #2: Tokenizer Vocabulary Audit

Before launching training, I audited the tokenizer vocabulary to verify that Yoruba characters existed as native tokens.

Specifically, I checked support for:

ẹ
ọ
ṣ
à
á
è
é
ì
í
ò
ó
ù
ú
Enter fullscreen mode Exit fullscreen mode

The objective was to ensure that none of these characters were being mapped to:

<UNK>
Enter fullscreen mode Exit fullscreen mode

Unknown-token substitutions would effectively erase tonal information before the model even saw the text.


Why Meta's MMS Struggles with Modern Yoruba

Meta's Massive Multilingual Speech (MMS) project is one of the most significant contributions to multilingual speech technology.

However, it exposes an important issue that affects many low-resource languages:

Domain Mismatch

A substantial portion of publicly available Yoruba speech data originates from:

  • Bible readings
  • Religious recordings
  • Formal narration

These datasets are valuable linguistically but represent a narrow style of speech.

As a result, models trained primarily on such data often produce speech that sounds:

  • Formal
  • Archaic
  • Unnaturally scripted

When generating contemporary Yoruba conversations, this mismatch becomes noticeable.


Why OpenSLR + IroyinSpeech Helped

The datasets used in this project contain:

  • News content
  • Modern writing
  • Everyday language patterns

This broader linguistic coverage produced synthesis that sounded substantially closer to modern spoken Yoruba.

The improvement was especially noticeable in:

  • Prosody
  • Phrase rhythm
  • Conversational flow

ASR-in-the-Loop Evaluation

Evaluating TTS remains one of the hardest problems in speech research.

The gold standard remains:

Mean Opinion Score (MOS)

Human listeners rate speech quality and naturalness.

The downside:

  • Expensive
  • Slow
  • Difficult to scale

To obtain a faster quantitative signal, I implemented an ASR-in-the-loop evaluation pipeline.

Text Prompt
      │
      ▼
[ Fine-Tuned OmniVoice ]
      │
      ▼
Synthesized Audio
      │
      ▼
[ Fine-Tuned Yoruba ASR ]
      │
      ▼
Word Error Rate (WER)
Enter fullscreen mode Exit fullscreen mode

The ASR model used was:

ccibeekeocwhisper-small-yoruba-07-17


Results

The synthesized speech achieved:

Normalized WER ≈ 11.5%
Enter fullscreen mode Exit fullscreen mode

This is not a replacement for MOS evaluation.

However:

If an independent, diacritic-aware Yoruba ASR system can accurately recover the original text from synthesized speech, that provides strong evidence that the speech is intelligible and linguistically faithful.

For rapid iteration, this evaluation loop proved extremely useful.


Key Takeaways

After training and evaluation, three conclusions stood out:

1. Data Quality Matters More Than Model Size

The combination of studio recordings and crowd-sourced diversity provided stronger gains than simply increasing parameter count.

2. Diacritics Must Be Preserved End-to-End

For Yoruba TTS:

Diacritics are acoustic features.

Treating them as optional formatting introduces ambiguity the model cannot reliably resolve.

3. Domain Coverage Matters

Modern conversational datasets produce noticeably more natural synthesis than corpora dominated by religious or formal speech.


What's Next?

With the TTS pipeline stable, the next step is building the reverse direction:

Streaming Yoruba ASR

The current roadmap includes:

  • Low-latency transcription
  • WebSocket streaming
  • Silero VAD
  • Real-time inference
  • Code-switch handling
  • Speaker diarization

The long-term goal is a production-grade speech platform supporting:

  • Yoruba ASR
  • Yoruba TTS
  • Voice cloning
  • Conversational AI
  • Educational speech applications

Resources

Model

Sam4rano/omnivoice-yoruba-tts

Repository

Search for:

sam4rano_tts
Enter fullscreen mode Exit fullscreen mode

Tags

#Yoruba
#TTS
#SpeechSynthesis
#VoiceCloning
#ASR
#MachineLearning
#DeepLearning
#NLP
#SpeechAI
#OpenSource
#LowResourceLanguages
Enter fullscreen mode Exit fullscreen mode

If you're working on speech technology for African or other low-resource languages, I'd be interested in hearing how you're handling tone preservation, tokenizer design, and evaluation beyond traditional WER and MOS metrics.

Top comments (0)