Samuel Oyerinde

Posted on Jul 2

Fine-Tuning OmniVoice for Yoruba Zero-Shot Voice Cloning: Lessons from 9.6 Hours of Speech Data

#ai #llm #gpt3

However, when you move beyond high-resource languages and start working with low-resource, tonal languages like Yoruba, many of the assumptions behind standard TTS pipelines begin to fail.

Over the past week, I fine-tuned OmniVoice for Yoruba using approximately 9.6 hours of speech data collected from 156 speakers. The goal was to build a model capable of zero-shot voice cloning while preserving the tonal characteristics that are fundamental to Yoruba speech.

This article documents the dataset construction process, the engineering decisions that mattered, why some popular multilingual baselines struggle with Yoruba, and how I evaluated synthesis quality.

Why Yoruba is Different

Unlike English, Yoruba is a tonal language.

Words that appear similar can have completely different meanings depending on tone:

Word	Meaning
ọkọ	husband
òkò	hoe
ọ̀kọ̀	vehicle

For humans, tone is obvious.

For TTS models, tone must be learned through the interaction between:

Text representations
Acoustic patterns
Prosodic modeling

A model that ignores tonal information will often generate speech that is intelligible acoustically but semantically incorrect.

This creates unique challenges for text normalization, tokenization, and model training.

Dataset Construction

To create a balanced Yoruba speech corpus, I merged two complementary datasets:

1. Google Yoruba Speech Dataset (OpenSLR 86)

A collection of high-quality studio recordings with clean acoustic conditions.

Benefits:

Consistent microphone setup
Minimal background noise
Reliable pronunciation

2. IroyinSpeech

A crowd-sourced Yoruba speech dataset containing diverse accents and speaking styles.

Benefits:

Greater speaker diversity
Real-world recording conditions
More representative pronunciation variation

Final Dataset Statistics

Metric	Value
Total Utterances	6,989
Total Duration	~9.6 Hours
Total Speakers	156
Sample Rate	24 kHz
Channels	Mono

The idea was simple:

Use studio-quality recordings to establish acoustic stability while using crowd-sourced speech to improve speaker diversity and robustness.

Data Pipeline

The preprocessing pipeline looked like this:

Google OpenSLR 86 (Studio)
          +
IroyinSpeech (Crowd)
          │
          ▼
     24 kHz Mono WAV
          │
          ▼
 Discrete Audio Tokenization
 (HiggsAudio 8-Codebook VQ)
          │
          ▼
     OmniVoice Training

Rather than feeding raw waveforms directly into the language model, OmniVoice operates on discrete audio tokens generated through vector quantization.

For this project, I used the HiggsAudio 8-Codebook VQ tokenizer, which converts continuous speech into sequences of discrete acoustic symbols that can be modeled autoregressively.

Understanding the Zero-Shot Paradigm

One common misconception about fine-tuning voice cloning models is that fine-tuning teaches the model a single voice.

That is not what happens here.

Even after fine-tuning, the model remains a zero-shot voice cloning system.

The architecture consists of:

Qwen3-0.6B as the autoregressive language model
Discrete audio tokens as targets
Reference audio for speaker conditioning

During training, the model learns:

Yoruba Text
     +
Audio Tokens
     ↓
Speech Generation

At inference time:

Text Prompt
     +
5–10 Second Reference Clip
     ↓
Cloned Speaker Output

The reference clip provides speaker identity dynamically.

This means the model can synthesize speech in voices it has never seen before, provided there is a short reference sample.

Diacritics Are Acoustic Features, Not Formatting

This was arguably the most important lesson from the project.

Many NLP pipelines treat diacritics as optional formatting.

For Yoruba TTS, this assumption is catastrophic.

Characters such as:

ẹ
ọ
ṣ
à
á
ì
ó

are not stylistic variations.

They encode phonological and tonal information directly tied to pronunciation.

Removing diacritics forces the model to map identical text sequences to multiple conflicting acoustic realizations.

The result is:

Tone confusion
Prosody degradation
Lower intelligibility
Inconsistent synthesis

Safeguard #1: Unicode NFC Normalization

Before training, all transcripts were normalized using Unicode Normalization Form C (NFC).

Without normalization:

ẹ

can be represented as:

e + U+0323

instead of a single Unicode codepoint.

This can silently break tokenization and create inconsistent text representations.

Using NFC ensured that all Yoruba characters were encoded consistently across the corpus.

Safeguard #2: Tokenizer Vocabulary Audit

Before launching training, I audited the tokenizer vocabulary to verify that Yoruba characters existed as native tokens.

Specifically, I checked support for:

ẹ
ọ
ṣ
à
á
è
é
ì
í
ò
ó
ù
ú

The objective was to ensure that none of these characters were being mapped to:

<UNK>

Unknown-token substitutions would effectively erase tonal information before the model even saw the text.

Why Meta's MMS Struggles with Modern Yoruba

Meta's Massive Multilingual Speech (MMS) project is one of the most significant contributions to multilingual speech technology.

However, it exposes an important issue that affects many low-resource languages:

Domain Mismatch

A substantial portion of publicly available Yoruba speech data originates from:

Bible readings
Religious recordings
Formal narration

These datasets are valuable linguistically but represent a narrow style of speech.

As a result, models trained primarily on such data often produce speech that sounds:

Formal
Archaic
Unnaturally scripted

When generating contemporary Yoruba conversations, this mismatch becomes noticeable.

Why OpenSLR + IroyinSpeech Helped

The datasets used in this project contain:

News content
Modern writing
Everyday language patterns

This broader linguistic coverage produced synthesis that sounded substantially closer to modern spoken Yoruba.

The improvement was especially noticeable in:

Prosody
Phrase rhythm
Conversational flow

ASR-in-the-Loop Evaluation

Evaluating TTS remains one of the hardest problems in speech research.

The gold standard remains:

Mean Opinion Score (MOS)

Human listeners rate speech quality and naturalness.

The downside:

Expensive
Slow
Difficult to scale

To obtain a faster quantitative signal, I implemented an ASR-in-the-loop evaluation pipeline.

Text Prompt
      │
      ▼
[ Fine-Tuned OmniVoice ]
      │
      ▼
Synthesized Audio
      │
      ▼
[ Fine-Tuned Yoruba ASR ]
      │
      ▼
Word Error Rate (WER)

The ASR model used was:

ccibeekeocwhisper-small-yoruba-07-17

Results

The synthesized speech achieved:

Normalized WER ≈ 11.5%

This is not a replacement for MOS evaluation.

However:

If an independent, diacritic-aware Yoruba ASR system can accurately recover the original text from synthesized speech, that provides strong evidence that the speech is intelligible and linguistically faithful.

For rapid iteration, this evaluation loop proved extremely useful.

Key Takeaways

After training and evaluation, three conclusions stood out:

1. Data Quality Matters More Than Model Size

The combination of studio recordings and crowd-sourced diversity provided stronger gains than simply increasing parameter count.

2. Diacritics Must Be Preserved End-to-End

For Yoruba TTS:

Diacritics are acoustic features.

Treating them as optional formatting introduces ambiguity the model cannot reliably resolve.

3. Domain Coverage Matters

Modern conversational datasets produce noticeably more natural synthesis than corpora dominated by religious or formal speech.

What's Next?

With the TTS pipeline stable, the next step is building the reverse direction:

Streaming Yoruba ASR

The current roadmap includes:

Low-latency transcription
WebSocket streaming
Silero VAD
Real-time inference
Code-switch handling
Speaker diarization

The long-term goal is a production-grade speech platform supporting:

Yoruba ASR
Yoruba TTS
Voice cloning
Conversational AI
Educational speech applications

Resources

Model

Sam4rano/omnivoice-yoruba-tts

Repository

Search for:

sam4rano_tts

DEV Community