However, when you move beyond high-resource languages and start working with low-resource, tonal languages like Yoruba, many of the assumptions behind standard TTS pipelines begin to fail.
Over the past week, I fine-tuned OmniVoice for Yoruba using approximately 9.6 hours of speech data collected from 156 speakers. The goal was to build a model capable of zero-shot voice cloning while preserving the tonal characteristics that are fundamental to Yoruba speech.
This article documents the dataset construction process, the engineering decisions that mattered, why some popular multilingual baselines struggle with Yoruba, and how I evaluated synthesis quality.
Why Yoruba is Different
Unlike English, Yoruba is a tonal language.
Words that appear similar can have completely different meanings depending on tone:
| Word | Meaning |
|---|---|
| ọkọ | husband |
| òkò | hoe |
| ọ̀kọ̀ | vehicle |
For humans, tone is obvious.
For TTS models, tone must be learned through the interaction between:
- Text representations
- Acoustic patterns
- Prosodic modeling
A model that ignores tonal information will often generate speech that is intelligible acoustically but semantically incorrect.
This creates unique challenges for text normalization, tokenization, and model training.
Dataset Construction
To create a balanced Yoruba speech corpus, I merged two complementary datasets:
1. Google Yoruba Speech Dataset (OpenSLR 86)
A collection of high-quality studio recordings with clean acoustic conditions.
Benefits:
- Consistent microphone setup
- Minimal background noise
- Reliable pronunciation
2. IroyinSpeech
A crowd-sourced Yoruba speech dataset containing diverse accents and speaking styles.
Benefits:
- Greater speaker diversity
- Real-world recording conditions
- More representative pronunciation variation
Final Dataset Statistics
| Metric | Value |
|---|---|
| Total Utterances | 6,989 |
| Total Duration | ~9.6 Hours |
| Total Speakers | 156 |
| Sample Rate | 24 kHz |
| Channels | Mono |
The idea was simple:
Use studio-quality recordings to establish acoustic stability while using crowd-sourced speech to improve speaker diversity and robustness.
Data Pipeline
The preprocessing pipeline looked like this:
Google OpenSLR 86 (Studio)
+
IroyinSpeech (Crowd)
│
▼
24 kHz Mono WAV
│
▼
Discrete Audio Tokenization
(HiggsAudio 8-Codebook VQ)
│
▼
OmniVoice Training
Rather than feeding raw waveforms directly into the language model, OmniVoice operates on discrete audio tokens generated through vector quantization.
For this project, I used the HiggsAudio 8-Codebook VQ tokenizer, which converts continuous speech into sequences of discrete acoustic symbols that can be modeled autoregressively.
Understanding the Zero-Shot Paradigm
One common misconception about fine-tuning voice cloning models is that fine-tuning teaches the model a single voice.
That is not what happens here.
Even after fine-tuning, the model remains a zero-shot voice cloning system.
The architecture consists of:
- Qwen3-0.6B as the autoregressive language model
- Discrete audio tokens as targets
- Reference audio for speaker conditioning
During training, the model learns:
Yoruba Text
+
Audio Tokens
↓
Speech Generation
At inference time:
Text Prompt
+
5–10 Second Reference Clip
↓
Cloned Speaker Output
The reference clip provides speaker identity dynamically.
This means the model can synthesize speech in voices it has never seen before, provided there is a short reference sample.
Diacritics Are Acoustic Features, Not Formatting
This was arguably the most important lesson from the project.
Many NLP pipelines treat diacritics as optional formatting.
For Yoruba TTS, this assumption is catastrophic.
Characters such as:
ẹ
ọ
ṣ
à
á
ì
ó
are not stylistic variations.
They encode phonological and tonal information directly tied to pronunciation.
Removing diacritics forces the model to map identical text sequences to multiple conflicting acoustic realizations.
The result is:
- Tone confusion
- Prosody degradation
- Lower intelligibility
- Inconsistent synthesis
Safeguard #1: Unicode NFC Normalization
Before training, all transcripts were normalized using Unicode Normalization Form C (NFC).
Without normalization:
ẹ
can be represented as:
e + U+0323
instead of a single Unicode codepoint.
This can silently break tokenization and create inconsistent text representations.
Using NFC ensured that all Yoruba characters were encoded consistently across the corpus.
Safeguard #2: Tokenizer Vocabulary Audit
Before launching training, I audited the tokenizer vocabulary to verify that Yoruba characters existed as native tokens.
Specifically, I checked support for:
ẹ
ọ
ṣ
à
á
è
é
ì
í
ò
ó
ù
ú
The objective was to ensure that none of these characters were being mapped to:
<UNK>
Unknown-token substitutions would effectively erase tonal information before the model even saw the text.
Why Meta's MMS Struggles with Modern Yoruba
Meta's Massive Multilingual Speech (MMS) project is one of the most significant contributions to multilingual speech technology.
However, it exposes an important issue that affects many low-resource languages:
Domain Mismatch
A substantial portion of publicly available Yoruba speech data originates from:
- Bible readings
- Religious recordings
- Formal narration
These datasets are valuable linguistically but represent a narrow style of speech.
As a result, models trained primarily on such data often produce speech that sounds:
- Formal
- Archaic
- Unnaturally scripted
When generating contemporary Yoruba conversations, this mismatch becomes noticeable.
Why OpenSLR + IroyinSpeech Helped
The datasets used in this project contain:
- News content
- Modern writing
- Everyday language patterns
This broader linguistic coverage produced synthesis that sounded substantially closer to modern spoken Yoruba.
The improvement was especially noticeable in:
- Prosody
- Phrase rhythm
- Conversational flow
ASR-in-the-Loop Evaluation
Evaluating TTS remains one of the hardest problems in speech research.
The gold standard remains:
Mean Opinion Score (MOS)
Human listeners rate speech quality and naturalness.
The downside:
- Expensive
- Slow
- Difficult to scale
To obtain a faster quantitative signal, I implemented an ASR-in-the-loop evaluation pipeline.
Text Prompt
│
▼
[ Fine-Tuned OmniVoice ]
│
▼
Synthesized Audio
│
▼
[ Fine-Tuned Yoruba ASR ]
│
▼
Word Error Rate (WER)
The ASR model used was:
ccibeekeocwhisper-small-yoruba-07-17
Results
The synthesized speech achieved:
Normalized WER ≈ 11.5%
This is not a replacement for MOS evaluation.
However:
If an independent, diacritic-aware Yoruba ASR system can accurately recover the original text from synthesized speech, that provides strong evidence that the speech is intelligible and linguistically faithful.
For rapid iteration, this evaluation loop proved extremely useful.
Key Takeaways
After training and evaluation, three conclusions stood out:
1. Data Quality Matters More Than Model Size
The combination of studio recordings and crowd-sourced diversity provided stronger gains than simply increasing parameter count.
2. Diacritics Must Be Preserved End-to-End
For Yoruba TTS:
Diacritics are acoustic features.
Treating them as optional formatting introduces ambiguity the model cannot reliably resolve.
3. Domain Coverage Matters
Modern conversational datasets produce noticeably more natural synthesis than corpora dominated by religious or formal speech.
What's Next?
With the TTS pipeline stable, the next step is building the reverse direction:
Streaming Yoruba ASR
The current roadmap includes:
- Low-latency transcription
- WebSocket streaming
- Silero VAD
- Real-time inference
- Code-switch handling
- Speaker diarization
The long-term goal is a production-grade speech platform supporting:
- Yoruba ASR
- Yoruba TTS
- Voice cloning
- Conversational AI
- Educational speech applications
Resources
Model
Repository
Search for:
sam4rano_tts
Tags
#Yoruba
#TTS
#SpeechSynthesis
#VoiceCloning
#ASR
#MachineLearning
#DeepLearning
#NLP
#SpeechAI
#OpenSource
#LowResourceLanguages
If you're working on speech technology for African or other low-resource languages, I'd be interested in hearing how you're handling tone preservation, tokenizer design, and evaluation beyond traditional WER and MOS metrics.
Top comments (0)