My Journey Improving a TTS Model for the Crimean Tatar Language

#ai #huggingface #programming #nlp

When you work with machine learning, success often hides behind hours of frustration, countless errors, and broken pipelines. This project — improving the Crimean Tatar TTS (Text-to-Speech) model — was exactly that kind of journey. What started as a small experiment to fine-tune an existing model turned into a full-scale debugging adventure that taught me more about data integrity, audio processing, and patience than any tutorial could.

In my previous article, Why Language Tech Matters: Developing AI Tools for Small Languages, I explored how AI can empower low-resource languages. This piece continues that journey with a hands-on look at improving TTS models.

The Starting Point: A Model That Worked — but Only Partially

My goal was simple: improve the voice model “Sevil” for the Crimean Tatar language. I had already worked with similar voices — “Arslan” and “Abibullah” — using Hugging Face datasets like speech-uk/tts-crh-arslan and speech-uk/tts-crh-abibullah.

The first training attempts went well — loss around 0.283, results acceptable. But something didn’t add up. The dataset had 1,566 audio files, yet the training logs showed only 415 were being used — about 26.5% of the total.

That meant almost three-quarters of my data was silently ignored.

At first, I thought it was a fluke. Then I realized it was a systemic problem in the Hugging Face datasets API when loading compressed audio from Parquet files.

Diagnosing the Problem

When I loaded the dataset through datasets.load_dataset(), most files threw errors:

Error: "Error while decoding audio"
Error: "Audio file appears to be empty"

That didn’t make sense — the audio bytes were clearly present in the Parquet files.

After checking manually with pandas.read_parquet(), I confirmed the data was there. The problem wasn’t in the files — it was in how the decoder handled them.

That’s when I realized: the datasets API couldn’t decode raw audio bytes correctly. The data was fine, but the pipeline was broken.

Turning Bytes into Sound

At this point, I tried everything.

I extracted the bytes manually, saved them as .raw files, and tried to load them with librosa and soundfile. Nothing worked.

Without proper metadata (sample rate, channels, encoding), the files were unreadable.

Eventually, I discovered the solution: FFmpeg.

By using known dataset parameters —

sample rate: 16000 Hz, channels: mono, format: PCM 16-bit little-endian —

I could convert all .raw files into clean .wav audio.

ffmpeg -f s16le -ar 16000 -ac 1 \
       -i sevil_0000.raw \
       sevil_0000.wav

And just like that — 1,566 files successfully converted.

No corruption. No decoding errors. 100% validation success.

It was the moment of breakthrough — the kind that makes you sit back, smile, and realize you’ve just solved a problem that haunted you for two days straight.

Training the Model — Again

With clean audio finally ready, I retrained the Sevil model from scratch, this time using all 1,566 recordings.

Training setup (based on my previous configs for “Arslan”):

num_train_epochs = 500
batch_size = 4
learning_rate = 1e-4
fp16 = True
warmup_steps = 2000
save_steps = 2000

The progress was promising:

Loss dropped from 1.14 → 0.80 → 0.50 → 0.27
The voice quality improved with every epoch
And then… it crashed at 78% completion.

The culprit? A familiar one for SpeechT5 users —

RuntimeError: torch.cat(): expected a non-empty list of Tensors

It turned out to be a bug in guided_attention_loss, a component that sometimes fails with uneven sequence lengths.

Fixing the Crash

Instead of starting over, I resumed training from the last checkpoint (step 16,000) and simply disabled guided attention:

model.config.use_guided_attention_loss = False

That one line saved the project.

Training resumed, completed 98% of the full cycle, and stabilized with a final loss of 0.267 — a small numerical improvement, but a big qualitative one.

The model became more consistent and robust across new data.

Comparing Results

Model	Data Used	Loss	Success	Notes
Sevil v1	415 files (26.5%)	0.276	47% test	Trained on partial data
Sevil v2	1,566 files (100%)	0.267	100% test	Fully trained, stable

The difference wasn’t just in metrics — it was in confidence.

Sevil v2 generalized better, produced smoother intonation, and maintained pronunciation consistency even on unseen words.

Lessons Learned

Never trust good metrics without checking data coverage.

My “good” baseline was trained on just 26% of the data.
FFmpeg is a lifesaver.

It solved what specialized libraries couldn’t.
Validate every single file.

Automation saved hours of manual checking.
Guided attention loss is optional.

Sometimes stability matters more than theoretical accuracy.
Document everything.

By keeping track of every attempt — successful or not — I could understand the full story, not just the happy ending.

Why This Project Matters

For me, this wasn’t just about fixing one dataset. It was about enabling a low-resource language — Crimean Tatar — to have a better voice in the digital world.

Improving the Sevil model means clearer pronunciation, smoother prosody, and better accessibility for learners and native speakers alike.

And for anyone working with custom TTS datasets:

Check your files, validate your data, and don’t give up when your model crashes at 78%.

That crash might be the best teacher you’ll ever have.

Author: Servin Osmanov

Lead Fullstack Python / ReactJS Engineer

AI researcher and TTS developer for low-resource languages

Project: servinosmanov/tts-crh-sevil-fixed on Hugging Face