Chapter 0: Pilot
Recently, I trained a Bangla TTS model on a large dataset—hours of clean speech, plenty of variety, enough coverage that I felt confident it would generalize. When I ran the first proper tests, it sounded good. The voice was smooth, the pacing felt natural, and full sentences came out with the kind of clarity that makes you believe the hard part is over.
For a moment, I genuinely thought: this is done.
Then I gave it this:
“১২/০৮/২০২৪-এ ৩pm-এ ডা. রহমানের সাথে meeting আছে।”
And suddenly… the system didn’t sound confident anymore.
It didn’t completely crash. It didn’t produce silence.
It spoke — but the way it spoke felt off. Some parts sounded robotic. Some parts sounded like the model was guessing. And the sentence that looks perfectly normal to a Bangla reader turned into something that sounded like a confused reading of symbols.
That’s when I realized something important:
The model wasn’t failing. The text was.
Bangla writing is full of shortcuts. Humans understand them instantly. But a neural TTS model doesn’t “understand” text like we do — it learns patterns from what it sees.
To a human reader, this is effortless. It’s just a normal sentence about a schedule. But inside it, there are things that don’t exist in spoken Bangla the way they exist in writing: a date written in numeric form, a time written in a mixed style, an abbreviated honorific, and an English word sitting naturally inside a Bangla structure.
A neural TTS model doesn’t “understand” what those pieces mean. It learns how text maps to sound based on patterns. When the text contains compressed writing conventions, the model is forced to guess how those symbols should sound and the guess is often wrong, inconsistent, or unstable.
So the real problem wasn’t “The model was weak.”
The real problem was simpler:
Bangla orthography and Bangla speech are not the same thing.
This mismatch shows up everywhere in real-world Bangla text, especially in the kind of data people actually use:
social media captions, chat messages, voice assistant commands, news headlines, education content, and UI-heavy text.
And once I noticed it, I started seeing it everywhere.
So, this playbook is about fixing that gap.
It follows the journey of turning written Bangla into spoken-form Bangla using a structured, rule-based normalization pipeline, so the text becomes something a TTS system can pronounce naturally, consistently, and with confidence.
Chapter 1: Understanding the Understanding
After seeing how text can break the TTS output, I needed to understand the model itself. Modern Bangla TTS systems are not single monoliths, they’re modular pipelines where each part depends on the previous one.
At a high level, text flows through a frontend, an acoustic model, and a vocoder. The frontend takes raw text and turns it into a sequence suitable for speech: normalizing numbers, abbreviations, and even mixed English words, then converting graphemes to phonemes. The acoustic model, built on architectures like VITS or Piper TTS, maps these phonemes to mel-spectrograms. Finally, the vocoder turns the spectrogram into audio that sounds natural to human ears.
These models have some important traits that make them both powerful and sensitive. They learn end-to-end, which reduces the need for hand-engineered features. They can produce highly natural prosody and intonation. They often support multiple languages through shared phoneme spaces, which is useful for Bangla mixed with English. And because the input is phoneme-based, the way text is normalized directly affects pronunciation. Optimized models like Piper even allow low-latency deployment without losing quality.
Understanding this pipeline made it clear: if the text is messy, no matter how advanced the model is, the output will suffer. Cleaning and structuring the text is not optional—it’s the first step to reliable speech.
Chapter 2: Challenges
Bangla presents several unique challenges for text-to-speech (TTS) synthesis that distinguish it from many languages. Understanding these challenges is critical for designing effective preprocessing and normalization pipelines.
2.1 Orthography vs Spoken-Form
One of the first challenges I noticed was how Bangla writing differs from spoken pronunciation. Historical spellings and complex consonant clusters make naïve character-by-character reading produce unnatural results. Grapheme-to-phoneme normalization is essential.
Examples of Orthography vs Spoken-Form Mismatch
| Orthographic Form | Spoken Form (Approx.) | Description |
|---|---|---|
| শিক্ষক | [শিক্-ষক] → [শিক্খক] | Written ক্ষ\ pronounced as kkh\
|
| কর্ম | [কর্ম] → [করমো] | র-ফলা alters vowel realization |
| বিদ্যালয় | [বিদ্যালয়] → [বিদ্দালয়] | Consonant clusters simplified in speech |
| রাজ্য | [রাজ্য] → [রাজ্জো] |
জ্য\ conjunct realized as geminated jj\
|
Without this normalization, even common words get mispronounced and speech feels unnatural.
2.2 Numeric and Symbolic Conventions
Another challenge I faced was numbers and symbols. In Bangla, they can’t be read directly—they change depending on context. A cardinal number, a date, a year, a phone number, or currency all need different spoken forms. Without normalization, phonemizers stumble, mispronounce, or skip these tokens entirely.
| Written Form | Spoken Form | Context |
|---|---|---|
| ১২৩ | একশ তেইশ | Cardinal number |
| ২১শে | একুশে | Ordinal / date |
| ২০২৪ | দুই হাজার চব্বিশ | Year |
| 01712 | শূন্য এক সাত এক দুই | Phone number |
| ৳500 | পাঁচশত টাকা | Currency |
| 10 km | দশ কিলোমিটার | Unit / measure |
Without rule-based expansion, phonemizers cannot generate meaningful phonemes for these tokens, leading to mispronunciation or skipped content.
2.3 Mixed Bangla–English Text
Code-mixing is everywhere in real Bangla: social media, technical writing, urban messages. English words, numerals, abbreviations, and symbols appear inside Bangla sentences, and a standard G2P tool can’t handle them correctly.
Examples:
- “Meeting ৩pm-এ হবে।”
- “ব্যাংক একাউন্টের ব্যালান্স $500।”
These tokens need language-aware preprocessing and normalization so that phonemes are correct and the TTS output sounds natural.
2.4 Abbreviations and Honorifics
Bangla text is full of abbreviations and honorifics that are ambiguous without context. Naïve G2P systems either spell them out letter by letter or mispronounce them.
Examples:
| Abbreviation/ Honorific | Expanded Form | Example Usage |
|---|---|---|
| ডঃ | ডাক্তার (Doctor) | ডঃ রহমান এসে গেলেন |
| মোঃ | মোহাম্মদ (Mohammad) | মোঃ সেলিম স্কুলে গেছে |
| Mr./Mrs. | মিষ্টার/মিসেস | জনাব রহমান বক্তব্য রাখলেন |
| am / pm | এ.এম./ পি.এম. | মিটিং ৩pm-এ হবে |
| লিঃ / ltd | লিমিটেড | কোম্পানি ABC লিঃ |
Normalizing these ensures the model produces natural, fluent speech instead of robotic or incorrect pronunciations.
2.5 Inconsistent Unicode
Bangla script has multiple Unicode ways to represent the same grapheme or vowel, and this can easily confuse phonemizers. Visually identical words may produce different phoneme sequences, breaking the naturalness of TTS output.
For example, consonants with a nukta can appear as a single precomposed character or as a combination of base + nukta. “ড়” might be U+09DC or U+09B0 U+09BC. One maps correctly to /ɽ/, the other may be misread as /r/.
Vowel signs also vary: “কা” could be U+0995 U+09BE or a single ligature. Phonemizers may pronounce them differently depending on encoding.
Even conjuncts have multiple forms. “ক্ক” can be decomposed (U+0995 U+09CD U+0995) or precomposed, and without normalization, the phonemizer might fail to recognize it.
Canonicalization, such as NFC normalization, is crucial. Without it, these variations propagate errors through the TTS system, reducing both intelligibility and naturalness.
Chapter 3: Pillars
3.1 The “Why”?
After seeing the same model sound great on clean sentences and fall apart on real-world Bangla, I stopped treating text normalization as a “nice-to-have.” It became a must-have.
Raw Bangla text carries multiple layers of ambiguity that directly affect phoneme accuracy. Some words look stable in writing but shift in pronunciation because of conjunctions and schwa deletion. Many sentences include mixed-script tokens like English words, numerals, and abbreviations. Symbols show up everywhere—currency, units, percentages—and none of them are meant to be spoken as written. On top of that, Unicode inconsistencies can make two identical-looking words behave differently for a phonemizer.
In a high-resource setting, a model might learn to survive some of this noise. But Bangla is often trained under low-resource constraints, and that makes the problem sharper: the model can’t learn what it never sees consistently.
This is also where standard grapheme-to-phoneme tools like eSpeak NG struggle. They work best when the input is already clean and predictable. When the text is messy, the phonemes become unreliable—and once phonemes are wrong, the audio will never fully recover.
So I built a structured, rule-based normalization pipeline to force raw text into one clear form: a deterministic, pronounceable, spoken-form representation. The goal wasn’t perfection. The goal was consistency—so phonemization becomes stable, coverage improves, and the TTS output becomes more natural and intelligible.
3.2 The Pipeline
I designed the pipeline as a sequence of small steps, where each step solves one category of failure I had already seen.
The flow looks like this:
Raw Text → Unicode Cleanup → Tokenization → Language Identification → Rule-Based Normalization → Spoken Bangla Text → Neural TTS
It starts with raw Bangla text, which may include mixed scripts, digits, abbreviations, symbols, and inconsistent Unicode forms. The first step is Unicode cleanup, where I normalize everything into a canonical NFC form and resolve variants like nukta characters or broken vowel signs.
Next comes tokenization. This step matters more than it sounds, because Bangla text often attaches numbers to suffixes, units, or punctuation. Simple whitespace splitting breaks meaning, so tokenization has to be aware of these patterns and handle mixed Bangla–English content properly.
After that, I do token-level language identification. I don’t need a heavy model here—just enough signal to decide whether a token should be treated as Bangla, English, or a symbol. This becomes crucial for code-mixed text and abbreviations.
Then comes the core: rule-based normalization. This is where I expand numbers, dates, time formats, currency, units, honorifics, and abbreviations into the form people actually say. English tokens that matter for pronunciation are transformed into Bangla-friendly spoken forms. The most important property here is determinism: the same input should always normalize the same way.
The output of this pipeline is spoken Bangla text that is ready for phonemization—clean, pronounceable, and free of raw digits or symbols that would confuse the TTS system.
3.3 Unicode Cleanup
The first thing I do in the pipeline is Unicode cleanup, because I learned the hard way that Bangla text can look identical on screen and still behave differently inside a phonemizer. Sometimes “ড়” arrives as a single character, sometimes as “র + nukta”. Sometimes conjuncts show up as decomposed sequences. The sentence looks the same, but the code points aren’t—and that tiny difference is enough to break rules and change phonemes.
So I normalize every input using Unicode NFC. NFC converts decomposed sequences into their canonical composed forms, which gives me one stable representation to work with. That stability matters because everything downstream—tokenization, language detection, and spoken-form conversion—depends on Unicode behaving predictably.
Without this step, the pipeline becomes unreliable: the same word can produce different phoneme sequences, normalization rules may not trigger, and the TTS voice starts sounding inconsistent for no obvious reason. NFC doesn’t make text “better,” but it makes it deterministic, and that’s the foundation I need.
3.4 Tokenization
After Unicode cleanup, I split the text into tokens that can be normalized safely. This sounds simple until I meet real Bangla input, where numbers and symbols love sticking to words like they’re glued together.
Whitespace tokenization fails immediately. In a sentence like:
“সে ৫kg চাল কিনেছে।”
a naïve tokenizer treats “৫kg” as one token, but I need it as two pieces—“৫” and “kg”—so I can later normalize it into something a human would actually say: “পাঁচ কেজি”.
The same problem shows up everywhere: “৩০%”, “২৫°C”, “১০০টাকা”. If I don’t split them correctly, normalization can’t expand them, and phonemization either guesses wrong or skips the token entirely.
So my tokenizer produces a structured stream: Bangla words, numbers, symbols, punctuation, and Latin-script words separated cleanly. Once I have that, normalization becomes predictable instead of fragile.
3.5 Language Identification
Tokenization solves separation, but mixed text adds another trap: the same sentence can contain Bangla, English, digits, and symbols—sometimes all in one line.
This is where language identification becomes necessary. If I feed an English token into a Bangla phonemizer like eSpeak NG, it can generate nonsense phonemes. If I treat Bangla text as English, it can fail completely. Either way, the voice breaks, and the error spreads across the sentence.
I keep this step lightweight and deterministic by using script-based detection. Bangla tokens are usually inside the Bengali Unicode block, Latin tokens are English, digits are numbers, and symbols stay symbols. For example:
Input tokens:
["Meeting", "টা", "৩", "pm", "এ", "শুরু", "হবে"]
Tagged output:
[("Meeting", EN), ("টা", BN), ("৩", NUM), ("pm", EN), ("এ", BN), ("শুরু", BN), ("হবে", BN)]
Once I know what each token is, I can normalize it the right way—without guessing—and the phonemes stop falling apart in code-mixed sentences.
3.6 The “Make It Speakable” Step
Rule-Based Normalization is the heart of the whole pipeline.
Because after Unicode cleanup, tokenization, and language tagging… I still have one big problem:
written Bangla is not automatically speakable Bangla.
Real-world text is full of shortcuts—digits, symbols, mixed English, abbreviations, compressed formats. Humans read them effortlessly. But a phonemizer (and even a strong neural TTS model) doesn’t “understand” them. It only sees tokens and patterns.
So this stage has one job:
Convert everything into a clean spoken-form Bangla sentence that a phonemizer can pronounce confidently.
And I keep it rule-based for very practical reasons:
Deterministic: same input → same output (no surprises)
Debuggable: I can fix one rule without retraining a model
Fast: almost zero latency
Low-resource friendly: no need for labeled normalization datasets
Once tokens are tagged (BN / EN / NUM / SYMBOL), I normalize them category by category.
3.6.1 Not all numbers are “numbers”
Numbers look simple, but their spoken form depends on what they mean.
So I classify them first, then expand them.
Cardinal: ১২৩ → একশ তেইশ
Ordinal/date style: ২১শে → একুশে
Ordinal suffix: ৫ম → পঞ্চম
Year: ২০২৪ → দুই হাজার চব্বিশ
Phone digits: ০১৭১২… → শূন্য এক সাত এক দুই…
IDs / mixed codes: ৪৫A৫২ → চার পাঁচ এ পাঁচ দুই
USSD / star-hash: *২২২# → স্টার দুই দুই দুই হ্যাশ
If I skip this, the phonemizer either breaks or reads the symbols like nonsense.
3.6.2 Where TTS usually gets embarrassed
Dates and time show up in too many formats:
১২/০৮/২০২৪
12 Aug 2024
৩pm
০৯:১৫
So I normalize them into one spoken style:
Numeric date: ১২/০৮/২০২৪ → বারোই আগস্ট দুই হাজার চব্বিশ
Mixed-script date: 12 Aug 2024 → বারোই আগস্ট দুই হাজার চব্বিশ
Time (am/pm): ৩pm → তিন পিএম
Clock time: ০৯:১৫ → নয়টা পনের মিনিট
This is where the voice suddenly starts sounding “educated” instead of confused.
3.6.3 Numbers with costumes
Most numbers in real text aren’t standalone. They come wearing units:
৳৫০০, $50, ₹200
৫kg, ১০km
২৫°C, ৩০%
So I expand both the number and the unit:
৳৫০০ → পাঁচশো টাকা
$50 → পঞ্চাশ ডলার
৫kg → পাঁচ কেজি
২৫°C → পঁচিশ ডিগ্রি সেলসিয়াস
৩০% → ত্রিশ শতাংশ
Without this, “৫kg” becomes a token the system can’t pronounce naturally.
3.6.4 Hidden meaning
Bangla writing uses abbreviations constantly:
ডঃ রহমান
মোঃ কাসেম
Mrs. / ltd
A phonemizer might spell them out weirdly or misread them entirely.
So I map them directly:
ডঃ → ডাক্তার
মোঃ → মোহাম্মদ
Mrs. → মিসেস
ltd → লিমিটেড
This one step improves naturalness a lot in formal text.
3.6.5 Transliteration
Code-mixing is normal Bangla now:
“Meeting টা ৩pm-এ হবে।”
If I leave “Meeting” as English, a Bangla phonemizer will struggle.
So I do a phonetic transliteration, not a semantic translation.
Examples:
Meeting → মিটিং
Mobile → মোবাইল
Manager → ম্যানেজার
Exceptionally → এক্সসেপশানালি
The goal is simple: turn English tokens into the Bangla-script form people actually speak.
3.7 The Gold
After rule-based normalization, the output is no longer “raw text”.
It becomes spoken Bangla text — clean, pronounceable, phonemizer-ready.
What it looks like
No digits / symbols left unexpanded
Abbreviations become full words
Mixed-script becomes consistent
Punctuation stays (helps prosody and pauses)
Before → After examples
১২/০৮/২০২৪-এ ৩pm মিটিং আছে
→ বারোই আগস্ট দুই হাজার চব্বিশ-এ তিন পিএম মিটিং আছেডঃ রহমান ৫kg ডাল নিয়েছে
→ ডাক্তার রহমান পাঁচ কেজি ডাল নিয়েছেআজ $50 খরচ হয়েছে
→ আজ পঞ্চাশ ডলার খরচ হয়েছেমোঃ কাসেম আজ ৩০% ছাড় পেয়েছেন
→ মোহাম্মদ কাসেম আজ ত্রিশ শতাংশ ছাড় পেয়েছেনMeeting টা ০৯:১৫-এ শুরু হবে
→ মিটিং টা নয়টা পনের মিনিট-এ শুরু হবে*222*3# নাম্বারে ডায়াল করলেই হবে
→ স্টার দুই দুই দুই স্টার তিন হ্যাশ নাম্বারে ডায়াল করলেই হবে
Now the phonemizer doesn’t need to “figure things out”.
It just converts clean spoken text into stable phonemes — and the TTS voice finally sounds confident.
Chapter 4 : Closure
At first, my Bangla TTS model sounded so good that I thought the job was done. But the moment I fed it real-world text—dates, times, symbols, abbreviations, and mixed Bangla-English—it stopped sounding confident. Not because the model was weak, but because the text was messy in a way humans understand instantly and machines don’t.
This playbook solved that gap by turning raw Bangla into spoken-form Bangla through a structured normalization pipeline: Unicode cleanup, tokenization, language detection, and rule-based expansion. Once the text became truly pronounceable, phonemization became stable—and the voice stopped guessing.
In the end, the biggest upgrade wasn’t a new model.
It was giving the model the right version of the text to speak.
Top comments (0)