Urdu is spoken by over 230 million people. It is the national language of Pakistan, one of the 22 scheduled languages of India, and the lingua franca of a diaspora spanning three continents. And yet, if you try to build Urdu software today — real software, not a toy — you will hit the same wall every other developer hit before you: the tools do not exist.
I hit that wall building HamaariUrdu, an Urdu language learning platform. This post is about what I built to fix it.
The bugs that no library could fix
I was not looking to build a library. I was looking to ship features. But the bugs kept piling up, and none of the available Urdu NLP libraries (UrduHack, URDUNLP, or anything else) could fix them.
Bug 1: Search returning zero results for words that are obviously in the database.
The database stored ہے using the correct Urdu ہ (U+06C1, Heh Goal). The user's keyboard typed Arabic ه (U+0647, Heh). Both look completely identical on screen in Naskh fonts. But U+06C1 !== U+0647. Zero results. No error. No warning. Just silence.
Bug 2: String equality silently failing.
"قلم" === "قلم" // false — why?!
One of those strings was copied from Microsoft Word and contains an invisible ZWNJ (Zero Width Non-Joiner, U+200C) that Word inserts automatically. You cannot see it. Your editor does not show it. But the comparison fails.
Bug 3: TinyMCE destroying Izafat.
In Urdu grammar, Izafat (اضافت) is a grammatical construction that links two words — like the English "of" but expressed as a marker on the first word. The marker is often an apostrophe-like character (U+2019, Right Single Quotation Mark).
TinyMCE — a very popular rich text editor — silently converts U+2019 to ’ before saving. So a word like کتابِ (with Kasra) or a phrase using Izafat apostrophe gets stored as an HTML entity. Every compound word lookup in the database then fails because the stored form doesn't match the queried form.
Bug 4: Numbers overflowing.
Urdu text frequently references South Asian scale: لاکھ (100,000), کروڑ (10,000,000), ارب (1,000,000,000). These are real everyday numbers in Pakistan — newspaper headlines, financial documents, government statistics.
Number.MAX_SAFE_INTEGER is 9,007,199,254,740,991. A single کھرب (1 trillion) value loses precision with typeof number. JavaScript silently gives you the wrong answer.
Bug 5: Sorting broken for every Urdu word list.
No database and no JavaScript runtime has native Urdu collation. The Urdu alphabet has 39 letters in a specific order that does not match either Unicode codepoint order or any Latin-derived collation. Every sorted word list was wrong.
Bug 6 — the worst one: Compound words destroying every downstream NLP task.
This one deserves its own section.
The compound word problem
Urdu مرکب الفاظ (compound words) are multi-word expressions that function as a single semantic unit but are written with spaces between their parts.
کتاب خانہ → library (کتاب = book, خانہ = place)
بے عزت → disrespectful (بے = without, عزت = honor)
خوش قسمت → fortunate (خوش = well, قسمت = fate)
علم و عمل → knowledge and practice (fixed expression)
محنت مشقت → hard work (synonym compound)
A naive tokenizer sees spaces and splits them. The result:
Input: "اس نے کتاب خانہ بنایا"
↑ ↑
space between compound components
Wrong: ['اس', 'نے', 'کتاب', 'خانہ', 'بنایا']
(5 tokens — "library" is split into "book" + "place")
Right: ['اس', 'نے', 'کتابخانہ', 'بنایا']
(4 tokens — "library" is one semantic unit)
The consequences ripple into every downstream NLP task:
| Task | What breaks |
|---|---|
| Search |
کتاب خانہ doesn't match کتابخانہ — zero results |
| NER |
امورِ خانہ داری (household affairs) split into 3 unrelated tokens |
| Sentiment |
بے عزت (disrespectful) vs بے + عزت — polarity lost |
| Translation |
رنگ برنگے (colorful) translated as "color" + unknown |
| Word count | Every compound inflates the count with phantom tokens |
Why this is genuinely hard
Urdu compound words span four different morphological strategies simultaneously:
Strategy 1 — Affix-based: One word contains a known derivational morpheme (prefix or suffix):
کتاب + خانہ → library (خانہ = "place of" suffix)
بے + عزت → disrespectful (بے = "without" prefix)
خوش + قسمت → fortunate (خوش = "well" prefix)
کتاب + داری → librarianship (داری = "keeping" suffix)
Strategy 2 — Izafat: A grammatical linking marker appears in the text, written or implied:
کتابِ حسنہ (the good book) — Zer mark (◌ِ) on first word
روحِ رواں (driving spirit) — Hamza-above (◌ٔ) marker
علم و عمل (knowledge and practice) — Vav-e-atf (و) connector
Strategy 3 — Lexical: Neither word is morphologically special. You simply have to know these pairs:
محنت مشقت (hard work — synonym compound)
رنگ برنگے (colorful — echo compound)
صبر شکر (patient gratitude — near-synonym pair)
انسائیکلوپیڈیا آف اسلام (3-word fixed title)
Strategy 4 — Chains: Three or more words where each link is independently valid:
امورِ خانہ داری (household affairs — 3 words)
↑ ↑ ↑
izafat affix suffix
Decomposition:
امورِ + خانہ → izafat compound
خانہ + داری → affix compound
Merged: امورِ خانہ داری → one 3-word compound
No statistical model trained on general text reliably covers all four strategies. They operate at different linguistic levels and require different detection mechanisms.
The approach: three deterministic layers
Every other Urdu compound detection library (where one even exists) treats this as a machine learning problem. They feed training data into statistical models and hope the probabilities align.
That means:
- Results change unpredictably between corpus versions
- You cannot explain why a pair was or wasn't detected
- Edge cases (literary izafat, 3-word expressions, echo words) fail silently
- No deterministic guarantee across identical inputs
urdu-tools takes the opposite approach. Every detection is grounded in one of three verifiable, explainable rules:
Raw text
│
├─► Layer 1 — Affix (UAWL)
│ 100+ known Urdu prefix/suffix morphemes
│ خانہ گاہ پرست بے نا خوش شب غم …
│
├─► Layer 2 — Izafat
│ zer mark (◌ِ) · hamza-above (◌ٔ) · vav-e-atf (و)
│ کتابِ حسنہ · روحِ رواں · علم و عمل
│
└─► Layer 3 — Lexicon
3,262 root entries · N-word tails · greedy longest-match
محنت مشقت · رنگ برنگے · انسائیکلوپیڈیا آف اسلام
│
└─► Span chaining
امورِ خانہ + خانہ داری → امورِ خانہ داری
The same input always produces the same output, always with a reason.
This is the first open-source implementation of deterministic, multi-layer, N-gram Urdu compound detection in any language.
Introducing urdu-tools
github.com/iamahsanmehmood/urdu-tools
A production-quality, zero-dependency Urdu text processing library. Available for TypeScript/JavaScript and C#/.NET, with identical APIs in both.
npm install @iamahsanmehmood/urdu-tools
dotnet add package UrduTools.Core
392 tests passing. 85 C# tests. 90%+ coverage enforced in CI.
The compound detection API
import {
detectCompounds,
joinCompounds,
splitCompounds,
isCompound
} from '@iamahsanmehmood/urdu-tools/compound'
Detecting compounds
// Layer 1: Affix — خانہ is a known place-suffix
detectCompounds('کتاب خانہ بہت اچھا ہے')
// → [{
// text: 'کتاب خانہ',
// type: 'affix',
// components: ['کتاب', 'خانہ'],
// start: 0,
// end: 1
// }]
// Layer 1: Affix — بے is a known privative prefix
detectCompounds('بے عزت آدمی نہیں چاہیے')
// → [{ text: 'بے عزت', type: 'affix', components: ['بے', 'عزت'], ... }]
// Layer 2: Izafat — standalone و (vav-e-atf) between content words
detectCompounds('علم و عمل ضروری ہے')
// → [{ text: 'علم و عمل', type: 'izafat', components: ['علم', 'و', 'عمل'], ... }]
// Layer 3: Lexicon — echo compound, neither word is an affix
detectCompounds('رنگ برنگے پھول کھلے ہیں')
// → [{ text: 'رنگ برنگے', type: 'lexicon', components: ['رنگ', 'برنگے'], ... }]
// Lexicon: synonym compound
detectCompounds('محنت مشقت کے بغیر کامیابی نہیں')
// → [{ text: 'محنت مشقت', type: 'lexicon', ... }]
// 3-word chain: izafat (zer on امورِ) + affix (داری suffix on خانہ)
detectCompounds('امورِ خانہ داری چلانا مشکل ہے')
// → [{ text: 'امورِ خانہ داری', type: 'affix', components: ['امورِ', 'خانہ', 'داری'], ... }]
// 3-word lexicon entry: greedy longest-match wins over any 2-word overlap
detectCompounds('انسائیکلوپیڈیا آف اسلام کا حوالہ')
// → [{ text: 'انسائیکلوپیڈیا آف اسلام', type: 'lexicon', ... }]
The pipeline: join before tokenize
The critical downstream use case — bind compounds before tokenizing:
import { joinCompounds } from '@iamahsanmehmood/urdu-tools/compound'
import { tokenize } from '@iamahsanmehmood/urdu-tools'
const text = 'کتاب خانہ میں علم و عمل کی کتابیں ہیں'
// Without compound joining — naive tokenizer splits everything
tokenize(text)
// → ['کتاب', 'خانہ', 'میں', 'علم', 'و', 'عمل', 'کی', 'کتابیں', 'ہیں']
// ↑ split! ↑ split!
// With compound joining — semantic integrity preserved
const joined = joinCompounds(text)
// → 'کتابخانہ میں علموعمل کی کتابیں ہیں'
// ↑ ZWNJ (invisible, prevents tokenizer split)
tokenize(joined)
// → ['کتابخانہ', 'میں', 'علموعمل', 'کی', 'کتابیں', 'ہیں']
// ↑ one token ↑ one token ✓
The ZWNJ (Zero Width Non-Joiner, U+200C) is invisible but meaningful — the tokenizer sees it and keeps the word intact.
Pair-level check
isCompound('کتاب', 'خانہ') // → { matched: true, type: 'affix' }
isCompound('محنت', 'مشقت') // → { matched: true, type: 'lexicon' }
isCompound('اخلاقِ', 'حسنہ') // → { matched: true, type: 'izafat' }
isCompound('اچھا', 'آدمی') // → { matched: false, type: null }
Fine-grained control
// Use only specific layers
detectCompounds(text, { affix: true, izafat: false, lexicon: false })
detectCompounds(text, { affix: false, izafat: true, lexicon: true })
// Choose the binder character for joinCompounds
joinCompounds(text) // ZWNJ U+200C (default, invisible)
joinCompounds(text, { binder: 'nbsp' }) // Non-breaking space (visible)
joinCompounds(text, { binder: 'wj' }) // Word Joiner U+2060 (never line-breaks)
// Inverse — split back to spaces
splitCompounds('کتابخانہ') // → 'کتاب خانہ'
The normalization pipeline
A 12-layer normalization pipeline — the foundation that every other module builds on.
import { normalize, fingerprint } from '@iamahsanmehmood/urdu-tools'
| Layer | What it does | Default |
|---|---|---|
| 1 — NFC | Unicode canonical form | ✅ |
| 2 — NBSP | Non-breaking space → regular space | ✅ |
| 3 — Alif Madda |
آ → آ (precomposed) |
✅ |
| 4 — Numerals |
٠–٩ and ۰–۹ → ASCII 0–9
|
✅ |
| 5 — Zero-width | Strip ZWNJ, ZWJ, soft hyphen | ✅ |
| 6 — Diacritics | Strip zabar, zer, pesh, shadda, sukun, tanwin | ✅ |
| 7 — Honorifics | Strip Islamic honorific signs (ؐ ؑ ؒ ؓ ؔ) | ✅ |
| 8 — Hamza |
أ → ا, ؤ → و
|
✅ |
| 9 — Kashida | Strip tatweel U+0640 | ❌ |
| 10 — Presentation forms | Map U+FB50–FEFF to base chars | ❌ |
| 11 — Punctuation trim | Strip leading/trailing non-letter chars | ❌ |
| 12 — Char normalize | Arabic look-alikes → correct Urdu codepoints | ❌ |
normalize('عِلمٌ') // 'علم' (layers 1–6: diacritics stripped)
normalize('آ') // 'آ' (layer 3: Alif + Madda → precomposed)
normalize('علمہے') // 'علمہے' (layer 5: ZWNJ stripped)
normalize('نبیؐ') // 'نبی' (layer 7: honorific stripped)
// Full normalization for search indexing
normalize(userInput, {
kashida: true,
presentationForms: true,
punctuationTrim: true,
normalizeCharacters: true, // ي → ی, ك → ک, ه → ہ
})
The fingerprint function
For client-side word comparison without database round-trips:
fingerprint('عِلمٌ') === fingerprint('عَلم') // true (both normalize to 'علم')
fingerprint('نبیؐ') === fingerprint('نبی') // true (honorific stripped)
fingerprint('علم') === fingerprint('علم') // true (ZWNJ stripped)
We use this in HamaariUrdu to compare user input against stored words in a 110,000+ word dictionary without needing a round-trip to the database for every keystroke.
The Arabic–Urdu confusion problem
This is the single most common source of silent failures in Urdu software, and no existing library addressed it.
Three character pairs are visually identical in Naskh fonts but are different Unicode code points:
| Visual | Arabic codepoint | Urdu codepoint | Common source |
|---|---|---|---|
| ی | ي U+064A | ی U+06CC | Arabic-layout keyboards, Arabic websites |
| ک | ك U+0643 | ک U+06A9 | Arabic-layout keyboards |
| ہ | ه U+0647 | ہ U+06C1 | Arabic text pasted into Urdu context |
A user searching for ہے typed with Arabic ه finds zero results in a database that stored it with Urdu ہ. Both look identical on screen. No error. No warning. Zero results.
import { normalizeCharacters } from '@iamahsanmehmood/urdu-tools'
normalizeCharacters('ي') // → 'ی' (U+064A → U+06CC)
normalizeCharacters('ك') // → 'ک' (U+0643 → U+06A9)
normalizeCharacters('ه') // → 'ہ' (U+0647 → U+06C1)
// Apply before storage or search indexing:
normalize(userInput, { normalizeCharacters: true })
Progressive search matching
The search module tries 9 progressively aggressive normalization layers until it finds a match — or returns false with full diagnostic info.
import { match, fuzzyMatch, getAllNormalizations } from '@iamahsanmehmood/urdu-tools'
match('عِلمٌ', 'علم')
// → { matched: true, layer: 'strip-diacritics', normalizedQuery: 'علم', normalizedTarget: 'علم' }
match('نبیؐ', 'نبی')
// → { matched: true, layer: 'strip-honorifics', ... }
match('أحمد', 'احمد')
// → { matched: true, layer: 'normalize-hamza', ... }
match('کتاب', 'علم')
// → { matched: false, layer: null, ... }
For database lookups, getAllNormalizations() returns every normalized form to try:
const forms = getAllNormalizations('عِلمٌ')
// → ['عِلمٌ', 'عِلم', 'علم', ...] (from most specific to most aggressive)
for (const form of forms) {
const result = await db.get(form)
if (result) return result
}
Fuzzy matching uses Levenshtein + LCS hybrid (threshold 0.5):
fuzzyMatch('کتاب', ['کتابیں', 'کتب', 'علم'])
// → { candidate: 'کتابیں', score: ~0.7 }
Numbers — South Asian scale with bigint
The South Asian number system has named units that don't exist in Western mathematics:
| Urdu | Value |
|---|---|
| ہزار | 1,000 |
| لاکھ | 100,000 |
| کروڑ | 10,000,000 |
| ارب | 1,000,000,000 |
| کھرب | 1,000,000,000,000 |
| نیل | 1,000,000,000,000,000 |
The entire module uses bigint throughout — South Asian numbers exceed Number.MAX_SAFE_INTEGER.
import { numberToWords, formatCurrency, toUrduNumerals, wordsToNumber } from '@iamahsanmehmood/urdu-tools'
numberToWords(0n) // 'صفر'
numberToWords(100n) // 'ایک سو'
numberToWords(100_000n) // 'ایک لاکھ'
numberToWords(10_000_000n) // 'ایک کروڑ'
numberToWords(1_000_000_000_000_000n) // 'ایک نیل'
// Ordinals with gender agreement
numberToWords(1n, { ordinal: true, gender: 'masculine' }) // 'پہلا'
numberToWords(1n, { ordinal: true, gender: 'feminine' }) // 'پہلی'
numberToWords(11n, { ordinal: true, gender: 'masculine' }) // 'گیارہواں'
numberToWords(11n, { ordinal: true, gender: 'feminine' }) // 'گیارہویں'
// Currency
formatCurrency(505.50, 'PKR') // 'پانچ سو پانچ روپے پچاس پیسے'
formatCurrency(1000, 'INR') // 'ایک ہزار روپے'
// Numeral conversion
toUrduNumerals(2024) // '۲۰۲۴'
// Inverse — parse words back to number
wordsToNumber('ایک کروڑ') // 10_000_000n
wordsToNumber('پانچ سو پانچ') // 505n
Canonical Urdu sorting
No database and no JavaScript runtime has native Urdu collation. The 39-letter Urdu alphabet order:
ء ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن ں و ہ ھ ی ے
import { sort, compare, sortKey } from '@iamahsanmehmood/urdu-tools'
sort(['ے', 'ا', 'ک', 'ب']) // → ['ا', 'ب', 'ک', 'ے']
sort(['زبان', 'اردو', 'بہترین']) // → ['اردو', 'بہترین', 'زبان']
// Use compare() as a comparator for any sorting context
['ے', 'ا', 'ک'].sort(compare) // → ['ا', 'ک', 'ے']
// sortKey() for indexing — diacritics stripped before key generation
sortKey('پاکستان') // '030003091102280814'
// عِلم and عَلم sort to the same position
In C# it implements IComparer<string> for native LINQ integration:
using UrduTools.Core.Sorting;
var words = new[] { "ے", "ا", "ک", "ب" };
var sorted = words.OrderBy(w => w, new UrduComparer()).ToList();
// ["ا", "ب", "ک", "ے"]
Unicode-aware tokenization
The tokenizer handles the edge cases that matter in real Urdu text:
import { tokenize, sentences, ngrams } from '@iamahsanmehmood/urdu-tools'
tokenize('پاکستان ایک خوبصورت ملک ہے')
// → [
// { text: 'پاکستان', type: 'urdu-word' },
// { text: 'ایک', type: 'urdu-word' },
// { text: 'خوبصورت', type: 'urdu-word' },
// { text: 'ملک', type: 'urdu-word' },
// { text: 'ہے', type: 'urdu-word' },
// ]
// Sentence splitting — on ۔ (U+06D4) ؟ ! but NOT on ، or ؛
sentences('پہلا جملہ۔ دوسرا جملہ؟ تیسرا جملہ!')
// → ['پہلا جملہ', 'دوسرا جملہ', 'تیسرا جملہ']
// The tokenizer preserves ZWNJ within words —
// so joinCompounds() output is one token per compound
Key edge cases handled:
- Izafat Kasra (U+0650) at word boundaries is not treated as a split point
- ZWNJ-bound compounds (output of
joinCompounds()) are kept as single tokens - Mixed Urdu/Latin text is classified per-token
Transliteration — 18 aspirated digraphs
import { toRoman, fromRoman } from '@iamahsanmehmood/urdu-tools'
toRoman('پاکستان') // 'pakistan'
toRoman('بھارت') // 'bharat'
toRoman('چھوٹا') // 'chhota'
Digraph rules (left-to-right FSM, digraph priority):
| Urdu | Roman | Urdu | Roman | |
|---|---|---|---|---|
| بھ | bh | پھ | ph | |
| تھ | th | ٹھ | Th | |
| جھ | jh | چھ | chh | |
| دھ | dh | ڈھ | Dh | |
| کھ | kh | گھ | gh |
fromRoman('pakistan') // → 'پاکستان' (trie-based longest-prefix match)
fromRoman('bharat') // → 'بھارت'
InPage encoding — decoding 30 years of Urdu archives
InPage was the dominant Urdu desktop publishing tool for decades. Millions of documents — newspapers, books, government archives — exist only in InPage format. The library decodes all three versions:
import { decodeInpage, detectEncoding } from '@iamahsanmehmood/urdu-tools'
// Auto-detect InPage version and decode
const result = decodeInpage(buffer, 'auto')
// result.paragraphs → string[] (Unicode Urdu text)
// result.version → 'v1' | 'v2' | 'v3'
// Explicit version
decodeInpage(buffer, 'v1') // 0x04-prefix byte-pair encoding (old InPage)
decodeInpage(buffer, 'v3') // UTF-16LE with paragraph markers
// Detect encoding from buffer alone
detectEncoding(buffer)
// → 'utf-8' | 'utf-16le' | 'windows-1256' | 'inpage-v1v2' | 'inpage-v3' | 'unknown'
String utilities
import { reverse, truncate, wordCount, charCount,
extractUrdu, decodeHtmlEntities } from '@iamahsanmehmood/urdu-tools'
// Reverse word order (not characters — preserves Arabic shaping)
reverse('پاکستان ہندوستان') // → 'ہندوستان پاکستان'
// Truncate at word boundary
truncate('یہ ایک بہت لمبا جملہ ہے', 10) // → 'یہ ایک...'
// Count grapheme clusters (correct for combining diacritics)
charCount('عِلم') // → 3 (ع+ِ = 1 cluster, ل, م)
// Extract Urdu/Arabic segments from mixed text
extractUrdu('The word علم means knowledge') // → ['علم']
// Decode HTML entities BEFORE normalize() — critical for TinyMCE/Quill content
decodeHtmlEntities('کتاب’خانہ') // → 'کتاب’خانہ'
decodeHtmlEntities('علم ہے') // → 'علم ہے'
That last one (decodeHtmlEntities) is the fix for the TinyMCE bug mentioned at the top. Always call it before normalizing text that came from a rich text editor.
Script and character analysis
import { isUrduChar, getScript, classifyChar, isRTL, getUrduDensity } from '@iamahsanmehmood/urdu-tools'
isUrduChar('پ') // true — U+067E is Urdu-specific
isUrduChar('ب') // false — U+0628 is shared with Arabic
isUrduChar('۱') // true — U+06F1 Urdu numeral
getScript('پاکستان') // 'urdu'
getScript('مرحبا') // 'arabic'
getScript('Hello پاکستان') // 'mixed'
classifyChar('پ') // 'urdu-letter'
classifyChar('َ') // 'diacritic'
classifyChar('۱') // 'numeral'
isRTL('پاکستان') // true
getUrduDensity('پاکستان زندہ') // 0.28
C#/.NET — identical API, zero dependencies
Every function is available in UrduTools.Core with the same behavior. The C# package mirrors the TypeScript structure exactly.
using UrduTools.Core.Normalization;
using UrduTools.Core.Compound;
using UrduTools.Core.Numbers;
using UrduTools.Core.Sorting;
using UrduTools.Core.Search;
// Normalize
UrduNormalizer.Normalize("عِلمٌ"); // "علم"
UrduNormalizer.Normalize("علم"); // "علم"
// Compound detection
var spans = CompoundDetector.DetectCompounds("کتاب خانہ میں");
// spans[0].Text == "کتاب خانہ"
// spans[0].Type == CompoundType.Affix
// Numbers
NumberToWords.Convert(10_000_000); // "ایک کروڑ"
NumberToWords.Convert(1, new NumberOptions { Ordinal = true, Gender = Gender.Feminine }); // "پہلی"
// Sort
var sorted = new[] { "ے", "ا", "ک", "ب" }
.OrderBy(w => w, new UrduComparer())
.ToList(); // ["ا", "ب", "ک", "ے"]
// Progressive normalization for DB lookup
foreach (var form in UrduMatcher.GetAllNormalizations(userInput))
{
var result = await db.LookupAsync(form);
if (result is not null) return result;
}
// Match
UrduMatcher.Match("عِلمٌ", "علم").Matched; // true, layer: StripDiacritics
Academic foundation
The compound word detection module was built on peer-reviewed Urdu linguistics research. These three works directly informed the architecture:
Jabbar, A. (2016). "Urdu Compound Words Manufacturing a State of Art."
Provides the Urdu Affix Word List (UAWL) — the definitive catalog of Urdu derivational morphemes. The 100+ affix morphemes in Layer 1 (AFFIX_SET, PREFIX_SET, SUFFIX_SET) are drawn from this work.
Rahman, M. "A Linguistic Classification of Urdu Compound Words."
Informed the typological distinctions between compound categories — specifically the Perso-Arabic vs. native Urdu origin split and vav-e-atf chain patterns. Shaped the CompoundType taxonomy and izafat heuristics.
"High Performance Stemming Algorithm to Handle Multi-Word Expressions."
Motivated the joinCompounds() + tokenize() pipeline design — the paper demonstrates that semantic integrity is best preserved by preventing erroneous splits at the input boundary, not by post-processing token sequences. Also reinforced N-gram scanning over bigram-only approaches.
Used in production
This library is not a side project. It runs in three production systems:
| System | Type |
|---|---|
| HamaariUrdu | Urdu language learning platform — normalization, search, compound detection, numbers |
| Pakistan Academy of Letters | Government literary institution — normalization, search, sorting |
| Digital Library of PAL | Government digital Urdu archive — normalization, search, encoding |
HamaariUrdu was the origin — the library was extracted from production code where these problems were first encountered and solved. PAL and DLP integrated later for their Urdu text search and archiving systems.
Live Playground
Every function is interactive at iamahsanmehmood.github.io/urdu-tools.
The playground includes compound reporting built-in: if you find a compound the detector misses, or a pair it wrongly detects, you can report it directly from the UI — a pre-filled GitHub issue opens in one click.
Contributing
The compound lexicon (3,262 roots, expandable) is the highest-impact area for non-developer contributions. If you know Urdu, you can contribute without writing code:
// packages/urdu-js/src/compound/lexicon-data.ts
// Format: ['rootWord', new Set(['tail1', 'tail2'])]
['محنت', new Set(['مشقت'])],
['علم', new Set(['و ہنر', 'و عمل', 'کیمیا'])],
['انسائیکلوپیڈیا', new Set(['آف اسلام'])],
Full guide in CONTRIBUTING.md.
GitHub: github.com/iamahsanmehmood/urdu-tools
اردو سافٹ ویئر کو بہتر بنانے میں ہمارا ساتھ دیں۔
Help us make Urdu software better.
Tags: #urdu #nlp #typescript #dotnet #opensource
Top comments (0)