DEV Community

Ahsan Mehmood
Ahsan Mehmood

Posted on

I Built the First Deterministic Urdu Compound Word Detector — Here's Why It Took a Full Library to Get There

Urdu is spoken by over 230 million people. It is the national language of Pakistan, one of the 22 scheduled languages of India, and the lingua franca of a diaspora spanning three continents. And yet, if you try to build Urdu software today — real software, not a toy — you will hit the same wall every other developer hit before you: the tools do not exist.

I hit that wall building HamaariUrdu, an Urdu language learning platform. This post is about what I built to fix it.


The bugs that no library could fix

I was not looking to build a library. I was looking to ship features. But the bugs kept piling up, and none of the available Urdu NLP libraries (UrduHack, URDUNLP, or anything else) could fix them.

Bug 1: Search returning zero results for words that are obviously in the database.

The database stored ہے using the correct Urdu ہ (U+06C1, Heh Goal). The user's keyboard typed Arabic ه (U+0647, Heh). Both look completely identical on screen in Naskh fonts. But U+06C1 !== U+0647. Zero results. No error. No warning. Just silence.

Bug 2: String equality silently failing.

"قلم" === "قلم"  // false — why?!
Enter fullscreen mode Exit fullscreen mode

One of those strings was copied from Microsoft Word and contains an invisible ZWNJ (Zero Width Non-Joiner, U+200C) that Word inserts automatically. You cannot see it. Your editor does not show it. But the comparison fails.

Bug 3: TinyMCE destroying Izafat.

In Urdu grammar, Izafat (اضافت) is a grammatical construction that links two words — like the English "of" but expressed as a marker on the first word. The marker is often an apostrophe-like character (U+2019, Right Single Quotation Mark).

TinyMCE — a very popular rich text editor — silently converts U+2019 to ’ before saving. So a word like کتابِ (with Kasra) or a phrase using Izafat apostrophe gets stored as an HTML entity. Every compound word lookup in the database then fails because the stored form doesn't match the queried form.

Bug 4: Numbers overflowing.

Urdu text frequently references South Asian scale: لاکھ (100,000), کروڑ (10,000,000), ارب (1,000,000,000). These are real everyday numbers in Pakistan — newspaper headlines, financial documents, government statistics.

Number.MAX_SAFE_INTEGER is 9,007,199,254,740,991. A single کھرب (1 trillion) value loses precision with typeof number. JavaScript silently gives you the wrong answer.

Bug 5: Sorting broken for every Urdu word list.

No database and no JavaScript runtime has native Urdu collation. The Urdu alphabet has 39 letters in a specific order that does not match either Unicode codepoint order or any Latin-derived collation. Every sorted word list was wrong.

Bug 6 — the worst one: Compound words destroying every downstream NLP task.

This one deserves its own section.


The compound word problem

Urdu مرکب الفاظ (compound words) are multi-word expressions that function as a single semantic unit but are written with spaces between their parts.

کتاب خانہ  →  library  (کتاب = book, خانہ = place)
بے عزت     →  disrespectful  (بے = without, عزت = honor)
خوش قسمت  →  fortunate  (خوش = well, قسمت = fate)
علم و عمل  →  knowledge and practice  (fixed expression)
محنت مشقت →  hard work  (synonym compound)
Enter fullscreen mode Exit fullscreen mode

A naive tokenizer sees spaces and splits them. The result:

Input:   "اس نے کتاب خانہ بنایا"
                ↑ ↑
         space between compound components

Wrong:   ['اس', 'نے', 'کتاب', 'خانہ', 'بنایا']
         (5 tokens — "library" is split into "book" + "place")

Right:   ['اس', 'نے', 'کتاب‌خانہ', 'بنایا']
         (4 tokens — "library" is one semantic unit)
Enter fullscreen mode Exit fullscreen mode

The consequences ripple into every downstream NLP task:

Task What breaks
Search کتاب خانہ doesn't match کتاب‌خانہ — zero results
NER امورِ خانہ داری (household affairs) split into 3 unrelated tokens
Sentiment بے عزت (disrespectful) vs بے + عزت — polarity lost
Translation رنگ برنگے (colorful) translated as "color" + unknown
Word count Every compound inflates the count with phantom tokens

Why this is genuinely hard

Urdu compound words span four different morphological strategies simultaneously:

Strategy 1 — Affix-based: One word contains a known derivational morpheme (prefix or suffix):

کتاب + خانہ   →  library   (خانہ = "place of" suffix)
بے + عزت      →  disrespectful  (بے = "without" prefix)  
خوش + قسمت   →  fortunate  (خوش = "well" prefix)
کتاب + داری   →  librarianship  (داری = "keeping" suffix)
Enter fullscreen mode Exit fullscreen mode

Strategy 2 — Izafat: A grammatical linking marker appears in the text, written or implied:

کتابِ حسنہ    (the good book)  — Zer mark (◌ِ) on first word
روحِ رواں     (driving spirit) — Hamza-above (◌ٔ) marker
علم و عمل     (knowledge and practice) — Vav-e-atf (و) connector
Enter fullscreen mode Exit fullscreen mode

Strategy 3 — Lexical: Neither word is morphologically special. You simply have to know these pairs:

محنت مشقت     (hard work — synonym compound)
رنگ برنگے     (colorful — echo compound)
صبر شکر       (patient gratitude — near-synonym pair)
انسائیکلوپیڈیا آف اسلام  (3-word fixed title)
Enter fullscreen mode Exit fullscreen mode

Strategy 4 — Chains: Three or more words where each link is independently valid:

امورِ خانہ داری  (household affairs — 3 words)
↑       ↑   ↑
izafat  affix  suffix

Decomposition:
امورِ + خانہ  →  izafat compound
خانہ + داری  →  affix compound
Merged:  امورِ خانہ داری  →  one 3-word compound
Enter fullscreen mode Exit fullscreen mode

No statistical model trained on general text reliably covers all four strategies. They operate at different linguistic levels and require different detection mechanisms.


The approach: three deterministic layers

Every other Urdu compound detection library (where one even exists) treats this as a machine learning problem. They feed training data into statistical models and hope the probabilities align.

That means:

  • Results change unpredictably between corpus versions
  • You cannot explain why a pair was or wasn't detected
  • Edge cases (literary izafat, 3-word expressions, echo words) fail silently
  • No deterministic guarantee across identical inputs

urdu-tools takes the opposite approach. Every detection is grounded in one of three verifiable, explainable rules:

Raw text
   │
   ├─► Layer 1 — Affix (UAWL)
   │       100+ known Urdu prefix/suffix morphemes
   │       خانہ  گاہ  پرست  بے  نا  خوش  شب  غم  …
   │
   ├─► Layer 2 — Izafat
   │       zer mark (◌ِ) · hamza-above (◌ٔ) · vav-e-atf (و)
   │       کتابِ حسنہ · روحِ رواں · علم و عمل
   │
   └─► Layer 3 — Lexicon
           3,262 root entries · N-word tails · greedy longest-match
           محنت مشقت · رنگ برنگے · انسائیکلوپیڈیا آف اسلام
               │
               └─► Span chaining
                       امورِ خانہ  +  خانہ داری  →  امورِ خانہ داری
Enter fullscreen mode Exit fullscreen mode

The same input always produces the same output, always with a reason.

This is the first open-source implementation of deterministic, multi-layer, N-gram Urdu compound detection in any language.


Introducing urdu-tools

github.com/iamahsanmehmood/urdu-tools

A production-quality, zero-dependency Urdu text processing library. Available for TypeScript/JavaScript and C#/.NET, with identical APIs in both.

npm install @iamahsanmehmood/urdu-tools
Enter fullscreen mode Exit fullscreen mode
dotnet add package UrduTools.Core
Enter fullscreen mode Exit fullscreen mode

392 tests passing. 85 C# tests. 90%+ coverage enforced in CI.


The compound detection API

import {
  detectCompounds,
  joinCompounds,
  splitCompounds,
  isCompound
} from '@iamahsanmehmood/urdu-tools/compound'
Enter fullscreen mode Exit fullscreen mode

Detecting compounds

// Layer 1: Affix — خانہ is a known place-suffix
detectCompounds('کتاب خانہ بہت اچھا ہے')
// → [{
//     text: 'کتاب خانہ',
//     type: 'affix',
//     components: ['کتاب', 'خانہ'],
//     start: 0,
//     end: 1
//   }]

// Layer 1: Affix — بے is a known privative prefix
detectCompounds('بے عزت آدمی نہیں چاہیے')
// → [{ text: 'بے عزت', type: 'affix', components: ['بے', 'عزت'], ... }]

// Layer 2: Izafat — standalone و (vav-e-atf) between content words
detectCompounds('علم و عمل ضروری ہے')
// → [{ text: 'علم و عمل', type: 'izafat', components: ['علم', 'و', 'عمل'], ... }]

// Layer 3: Lexicon — echo compound, neither word is an affix
detectCompounds('رنگ برنگے پھول کھلے ہیں')
// → [{ text: 'رنگ برنگے', type: 'lexicon', components: ['رنگ', 'برنگے'], ... }]

// Lexicon: synonym compound
detectCompounds('محنت مشقت کے بغیر کامیابی نہیں')
// → [{ text: 'محنت مشقت', type: 'lexicon', ... }]

// 3-word chain: izafat (zer on امورِ) + affix (داری suffix on خانہ)
detectCompounds('امورِ خانہ داری چلانا مشکل ہے')
// → [{ text: 'امورِ خانہ داری', type: 'affix', components: ['امورِ', 'خانہ', 'داری'], ... }]

// 3-word lexicon entry: greedy longest-match wins over any 2-word overlap
detectCompounds('انسائیکلوپیڈیا آف اسلام کا حوالہ')
// → [{ text: 'انسائیکلوپیڈیا آف اسلام', type: 'lexicon', ... }]
Enter fullscreen mode Exit fullscreen mode

The pipeline: join before tokenize

The critical downstream use case — bind compounds before tokenizing:

import { joinCompounds } from '@iamahsanmehmood/urdu-tools/compound'
import { tokenize } from '@iamahsanmehmood/urdu-tools'

const text = 'کتاب خانہ میں علم و عمل کی کتابیں ہیں'

// Without compound joining — naive tokenizer splits everything
tokenize(text)
// → ['کتاب', 'خانہ', 'میں', 'علم', 'و', 'عمل', 'کی', 'کتابیں', 'ہیں']
//    ↑ split!                 ↑ split!

// With compound joining — semantic integrity preserved
const joined = joinCompounds(text)
// → 'کتاب‌خانہ میں علم‌و‌عمل کی کتابیں ہیں'
//          ↑ ZWNJ (invisible, prevents tokenizer split)

tokenize(joined)
// → ['کتاب‌خانہ', 'میں', 'علم‌و‌عمل', 'کی', 'کتابیں', 'ہیں']
//    ↑ one token            ↑ one token  ✓
Enter fullscreen mode Exit fullscreen mode

The ZWNJ (Zero Width Non-Joiner, U+200C) is invisible but meaningful — the tokenizer sees it and keeps the word intact.

Pair-level check

isCompound('کتاب', 'خانہ')    // → { matched: true,  type: 'affix'   }
isCompound('محنت', 'مشقت')    // → { matched: true,  type: 'lexicon' }
isCompound('اخلاقِ', 'حسنہ')  // → { matched: true,  type: 'izafat' }
isCompound('اچھا', 'آدمی')    // → { matched: false, type: null      }
Enter fullscreen mode Exit fullscreen mode

Fine-grained control

// Use only specific layers
detectCompounds(text, { affix: true, izafat: false, lexicon: false })
detectCompounds(text, { affix: false, izafat: true, lexicon: true })

// Choose the binder character for joinCompounds
joinCompounds(text)                      // ZWNJ U+200C (default, invisible)
joinCompounds(text, { binder: 'nbsp' })  // Non-breaking space (visible)
joinCompounds(text, { binder: 'wj' })   // Word Joiner U+2060 (never line-breaks)

// Inverse — split back to spaces
splitCompounds('کتاب‌خانہ')  // → 'کتاب خانہ'
Enter fullscreen mode Exit fullscreen mode

The normalization pipeline

A 12-layer normalization pipeline — the foundation that every other module builds on.

import { normalize, fingerprint } from '@iamahsanmehmood/urdu-tools'
Enter fullscreen mode Exit fullscreen mode
Layer What it does Default
1 — NFC Unicode canonical form
2 — NBSP Non-breaking space → regular space
3 — Alif Madda آآ (precomposed)
4 — Numerals ٠–٩ and ۰–۹ → ASCII 0–9
5 — Zero-width Strip ZWNJ, ZWJ, soft hyphen
6 — Diacritics Strip zabar, zer, pesh, shadda, sukun, tanwin
7 — Honorifics Strip Islamic honorific signs (ؐ ؑ ؒ ؓ ؔ)
8 — Hamza أا, ؤو
9 — Kashida Strip tatweel U+0640
10 — Presentation forms Map U+FB50–FEFF to base chars
11 — Punctuation trim Strip leading/trailing non-letter chars
12 — Char normalize Arabic look-alikes → correct Urdu codepoints
normalize('عِلمٌ')                    // 'علم'  (layers 1–6: diacritics stripped)
normalize('آ')             // 'آ'    (layer 3: Alif + Madda → precomposed)
normalize('علم‌ہے')             // 'علمہے' (layer 5: ZWNJ stripped)
normalize('نبیؐ')                    // 'نبی'  (layer 7: honorific stripped)

// Full normalization for search indexing
normalize(userInput, {
  kashida: true,
  presentationForms: true,
  punctuationTrim: true,
  normalizeCharacters: true,   // ي → ی, ك → ک, ه → ہ
})
Enter fullscreen mode Exit fullscreen mode

The fingerprint function

For client-side word comparison without database round-trips:

fingerprint('عِلمٌ') === fingerprint('عَلم')   // true (both normalize to 'علم')
fingerprint('نبیؐ') === fingerprint('نبی')     // true (honorific stripped)
fingerprint('علم‌') === fingerprint('علم') // true (ZWNJ stripped)
Enter fullscreen mode Exit fullscreen mode

We use this in HamaariUrdu to compare user input against stored words in a 110,000+ word dictionary without needing a round-trip to the database for every keystroke.


The Arabic–Urdu confusion problem

This is the single most common source of silent failures in Urdu software, and no existing library addressed it.

Three character pairs are visually identical in Naskh fonts but are different Unicode code points:

Visual Arabic codepoint Urdu codepoint Common source
ی ي U+064A ی U+06CC Arabic-layout keyboards, Arabic websites
ک ك U+0643 ک U+06A9 Arabic-layout keyboards
ہ ه U+0647 ہ U+06C1 Arabic text pasted into Urdu context

A user searching for ہے typed with Arabic ه finds zero results in a database that stored it with Urdu ہ. Both look identical on screen. No error. No warning. Zero results.

import { normalizeCharacters } from '@iamahsanmehmood/urdu-tools'

normalizeCharacters('ي')  // → 'ی'  (U+064A → U+06CC)
normalizeCharacters('ك')  // → 'ک'  (U+0643 → U+06A9)
normalizeCharacters('ه')  // → 'ہ'  (U+0647 → U+06C1)

// Apply before storage or search indexing:
normalize(userInput, { normalizeCharacters: true })
Enter fullscreen mode Exit fullscreen mode

Progressive search matching

The search module tries 9 progressively aggressive normalization layers until it finds a match — or returns false with full diagnostic info.

import { match, fuzzyMatch, getAllNormalizations } from '@iamahsanmehmood/urdu-tools'

match('عِلمٌ', 'علم')
// → { matched: true, layer: 'strip-diacritics', normalizedQuery: 'علم', normalizedTarget: 'علم' }

match('نبیؐ', 'نبی')
// → { matched: true, layer: 'strip-honorifics', ... }

match('أحمد', 'احمد')
// → { matched: true, layer: 'normalize-hamza', ... }

match('کتاب', 'علم')
// → { matched: false, layer: null, ... }
Enter fullscreen mode Exit fullscreen mode

For database lookups, getAllNormalizations() returns every normalized form to try:

const forms = getAllNormalizations('عِلمٌ')
// → ['عِلمٌ', 'عِلم', 'علم', ...]  (from most specific to most aggressive)

for (const form of forms) {
  const result = await db.get(form)
  if (result) return result
}
Enter fullscreen mode Exit fullscreen mode

Fuzzy matching uses Levenshtein + LCS hybrid (threshold 0.5):

fuzzyMatch('کتاب', ['کتابیں', 'کتب', 'علم'])
// → { candidate: 'کتابیں', score: ~0.7 }
Enter fullscreen mode Exit fullscreen mode

Numbers — South Asian scale with bigint

The South Asian number system has named units that don't exist in Western mathematics:

Urdu Value
ہزار 1,000
لاکھ 100,000
کروڑ 10,000,000
ارب 1,000,000,000
کھرب 1,000,000,000,000
نیل 1,000,000,000,000,000

The entire module uses bigint throughout — South Asian numbers exceed Number.MAX_SAFE_INTEGER.

import { numberToWords, formatCurrency, toUrduNumerals, wordsToNumber } from '@iamahsanmehmood/urdu-tools'

numberToWords(0n)                      // 'صفر'
numberToWords(100n)                    // 'ایک سو'
numberToWords(100_000n)                // 'ایک لاکھ'
numberToWords(10_000_000n)             // 'ایک کروڑ'
numberToWords(1_000_000_000_000_000n)  // 'ایک نیل'

// Ordinals with gender agreement
numberToWords(1n, { ordinal: true, gender: 'masculine' })  // 'پہلا'
numberToWords(1n, { ordinal: true, gender: 'feminine' })   // 'پہلی'
numberToWords(11n, { ordinal: true, gender: 'masculine' }) // 'گیارہواں'
numberToWords(11n, { ordinal: true, gender: 'feminine' })  // 'گیارہویں'

// Currency
formatCurrency(505.50, 'PKR')  // 'پانچ سو پانچ روپے پچاس پیسے'
formatCurrency(1000, 'INR')    // 'ایک ہزار روپے'

// Numeral conversion
toUrduNumerals(2024)    // '۲۰۲۴'

// Inverse — parse words back to number
wordsToNumber('ایک کروڑ')     // 10_000_000n
wordsToNumber('پانچ سو پانچ') // 505n
Enter fullscreen mode Exit fullscreen mode

Canonical Urdu sorting

No database and no JavaScript runtime has native Urdu collation. The 39-letter Urdu alphabet order:

ء ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن ں و ہ ھ ی ے
Enter fullscreen mode Exit fullscreen mode
import { sort, compare, sortKey } from '@iamahsanmehmood/urdu-tools'

sort(['ے', 'ا', 'ک', 'ب'])           // → ['ا', 'ب', 'ک', 'ے']
sort(['زبان', 'اردو', 'بہترین'])      // → ['اردو', 'بہترین', 'زبان']

// Use compare() as a comparator for any sorting context
['ے', 'ا', 'ک'].sort(compare)        // → ['ا', 'ک', 'ے']

// sortKey() for indexing — diacritics stripped before key generation
sortKey('پاکستان')   // '030003091102280814'
// عِلم and عَلم sort to the same position
Enter fullscreen mode Exit fullscreen mode

In C# it implements IComparer<string> for native LINQ integration:

using UrduTools.Core.Sorting;

var words = new[] { "ے", "ا", "ک", "ب" };
var sorted = words.OrderBy(w => w, new UrduComparer()).ToList();
// ["ا", "ب", "ک", "ے"]
Enter fullscreen mode Exit fullscreen mode

Unicode-aware tokenization

The tokenizer handles the edge cases that matter in real Urdu text:

import { tokenize, sentences, ngrams } from '@iamahsanmehmood/urdu-tools'

tokenize('پاکستان ایک خوبصورت ملک ہے')
// → [
//   { text: 'پاکستان', type: 'urdu-word' },
//   { text: 'ایک',     type: 'urdu-word' },
//   { text: 'خوبصورت', type: 'urdu-word' },
//   { text: 'ملک',     type: 'urdu-word' },
//   { text: 'ہے',      type: 'urdu-word' },
// ]

// Sentence splitting — on ۔ (U+06D4) ؟ ! but NOT on ، or ؛
sentences('پہلا جملہ۔ دوسرا جملہ؟ تیسرا جملہ!')
// → ['پہلا جملہ', 'دوسرا جملہ', 'تیسرا جملہ']

// The tokenizer preserves ZWNJ within words —
// so joinCompounds() output is one token per compound
Enter fullscreen mode Exit fullscreen mode

Key edge cases handled:

  • Izafat Kasra (U+0650) at word boundaries is not treated as a split point
  • ZWNJ-bound compounds (output of joinCompounds()) are kept as single tokens
  • Mixed Urdu/Latin text is classified per-token

Transliteration — 18 aspirated digraphs

import { toRoman, fromRoman } from '@iamahsanmehmood/urdu-tools'

toRoman('پاکستان')   // 'pakistan'
toRoman('بھارت')     // 'bharat'
toRoman('چھوٹا')     // 'chhota'
Enter fullscreen mode Exit fullscreen mode

Digraph rules (left-to-right FSM, digraph priority):

Urdu Roman Urdu Roman
بھ bh پھ ph
تھ th ٹھ Th
جھ jh چھ chh
دھ dh ڈھ Dh
کھ kh گھ gh
fromRoman('pakistan')  // → 'پاکستان' (trie-based longest-prefix match)
fromRoman('bharat')    // → 'بھارت'
Enter fullscreen mode Exit fullscreen mode

InPage encoding — decoding 30 years of Urdu archives

InPage was the dominant Urdu desktop publishing tool for decades. Millions of documents — newspapers, books, government archives — exist only in InPage format. The library decodes all three versions:

import { decodeInpage, detectEncoding } from '@iamahsanmehmood/urdu-tools'

// Auto-detect InPage version and decode
const result = decodeInpage(buffer, 'auto')
// result.paragraphs → string[]  (Unicode Urdu text)
// result.version   → 'v1' | 'v2' | 'v3'

// Explicit version
decodeInpage(buffer, 'v1')  // 0x04-prefix byte-pair encoding (old InPage)
decodeInpage(buffer, 'v3')  // UTF-16LE with paragraph markers

// Detect encoding from buffer alone
detectEncoding(buffer)
// → 'utf-8' | 'utf-16le' | 'windows-1256' | 'inpage-v1v2' | 'inpage-v3' | 'unknown'
Enter fullscreen mode Exit fullscreen mode

String utilities

import { reverse, truncate, wordCount, charCount,
         extractUrdu, decodeHtmlEntities } from '@iamahsanmehmood/urdu-tools'

// Reverse word order (not characters — preserves Arabic shaping)
reverse('پاکستان ہندوستان')      // → 'ہندوستان پاکستان'

// Truncate at word boundary
truncate('یہ ایک بہت لمبا جملہ ہے', 10)  // → 'یہ ایک...'

// Count grapheme clusters (correct for combining diacritics)
charCount('عِلم')   // → 3  (ع+ِ = 1 cluster, ل, م)

// Extract Urdu/Arabic segments from mixed text
extractUrdu('The word علم means knowledge')  // → ['علم']

// Decode HTML entities BEFORE normalize() — critical for TinyMCE/Quill content
decodeHtmlEntities('کتاب&rsquo;خانہ')  // → 'کتاب’خانہ'
decodeHtmlEntities('علم&nbsp;ہے')      // → 'علم ہے'
Enter fullscreen mode Exit fullscreen mode

That last one (decodeHtmlEntities) is the fix for the TinyMCE bug mentioned at the top. Always call it before normalizing text that came from a rich text editor.


Script and character analysis

import { isUrduChar, getScript, classifyChar, isRTL, getUrduDensity } from '@iamahsanmehmood/urdu-tools'

isUrduChar('پ')  // true  — U+067E is Urdu-specific
isUrduChar('ب')  // false — U+0628 is shared with Arabic
isUrduChar('۱')  // true  — U+06F1 Urdu numeral

getScript('پاکستان')          // 'urdu'
getScript('مرحبا')             // 'arabic'
getScript('Hello پاکستان')    // 'mixed'

classifyChar('پ')   // 'urdu-letter'
classifyChar('َ')   // 'diacritic'
classifyChar('۱')   // 'numeral'

isRTL('پاکستان')               // true
getUrduDensity('پاکستان زندہ') // 0.28
Enter fullscreen mode Exit fullscreen mode

C#/.NET — identical API, zero dependencies

Every function is available in UrduTools.Core with the same behavior. The C# package mirrors the TypeScript structure exactly.

using UrduTools.Core.Normalization;
using UrduTools.Core.Compound;
using UrduTools.Core.Numbers;
using UrduTools.Core.Sorting;
using UrduTools.Core.Search;

// Normalize
UrduNormalizer.Normalize("عِلمٌ");                          // "علم"
UrduNormalizer.Normalize("علم‌");                      // "علم"

// Compound detection
var spans = CompoundDetector.DetectCompounds("کتاب خانہ میں");
// spans[0].Text == "کتاب خانہ"
// spans[0].Type == CompoundType.Affix

// Numbers
NumberToWords.Convert(10_000_000);  // "ایک کروڑ"
NumberToWords.Convert(1, new NumberOptions { Ordinal = true, Gender = Gender.Feminine });  // "پہلی"

// Sort
var sorted = new[] { "ے", "ا", "ک", "ب" }
    .OrderBy(w => w, new UrduComparer())
    .ToList();  // ["ا", "ب", "ک", "ے"]

// Progressive normalization for DB lookup
foreach (var form in UrduMatcher.GetAllNormalizations(userInput))
{
    var result = await db.LookupAsync(form);
    if (result is not null) return result;
}

// Match
UrduMatcher.Match("عِلمٌ", "علم").Matched;  // true, layer: StripDiacritics
Enter fullscreen mode Exit fullscreen mode

Academic foundation

The compound word detection module was built on peer-reviewed Urdu linguistics research. These three works directly informed the architecture:

Jabbar, A. (2016). "Urdu Compound Words Manufacturing a State of Art."
Provides the Urdu Affix Word List (UAWL) — the definitive catalog of Urdu derivational morphemes. The 100+ affix morphemes in Layer 1 (AFFIX_SET, PREFIX_SET, SUFFIX_SET) are drawn from this work.

Rahman, M. "A Linguistic Classification of Urdu Compound Words."
Informed the typological distinctions between compound categories — specifically the Perso-Arabic vs. native Urdu origin split and vav-e-atf chain patterns. Shaped the CompoundType taxonomy and izafat heuristics.

"High Performance Stemming Algorithm to Handle Multi-Word Expressions."
Motivated the joinCompounds() + tokenize() pipeline design — the paper demonstrates that semantic integrity is best preserved by preventing erroneous splits at the input boundary, not by post-processing token sequences. Also reinforced N-gram scanning over bigram-only approaches.


Used in production

This library is not a side project. It runs in three production systems:

System Type
HamaariUrdu Urdu language learning platform — normalization, search, compound detection, numbers
Pakistan Academy of Letters Government literary institution — normalization, search, sorting
Digital Library of PAL Government digital Urdu archive — normalization, search, encoding

HamaariUrdu was the origin — the library was extracted from production code where these problems were first encountered and solved. PAL and DLP integrated later for their Urdu text search and archiving systems.


Live Playground

Every function is interactive at iamahsanmehmood.github.io/urdu-tools.

The playground includes compound reporting built-in: if you find a compound the detector misses, or a pair it wrongly detects, you can report it directly from the UI — a pre-filled GitHub issue opens in one click.


Contributing

The compound lexicon (3,262 roots, expandable) is the highest-impact area for non-developer contributions. If you know Urdu, you can contribute without writing code:

// packages/urdu-js/src/compound/lexicon-data.ts
// Format: ['rootWord', new Set(['tail1', 'tail2'])]

['محنت', new Set(['مشقت'])],
['علم', new Set(['و ہنر', 'و عمل', 'کیمیا'])],
['انسائیکلوپیڈیا', new Set(['آف اسلام'])],
Enter fullscreen mode Exit fullscreen mode

Full guide in CONTRIBUTING.md.

GitHub: github.com/iamahsanmehmood/urdu-tools


اردو سافٹ ویئر کو بہتر بنانے میں ہمارا ساتھ دیں۔
Help us make Urdu software better.


Tags: #urdu #nlp #typescript #dotnet #opensource

Top comments (0)