Ahsan Mehmood

Posted on Apr 27

I Built the First Deterministic Urdu Compound Word Detector — Here's Why It Took a Full Library to Get There

#nlp #opensource #typescript #dotnet

Urdu is spoken by over 230 million people. It is the national language of Pakistan, one of the 22 scheduled languages of India, and the lingua franca of a diaspora spanning three continents. And yet, if you try to build Urdu software today — real software, not a toy — you will hit the same wall every other developer hit before you: the tools do not exist.

I hit that wall building HamaariUrdu, an Urdu language learning platform. This post is about what I built to fix it.

The bugs that no library could fix

I was not looking to build a library. I was looking to ship features. But the bugs kept piling up, and none of the available Urdu NLP libraries (UrduHack, URDUNLP, or anything else) could fix them.

Bug 1: Search returning zero results for words that are obviously in the database.

The database stored ہے using the correct Urdu ہ (U+06C1, Heh Goal). The user's keyboard typed Arabic ه (U+0647, Heh). Both look completely identical on screen in Naskh fonts. But U+06C1 !== U+0647. Zero results. No error. No warning. Just silence.

Bug 2: String equality silently failing.

"قلم" === "قلم"  // false — why?!

One of those strings was copied from Microsoft Word and contains an invisible ZWNJ (Zero Width Non-Joiner, U+200C) that Word inserts automatically. You cannot see it. Your editor does not show it. But the comparison fails.

Bug 3: TinyMCE destroying Izafat.

In Urdu grammar, Izafat (اضافت) is a grammatical construction that links two words — like the English "of" but expressed as a marker on the first word. The marker is often an apostrophe-like character (U+2019, Right Single Quotation Mark).

TinyMCE — a very popular rich text editor — silently converts U+2019 to ’ before saving. So a word like کتابِ (with Kasra) or a phrase using Izafat apostrophe gets stored as an HTML entity. Every compound word lookup in the database then fails because the stored form doesn't match the queried form.

Bug 4: Numbers overflowing.

Urdu text frequently references South Asian scale: لاکھ (100,000), کروڑ (10,000,000), ارب (1,000,000,000). These are real everyday numbers in Pakistan — newspaper headlines, financial documents, government statistics.

Number.MAX_SAFE_INTEGER is 9,007,199,254,740,991. A single کھرب (1 trillion) value loses precision with typeof number. JavaScript silently gives you the wrong answer.

Bug 5: Sorting broken for every Urdu word list.

No database and no JavaScript runtime has native Urdu collation. The Urdu alphabet has 39 letters in a specific order that does not match either Unicode codepoint order or any Latin-derived collation. Every sorted word list was wrong.

Bug 6 — the worst one: Compound words destroying every downstream NLP task.

This one deserves its own section.

The compound word problem

Urdu مرکب الفاظ (compound words) are multi-word expressions that function as a single semantic unit but are written with spaces between their parts.

کتاب خانہ  →  library  (کتاب = book, خانہ = place)
بے عزت     →  disrespectful  (بے = without, عزت = honor)
خوش قسمت  →  fortunate  (خوش = well, قسمت = fate)
علم و عمل  →  knowledge and practice  (fixed expression)
محنت مشقت →  hard work  (synonym compound)

A naive tokenizer sees spaces and splits them. The result:

Input:   "اس نے کتاب خانہ بنایا"
                ↑ ↑
         space between compound components

Wrong:   ['اس', 'نے', 'کتاب', 'خانہ', 'بنایا']
         (5 tokens — "library" is split into "book" + "place")

Right:   ['اس', 'نے', 'کتاب‌خانہ', 'بنایا']
         (4 tokens — "library" is one semantic unit)

The consequences ripple into every downstream NLP task:

Task	What breaks
Search	`کتاب خانہ` doesn't match `کتاب‌خانہ` — zero results
NER	`امورِ خانہ داری` (household affairs) split into 3 unrelated tokens
Sentiment	`بے عزت` (disrespectful) vs `بے` + `عزت` — polarity lost
Translation	`رنگ برنگے` (colorful) translated as "color" + unknown
Word count	Every compound inflates the count with phantom tokens

Why this is genuinely hard

Urdu compound words span four different morphological strategies simultaneously:

Strategy 1 — Affix-based: One word contains a known derivational morpheme (prefix or suffix):

کتاب + خانہ   →  library   (خانہ = "place of" suffix)
بے + عزت      →  disrespectful  (بے = "without" prefix)  
خوش + قسمت   →  fortunate  (خوش = "well" prefix)
کتاب + داری   →  librarianship  (داری = "keeping" suffix)

Strategy 2 — Izafat: A grammatical linking marker appears in the text, written or implied:

کتابِ حسنہ    (the good book)  — Zer mark (◌ِ) on first word
روحِ رواں     (driving spirit) — Hamza-above (◌ٔ) marker
علم و عمل     (knowledge and practice) — Vav-e-atf (و) connector

Strategy 3 — Lexical: Neither word is morphologically special. You simply have to know these pairs:

محنت مشقت     (hard work — synonym compound)
رنگ برنگے     (colorful — echo compound)
صبر شکر       (patient gratitude — near-synonym pair)
انسائیکلوپیڈیا آف اسلام  (3-word fixed title)

Strategy 4 — Chains: Three or more words where each link is independently valid:

امورِ خانہ داری  (household affairs — 3 words)
↑       ↑   ↑
izafat  affix  suffix

Decomposition:
امورِ + خانہ  →  izafat compound
خانہ + داری  →  affix compound
Merged:  امورِ خانہ داری  →  one 3-word compound

No statistical model trained on general text reliably covers all four strategies. They operate at different linguistic levels and require different detection mechanisms.

The approach: three deterministic layers

Every other Urdu compound detection library (where one even exists) treats this as a machine learning problem. They feed training data into statistical models and hope the probabilities align.

That means:

Results change unpredictably between corpus versions
You cannot explain why a pair was or wasn't detected
Edge cases (literary izafat, 3-word expressions, echo words) fail silently
No deterministic guarantee across identical inputs

urdu-tools takes the opposite approach. Every detection is grounded in one of three verifiable, explainable rules:

Raw text
   │
   ├─► Layer 1 — Affix (UAWL)
   │       100+ known Urdu prefix/suffix morphemes
   │       خانہ  گاہ  پرست  بے  نا  خوش  شب  غم  …
   │
   ├─► Layer 2 — Izafat
   │       zer mark (◌ِ) · hamza-above (◌ٔ) · vav-e-atf (و)
   │       کتابِ حسنہ · روحِ رواں · علم و عمل
   │
   └─► Layer 3 — Lexicon
           3,262 root entries · N-word tails · greedy longest-match
           محنت مشقت · رنگ برنگے · انسائیکلوپیڈیا آف اسلام
               │
               └─► Span chaining
                       امورِ خانہ  +  خانہ داری  →  امورِ خانہ داری

The same input always produces the same output, always with a reason.

This is the first open-source implementation of deterministic, multi-layer, N-gram Urdu compound detection in any language.

Introducing urdu-tools

github.com/iamahsanmehmood/urdu-tools

A production-quality, zero-dependency Urdu text processing library. Available for TypeScript/JavaScript and C#/.NET, with identical APIs in both.

npm install @iamahsanmehmood/urdu-tools

dotnet add package UrduTools.Core

392 tests passing. 85 C# tests. 90%+ coverage enforced in CI.

The compound detection API

import {
  detectCompounds,
  joinCompounds,
  splitCompounds,
  isCompound
} from '@iamahsanmehmood/urdu-tools/compound'

Detecting compounds

// Layer 1: Affix — خانہ is a known place-suffix
detectCompounds('کتاب خانہ بہت اچھا ہے')
// → [{
//     text: 'کتاب خانہ',
//     type: 'affix',
//     components: ['کتاب', 'خانہ'],
//     start: 0,
//     end: 1
//   }]

// Layer 1: Affix — بے is a known privative prefix
detectCompounds('بے عزت آدمی نہیں چاہیے')
// → [{ text: 'بے عزت', type: 'affix', components: ['بے', 'عزت'], ... }]

// Layer 2: Izafat — standalone و (vav-e-atf) between content words
detectCompounds('علم و عمل ضروری ہے')
// → [{ text: 'علم و عمل', type: 'izafat', components: ['علم', 'و', 'عمل'], ... }]

// Layer 3: Lexicon — echo compound, neither word is an affix
detectCompounds('رنگ برنگے پھول کھلے ہیں')
// → [{ text: 'رنگ برنگے', type: 'lexicon', components: ['رنگ', 'برنگے'], ... }]

// Lexicon: synonym compound
detectCompounds('محنت مشقت کے بغیر کامیابی نہیں')
// → [{ text: 'محنت مشقت', type: 'lexicon', ... }]

// 3-word chain: izafat (zer on امورِ) + affix (داری suffix on خانہ)
detectCompounds('امورِ خانہ داری چلانا مشکل ہے')
// → [{ text: 'امورِ خانہ داری', type: 'affix', components: ['امورِ', 'خانہ', 'داری'], ... }]

// 3-word lexicon entry: greedy longest-match wins over any 2-word overlap
detectCompounds('انسائیکلوپیڈیا آف اسلام کا حوالہ')
// → [{ text: 'انسائیکلوپیڈیا آف اسلام', type: 'lexicon', ... }]

The pipeline: join before tokenize

The critical downstream use case — bind compounds before tokenizing:

import { joinCompounds } from '@iamahsanmehmood/urdu-tools/compound'
import { tokenize } from '@iamahsanmehmood/urdu-tools'

const text = 'کتاب خانہ میں علم و عمل کی کتابیں ہیں'

// Without compound joining — naive tokenizer splits everything
tokenize(text)
// → ['کتاب', 'خانہ', 'میں', 'علم', 'و', 'عمل', 'کی', 'کتابیں', 'ہیں']
//    ↑ split!                 ↑ split!

// With compound joining — semantic integrity preserved
const joined = joinCompounds(text)
// → 'کتاب‌خانہ میں علم‌و‌عمل کی کتابیں ہیں'
//          ↑ ZWNJ (invisible, prevents tokenizer split)

tokenize(joined)
// → ['کتاب‌خانہ', 'میں', 'علم‌و‌عمل', 'کی', 'کتابیں', 'ہیں']
//    ↑ one token            ↑ one token  ✓

The ZWNJ (Zero Width Non-Joiner, U+200C) is invisible but meaningful — the tokenizer sees it and keeps the word intact.

Pair-level check

isCompound('کتاب', 'خانہ')    // → { matched: true,  type: 'affix'   }
isCompound('محنت', 'مشقت')    // → { matched: true,  type: 'lexicon' }
isCompound('اخلاقِ', 'حسنہ')  // → { matched: true,  type: 'izafat' }
isCompound('اچھا', 'آدمی')    // → { matched: false, type: null      }

Fine-grained control

// Use only specific layers
detectCompounds(text, { affix: true, izafat: false, lexicon: false })
detectCompounds(text, { affix: false, izafat: true, lexicon: true })

// Choose the binder character for joinCompounds
joinCompounds(text)                      // ZWNJ U+200C (default, invisible)
joinCompounds(text, { binder: 'nbsp' })  // Non-breaking space (visible)
joinCompounds(text, { binder: 'wj' })   // Word Joiner U+2060 (never line-breaks)

// Inverse — split back to spaces
splitCompounds('کتاب‌خانہ')  // → 'کتاب خانہ'

The normalization pipeline

A 12-layer normalization pipeline — the foundation that every other module builds on.

import { normalize, fingerprint } from '@iamahsanmehmood/urdu-tools'

Layer	What it does	Default
1 — NFC	Unicode canonical form	✅
2 — NBSP	Non-breaking space → regular space	✅
3 — Alif Madda	`آ` → `آ` (precomposed)	✅
4 — Numerals	`٠–٩` and `۰–۹` → ASCII `0–9`	✅
5 — Zero-width	Strip ZWNJ, ZWJ, soft hyphen	✅
6 — Diacritics	Strip zabar, zer, pesh, shadda, sukun, tanwin	✅
7 — Honorifics	Strip Islamic honorific signs (ؐ ؑ ؒ ؓ ؔ)	✅
8 — Hamza	`أ` → `ا`, `ؤ` → `و`	✅
9 — Kashida	Strip tatweel U+0640	❌
10 — Presentation forms	Map U+FB50–FEFF to base chars	❌
11 — Punctuation trim	Strip leading/trailing non-letter chars	❌
12 — Char normalize	Arabic look-alikes → correct Urdu codepoints	❌

normalize('عِلمٌ')                    // 'علم'  (layers 1–6: diacritics stripped)
normalize('آ')             // 'آ'    (layer 3: Alif + Madda → precomposed)
normalize('علم‌ہے')             // 'علمہے' (layer 5: ZWNJ stripped)
normalize('نبیؐ')                    // 'نبی'  (layer 7: honorific stripped)

// Full normalization for search indexing
normalize(userInput, {
  kashida: true,
  presentationForms: true,
  punctuationTrim: true,
  normalizeCharacters: true,   // ي → ی, ك → ک, ه → ہ
})

The fingerprint function

For client-side word comparison without database round-trips:

fingerprint('عِلمٌ') === fingerprint('عَلم')   // true (both normalize to 'علم')
fingerprint('نبیؐ') === fingerprint('نبی')     // true (honorific stripped)
fingerprint('علم‌') === fingerprint('علم') // true (ZWNJ stripped)

We use this in HamaariUrdu to compare user input against stored words in a 110,000+ word dictionary without needing a round-trip to the database for every keystroke.

The Arabic–Urdu confusion problem

This is the single most common source of silent failures in Urdu software, and no existing library addressed it.

Three character pairs are visually identical in Naskh fonts but are different Unicode code points:

Visual	Arabic codepoint	Urdu codepoint	Common source
ی	ي U+064A	ی U+06CC	Arabic-layout keyboards, Arabic websites
ک	ك U+0643	ک U+06A9	Arabic-layout keyboards
ہ	ه U+0647	ہ U+06C1	Arabic text pasted into Urdu context

A user searching for ہے typed with Arabic ه finds zero results in a database that stored it with Urdu ہ. Both look identical on screen. No error. No warning. Zero results.

import { normalizeCharacters } from '@iamahsanmehmood/urdu-tools'

normalizeCharacters('ي')  // → 'ی'  (U+064A → U+06CC)
normalizeCharacters('ك')  // → 'ک'  (U+0643 → U+06A9)
normalizeCharacters('ه')  // → 'ہ'  (U+0647 → U+06C1)

// Apply before storage or search indexing:
normalize(userInput, { normalizeCharacters: true })

Progressive search matching

The search module tries 9 progressively aggressive normalization layers until it finds a match — or returns false with full diagnostic info.

import { match, fuzzyMatch, getAllNormalizations } from '@iamahsanmehmood/urdu-tools'

match('عِلمٌ', 'علم')
// → { matched: true, layer: 'strip-diacritics', normalizedQuery: 'علم', normalizedTarget: 'علم' }

match('نبیؐ', 'نبی')
// → { matched: true, layer: 'strip-honorifics', ... }

match('أحمد', 'احمد')
// → { matched: true, layer: 'normalize-hamza', ... }

match('کتاب', 'علم')
// → { matched: false, layer: null, ... }

For database lookups, getAllNormalizations() returns every normalized form to try:

const forms = getAllNormalizations('عِلمٌ')
// → ['عِلمٌ', 'عِلم', 'علم', ...]  (from most specific to most aggressive)

for (const form of forms) {
  const result = await db.get(form)
  if (result) return result
}

Fuzzy matching uses Levenshtein + LCS hybrid (threshold 0.5):

fuzzyMatch('کتاب', ['کتابیں', 'کتب', 'علم'])
// → { candidate: 'کتابیں', score: ~0.7 }

Numbers — South Asian scale with bigint

The South Asian number system has named units that don't exist in Western mathematics:

Urdu	Value
ہزار	1,000
لاکھ	100,000
کروڑ	10,000,000
ارب	1,000,000,000
کھرب	1,000,000,000,000
نیل	1,000,000,000,000,000

The entire module uses bigint throughout — South Asian numbers exceed Number.MAX_SAFE_INTEGER.

import { numberToWords, formatCurrency, toUrduNumerals, wordsToNumber } from '@iamahsanmehmood/urdu-tools'

numberToWords(0n)                      // 'صفر'
numberToWords(100n)                    // 'ایک سو'
numberToWords(100_000n)                // 'ایک لاکھ'
numberToWords(10_000_000n)             // 'ایک کروڑ'
numberToWords(1_000_000_000_000_000n)  // 'ایک نیل'

// Ordinals with gender agreement
numberToWords(1n, { ordinal: true, gender: 'masculine' })  // 'پہلا'
numberToWords(1n, { ordinal: true, gender: 'feminine' })   // 'پہلی'
numberToWords(11n, { ordinal: true, gender: 'masculine' }) // 'گیارہواں'
numberToWords(11n, { ordinal: true, gender: 'feminine' })  // 'گیارہویں'

// Currency
formatCurrency(505.50, 'PKR')  // 'پانچ سو پانچ روپے پچاس پیسے'
formatCurrency(1000, 'INR')    // 'ایک ہزار روپے'

// Numeral conversion
toUrduNumerals(2024)    // '۲۰۲۴'

// Inverse — parse words back to number
wordsToNumber('ایک کروڑ')     // 10_000_000n
wordsToNumber('پانچ سو پانچ') // 505n

Canonical Urdu sorting

No database and no JavaScript runtime has native Urdu collation. The 39-letter Urdu alphabet order:

ء ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن ں و ہ ھ ی ے

import { sort, compare, sortKey } from '@iamahsanmehmood/urdu-tools'

sort(['ے', 'ا', 'ک', 'ب'])           // → ['ا', 'ب', 'ک', 'ے']
sort(['زبان', 'اردو', 'بہترین'])      // → ['اردو', 'بہترین', 'زبان']

// Use compare() as a comparator for any sorting context
['ے', 'ا', 'ک'].sort(compare)        // → ['ا', 'ک', 'ے']

// sortKey() for indexing — diacritics stripped before key generation
sortKey('پاکستان')   // '030003091102280814'
// عِلم and عَلم sort to the same position

In C# it implements IComparer<string> for native LINQ integration:

using UrduTools.Core.Sorting;

var words = new[] { "ے", "ا", "ک", "ب" };
var sorted = words.OrderBy(w => w, new UrduComparer()).ToList();
// ["ا", "ب", "ک", "ے"]

Unicode-aware tokenization

The tokenizer handles the edge cases that matter in real Urdu text:

import { tokenize, sentences, ngrams } from '@iamahsanmehmood/urdu-tools'

tokenize('پاکستان ایک خوبصورت ملک ہے')
// → [
//   { text: 'پاکستان', type: 'urdu-word' },
//   { text: 'ایک',     type: 'urdu-word' },
//   { text: 'خوبصورت', type: 'urdu-word' },
//   { text: 'ملک',     type: 'urdu-word' },
//   { text: 'ہے',      type: 'urdu-word' },
// ]

// Sentence splitting — on ۔ (U+06D4) ؟ ! but NOT on ، or ؛
sentences('پہلا جملہ۔ دوسرا جملہ؟ تیسرا جملہ!')
// → ['پہلا جملہ', 'دوسرا جملہ', 'تیسرا جملہ']

// The tokenizer preserves ZWNJ within words —
// so joinCompounds() output is one token per compound

Key edge cases handled:

Izafat Kasra (U+0650) at word boundaries is not treated as a split point
ZWNJ-bound compounds (output of joinCompounds()) are kept as single tokens
Mixed Urdu/Latin text is classified per-token

Transliteration — 18 aspirated digraphs

import { toRoman, fromRoman } from '@iamahsanmehmood/urdu-tools'

toRoman('پاکستان')   // 'pakistan'
toRoman('بھارت')     // 'bharat'
toRoman('چھوٹا')     // 'chhota'

Digraph rules (left-to-right FSM, digraph priority):

Urdu	Roman	Urdu	Roman
بھ	bh	پھ	ph
تھ	th	ٹھ	Th
جھ	jh	چھ	chh
دھ	dh	ڈھ	Dh
کھ	kh	گھ	gh

fromRoman('pakistan')  // → 'پاکستان' (trie-based longest-prefix match)
fromRoman('bharat')    // → 'بھارت'

InPage encoding — decoding 30 years of Urdu archives

InPage was the dominant Urdu desktop publishing tool for decades. Millions of documents — newspapers, books, government archives — exist only in InPage format. The library decodes all three versions:

import { decodeInpage, detectEncoding } from '@iamahsanmehmood/urdu-tools'

// Auto-detect InPage version and decode
const result = decodeInpage(buffer, 'auto')
// result.paragraphs → string[]  (Unicode Urdu text)
// result.version   → 'v1' | 'v2' | 'v3'

// Explicit version
decodeInpage(buffer, 'v1')  // 0x04-prefix byte-pair encoding (old InPage)
decodeInpage(buffer, 'v3')  // UTF-16LE with paragraph markers

// Detect encoding from buffer alone
detectEncoding(buffer)
// → 'utf-8' | 'utf-16le' | 'windows-1256' | 'inpage-v1v2' | 'inpage-v3' | 'unknown'

String utilities

import { reverse, truncate, wordCount, charCount,
         extractUrdu, decodeHtmlEntities } from '@iamahsanmehmood/urdu-tools'

// Reverse word order (not characters — preserves Arabic shaping)
reverse('پاکستان ہندوستان')      // → 'ہندوستان پاکستان'

// Truncate at word boundary
truncate('یہ ایک بہت لمبا جملہ ہے', 10)  // → 'یہ ایک...'

// Count grapheme clusters (correct for combining diacritics)
charCount('عِلم')   // → 3  (ع+ِ = 1 cluster, ل, م)

// Extract Urdu/Arabic segments from mixed text
extractUrdu('The word علم means knowledge')  // → ['علم']

// Decode HTML entities BEFORE normalize() — critical for TinyMCE/Quill content
decodeHtmlEntities('کتاب&rsquo;خانہ')  // → 'کتاب’خانہ'
decodeHtmlEntities('علم&nbsp;ہے')      // → 'علم ہے'

That last one (decodeHtmlEntities) is the fix for the TinyMCE bug mentioned at the top. Always call it before normalizing text that came from a rich text editor.

Script and character analysis

import { isUrduChar, getScript, classifyChar, isRTL, getUrduDensity } from '@iamahsanmehmood/urdu-tools'

isUrduChar('پ')  // true  — U+067E is Urdu-specific
isUrduChar('ب')  // false — U+0628 is shared with Arabic
isUrduChar('۱')  // true  — U+06F1 Urdu numeral

getScript('پاکستان')          // 'urdu'
getScript('مرحبا')             // 'arabic'
getScript('Hello پاکستان')    // 'mixed'

classifyChar('پ')   // 'urdu-letter'
classifyChar('َ')   // 'diacritic'
classifyChar('۱')   // 'numeral'

isRTL('پاکستان')               // true
getUrduDensity('پاکستان زندہ') // 0.28

C#/.NET — identical API, zero dependencies

Every function is available in UrduTools.Core with the same behavior. The C# package mirrors the TypeScript structure exactly.

using UrduTools.Core.Normalization;
using UrduTools.Core.Compound;
using UrduTools.Core.Numbers;
using UrduTools.Core.Sorting;
using UrduTools.Core.Search;

// Normalize
UrduNormalizer.Normalize("عِلمٌ");                          // "علم"
UrduNormalizer.Normalize("علم‌");                      // "علم"

// Compound detection
var spans = CompoundDetector.DetectCompounds("کتاب خانہ میں");
// spans[0].Text == "کتاب خانہ"
// spans[0].Type == CompoundType.Affix

// Numbers
NumberToWords.Convert(10_000_000);  // "ایک کروڑ"
NumberToWords.Convert(1, new NumberOptions { Ordinal = true, Gender = Gender.Feminine });  // "پہلی"

// Sort
var sorted = new[] { "ے", "ا", "ک", "ب" }
    .OrderBy(w => w, new UrduComparer())
    .ToList();  // ["ا", "ب", "ک", "ے"]

// Progressive normalization for DB lookup
foreach (var form in UrduMatcher.GetAllNormalizations(userInput))
{
    var result = await db.LookupAsync(form);
    if (result is not null) return result;
}

// Match
UrduMatcher.Match("عِلمٌ", "علم").Matched;  // true, layer: StripDiacritics

Academic foundation

The compound word detection module was built on peer-reviewed Urdu linguistics research. These three works directly informed the architecture:

Jabbar, A. (2016). "Urdu Compound Words Manufacturing a State of Art."
Provides the Urdu Affix Word List (UAWL) — the definitive catalog of Urdu derivational morphemes. The 100+ affix morphemes in Layer 1 (AFFIX_SET, PREFIX_SET, SUFFIX_SET) are drawn from this work.

Rahman, M. "A Linguistic Classification of Urdu Compound Words."
Informed the typological distinctions between compound categories — specifically the Perso-Arabic vs. native Urdu origin split and vav-e-atf chain patterns. Shaped the CompoundType taxonomy and izafat heuristics.

"High Performance Stemming Algorithm to Handle Multi-Word Expressions."
Motivated the joinCompounds() + tokenize() pipeline design — the paper demonstrates that semantic integrity is best preserved by preventing erroneous splits at the input boundary, not by post-processing token sequences. Also reinforced N-gram scanning over bigram-only approaches.

Used in production

This library is not a side project. It runs in three production systems:

System	Type
HamaariUrdu	Urdu language learning platform — normalization, search, compound detection, numbers
Pakistan Academy of Letters	Government literary institution — normalization, search, sorting
Digital Library of PAL	Government digital Urdu archive — normalization, search, encoding

HamaariUrdu was the origin — the library was extracted from production code where these problems were first encountered and solved. PAL and DLP integrated later for their Urdu text search and archiving systems.

Live Playground

Every function is interactive at iamahsanmehmood.github.io/urdu-tools.

The playground includes compound reporting built-in: if you find a compound the detector misses, or a pair it wrongly detects, you can report it directly from the UI — a pre-filled GitHub issue opens in one click.

Contributing

The compound lexicon (3,262 roots, expandable) is the highest-impact area for non-developer contributions. If you know Urdu, you can contribute without writing code:

// packages/urdu-js/src/compound/lexicon-data.ts
// Format: ['rootWord', new Set(['tail1', 'tail2'])]

['محنت', new Set(['مشقت'])],
['علم', new Set(['و ہنر', 'و عمل', 'کیمیا'])],
['انسائیکلوپیڈیا', new Set(['آف اسلام'])],

Full guide in CONTRIBUTING.md.

GitHub: github.com/iamahsanmehmood/urdu-tools

اردو سافٹ ویئر کو بہتر بنانے میں ہمارا ساتھ دیں۔
Help us make Urdu software better.

Tags: #urdu #nlp #typescript #dotnet #opensource

DEV Community