Building Language Tech for Meghalaya: Lessons from Tokenizing Khasi and Garo with Modern LLMs

#ai #llm #nlp

When people talk about AI and language models, they rarely mean languages like Khasi or Garo. But for those of us working in Northeast India, that’s exactly where the challenge—and the opportunity—lies.

Over the past few months, I’ve been diving deep into how modern LLMs handle tokenization for low-resource languages, especially those with unique orthographic features. Khasi (Austroasiatic) and Garo (Tibeto-Burman) aren’t just linguistically rich—they’re structurally distinct from the Indo-Aryan mainstream. That makes them a fascinating testbed for evaluating how well current models preserve linguistic authenticity.

🔍 What I Found

Most open-source LLMs tokenize these languages poorly. Diacritics get corrupted, middle dots turn into hex gibberish, and meaningful units are fractured. Even models with massive vocabularies struggle unless they’ve been trained with orthographic sensitivity.

I ran a systematic evaluation across five models—including Gemma, Falcon, LLaMA, and Nemotron—using both efficiency and authenticity metrics. The results were surprising: one model nailed it, most didn’t.

🧪 Why Tokenization Matters

If your tokenizer breaks a word like ka·la·ï into meaningless fragments, downstream tasks like translation, speech synthesis, or search will fail. For civic tech, that’s not just a bug—it’s a barrier to access.

🌱 What Comes Next

This isn’t just about benchmarking. It’s about building a reproducible, region-first ecosystem for language tech in Meghalaya. I’ve released the evaluation framework as a public artifact, and I’m working toward open-source models that respect the linguistic integrity of Khasi and Garo.

If you’re building LLMs, working on STT/TTS, or deploying civic tech in Northeast India, tokenization isn’t a footnote—it’s foundational.