DEV Community: B Nyalang

NE-Agent: Building an AI agent that actually speaks Khasi, Garo, Mizo

B Nyalang — Fri, 03 Jul 2026 21:49:07 +0000

So this has been eating my nights for months and it's finally out. Northeast India has 220+ indigenous languages and basically zero AI built for them. Not because there's nothing to build with. Just nobody's done it. So I did.

I run MWire Labs out of Shillong. We've been quietly building the NE-Stack, language models for this region, LID, embeddings, ASR, translation. But models sitting on HuggingFace don't help anyone unless something ties them together. That's NE-Agent.

What it does

You type or speak in Khasi, Garo, Mizo, whatever. It figures out what language you're using, decides what you actually want (are you asking a question, translating something, giving it audio), and routes to the right tool.

pip install ne-agent
ollama pull qwen2.5:1.5b
ne-agent

That's it. No API key, no cloud dependency. It runs a small local LLM through Ollama and everything else is purpose built for these languages.

The pieces

NE-LID does language detection across 11 languages, 99% accuracy
NE-Embed is a fine tuned LaBSE model for retrieval, since off the shelf embedding models basically don't understand Garo or Khasi at all
A fine tuned NLLB model handles Khasi to English translation
NE-ASR is a Whisper fine tune for transcription across 8 languages
A small local model (qwen2.5 1.5B) does the actual routing and generation

The router is the part I like most. It's just the same small LLM deciding, per query, whether you want search, translation, or transcription. No hardcoded rules, no separate classifier. One model doing double duty.

What actually broke

Building the embeddings was the real fight. LaBSE zero shot on Garo gets you 13% recall at 1. After fine tuning on our parallel data it jumps to 90%. Same story for Khasi and Nyishi. These languages just aren't represented in general purpose training data, so you can't shortcut this with a bigger foundation model. You have to build the data.

Also learned the hard way that whisper's generate API changed and broke my transcription pipeline mid build, and that mixing an older transformers version needed for one model with a newer one needed for another means you basically need separate environments per tool. Not glamorous, just real.

Where it's rough

The 1.5B model hallucinates sometimes even when retrieval gets the right document. Translation only covers Khasi for now. NE-LID still mixes up Mizo and Nyishi occasionally, they're phonologically close. None of this is hidden, it's all called out plainly in the limitations section of the paper.

Why bother

Because if nobody builds this stuff for low resource languages, it just doesn't exist. There's no market pressure making a frontier lab prioritize Kokborok. Someone has to build the boring infrastructure first. That's what this is.

Code and models are open, CC BY 4.0.

GitHub: github.com/MWirelabs/ne-agent
PyPI: pypi.org/project/ne-agent
Models: huggingface.co/MWirelabs

If you work on low resource language tech or you're from the region and want to poke at this, I'd love to hear from you.

We're running an AI-authored research workshop for Northeast India's 200+ languages - and publishing everything openly

B Nyalang — Wed, 01 Apr 2026 00:19:45 +0000

At MWire Labs, we build language technology for Northeast India's indigenous languages - ASR, MT, OCR, LLMs. The region has 200+ languages. Almost none of them exist in mainstream AI datasets.
So we're doing something a bit unusual.

NortheastGenAI 2026 is a virtual workshop on May 29 where every submission must be AI-generated or AI-assisted - with full disclosure of how. All reviews are AI-assisted too, followed by a human editorial check. Everything is public on OpenReview. Inspired by Agents4Science 2025 (Stanford).

We're not claiming AI research is ready. We're asking the question openly and publishing whatever comes out.

*Three tracks:
*
Language, Culture & Heritage
Society, History & Anthropology
AI and Technology for NE India

Stack we're using: OpenReview for submissions.

Keynote: Bonaventure F. P. Dossou (McGill/Mila, Masakhane) — "Doing More with Less: Efficient Methods for Low-Resource Languages"

**Key dates:
**Submissions open: April 8
Deadline: May 15
Workshop: May 29

Non-archival - submit elsewhere after.
northeastgenai.github.io

If you're working on low-resource NLP, indigenous language tech, or just curious - come submit or attend.

How We Built Northeast India’s First Foundational AI Model from Shillong, on Our Own Terms

B Nyalang — Wed, 19 Nov 2025 10:51:13 +0000

We just released Kren-M™, a production-ready bilingual foundation model for Khasi and English.

No outside funding rounds.

No imported talent.

No compromise on local understanding.

We did it internally at MWire Labs (the AI research division of MWire, a Shillong-based firm that has delivered IT systems and solutions serving 8+ million citizens since 2017).

Because when it comes to Northeast languages, the deepest expertise isn’t in Bangalore or California — it’s right here in the hills.

Why Local Roots Beat Everything Else

Big labs throw hundreds at Indic models.

We threw eight years of on-the-ground experience.

We know Khasi isn’t just tokens, it’s morphology, dialect variation, cultural nuance that only someone who grew up hearing it can capture.

That’s why our tokenizer cuts Khasi token count by 36 %.

That’s why the model never auto-translates unless asked.

That’s why it sounds like home.

What We Shipped

Kren-M™ (Gemma-2-2B base, 2.6B params):

Custom tokenizer with 2,135 Khasi/Garo tokens
5.43 M hand-cleaned Khasi sentences (proprietary — our moat)
Fully task-aware SFT — natural bilingual behaviour
Runs offline on 6 GB VRAM

Live: https://huggingface.co/MWirelabs/Kren-M

White paper: https://mwirelabs.com/models/kren-m

Preprint (DOI): https://www.researchsquare.com/article/rs-8144118/v1

We also open-sourced the one of the largest public Assamese & Mizo corpora + the first Garo corpus ever.

This Is Just the Beginning

Early 2026: Expect Kren-NE, Gemma-2-9B multilingual covering Khasi, Garo, Mizo, Assamese, Meitei, Nagamese, Kokborok and more.

All built the same way: local team, local data, local control.

The future of Northeast AI won’t be built in glass towers far away.

It will be built here, by us, for us.

NEindicLLM #KhasiLLM #MeghalayaAI #NortheastAI

Building Language Tech for Meghalaya: Lessons from Tokenizing Khasi and Garo with Modern LLMs

B Nyalang — Sat, 20 Sep 2025 18:45:49 +0000

When people talk about AI and language models, they rarely mean languages like Khasi or Garo. But for those of us working in Northeast India, that’s exactly where the challenge—and the opportunity—lies.

Over the past few months, I’ve been diving deep into how modern LLMs handle tokenization for low-resource languages, especially those with unique orthographic features. Khasi (Austroasiatic) and Garo (Tibeto-Burman) aren’t just linguistically rich—they’re structurally distinct from the Indo-Aryan mainstream. That makes them a fascinating testbed for evaluating how well current models preserve linguistic authenticity.

🔍 What I Found

Most open-source LLMs tokenize these languages poorly. Diacritics get corrupted, middle dots turn into hex gibberish, and meaningful units are fractured. Even models with massive vocabularies struggle unless they’ve been trained with orthographic sensitivity.

I ran a systematic evaluation across five models—including Gemma, Falcon, LLaMA, and Nemotron—using both efficiency and authenticity metrics. The results were surprising: one model nailed it, most didn’t.

🧪 Why Tokenization Matters

If your tokenizer breaks a word like ka·la·ï into meaningless fragments, downstream tasks like translation, speech synthesis, or search will fail. For civic tech, that’s not just a bug—it’s a barrier to access.

🌱 What Comes Next

This isn’t just about benchmarking. It’s about building a reproducible, region-first ecosystem for language tech in Meghalaya. I’ve released the evaluation framework as a public artifact, and I’m working toward open-source models that respect the linguistic integrity of Khasi and Garo.

If you’re building LLMs, working on STT/TTS, or deploying civic tech in Northeast India, tokenization isn’t a footnote—it’s foundational.

🙌 Final Thought

Language tech isn’t just about scale—it’s about respect. And sometimes, the smallest tokens carry the biggest meaning.

Kren v1: Turning an Encoder into a Khasi-Speaking AI

B Nyalang — Wed, 17 Sep 2025 13:48:14 +0000

Most generative AI models don’t speak Khasi. Or several Northeast Indian language, really. So, I built Kren v1—a compact, GPT-2-style model that can generate Khasi text, trained from scratch by converting an encoder into a decoder.

This wasn’t just a fine-tuning job. It was a full architectural pivot.

🔄 From KhasiBERT to Kren

Kren started life as KhasiBERT, a RoBERTa-style encoder trained on Khasi. But encoders don’t generate—they classify. So I reworked it into a decoder, transferring weights and adapting it to GPT-2’s causal format.

Why bother? Because there’s no generative model for Khasi. And building one from scratch with limited data is tough.

📊 Training Breakdown

I tested different data sizes to find the sweet spot for generation quality—not just loss scores. Here’s how it played out:

Version	Lines of Khasi Text	Loss	Notes
v0.1	300K	3.149	Basic generation, short replies
v0.2	800K	2.995	Dialogue improves
v1.0	1M	2.960	Abstract reasoning kicks in
v0.4	2M	2.903	Lower loss, but degraded output

More data didn’t mean better results. At 2M lines, the model started to lose coherence—so I stuck with 1M for the final release.

🧵 What Kren Can Do

Kren v1 can generate Khasi text about:

Places
Cultural topics
Abstract reasoning and multi-sentence replies

It’s not perfect—there’s a 514-token limit, and it can hallucinate or reflect biases. But it’s a start.

🚀 Try It Yourself

You can test it on Hugging Face or load it locally with:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("MWirelabs/kren-v1")
model = AutoModelForCausalLM.from_pretrained("MWirelabs/kren-v1")

inputs = tokenizer("Ka Khasi ka", return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_length=100, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🌱 Why This Matters

Kren v1 shows that it’s possible to build generative models for low-resource languages—even by converting encoders. It’s compact, reproducible, and open for anyone to build on.

If you’re working on regional NLP or want to explore encoder-to-decoder conversions, check out MWire Labs. We’re building tools that reflect the linguistic diversity of Northeast India—quietly, but with purpose.

Khasibert: A Regional Language Model for Khasi NLP

B Nyalang — Fri, 12 Sep 2025 10:51:40 +0000

Most language models overlook low-resource languages. Khasibert is built to change that—it's an open-source Khasi language model designed for translation, summarization, and civic NLP tasks in Northeast India.

What Is Khasibert?

A compact transformer-based language model, architected for region-first NLP in Khasi
Optimized for low-resource deployment and real-world usability
Built by MWire Labs to support inclusive, culturally aware AI.

Why It Matters

Khasi is spoken by over a million people, yet underrepresented in mainstream NLP
Khasibert enables language technology research, civic applications, and education tools
It’s part of a broader mission to democratize AI for Northeast India.

What’s Under the Hood

Pretrained on cleaned, deduplicated Khasi text
Fine-tuned for translation, summarization, and semantic understanding
Benchmarked for responsiveness in resource-constrained environments