NE-Agent: Building an AI agent that actually speaks Khasi, Garo, Mizo

#ai #nlp #agents #northeastindia

So this has been eating my nights for months and it's finally out. Northeast India has 220+ indigenous languages and basically zero AI built for them. Not because there's nothing to build with. Just nobody's done it. So I did.

I run MWire Labs out of Shillong. We've been quietly building the NE-Stack, language models for this region, LID, embeddings, ASR, translation. But models sitting on HuggingFace don't help anyone unless something ties them together. That's NE-Agent.

What it does

You type or speak in Khasi, Garo, Mizo, whatever. It figures out what language you're using, decides what you actually want (are you asking a question, translating something, giving it audio), and routes to the right tool.

pip install ne-agent
ollama pull qwen2.5:1.5b
ne-agent

That's it. No API key, no cloud dependency. It runs a small local LLM through Ollama and everything else is purpose built for these languages.

The pieces

NE-LID does language detection across 11 languages, 99% accuracy
NE-Embed is a fine tuned LaBSE model for retrieval, since off the shelf embedding models basically don't understand Garo or Khasi at all
A fine tuned NLLB model handles Khasi to English translation
NE-ASR is a Whisper fine tune for transcription across 8 languages
A small local model (qwen2.5 1.5B) does the actual routing and generation

The router is the part I like most. It's just the same small LLM deciding, per query, whether you want search, translation, or transcription. No hardcoded rules, no separate classifier. One model doing double duty.

What actually broke

Building the embeddings was the real fight. LaBSE zero shot on Garo gets you 13% recall at 1. After fine tuning on our parallel data it jumps to 90%. Same story for Khasi and Nyishi. These languages just aren't represented in general purpose training data, so you can't shortcut this with a bigger foundation model. You have to build the data.

Also learned the hard way that whisper's generate API changed and broke my transcription pipeline mid build, and that mixing an older transformers version needed for one model with a newer one needed for another means you basically need separate environments per tool. Not glamorous, just real.

Where it's rough

The 1.5B model hallucinates sometimes even when retrieval gets the right document. Translation only covers Khasi for now. NE-LID still mixes up Mizo and Nyishi occasionally, they're phonologically close. None of this is hidden, it's all called out plainly in the limitations section of the paper.

Why bother

Because if nobody builds this stuff for low resource languages, it just doesn't exist. There's no market pressure making a frontier lab prioritize Kokborok. Someone has to build the boring infrastructure first. That's what this is.

Code and models are open, CC BY 4.0.