Kren v1: Turning an Encoder into a Khasi-Speaking AI

#nlp #khasi #meghalaya

Most generative AI models don’t speak Khasi. Or several Northeast Indian language, really. So, I built Kren v1—a compact, GPT-2-style model that can generate Khasi text, trained from scratch by converting an encoder into a decoder.

This wasn’t just a fine-tuning job. It was a full architectural pivot.

🔄 From KhasiBERT to Kren

Kren started life as KhasiBERT, a RoBERTa-style encoder trained on Khasi. But encoders don’t generate—they classify. So I reworked it into a decoder, transferring weights and adapting it to GPT-2’s causal format.

Why bother? Because there’s no generative model for Khasi. And building one from scratch with limited data is tough.

📊 Training Breakdown

I tested different data sizes to find the sweet spot for generation quality—not just loss scores. Here’s how it played out:

Version	Lines of Khasi Text	Loss	Notes
v0.1	300K	3.149	Basic generation, short replies
v0.2	800K	2.995	Dialogue improves
v1.0	1M	2.960	Abstract reasoning kicks in
v0.4	2M	2.903	Lower loss, but degraded output

More data didn’t mean better results. At 2M lines, the model started to lose coherence—so I stuck with 1M for the final release.

🧵 What Kren Can Do

Kren v1 can generate Khasi text about:

Places
Cultural topics
Abstract reasoning and multi-sentence replies

It’s not perfect—there’s a 514-token limit, and it can hallucinate or reflect biases. But it’s a start.

🚀 Try It Yourself

You can test it on Hugging Face or load it locally with:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("MWirelabs/kren-v1")
model = AutoModelForCausalLM.from_pretrained("MWirelabs/kren-v1")

inputs = tokenizer("Ka Khasi ka", return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_length=100, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🌱 Why This Matters

Kren v1 shows that it’s possible to build generative models for low-resource languages—even by converting encoders. It’s compact, reproducible, and open for anyone to build on.

If you’re working on regional NLP or want to explore encoder-to-decoder conversions, check out MWire Labs. We’re building tools that reflect the linguistic diversity of Northeast India—quietly, but with purpose.

DEV Community