DEV Community

Cover image for Kren v1: Turning an Encoder into a Khasi-Speaking AI
B Nyalang
B Nyalang

Posted on

Kren v1: Turning an Encoder into a Khasi-Speaking AI

Most generative AI models don’t speak Khasi. Or several Northeast Indian language, really. So, I built Kren v1—a compact, GPT-2-style model that can generate Khasi text, trained from scratch by converting an encoder into a decoder.

This wasn’t just a fine-tuning job. It was a full architectural pivot.

🔄 From KhasiBERT to Kren

Kren started life as KhasiBERT, a RoBERTa-style encoder trained on Khasi. But encoders don’t generate—they classify. So I reworked it into a decoder, transferring weights and adapting it to GPT-2’s causal format.

Why bother? Because there’s no generative model for Khasi. And building one from scratch with limited data is tough.

📊 Training Breakdown

I tested different data sizes to find the sweet spot for generation quality—not just loss scores. Here’s how it played out:

Version Lines of Khasi Text Loss Notes
v0.1 300K 3.149 Basic generation, short replies
v0.2 800K 2.995 Dialogue improves
v1.0 1M 2.960 Abstract reasoning kicks in
v0.4 2M 2.903 Lower loss, but degraded output

More data didn’t mean better results. At 2M lines, the model started to lose coherence—so I stuck with 1M for the final release.

🧵 What Kren Can Do

Kren v1 can generate Khasi text about:

  • Places
  • Cultural topics
  • Abstract reasoning and multi-sentence replies

It’s not perfect—there’s a 514-token limit, and it can hallucinate or reflect biases. But it’s a start.

🚀 Try It Yourself

You can test it on Hugging Face or load it locally with:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("MWirelabs/kren-v1")
model = AutoModelForCausalLM.from_pretrained("MWirelabs/kren-v1")

inputs = tokenizer("Ka Khasi ka", return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_length=100, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Enter fullscreen mode Exit fullscreen mode

🌱 Why This Matters

Kren v1 shows that it’s possible to build generative models for low-resource languages—even by converting encoders. It’s compact, reproducible, and open for anyone to build on.

If you’re working on regional NLP or want to explore encoder-to-decoder conversions, check out MWire Labs. We’re building tools that reflect the linguistic diversity of Northeast India—quietly, but with purpose.

Top comments (0)