Most generative AI models don’t speak Khasi. Or several Northeast Indian language, really. So, I built Kren v1—a compact, GPT-2-style model that can generate Khasi text, trained from scratch by converting an encoder into a decoder.
This wasn’t just a fine-tuning job. It was a full architectural pivot.
🔄 From KhasiBERT to Kren
Kren started life as KhasiBERT, a RoBERTa-style encoder trained on Khasi. But encoders don’t generate—they classify. So I reworked it into a decoder, transferring weights and adapting it to GPT-2’s causal format.
Why bother? Because there’s no generative model for Khasi. And building one from scratch with limited data is tough.
📊 Training Breakdown
I tested different data sizes to find the sweet spot for generation quality—not just loss scores. Here’s how it played out:
Version | Lines of Khasi Text | Loss | Notes |
---|---|---|---|
v0.1 | 300K | 3.149 | Basic generation, short replies |
v0.2 | 800K | 2.995 | Dialogue improves |
v1.0 | 1M | 2.960 | Abstract reasoning kicks in |
v0.4 | 2M | 2.903 | Lower loss, but degraded output |
More data didn’t mean better results. At 2M lines, the model started to lose coherence—so I stuck with 1M for the final release.
🧵 What Kren Can Do
Kren v1 can generate Khasi text about:
- Places
- Cultural topics
- Abstract reasoning and multi-sentence replies
It’s not perfect—there’s a 514-token limit, and it can hallucinate or reflect biases. But it’s a start.
🚀 Try It Yourself
You can test it on Hugging Face or load it locally with:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/kren-v1")
model = AutoModelForCausalLM.from_pretrained("MWirelabs/kren-v1")
inputs = tokenizer("Ka Khasi ka", return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_length=100, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
🌱 Why This Matters
Kren v1 shows that it’s possible to build generative models for low-resource languages—even by converting encoders. It’s compact, reproducible, and open for anyone to build on.
If you’re working on regional NLP or want to explore encoder-to-decoder conversions, check out MWire Labs. We’re building tools that reflect the linguistic diversity of Northeast India—quietly, but with purpose.
Top comments (0)