<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: B Nyalang</title>
    <description>The latest articles on DEV Community by B Nyalang (@b_nyalang).</description>
    <link>https://dev.to/b_nyalang</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3497193%2Ff24d3632-7fd7-4b41-9739-b167a8273499.jpg</url>
      <title>DEV Community: B Nyalang</title>
      <link>https://dev.to/b_nyalang</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/b_nyalang"/>
    <language>en</language>
    <item>
      <title>We're running an AI-authored research workshop for Northeast India's 200+ languages - and publishing everything openly</title>
      <dc:creator>B Nyalang</dc:creator>
      <pubDate>Wed, 01 Apr 2026 00:19:45 +0000</pubDate>
      <link>https://dev.to/b_nyalang/were-running-an-ai-authored-research-workshop-for-northeast-indias-200-languages-and-58nc</link>
      <guid>https://dev.to/b_nyalang/were-running-an-ai-authored-research-workshop-for-northeast-indias-200-languages-and-58nc</guid>
      <description>&lt;p&gt;At MWire Labs, we build language technology for Northeast India's indigenous languages - ASR, MT, OCR, LLMs. The region has 200+ languages. Almost none of them exist in mainstream AI datasets.&lt;br&gt;
So we're doing something a bit unusual.&lt;/p&gt;

&lt;p&gt;NortheastGenAI 2026 is a virtual workshop on May 29 where every submission must be AI-generated or AI-assisted - with full disclosure of how. All reviews are AI-assisted too, followed by a human editorial check. Everything is public on OpenReview. Inspired by Agents4Science 2025 (Stanford).&lt;/p&gt;

&lt;p&gt;We're not claiming AI research is ready. We're asking the question openly and publishing whatever comes out.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Three tracks:&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
Language, Culture &amp;amp; Heritage&lt;br&gt;
Society, History &amp;amp; Anthropology&lt;br&gt;
AI and Technology for NE India&lt;/p&gt;

&lt;p&gt;Stack we're using: OpenReview for submissions.&lt;/p&gt;

&lt;p&gt;Keynote: Bonaventure F. P. Dossou (McGill/Mila, Masakhane) — "Doing More with Less: Efficient Methods for Low-Resource Languages"&lt;/p&gt;

&lt;p&gt;**Key dates:&lt;br&gt;
**Submissions open: April 8&lt;br&gt;
Deadline: May 15&lt;br&gt;
Workshop: May 29&lt;/p&gt;

&lt;p&gt;Non-archival - submit elsewhere after.&lt;br&gt;
&lt;a href="https://northeastgenai.github.io/" rel="noopener noreferrer"&gt;northeastgenai.github.io&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're working on low-resource NLP, indigenous language tech, or just curious - come submit or attend.&lt;/p&gt;

</description>
      <category>nlp</category>
      <category>ai</category>
      <category>opensource</category>
      <category>computerscience</category>
    </item>
    <item>
      <title>How We Built Northeast India’s First Foundational AI Model from Shillong, on Our Own Terms</title>
      <dc:creator>B Nyalang</dc:creator>
      <pubDate>Wed, 19 Nov 2025 10:51:13 +0000</pubDate>
      <link>https://dev.to/b_nyalang/how-we-built-northeast-indias-first-foundational-ai-model-from-shillong-on-our-own-terms-5g8h</link>
      <guid>https://dev.to/b_nyalang/how-we-built-northeast-indias-first-foundational-ai-model-from-shillong-on-our-own-terms-5g8h</guid>
      <description>&lt;p&gt;We just released &lt;a href="https://mwirelabs.com/models/kren-m/" rel="noopener noreferrer"&gt;&lt;strong&gt;Kren-M™&lt;/strong&gt;&lt;/a&gt;, a production-ready bilingual foundation model for Khasi and English.&lt;/p&gt;

&lt;p&gt;No outside funding rounds.&lt;br&gt;&lt;br&gt;
No imported talent.&lt;br&gt;&lt;br&gt;
No compromise on local understanding.&lt;/p&gt;

&lt;p&gt;We did it internally at MWire Labs (the AI research division of &lt;a href="https://mwireconsulting.com/" rel="noopener noreferrer"&gt;MWire&lt;/a&gt;, a Shillong-based firm that has delivered IT systems and solutions serving 8+ million citizens since 2017).&lt;/p&gt;

&lt;p&gt;Because when it comes to Northeast languages, the deepest expertise isn’t in Bangalore or California — it’s right here in the hills.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Local Roots Beat Everything Else
&lt;/h3&gt;

&lt;p&gt;Big labs throw hundreds at Indic models.&lt;/p&gt;

&lt;p&gt;We threw eight years of on-the-ground experience.&lt;/p&gt;

&lt;p&gt;We know Khasi isn’t just tokens, it’s morphology, dialect variation, cultural nuance that only someone who grew up hearing it can capture.&lt;/p&gt;

&lt;p&gt;That’s why our tokenizer cuts Khasi token count by 36 %.&lt;br&gt;&lt;br&gt;
That’s why the model never auto-translates unless asked.&lt;br&gt;&lt;br&gt;
That’s why it sounds like home.&lt;/p&gt;

&lt;h3&gt;
  
  
  What We Shipped
&lt;/h3&gt;

&lt;p&gt;Kren-M™ (Gemma-2-2B base, 2.6B params):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom tokenizer with 2,135 Khasi/Garo tokens&lt;/li&gt;
&lt;li&gt;5.43 M hand-cleaned Khasi sentences (proprietary — our moat)&lt;/li&gt;
&lt;li&gt;Fully task-aware SFT — natural bilingual behaviour&lt;/li&gt;
&lt;li&gt;Runs offline on 6 GB VRAM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Live: &lt;a href="https://huggingface.co/MWirelabs/Kren-M" rel="noopener noreferrer"&gt;https://huggingface.co/MWirelabs/Kren-M&lt;/a&gt;&lt;br&gt;&lt;br&gt;
White paper: &lt;a href="https://mwirelabs.com/models/kren-m" rel="noopener noreferrer"&gt;https://mwirelabs.com/models/kren-m&lt;/a&gt;&lt;br&gt;&lt;br&gt;
Preprint (DOI): &lt;a href="https://www.researchsquare.com/article/rs-8144118/v1" rel="noopener noreferrer"&gt;https://www.researchsquare.com/article/rs-8144118/v1&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We also open-sourced the one of the largest public Assamese &amp;amp; Mizo corpora + the first Garo corpus ever.&lt;/p&gt;

&lt;h3&gt;
  
  
  This Is Just the Beginning
&lt;/h3&gt;

&lt;p&gt;Early 2026: Expect Kren-NE, Gemma-2-9B multilingual covering Khasi, Garo, Mizo, Assamese, Meitei, Nagamese, Kokborok and more.&lt;/p&gt;

&lt;p&gt;All built the same way: local team, local data, local control.&lt;/p&gt;

&lt;p&gt;The future of Northeast AI won’t be built in glass towers far away.  &lt;/p&gt;

&lt;p&gt;It will be built here, by us, for us.&lt;/p&gt;

&lt;h1&gt;
  
  
  NEindicLLM #KhasiLLM #MeghalayaAI #NortheastAI
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>northeastai</category>
      <category>northeastindicllm</category>
      <category>meghalayaai</category>
    </item>
    <item>
      <title>Building Language Tech for Meghalaya: Lessons from Tokenizing Khasi and Garo with Modern LLMs</title>
      <dc:creator>B Nyalang</dc:creator>
      <pubDate>Sat, 20 Sep 2025 18:45:49 +0000</pubDate>
      <link>https://dev.to/b_nyalang/building-language-tech-for-meghalaya-lessons-from-tokenizing-khasi-and-garo-with-modern-llms-599p</link>
      <guid>https://dev.to/b_nyalang/building-language-tech-for-meghalaya-lessons-from-tokenizing-khasi-and-garo-with-modern-llms-599p</guid>
      <description>&lt;p&gt;When people talk about AI and language models, they rarely mean languages like Khasi or Garo. But for those of us working in Northeast India, that’s exactly where the challenge—and the opportunity—lies.&lt;/p&gt;

&lt;p&gt;Over the past few months, I’ve been diving deep into how modern LLMs handle tokenization for low-resource languages, especially those with unique orthographic features. Khasi (Austroasiatic) and Garo (Tibeto-Burman) aren’t just linguistically rich—they’re structurally distinct from the Indo-Aryan mainstream. That makes them a fascinating testbed for evaluating how well current models preserve linguistic authenticity.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔍 What I Found
&lt;/h3&gt;

&lt;p&gt;Most open-source LLMs tokenize these languages poorly. Diacritics get corrupted, middle dots turn into hex gibberish, and meaningful units are fractured. Even models with massive vocabularies struggle unless they’ve been trained with orthographic sensitivity.&lt;/p&gt;

&lt;p&gt;I ran a systematic evaluation across five models—including Gemma, Falcon, LLaMA, and Nemotron—using both efficiency and authenticity metrics. The results were surprising: one model nailed it, most didn’t.&lt;/p&gt;

&lt;h3&gt;
  
  
  🧪 Why Tokenization Matters
&lt;/h3&gt;

&lt;p&gt;If your tokenizer breaks a word like &lt;em&gt;ka·la·ï&lt;/em&gt; into meaningless fragments, downstream tasks like translation, speech synthesis, or search will fail. For civic tech, that’s not just a bug—it’s a barrier to access.&lt;/p&gt;

&lt;h3&gt;
  
  
  🌱 What Comes Next
&lt;/h3&gt;

&lt;p&gt;This isn’t just about benchmarking. It’s about building a reproducible, region-first ecosystem for language tech in Meghalaya. I’ve released the evaluation framework as a public artifact, and I’m working toward open-source models that respect the linguistic integrity of Khasi and Garo.&lt;/p&gt;

&lt;p&gt;If you’re building LLMs, working on STT/TTS, or deploying civic tech in Northeast India, tokenization isn’t a footnote—it’s foundational.&lt;/p&gt;




&lt;h3&gt;
  
  
  🙌 Final Thought
&lt;/h3&gt;

&lt;p&gt;Language tech isn’t just about scale—it’s about respect. And sometimes, the smallest tokens carry the biggest meaning.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>nlp</category>
    </item>
    <item>
      <title>Kren v1: Turning an Encoder into a Khasi-Speaking AI</title>
      <dc:creator>B Nyalang</dc:creator>
      <pubDate>Wed, 17 Sep 2025 13:48:14 +0000</pubDate>
      <link>https://dev.to/b_nyalang/kren-v1-turning-an-encoder-into-a-khasi-speaking-ai-1bd3</link>
      <guid>https://dev.to/b_nyalang/kren-v1-turning-an-encoder-into-a-khasi-speaking-ai-1bd3</guid>
      <description>&lt;p&gt;Most generative AI models don’t speak Khasi. Or several Northeast Indian language, really. So, I built &lt;a href="https://huggingface.co/MWirelabs/kren-v1" rel="noopener noreferrer"&gt;Kren v1&lt;/a&gt;—a compact, GPT-2-style model that can generate Khasi text, trained from scratch by converting an encoder into a decoder.&lt;/p&gt;

&lt;p&gt;This wasn’t just a fine-tuning job. It was a full architectural pivot.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔄 From KhasiBERT to Kren
&lt;/h3&gt;

&lt;p&gt;Kren started life as &lt;a href="https://huggingface.co/MWirelabs/khasibert" rel="noopener noreferrer"&gt;KhasiBERT&lt;/a&gt;, a RoBERTa-style encoder trained on Khasi. But encoders don’t generate—they classify. So I reworked it into a decoder, transferring weights and adapting it to GPT-2’s causal format.&lt;/p&gt;

&lt;p&gt;Why bother? Because there’s no generative model for Khasi. And building one from scratch with limited data is tough.&lt;/p&gt;

&lt;h3&gt;
  
  
  📊 Training Breakdown
&lt;/h3&gt;

&lt;p&gt;I tested different data sizes to find the sweet spot for generation quality—not just loss scores. Here’s how it played out:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;Lines of Khasi Text&lt;/th&gt;
&lt;th&gt;Loss&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;v0.1&lt;/td&gt;
&lt;td&gt;300K&lt;/td&gt;
&lt;td&gt;3.149&lt;/td&gt;
&lt;td&gt;Basic generation, short replies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v0.2&lt;/td&gt;
&lt;td&gt;800K&lt;/td&gt;
&lt;td&gt;2.995&lt;/td&gt;
&lt;td&gt;Dialogue improves&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v1.0&lt;/td&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;td&gt;2.960&lt;/td&gt;
&lt;td&gt;Abstract reasoning kicks in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v0.4&lt;/td&gt;
&lt;td&gt;2M&lt;/td&gt;
&lt;td&gt;2.903&lt;/td&gt;
&lt;td&gt;Lower loss, but degraded output&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;More data didn’t mean better results. At 2M lines, the model started to lose coherence—so I stuck with 1M for the final release.&lt;/p&gt;

&lt;h3&gt;
  
  
  🧵 What Kren Can Do
&lt;/h3&gt;

&lt;p&gt;Kren v1 can generate Khasi text about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Places&lt;/li&gt;
&lt;li&gt;Cultural topics&lt;/li&gt;
&lt;li&gt;Abstract reasoning and multi-sentence replies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s not perfect—there’s a 514-token limit, and it can hallucinate or reflect biases. But it’s a start.&lt;/p&gt;

&lt;h3&gt;
  
  
  🚀 Try It Yourself
&lt;/h3&gt;

&lt;p&gt;You can test it on &lt;a href="https://huggingface.co/MWirelabs/kren-v1" rel="noopener noreferrer"&gt;Hugging Face&lt;/a&gt; or load it locally with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;

&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MWirelabs/kren-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MWirelabs/kren-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ka Khasi ka&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;skip_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🌱 Why This Matters
&lt;/h3&gt;

&lt;p&gt;Kren v1 shows that it’s possible to build generative models for low-resource languages—even by converting encoders. It’s compact, reproducible, and open for anyone to build on.&lt;/p&gt;

&lt;p&gt;If you’re working on regional NLP or want to explore encoder-to-decoder conversions, check out &lt;a href="https://mwirelabs.com/" rel="noopener noreferrer"&gt;MWire Labs&lt;/a&gt;. We’re building tools that reflect the linguistic diversity of Northeast India—quietly, but with purpose.&lt;/p&gt;

</description>
      <category>nlp</category>
      <category>khasi</category>
      <category>meghalaya</category>
    </item>
    <item>
      <title>Khasibert: A Regional Language Model for Khasi NLP</title>
      <dc:creator>B Nyalang</dc:creator>
      <pubDate>Fri, 12 Sep 2025 10:51:40 +0000</pubDate>
      <link>https://dev.to/b_nyalang/khasibert-a-region-first-language-model-for-khasi-nlp-1i7k</link>
      <guid>https://dev.to/b_nyalang/khasibert-a-region-first-language-model-for-khasi-nlp-1i7k</guid>
      <description>&lt;p&gt;Most language models overlook low-resource languages. Khasibert is built to change that—it's an open-source Khasi language model designed for translation, summarization, and civic NLP tasks in Northeast India.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Khasibert?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A compact transformer-based language model, architected for region-first NLP in Khasi&lt;/li&gt;
&lt;li&gt;Optimized for low-resource deployment and real-world usability&lt;/li&gt;
&lt;li&gt;Built by &lt;a href="https://www.mwirelabs.com" rel="noopener noreferrer"&gt;MWire Labs&lt;/a&gt; to support inclusive, culturally aware AI.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why It Matters
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Khasi is spoken by over a million people, yet underrepresented in mainstream NLP&lt;/li&gt;
&lt;li&gt;Khasibert enables language technology research, civic applications, and education tools&lt;/li&gt;
&lt;li&gt;It’s part of a broader mission to democratize AI for Northeast India.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What’s Under the Hood
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Pretrained on cleaned, deduplicated Khasi text&lt;/li&gt;
&lt;li&gt;Fine-tuned for translation, summarization, and semantic understanding&lt;/li&gt;
&lt;li&gt;Benchmarked for responsiveness in resource-constrained environments&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>nlp</category>
      <category>ai</category>
      <category>huggingface</category>
    </item>
  </channel>
</rss>
