<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ForceGaming4K</title>
    <description>The latest articles on DEV Community by ForceGaming4K (@auraiis).</description>
    <link>https://dev.to/auraiis</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3968357%2F6034a22b-4067-44ca-9164-6440dce9dc5b.png</url>
      <title>DEV Community: ForceGaming4K</title>
      <link>https://dev.to/auraiis</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/auraiis"/>
    <language>en</language>
    <item>
      <title>I designed a 0.9B Mamba-2 / GLA hybrid LLM — the AI agents wrote the code. An honest build log.</title>
      <dc:creator>ForceGaming4K</dc:creator>
      <pubDate>Thu, 04 Jun 2026 13:16:24 +0000</pubDate>
      <link>https://dev.to/auraiis/i-designed-a-09b-mamba-2-gla-hybrid-llm-the-ai-agents-wrote-the-code-an-honest-build-log-dnj</link>
      <guid>https://dev.to/auraiis/i-designed-a-09b-mamba-2-gla-hybrid-llm-the-ai-agents-wrote-the-code-an-honest-build-log-dnj</guid>
      <description>&lt;p&gt;Let me be clear about my role up front, because it matters: &lt;strong&gt;I didn't hand-write the code for this.&lt;/strong&gt; I designed the system and directed it — the architecture, the decisions, the &lt;em&gt;why&lt;/em&gt;, and the discipline of debugging it. The actual implementation was written by AI coding agents (Claude and Codex). I was the architect and the lead; they were the hands.&lt;/p&gt;

&lt;p&gt;That collaboration is half the reason I'm writing this. The other half is that it's still a work in progress, and I'd rather show the honest version.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it is
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Helix v2 / Auralis&lt;/strong&gt; — a ~0.9B-parameter &lt;strong&gt;hybrid&lt;/strong&gt; language model, built from the tokenizer up (not a fine-tune, not an API wrapper):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;28 layers&lt;/strong&gt;, heterogeneous: &lt;strong&gt;6× Mamba-2&lt;/strong&gt; (state-space) → &lt;strong&gt;16× GLA&lt;/strong&gt; (Gated Linear Attention) → &lt;strong&gt;6× Sparse-Attention&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Pre-Norm (RMSNorm), RoPE, SwiGLU FFN&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tied 200k SentencePiece&lt;/strong&gt; vocabulary, bilingual &lt;strong&gt;German/English&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;d_model 1280, bf16&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2sa3ekfr8x1glezzemmh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2sa3ekfr8x1glezzemmh.png" alt="Helix v2 architecture — 28-layer hybrid: 6× Mamba-2, 16× GLA, 6× Sparse-Attention, tied 200k vocabulary" width="799" height="335"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The reasoning behind the mix: cheap &lt;strong&gt;Mamba-2 (O(n))&lt;/strong&gt; at the bottom to move information, &lt;strong&gt;GLA&lt;/strong&gt; in the middle, and a few &lt;strong&gt;precise Sparse-Attention&lt;/strong&gt; layers on top where exact token-mixing actually matters — so most layers never pay the quadratic-attention cost.&lt;/p&gt;

&lt;p&gt;Here's a cross-section of a single Mamba-2 mixer block — the state &lt;code&gt;h&lt;/code&gt; replaces the quadratic attention matrix, giving linear-time inference and constant memory:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqlnr1t4dj3td5vo9qcue.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqlnr1t4dj3td5vo9qcue.png" alt="Mamba-2 selective state-space mixer — cross-section of one block, linear-time O(n)" width="800" height="473"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest status (work in progress)
&lt;/h2&gt;

&lt;p&gt;It's at &lt;strong&gt;~33k / 50k&lt;/strong&gt; training steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Trains stably; learns German and English fluently and keeps them separate&lt;/li&gt;
&lt;li&gt;✅ Facts are anchored reasonably well in history &amp;amp; geography (measured, below)&lt;/li&gt;
&lt;li&gt;⚠️ Science &amp;amp; translation are weaker; &lt;strong&gt;0% code&lt;/strong&gt; in the current data mix&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;No instruction-following yet&lt;/strong&gt; (no SFT) — ask a question, you get raw continuation, not an answer&lt;/li&gt;
&lt;li&gt;⚠️ Greedy decoding is still rough&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're looking for a chatbot to download, this isn't one yet. If you're here for the engineering, read on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part worth sharing: it was rarely the data
&lt;/h2&gt;

&lt;p&gt;The most useful lesson wasn't architectural — it was how often my first explanation for a bad result was &lt;strong&gt;wrong&lt;/strong&gt;, and how a careful process caught it.&lt;/p&gt;

&lt;p&gt;At one point the model looked like it had regressed. My instinct (and that of two people I asked) was "the data must be bad." It wasn't. It was a stack of &lt;strong&gt;measurement problems&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Learning rate too high&lt;/strong&gt; for warm-start continued pretraining (carried over from a fresh-start schedule).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invalid baseline&lt;/strong&gt; — comparing val-loss measured on two &lt;em&gt;different&lt;/em&gt; validation sets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wrong tokens/byte constant&lt;/strong&gt; → ~33% inflated bits-per-byte. The model looked worse on paper than it was.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stochastic eval&lt;/strong&gt; — nothing was re-seeded, so each evaluation drew &lt;em&gt;different&lt;/em&gt; tokens. The "trend" was half real change, half sampling noise.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;wiki-only validation tail&lt;/strong&gt; produced a fake cross-language gap of ~3.2 bits-per-byte; the real gap was ~1.04.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And the one that almost sent me chasing ghosts: &lt;strong&gt;"the model has no knowledge."&lt;/strong&gt; Greedy decoding kept flip-flopping on simple facts. The conclusion "the facts aren't there" turned out to be wrong — I measured it properly with a contrastive margin (&lt;code&gt;NLL(wrong) − NLL(correct)&lt;/code&gt; per token), and the facts &lt;em&gt;were&lt;/em&gt; anchored. The flip-flop was a &lt;strong&gt;decoding artifact&lt;/strong&gt;, not missing knowledge.&lt;/p&gt;

&lt;p&gt;Here's the current evidence sheet — training curve, the metrics I actually trust, and an honest maturity grid of what works vs. what doesn't:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F366oejg0bkcjcfmc2z1r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F366oejg0bkcjcfmc2z1r.png" alt="Helix v2 measurements and maturity — training curve, key metrics, component maturity grid" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The takeaway I keep coming back to: &lt;strong&gt;before a bad number becomes "the data," check whether the number even measures what you think it does.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What helped
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic eval&lt;/strong&gt; (re-seed before every evaluation) — turned a noisy curve into a readable one&lt;/li&gt;
&lt;li&gt;A custom &lt;strong&gt;200k tokenizer&lt;/strong&gt; (the GPT-2 one was ~2× too inefficient for German)&lt;/li&gt;
&lt;li&gt;A two-stage &lt;strong&gt;data-cleaning pipeline&lt;/strong&gt;, collecting data by &lt;strong&gt;knowledge profile&lt;/strong&gt; rather than chasing total val-loss&lt;/li&gt;
&lt;li&gt;Treating &lt;strong&gt;knowledge, recall, and decoding behavior as separate things&lt;/strong&gt; — conflating them cost me weeks&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Licensing (precise on purpose)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Code: Apache-2.0&lt;/strong&gt; — fully open&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weights: OpenRAIL-M&lt;/strong&gt; (responsible-use restrictions) — which means the weights are &lt;strong&gt;not&lt;/strong&gt; OSI "open source" in the strict sense. I'd rather say that plainly than misuse the term.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;The longer-term plan isn't just "make this one model bigger." It's a frozen universal base plus swappable DoRA/LoRA adapters — which is also why the large 200k vocabulary exists, and why its parameter cost gets cheaper as the base grows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1pi9ybcgfwueqwbqyace.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1pi9ybcgfwueqwbqyace.png" alt="Auralis system vision — frozen universal base plus DoRA/LoRA adapters, and the scaling roadmap" width="800" height="552"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Finish to 50k → &lt;strong&gt;SFT&lt;/strong&gt; so it can follow instructions&lt;/li&gt;
&lt;li&gt;A small reproducible demo&lt;/li&gt;
&lt;li&gt;Then &lt;strong&gt;scaling&lt;/strong&gt; — 1B is the foundation, not the goal (3B / 7B+), where the large 200k vocab finally earns its keep as its parameter share shrinks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Repo (critique very welcome):&lt;br&gt;
👉 &lt;strong&gt;&lt;a href="https://github.com/AuraIis/Helix" rel="noopener noreferrer"&gt;https://github.com/AuraIis/Helix&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The most valuable part of this whole thing was having AI agents do the implementation while I stayed responsible for the decisions — and getting corrected, often, on my own assumptions.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>buildinpublic</category>
    </item>
  </channel>
</rss>
