I designed a 0.9B Mamba-2 / GLA hybrid LLM — the AI agents wrote the code. An honest build log.

#ai #python #machinelearning #buildinpublic

Let me be clear about my role up front, because it matters: I didn't hand-write the code for this. I designed the system and directed it — the architecture, the decisions, the why, and the discipline of debugging it. The actual implementation was written by AI coding agents (Claude and Codex). I was the architect and the lead; they were the hands.

That collaboration is half the reason I'm writing this. The other half is that it's still a work in progress, and I'd rather show the honest version.

What it is

Helix v2 / Auralis — a ~0.9B-parameter hybrid language model, built from the tokenizer up (not a fine-tune, not an API wrapper):

28 layers, heterogeneous: 6× Mamba-2 (state-space) → 16× GLA (Gated Linear Attention) → 6× Sparse-Attention
Pre-Norm (RMSNorm), RoPE, SwiGLU FFN
Tied 200k SentencePiece vocabulary, bilingual German/English
d_model 1280, bf16

The reasoning behind the mix: cheap Mamba-2 (O(n)) at the bottom to move information, GLA in the middle, and a few precise Sparse-Attention layers on top where exact token-mixing actually matters — so most layers never pay the quadratic-attention cost.

Here's a cross-section of a single Mamba-2 mixer block — the state h replaces the quadratic attention matrix, giving linear-time inference and constant memory:

Honest status (work in progress)

It's at ~33k / 50k training steps:

✅ Trains stably; learns German and English fluently and keeps them separate
✅ Facts are anchored reasonably well in history & geography (measured, below)
⚠️ Science & translation are weaker; 0% code in the current data mix
❌ No instruction-following yet (no SFT) — ask a question, you get raw continuation, not an answer
⚠️ Greedy decoding is still rough

If you're looking for a chatbot to download, this isn't one yet. If you're here for the engineering, read on.

The part worth sharing: it was rarely the data

The most useful lesson wasn't architectural — it was how often my first explanation for a bad result was wrong, and how a careful process caught it.

At one point the model looked like it had regressed. My instinct (and that of two people I asked) was "the data must be bad." It wasn't. It was a stack of measurement problems:

Learning rate too high for warm-start continued pretraining (carried over from a fresh-start schedule).
Invalid baseline — comparing val-loss measured on two different validation sets.
Wrong tokens/byte constant → ~33% inflated bits-per-byte. The model looked worse on paper than it was.
Stochastic eval — nothing was re-seeded, so each evaluation drew different tokens. The "trend" was half real change, half sampling noise.
A wiki-only validation tail produced a fake cross-language gap of ~3.2 bits-per-byte; the real gap was ~1.04.

And the one that almost sent me chasing ghosts: "the model has no knowledge." Greedy decoding kept flip-flopping on simple facts. The conclusion "the facts aren't there" turned out to be wrong — I measured it properly with a contrastive margin (NLL(wrong) − NLL(correct) per token), and the facts were anchored. The flip-flop was a decoding artifact, not missing knowledge.

Here's the current evidence sheet — training curve, the metrics I actually trust, and an honest maturity grid of what works vs. what doesn't:

The takeaway I keep coming back to: before a bad number becomes "the data," check whether the number even measures what you think it does.

What helped

Deterministic eval (re-seed before every evaluation) — turned a noisy curve into a readable one
A custom 200k tokenizer (the GPT-2 one was ~2× too inefficient for German)
A two-stage data-cleaning pipeline, collecting data by knowledge profile rather than chasing total val-loss
Treating knowledge, recall, and decoding behavior as separate things — conflating them cost me weeks

Licensing (precise on purpose)

Code: Apache-2.0 — fully open
Weights: OpenRAIL-M (responsible-use restrictions) — which means the weights are not OSI "open source" in the strict sense. I'd rather say that plainly than misuse the term.

What's next

The longer-term plan isn't just "make this one model bigger." It's a frozen universal base plus swappable DoRA/LoRA adapters — which is also why the large 200k vocabulary exists, and why its parameter cost gets cheaper as the base grows:

Finish to 50k → SFT so it can follow instructions
A small reproducible demo
Then scaling — 1B is the foundation, not the goal (3B / 7B+), where the large 200k vocab finally earns its keep as its parameter share shrinks

Repo (critique very welcome):
👉 https://github.com/AuraIis/Auralis

The most valuable part of this whole thing was having AI agents do the implementation while I stayed responsible for the decisions — and getting corrected, often, on my own assumptions.