Let me be clear about my role up front, because it matters: I didn't hand-write the code for this. I designed the system and directed it — the architecture, the decisions, the why, and the discipline of debugging it. The actual implementation was written by AI coding agents (Claude and Codex). I was the architect and the lead; they were the hands.
That collaboration is half the reason I'm writing this. The other half is that it's still a work in progress, and I'd rather show the honest version.
What it is
Helix v2 / Auralis — a ~0.9B-parameter hybrid language model, built from the tokenizer up (not a fine-tune, not an API wrapper):
- 28 layers, heterogeneous: 6× Mamba-2 (state-space) → 16× GLA (Gated Linear Attention) → 6× Sparse-Attention
- Pre-Norm (RMSNorm), RoPE, SwiGLU FFN
- Tied 200k SentencePiece vocabulary, bilingual German/English
- d_model 1280, bf16
The reasoning behind the mix: cheap Mamba-2 (O(n)) at the bottom to move information, GLA in the middle, and a few precise Sparse-Attention layers on top where exact token-mixing actually matters — so most layers never pay the quadratic-attention cost.
Here's a cross-section of a single Mamba-2 mixer block — the state h replaces the quadratic attention matrix, giving linear-time inference and constant memory:
Honest status (work in progress)
It's at ~33k / 50k training steps:
- ✅ Trains stably; learns German and English fluently and keeps them separate
- ✅ Facts are anchored reasonably well in history & geography (measured, below)
- ⚠️ Science & translation are weaker; 0% code in the current data mix
- ❌ No instruction-following yet (no SFT) — ask a question, you get raw continuation, not an answer
- ⚠️ Greedy decoding is still rough
If you're looking for a chatbot to download, this isn't one yet. If you're here for the engineering, read on.
The part worth sharing: it was rarely the data
The most useful lesson wasn't architectural — it was how often my first explanation for a bad result was wrong, and how a careful process caught it.
At one point the model looked like it had regressed. My instinct (and that of two people I asked) was "the data must be bad." It wasn't. It was a stack of measurement problems:
- Learning rate too high for warm-start continued pretraining (carried over from a fresh-start schedule).
- Invalid baseline — comparing val-loss measured on two different validation sets.
- Wrong tokens/byte constant → ~33% inflated bits-per-byte. The model looked worse on paper than it was.
- Stochastic eval — nothing was re-seeded, so each evaluation drew different tokens. The "trend" was half real change, half sampling noise.
- A wiki-only validation tail produced a fake cross-language gap of ~3.2 bits-per-byte; the real gap was ~1.04.
And the one that almost sent me chasing ghosts: "the model has no knowledge." Greedy decoding kept flip-flopping on simple facts. The conclusion "the facts aren't there" turned out to be wrong — I measured it properly with a contrastive margin (NLL(wrong) − NLL(correct) per token), and the facts were anchored. The flip-flop was a decoding artifact, not missing knowledge.
Here's the current evidence sheet — training curve, the metrics I actually trust, and an honest maturity grid of what works vs. what doesn't:
The takeaway I keep coming back to: before a bad number becomes "the data," check whether the number even measures what you think it does.
What helped
- Deterministic eval (re-seed before every evaluation) — turned a noisy curve into a readable one
- A custom 200k tokenizer (the GPT-2 one was ~2× too inefficient for German)
- A two-stage data-cleaning pipeline, collecting data by knowledge profile rather than chasing total val-loss
- Treating knowledge, recall, and decoding behavior as separate things — conflating them cost me weeks
Licensing (precise on purpose)
- Code: Apache-2.0 — fully open
- Weights: OpenRAIL-M (responsible-use restrictions) — which means the weights are not OSI "open source" in the strict sense. I'd rather say that plainly than misuse the term.
What's next
The longer-term plan isn't just "make this one model bigger." It's a frozen universal base plus swappable DoRA/LoRA adapters — which is also why the large 200k vocabulary exists, and why its parameter cost gets cheaper as the base grows:
- Finish to 50k → SFT so it can follow instructions
- A small reproducible demo
- Then scaling — 1B is the foundation, not the goal (3B / 7B+), where the large 200k vocab finally earns its keep as its parameter share shrinks
Repo (critique very welcome):
👉 https://github.com/AuraIis/Helix
The most valuable part of this whole thing was having AI agents do the implementation while I stayed responsible for the decisions — and getting corrected, often, on my own assumptions.




Top comments (0)