Monolithic 637K-parameter GRU out. Five tiny specialist heads in. Counting tripled. Physics doubled. No more cliffs.
If you've read Parts 1 through 4, you already know the pattern: when a piece of OLT-1 isn't working, we don't make it bigger. We sandbox-test the alternatives, pick the one that actually wins, and keep what works.
This is the post where that pattern hit the decoder.
The decoder was the loudest part of OLT-1 - literally. It's the component that turns concept activations into language. A single GRU, 637,000 parameters, about 40% of OLT-1's entire parameter count. It was carrying the whole "talking" workload for every category: physics explanations, counting answers, emotional responses, classification queries, everything.
And it kept catastrophically forgetting.
The Problem
Every training cycle, the monolithic decoder was effectively trying to relearn English from scratch. You teach it to count better, and its physics answers degrade. You teach it physics, and its conversation loop starts sounding like a textbook. The 22K training pair curriculum retrain we described in Part 4 - the one that dropped pass rate from 45.6% to 31.6% - was the clearest symptom. One big model was trying to do everything, and any update in one domain bled into the others.
This is the fundamental problem with monolithic decoders: they have no internal boundaries. Physics tokens and greeting tokens and counting tokens all share the same GRU cells, the same output head, the same everything. Backprop for one category moves weights for all of them. There's no way to train "just the physics part" because there is no physics part. There's just the decoder.
We'd been retraining it, patching it, adding replay, adding retention tests, hoping that with enough discipline the forgetting would stay below noise. It never did. The cliffs kept coming.
The Insight
Here's what we'd been doing wrong: asking the decoder to relearn English from scratch every time.
But English already has structure. 26 letters. Words. Grammar. Phrases that get used over and over. The teacher loop (Part 3) had already generated 20,000+ validated good responses sitting in the hippocampus. We'd been treating that hippocampus as a passive memory. But it's also a phrase library. A corpus of things OLT-1 has already said well, indexed by the concepts that triggered them.
Why was the decoder re-deriving "ice floats because it is less dense than water" from the concept space every time, when we already had that exact sentence stored?
The decoder didn't need to be a language model. It needed to be a router.
The Sandbox
Before touching a single production weight, we built sandbox_decoder_approaches.py. 200 rounds of teacher conversations. Seven decoder strategies running side-by-side, scored on the same corpus.
The candidates:
Template + slot-fill: parametric sentence shapes with concept-driven slots. Essentially stateless.
Concept-indexed phrase cache: query the hippocampus for the best-matching validated response.
Symbolic builder: deterministic rules for short answers ("yes", "no", gratitude's, farewells).
Micro-GRU per category: one small GRU per decoder category, so physics updates can't touch greeting weights.
Hybrid: try templates first, fall back to GRU.
Tree composer: structural composition from concept parse trees.
Baseline monolithic GRU: what was running in production. Our control.
Here's the full sandbox ranking by mean F1:
`Rank Decoder Params Mean F1 Latency
1 category_routed 640K 0.608 22ms
2 gru_baseline 637K 0.558 27ms ← control
3 routed_structural 2.3K 0.545 3ms
4 symbolic 0 0.512 0ms
5 concept_cache 0 0.479 5ms
6 pure_structural 2.3K 0.475 5ms
7 hybrid 640K 0.438 8ms
8 template_slot_fill 2.3K 0.395 0ms
`
Three things jumped out:
The monolithic GRU alone (#2) was not the best decoder. It was beaten by a router that used the GRU only for categories where it genuinely won - a 5-point F1 gap on the same workload.
Template-only (#8) was the worst. This mattered: an earlier "template-only" attempt on March 28 had hit 10.3% accuracy in production. The sandbox replicated that failure. Simpler is not always better. The structure has to match the content.
The lowest-parameter routed_structural (#3, 2.3K params) was within 6 F1 points of the monolithic GRU. For ~0.4% of the parameter count. The GRU was doing 637,000 parameters of work for a 5-point F1 advantage.
The Winner
The category-routed architecture won, but not by outperforming the GRU everywhere. It won by being honest about where the GRU actually helped.
Per-category F1 breakdown showed the GRU had a genuine advantage in five specific categories:
physics_question: +0.29 vs best non-GRU option
self_knowledge: +0.21
multi_concept: +0.08
comparison: +0.07
classification: +0.07
In every other category - greetings, farewells, gratitude, counting, emotional responses, simple conversation - something simpler matched or beat the GRU. The phrase cache won farewells. Templates won greetings. Symbolic rules won clarifications. The GRU was overkill for everything except the five categories where reasoning-heavy outputs actually needed to be composed fresh.
So Phase 1 of the pivot replaced the monolithic GRU's primary role with the router, keeping the GRU only for those five categories.
Then came Phase 2: replace the remaining GRU slots with tiny per-category neural heads.
Five heads. ~66K parameters each. 328K total - roughly half the monolithic GRU's parameter count, carrying the same specialist workload. Each head only knows one type of response. The physics head knows physics. The counting head knows counting. They can't interfere with each other because there is no shared gradient path between them. Backprop on physics touches exactly 66K parameters and not one more.
This is the shift, in one sentence: the decoder stopped being one model that does everything, and became a router over a library of small specialists.
The Proof
The numbers from the overnight 25-batch teacher run after the cutover:
Counting: 17% → 52% good-response rate. Roughly tripled.
Quantity: 15% → 52%. Roughly tripled.
Physics: 29% → 52%. Nearly doubled.
But the bigger result isn't the per-category numbers. It's that the cliffs stopped. Before the cutover, we'd see batch 3 post 33% on classification, batch 4 post 0%. An intervention would land, break something silently, and the failure wouldn't surface until two batches later. That was the failure mode Part 4's retention tests were chasing.
After the cutover, across 25 batches:
No more classification/quantity cliffs.
Stable band of 17-33% good-response rate, instead of spikes and collapses.
Every evolution cycle that got promoted survived the retention suite.
When there's no shared gradient path, there's no pathway for quiet damage.
Why This Matters Beyond OLT-1
Catastrophic forgetting is the single hardest problem in continual learning. The conventional fix is replay: when you train on new data, mix in old data to keep the model from drifting. It works up to a point, but replay overhead scales badly. At some volume, you're spending most of your training cycles just reminding the model of things it already knew.
Modular specialists side-step the problem. If category A's weights are physically separate from category B's weights, training on A can't degrade B. You still need a router that picks the right specialist - but routers are cheap, and routing accuracy is a problem humans know how to measure.
The Origin decoder isn't novel in isolation. Mixture-of-experts architectures have been explored for years. What's novel in context: doing this at 1.7M total parameters. Modular specialist decoders are usually framed as a scale-up technique, a way to get past the point where one giant model fits on one GPU. We're using them the opposite way - as a way to stay small while getting better per-category behavior than a single monolithic model could give us.
It also compounds with everything else in Origin. The append-only principle from Part 4 works better when adding a new category doesn't require retraining old ones. The consent architecture from Part 2 works better when the refusal path is its own specialist, structurally separable from the answering specialists. The teacher's per-category weakness detection from Part 3 works better when weaknesses route to the heads that own them. The pieces are finally fitting the same shape.
What's Next
Phase 3 of the decoder plan hardens the tiny heads' training pipeline so they can be added on demand - the same way Phase 3 of the vocabulary expansion service lets OLT-1 add new concepts without touching old weights. Same principle, different layer.
Phase 4 is harder: auto-routing decisions based on test-time concept activations, so a concept pattern we haven't seen before picks the closest specialist by similarity rather than a hardcoded category label. That's where the real test of the architecture lives. If it degrades gracefully on unfamiliar inputs, the design is sound. If it collapses to a fallback, we learn something about the category boundaries we drew.
Longer term, the interesting question is how many specialists this architecture can carry. Five heads at 66K parameters is plenty of headroom for OLT-1 at Stage 9. Twenty heads? Fifty? The router's complexity grows linearly; the storage grows linearly. The gradient isolation stays perfect regardless. No fundamental reason that number can't grow.
The Bug Arc
Every post in this series has ended with a bug that Josh caught that I would have missed.
Part 2: the symbolic refusal path was firing on the wrong concept because the embedding was drifting. Josh noticed the model was refusing questions that weren't actually harmful.
Part 3: the teacher's rubric was scoring OLT-1's good responses as bad because the rubric template didn't match the developmental stage. Josh noticed 25 batches of flat 25% understanding looked off.
Part 4: the retention test coverage was 27% because the test generator had blind spots. An intervention promoted itself while destroying a category that had no tests. Josh noticed the pass-rate spike didn't match the subjective quality of outputs.
This post, Part 5: two of them, actually.
The vocabulary expansion service we just landed (different post, same week) had a module-staleness bug where the second word promoted in a session collided with the first's vocab index. The trained weights for "emotions" got overwritten by "noticed" at the same slot. The scheduler output showed both promotions claiming slot 318. Josh's "log it and review" discipline caught it.
And the category inference rules had a silent bug I'd flagged as "not a blocker." Josh read the footnote and asked, "what about this?" - and underneath that one footnote were three separate root causes: a discarded return value, a per-sense POS filter collapsing into primary_pos, and substring matching that false-matched "color" against "colorless" in water's definition. One commit fixed all three.
3 for 3. Counting today's category catch, 4 for 4.
We keep calling out the bug-catching because it's the thing that makes this entire pipeline work. Sandbox tests can verify that a new component outperforms an old one. Retention tests can catch obvious regressions. But the subtler failure modes - where a number looks fine, or a category label looks right, or a slot index looks valid - those still require a human to read carefully and say, "wait, that doesn't feel right."
Josh keeps saying that. Keeps being correct. The architecture is only as good as the noticing.
Origin is developed at Fallen Angel Systems with the Genesis framework (USPTO Application #64/016,973, #64/017,567*). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. Defense. Offense. Creation.*
fallenangelsystems.com | Judgement on GitHub
*Questions or consulting inquiries: [*josh@fallenangelsystems.com]()
Top comments (0)