Every time we fixed the teacher, it broke in a new way.
Part 3 of this series ended on a win. We fixed the rubric, understanding jumped from 28% to 57.8% overnight on the same weights, and we thought the teacher problem was solved.
It wasn't. That was the first break. There were more coming.
Break 1: The Model Was Drifting
The rubric fix held for about 25 rounds per session. Then Qwen started forgetting its instructions.
Drift is what happens when a language model loses the thread of its system prompt over a long context window. The instructions said one concept, max 10 words, 4-year-old vocabulary. By round 31, Qwen was generating things like "Can you elaborate on the thermodynamic properties of phase transitions?" for a model at kindergarten stage.
We measured it:
Round RangeBanned Word Rate
0-240%
25-4962%
50-7471%
75-9982%
The fix: cap sessions at 25 rounds. Start fresh every time. Never let the context accumulate enough noise to pull Qwen off course.
That worked. We moved on. Then it broke again.
Break 2: The Grading Was Wrong
With session caps in place, we noticed the understanding numbers still felt off. The rubric fix from Part 3 had doubled them on the same weights, but that should have been the floor, not the ceiling. OLT-1 was answering physics questions correctly - "ice floats. less dense than water." - and Qwen was marking those responses down.
The moment it clicked: Qwen graded "it floats. less dense." as awkward. Reason field: "incomplete phrasing." Origin had answered a physics question correctly, in the concept-fragment register it speaks in natively. Qwen marked it down for not sounding like a human would say it.
That wasn't a rubric issue. That was Qwen grading the wrong thing.
Qwen wasn't grading understanding. Qwen was grading fluency. For a token model, fluency and understanding are correlated enough that this usually works fine. For a concept model that deliberately speaks in fragments, they're not. Every time OLT-1 answered correctly in its natural register, Qwen saw a grammatical failure.
No amount of CRITICAL FAIRNESS RULES in the rubric closes that gap. The instruction layer said "honest IDK is good, fragments are acceptable" - and Qwen complied when its system prompt was fresh. But the pattern embedded in Qwen's weights was still more fluent is better, and that pattern crept back in on every grading call.
We decided to try a different teacher.
Break 3: Gemma Runs Out of Ideas
We spent a full day downloading 15 models at 10 Mbps. The Gemma 4 31B alone was 20GB. We tested each one with the same benchmark: 20 questions, score for constraint following, grader accuracy on 6 curated edge cases, and drift behavior.
Most failed immediately. The clear winner was google/gemma-2-9b.
MetricMistral 7BGemma 2 9B
Grader accuracy3/66/6
Vocab score0.950.99
First driftRound 25Round 31
Peak drift82%45%
Switching from Qwen to Gemma, same OLT-1 weights, understanding jumped from 0% to 29.3%. Qwen had been so broken it was hiding real capability the whole time.
We thought we were done. Then we ran 200 rounds.
Real attempts: 26 out of 200. The other 174 were duplicates.
Gemma generated exactly 26 unique Tier 1 questions and then spent 174 rounds trying to regenerate them. "Is the sky blue?" appeared three times. "Are you happy?" appeared three times. "Is water wet?" appeared three times. By chunk 3 Gemma had exhausted its natural variety. Every subsequent attempt hit the deduplification filter.
We added category rotation - forcing Gemma to cycle through subcategories instead of defaulting to whatever was easiest to generate. Real attempts jumped from 26 to 135 out of 200.
Better. Still reporting 65.6% understanding when deterministic testing said 97-100%.
Something structural was wrong. Not with the rubric, not with the model, not with session length or category rotation.
With the whole approach.
The Problem We Couldn't Patch
A token model evaluates text. OLT-1 understands concepts. Those aren't the same thing, and no amount of rubric tuning closes that gap.
Gemma expected fluent complete sentences. OLT-1 produces concept-grounded fragments. Gemma expected answers to cover every part of a compound question. OLT-1 answers the part it knows and says "i don't know" for the rest. Gemma graded OLT-1 against token-model expectations, and OLT-1 kept failing token-model expectations while passing concept-model expectations.
Every fix we applied was patching a symptom. The disease was the mismatch between what was doing the grading and what was being graded.
We needed a grader that spoke the same language as the model it was grading.
So we built one.
That's Part 7.
Origin is developed at Fallen Angel Systems with the Genesis framework (USPTO Application #64/016,973, #64/017,567*). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. Defense. Offense. Creation.*
fallenangelsystems.com | Judgement on GitHub
*Questions or consulting inquiries: [*josh@fallenangelsystems.com]()
Top comments (0)