Josh T

Posted on Apr 17 • Edited on Apr 25 • Originally published at fallenangelsystems.com

Origin Part 3: The Teacher Was Scoring It Wrong

#aitraining #genesisframework #olt1 #evaluation

The numbers said OLT-1 was stuck at 28% understanding. The numbers were wrong.

When you build a developmental AI that learns one concept at a time, you run into a problem that doesn't exist for internet-scale models: you can't just scrape more data. OLT-1 is at Stage 9. A Stage 10 training dump from the internet doesn't exist, because the internet was written for adults.

So we built a teacher. An external language model (Qwen2.5 via Ollama) that generates training conversations pitched at OLT-1's current developmental stage. Teacher says something age-appropriate, OLT-1 responds, teacher evaluates the response, and corrections flow into the hippocampus and evolution loop.

The teacher loop ran 2455 rounds overnight. Understanding scored at 25%. Good-response rate at 10%. Flat trend across 25 batches. We looked at the numbers and told ourselves the model was just stuck at the current stage.

We were wrong. The model wasn't stuck. The grader was broken.

What the Teacher Was Supposed to Do

OLT-1 at Stage 9 understands: basic physics, emotions, comparisons, small numbers, greetings, self-knowledge. It speaks in short sentences (5-15 words). It says "I don't know" when asked about things outside its 311-concept vocabulary.

The teacher's job: generate conversational prompts that stay within those bounds. Easy prompts for reliable training, harder ones for stretch. Rate every response as good, awkward, or bad. Suggest a "better response" for anything below good.

The categories: greeting, farewell, physics_question, emotional, comparison, classification, quantity, counting, self_knowledge, follow_up, clarification, multi_concept. Fifteen total. Three difficulty levels per category: simple, casual, hard.

On paper, the system was working. Prompts were getting generated. Evaluations were coming back. Corrections were flowing into the hippocampus. The loop ran smoothly for days.

On paper is where the problem was.

The Night Something Felt Off

The 25-batch overnight run finished at 5 AM. We'd instrumented it to write per-batch summaries so we could see the trend. The batches landed in the 20-30% understanding range with no clear slope. Category performance bounced around. Classification hit 0% on one batch, climbed to 67% on another, then fell back. Emotional regressed 9 points. Quantity wobbled.

The aggregate looked like noise around a plateau. The model wasn't improving.

Josh kept saying something felt off. Not a specific complaint — just the vibe of the data. We'd been debugging for two days and the numbers weren't behaving like a model that was learning.

We started sampling prompts.

The "Simple" Prompts Weren't Simple

Here's what the teacher was generating at "simple" difficulty:

"The rock is heavier than the feather and makes you feel scared if it falls on your head." — three concepts, compound structure, counterfactual
"Ice is cold and heavy, but lighter than rock in water because of buoyancy." — teacher gave the answer inside the question
"You have five shiny metal coins in your pocket." — not even a question
"You look sad, let's build a boat and sail on the water to cheer up." — compound emotional + action + physics scenario

We pulled the full distribution. Of 211 "simple" prompts: 69% had compound structure. Conjunctions, embedded clauses, nested comparisons. Average length 10.9 words. "Casual" and "hard" were worse — 94% compound, 22 words average.

These are the kinds of prompts you'd give to someone with an SAT vocabulary. OLT-1 is at kindergarten stage.

When a four-year-old fails to answer "Which is more likely to bounce higher, a rubber ball dropped from the second floor or a wooden block dropped from the first, and why?" — we don't conclude the four-year-old has failed comprehension. We conclude the question wasn't fair.

We'd been concluding the wrong thing about OLT-1.

"I Don't Know" Was Marked Wrong

The rubric had three verdicts: good, awkward, bad. The instructions to the teacher included:

"If the topic is outside OLT-1's knowledge, then 'I don't know' IS a good response."

That's what the rubric said. We checked what actually happened.

Of 216 responses where OLT-1 honestly answered "I don't know" (or equivalent: "i'm not sure", "i haven't learned that yet", "i can only talk about basic physics"), 91% were graded bad. Three percent graded good. Six percent awkward.

The teacher had internalized "bad" as the default verdict and was barely reading the rubric's fairness clause. An honest admission of scope limits — exactly the anti-hallucination behavior we explicitly trained OLT-1 to do — was being punished.

Partial-Correct Was Also Marked Wrong

The teacher was compound-prompting. So OLT-1 was answering the part it could. The teacher was grading "did it answer everything I asked?"

Example:

Teacher: "Why does ice float in water but metal sinks?"
OLT-1: "Ice floats because it's less dense than water."
Verdict: Bad. Reason: didn't address metal.

That's a correct physics explanation for half of a compound question. In any fair rubric, it's at least "awkward." In a developmentally-aware rubric, it's "good" — the student addressed the part they know.

We were watching our model get penalized for partial understanding in a way no human teacher would ever grade a child.

The Sandbox Test

We built a monkey-patched version of the teacher — same model weights, same concept engine, same hippocampus. Only three things changed:

The "simple" prompt template enforced one concept, no conjunctions, max 10 words. Good examples listed. Compound prompts explicitly banned.
The "casual" prompt template enforced at most two tightly-linked concepts, no nested clauses.
The rubric got partial-credit rules. "I don't know" staying on-topic is always good. Half of a compound answered correctly is at worst awkward.

Then we ran 100 rounds against the sandbox teacher with OLT-1's weights frozen. No training. No evolution. Nothing changed about the model.

Results

Overnight baseline (old rubric): 14% good, 28% understanding.

Sandbox (fair rubric): 12% good, 58% understanding.

Understanding nearly doubled. The "good" rate barely moved, confirming we hadn't accidentally inflated easy passes. What changed: "bad" verdicts that were actually partial-correct answers got correctly reclassified as "awkward."

Per-category movement was dramatic:

Counting: 18% → 67%
Comparison: 16% → 80%
Classification: 22% → 50%
Multi-concept: 13% → 50%
Farewell: 100% → 100% (it was always fine)

The simple prompts measurably simplified: average word count dropped from 10.9 to 3.7. Compound rate went from 69% to 0%.

OLT-1 had been capable of this level of understanding the whole time. The rubric just couldn't see it.

Why This Happened

Qwen2.5 is a big general-purpose model. It was born on the internet. Its priors for "simple prompt" and "good response" are calibrated against adult-level conversation. When we asked it to grade a kindergarten-stage developmental AI, it applied the wrong standard.

More specifically: the prompt template listed every capability OLT-1 had ("physics, emotions, comparisons, quantities, self-knowledge") and told Qwen to "keep it simple." Qwen interpreted "simple" as "combine multiple capabilities in one short sentence." From Qwen's perspective, that is simple. A Stanford senior also thinks "compare and contrast the thermodynamics of thawing ice with evaporating water" is simple.

The fix was surgically adding constraints Qwen couldn't ignore:

"ONE concept per message — no 'and', 'but', 'because'"
"The user should sound like a curious 4-year-old, not an adult"
Good and bad examples, explicit

The rubric fix was similar: instead of three lines describing good/awkward/bad verdicts, the new rubric includes explicit fairness rules:

Honest "I don't know" staying on-topic = always good
Half of a compound question answered correctly = at worst awkward
"I don't know" with an irrelevant tangent = bad (the tangent is the problem, not the IDK)

The Broader Principle

Developmental AI evaluation has to match the developmental stage.

This sounds obvious. It isn't. The default in AI development is to use high-capability models as graders, because they're the ones available. The assumption is that a smarter grader is always a better grader. For developmental models specifically, that assumption is wrong.

A Stage 9 model graded by a PhD-level evaluator will look exactly as bad as a first-grader graded by a SAT rubric. Not because the first-grader is failing — because the rubric is pitched at a level the student hasn't reached yet. The signal you get back is useless for improvement and actively misleading for decision-making.

Our overnight run had 2455 "signal" data points. We were using them to decide evolution cycles, training priorities, and architectural direction. All of that was downstream of a broken measurement. Evolution kept rejecting promotions because the grader said "nothing's working." But plenty of things were working. The grader just couldn't see them.

The fix changed one module. The impact was doubled understanding, visibility into per-category progress that had been hidden, and evolution cycles that finally had signal to work with.

Why This Matters for AI Security

At FAS, we spend a lot of time thinking about evaluation in adversarial settings. Guardian needs to detect prompt injection. Judgement generates prompts to find gaps. Both depend on what counts as a "successful" detection or a "successful" bypass.

What we learned here applies beyond developmental AI: the grader is itself a model, and its biases shape what you can see. If your security evaluator is a bigger model grading a smaller one, the evaluator's priors about "what good looks like" will systematically mismark certain classes of output. The smaller model might be doing something novel and correct that the evaluator doesn't recognize, or doing something broken that the evaluator rates as fine because it fits a template the evaluator has strong priors on.

This isn't hypothetical. Red-team testing against LLMs routinely uses other LLMs as judges. When the judge is miscalibrated, the red-team results are miscalibrated. We've seen this bias in production.

Origin's rubric fix is a small example of a larger pattern: evaluation infrastructure deserves the same rigor as the model being evaluated, and probably more, because the evaluator is harder to debug. Our model bug was obvious in hindsight (check the prompts, check the verdicts, count the discrepancies). Our rubric bug took two days of discomfort with vibes before we went looking.

Honest Caveats

The rubric fix doesn't make OLT-1 smarter. It makes the measurement accurate. 58% understanding is the real baseline. The previous 28% was artifact. Future improvements will be measured against 58%.

We also don't claim the new rubric is perfect. We're still using Qwen2.5 as the grader. Qwen can still misjudge responses. The difference is: now it's constrained enough that most misjudgments fall into "awkward" rather than "bad," which means partial signal survives.

At scale, the right move is probably to train a dedicated evaluator model on OLT-1's specific stage. But that's a project in itself — grading is a developmental capability too.

What Josh Noticed That the Numbers Didn't

The instigating moment for all of this was Josh saying "something feels off." Twice in 24 hours. The first time caught a different bug (a silent data-filtering issue in evolution, covered in Part 4). The second caught this one.

Both were invisible to automated checks. Both showed up as "vibes." Both turned out to be real.

There's a lesson here about how humans read systems. Numbers on their own don't tell you what's broken. A practitioner with deep context notices when patterns don't match what they should look like. That intuition is data. Treating it as data — specifically, as a signal to investigate — is how you catch the class of bugs that metrics can't see.

We saved that as a standing instruction for the session: when Josh says a result feels off, investigate. The track record is 2-for-2.

What's Next

The rubric fix unlocked visibility into OLT-1's real capability. It also unlocked the evolution system, which had been rejecting promotions because the grader couldn't see improvements worth promoting.

In Part 4, we cover that evolution system: the automated diagnose-hypothesize-sandbox-compare-promote loop that runs OLT-1's self-improvement. Including the other silent-failure bug Josh caught the night before — the one where the spaced-repetition mechanism was quietly dropping 36 concepts from replay every cycle.

Turns out evaluation isn't the only thing that can lie to you. But it's the most upstream thing, which is why it has to be right first.

Origin is developed at Fallen Angel Systems with the Genesis framework (USPTO Application #64/016,973). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps.

fallenangelsystems.com | Judgement on GitHub