Origin Part 10: The Plan Didn't Work

#security #ai #machinelearning

We executed the plan exactly as written. The encoder still couldn't tell concepts apart.
Part 9 ended with 94,000 natural-context pairs wired into the trainer and a clean execution of every phase gate. We had three times the data. The hypothesis was about to be tested.

Phase 4: The Retrain
The full joint retrain ran clean. Loss curve descended monotonically. The encoder's healthy concept count went from 84 to 107, measured by an internal probe of about 30 hand-crafted queries that exercise common concepts.

+23 healthy concepts, +27% relative. We were cautiously optimistic. The trainer's audit is a small set of probes and "healthy" only counts the concepts those probes happen to test. The real validation was Phase 5.

Phase 5: The Probe Battery
The plan's success metric was specific. Random-sample 50 V2C concepts. Ask gemma to generate three short held-out sentences mentioning each one (verified not to appear in the training corpus). Run them through the encoder. Measure four things:

top1 accuracy: does the encoder rank the target concept first?
top3 accuracy: is the target in the top three?
target activation: how strongly does the target itself fire?
cross_fire: how many other concepts fire above threshold?
The pre-defined success gates were top1 at or above 70%, target_act at or above 0.7, cross_fire under 2.0.

The result on the freshly-retrained encoder:

top1: 1.3%
top3: 4.0%
target_act: 0.086
cross_fire: 11.92
We ran it twice.

One concept out of fifty had its target rank first. The encoder fired on twelve wrong concepts per probe, on average. Target activation was eight percent. When we handed the encoder the exact sentence it should have been designed to recognize, it barely registered the right answer.

The plan had executed exactly as written and not moved the encoder.

What That Meant
This is the place in the post where it would be easy to say something exculpatory: "the data work wasn't wasted" or "we learned something." Both are true. But the cleaner reading is that we were wrong about the bottleneck. We had thought the encoder was data-starved. The earlier sandbox at 10-concept scale had shown data could lift top1 from 33% to 80%. We assumed that signal would transfer to 3687 concepts.

It didn't.

We had built the plan with an explicit abort condition for exactly this case: if Phase 5 returns top1 below 50% on held-out probes, the architecture is the bottleneck, not the data. Design contrastive next.

1.3% triggered it.

The data work wasn't wasted. We needed the data anyway, and the elaboration corpus is now properly structured for whatever the next model wants to do with it. But it wasn't the lever. Something else was.

What Comes Next
The abort condition pointed at architecture. The encoder's concept_head, the part that maps general features to per-concept activations, was a flat MLP trained with multi-label binary cross-entropy. Every concept slot had to learn its own discriminator independently against roughly 3686 others. At 327 concepts (the v1 vocab) this had worked. At 3687 it had been quietly failing the whole time.

The next move: build a sandbox, test multiple head architectures against the same data, let the numbers pick the winner. No production changes until something actually beats the baseline on Phase 5.

Hypothesis tests fail more usefully than hypothesis confirmations. We'd just gotten one of the more useful failures.

Origin is developed at Fallen Angel Systems with the Genesis framework — NVIDIA Inception member. (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. Defense. Offense. Creation.

fallenangelsystems.com | Judgement on GitHub | Guardian on GitHub

Questions or consulting inquiries: josh@fallenangelsystems.com

DEV Community

Origin Part 10: The Plan Didn't Work

Top comments (0)