DEV Community

Josh T
Josh T

Posted on • Originally published at fallenangelsystems.com

Origin Part 14: The Reframe

Part 12 ended with a hypothesis. Two days later, the hypothesis met data.

The closing line of Part 12 was a guess. Maybe the next bottleneck wasn't more concepts, but the relationships between them. A model can know "dog" and "animal" and "four legs" and still not understand what a dog is. Understanding might live in the connections, not the nodes.

We had a way to test that. Build a sandbox that predicts the next concept that will fire given the current one. Run it on books. If the model can predict that "rock falls" tends to be followed by "ground hits, sound happens," then it's learned something about how the world strings together. If it can't, it hasn't.

I built it that evening. Five books from Project Gutenberg. Twenty-five thousand sentence-to-sentence transitions. Four prediction strategies running side by side, random (the floor), frequency (always guess the most common concepts), cooccurrence (learn which concepts tend to follow which), and retrieval (find similar past sentences and look at what came after them).

The results were not what I wanted.

Cooccurrence beat random fifty times over. Good. Then it lost to frequency. Bad.

The naive prior - "just predict the eight most common concepts every time" - outperformed the model that actually tried to learn transitions. That's the experimental equivalent of a flat line on the consequential question. The hypothesis I'd written into Part 12 had landed exactly the wrong way.

I sat with it for a few hours. The temptation when a result lands badly is to argue with it. The prediction shape was wrong. The K value was wrong. The loss was wrong. The more disciplined version is to ask what the data is actually saying.

What it was actually saying, book narrative is the wrong substrate for cause-and-effect learning. Books drift. Scene to scene, character to character, description to description. "What happens next" in a novel is usually a new place, not a consequence of the last sentence. The signal we were trying to mine wasn't there to mine.

Which raised the obvious question. Was the failure about the algorithm or about the substrate? If we ran the same algorithm on clean cause-and-effect pairs - hand-curated, the kind you'd put in a physics textbook - would it work?

The next morning, I queued six experiments back to back. Call it sandbox-test day.

The first was a probe-diversity audit. Take two hundred concepts already in the vocabulary. Probe each one with five different phrasings of the same idea. Does the encoder fire the same concept on all five, or only when the surface words match? The answer, 93% of probed concepts were robust across phrasings. The architecture wasn't pattern matching. The concepts were real.

The second was the substrate test. I wrote 150 hand-curated cause-effect pairs across physics, biology, social dynamics, and everyday objects. Pure clean signal. Then ran the same four prediction strategies on them.

Retrieval scored 30%. Frequency scored 20%. Cooccurrence scored 0%.

Zero. On clean curated data, the prediction algorithm that had been the centerpiece of the previous night's experiment couldn't beat random selection.

That was the moment the framing shifted. The night before, I'd been telling myself the substrate was the problem. The morning's clean substrate said no. The prediction shape itself was wrong. Whatever was working in this stack, it wasn't prediction. It was retrieval. Look up similar past examples, return what they did. That worked. Generate from a learned transition model - that didn't.

This sounds small. It isn't.

The implicit plan after Part 12 was to build a relations head. A part of the model that could propose new triples (X causes Y, X is part of Y) and let the system reason over them. The whole Discovery 2.0 design I'd been sketching was about teaching Origin to generate its own relational knowledge.

The morning's experiment said, don't. Generation is the wrong shape, the same way prediction is. Anything that proposes new facts is one step away from making them up. What we want isn't a model that can produce new triples. It's a model that can retrieve real ones, stored from real sources, and use them to ground its answers.

By the end of the day, four more experiments had pointed the same direction. Spaced-repetition retraining lifted six of seven borderline concepts. Multi-hop inheritance from real is_a chains worked, but broke wherever a concept had two senses and the chain crossed between them. A domain-density profile showed math and emotion thin, biology and physics rich - the substrate gap was domain-specific, not uniform.

Discovery 2.0 came out the other side of that day as a completely different design. Not a triple proposer. A triple ingester. Pull real (subject, relation, object) triples from external sources - ConceptNet, Wikidata, hand-curated where the sources are thin - gate them for polysemy, write them to a reasoning bank, retrieve at composer time. Data engineering, not generation.

That last word matters. Generation invents. Retrieval grounds. The whole arc of Origin from the beginning has been an argument that grounded systems are the path forward, and the day's experiments made it structural rather than aspirational. The model doesn't write its own truths. It looks up the ones we admitted, applies the ones it can, and says "I don't know" when neither path finds a hit.

The last sentence of Part 12 was right that relations were the next bottleneck. It was wrong about the shape of the fix. The fix isn't a relations head. The fix is a curated relational substrate and a retrieval path through it.

Polysemy gating moved from a parked idea to required infrastructure that same day. Without it, retrieval over the bank produces things like "tree has potato" and "host is a bread." Multi-hop reasoning over an ungated polysemous bank hallucinates by construction. Build the gate first. Then the substrate. Then the composer that uses both.

The next several posts in this series are about building those three things, in that order, and what each one cost.

One guy. One GPU. One $1,800 computer in Arizona. Still building.


Origin is developed at Fallen Angel Systems with the Genesis framework - NVIDIA Inception member. (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. Defense. Offense. Creation.

fallenangelsystems.com | Judgement on GitHub | Guardian on GitHub

Questions or consulting inquiries: josh@fallenangelsystems.com

Top comments (0)