DEV Community

Josh T
Josh T

Posted on • Originally published at fallenangelsystems.com

Origin Part 16: Forty Percent Wrong

Fourteen thousand triples. Forty percent of them wrong. Twelve hours of compute on a substrate that turned out to be unusable.

The plan after Part 14 was simple to describe: build a relational substrate so the composer would have real (X, R, Y) facts to retrieve at inference time. Sandbox-test day had said retrieval grounds and generation invents. The whole next stretch of work was about giving the retrieval path something real to retrieve.

ConceptNet was the obvious starting point. It's free, it's structured, it has millions of triples. I pulled it in and looked at coverage. Biology, rich. Physics, rich. Math, almost nothing. Every math concept in our vocabulary had at most one or two triples, all of them shallow is_a chains. The substrate gap was real, and it was domain-specific.

Wikidata helped, but not as much as I'd hoped. Most of its dense content is biographical and geographical - useful for is_a facts about people and places, less useful for the cause-and-effect relations that ground reasoning about physical and biological processes.

Hand-curating was always going to be the gold standard, but slow. I had a hundred and fifty hand-written cause-effect pairs from sandbox-test day. To reach the coverage we needed, I'd be writing for months.

So I tried the obvious thing. Get an LLM to extract triples from textbooks.

The model was Gemma-2-9B running through LM Studio on my computer. The input was the core tier of our book collection - a hundred and nine textbooks spanning biology, physics, social studies, math. The output, after eighteen hours of extraction spread across two days, one of which Windows interrupted with an unscheduled reboot for an update mid-run, fourteen thousand seven hundred and seventy-five triples.

The extraction pipeline was reasonable. For each passage, ask Gemma to identify (subject, relation, object) triples where the relation was one of is_a, has, does, or causes. Constrain output to JSON. Filter triples where subject and object weren't both in our vocabulary. Write everything incrementally so a Windows reboot couldn't cost more than a few minutes of work. That last detail saved the run when Windows did, in fact, reboot.

The polysemy gate ran on the output as a dry run first. Six hundred and twenty-one triples out of fourteen thousand made it through. Four point two percent. That sounded like an aggressive filter doing its job. I committed them to the reasoning bank as a live ingest, tagged with a source label so I could roll them back if needed.

Then I pulled fifty triples at random and read them.

Seventeen were good. Thirteen were weak - technically defensible but not useful for grounding. Twenty were wrong in ways that would actively damage the system.

The wrong ones came in two flavors.

The first was direction errors on is_a. "Animal is a vertebrate." "Cartoon is a parody." "Diagram is a tree." "Sine is a cosine." In each case, Gemma had seen two concepts in close proximity in the source text and emitted an is_a triple, but in the wrong direction. A diagram is sometimes drawn as a tree; a tree is not a kind of diagram. Sine and cosine share a category; one is not a kind of the other. The model picked a direction and committed to it, and the direction it picked was wrong about half the time.

The second was object errors. "Compare can sonnet." "Divide can put." "Wartime is part of a vancouver." "Tax is a whiskey." "Aircraft requires winnipeg." Gemma had reached into the surrounding sentence and grabbed a word that happened to be there, and emitted it as the object of the relation. The result is grammatical English that means nothing. Aircraft requires Winnipeg is what it sounds like when a language model is pattern-matching the shape of a triple without checking whether the assertion is true.

I ran the rollback. DELETE WHERE source = 'llm_relation_extraction_v1'. Verified that the reasoning bank dropped from 51,872 rows back to 51,251. The bank was clean again.

I sat with the math for a few minutes. Forty percent of the triples that had passed our polysemy gate were wrong. The gate was filtering for a different axis entirely - checking whether subject or object had multiple senses that would conflate under retrieval. It wasn't checking whether the assertion itself was true. There was no check for that. We hadn't built one because we hadn't expected the extractor to be wrong that often.

The failure modes are semantic, not syntactic. That sentence is the one I kept coming back to. Templated post-filtering - does this look like a triple, are both ends in the vocabulary, is the relation one we recognize - can catch all the obvious junk. What it can't catch is a grammatical, well-formed assertion that happens to be wrong about the world. "Sine is a cosine" passes every cheap check. So does "Tax is a whiskey." The error is in the meaning, and the validator has no way to see meaning.

The natural next thought is a verifier pass. Run a second LLM call: "is this assertion true?" Untested, but it has its own problem. Wrong-direction is_a errors look plausible to a verifier model the same way they looked plausible to the extractor. The verifier would have to be aware of asymmetric is_a in a way the extractor wasn't. That's not an obviously cheaper problem than building a clean substrate by hand.

What I did salvage was the pipeline itself. The extractor's --resume flag and incremental-write logic are keepers. They survived Windows's unscheduled reboot without losing a sentence of work. When we eventually run a different extractor - a more constrained one, or a hand-verified one, or a hybrid - the plumbing is there. The model is what failed.

There's something underneath the failure that's worth saying out loud. The case for hand-curated relational substrate just got stronger. The case for trusting an off-the-shelf LLM to produce structured truth from free text just got weaker. The same machine-learning move that makes these models good at producing grammatical English also makes them comfortable producing confident assertions about things they haven't reasoned about. The pattern-matching is the whole story.

The substrate problem is harder than I thought going in. ConceptNet has gaps. Wikidata has limited cause-effect coverage. Hand-curating is slow. LLM extraction has a quality floor. There isn't a free option here. There's a slow option with a clean output, and a fast option with a contaminated output, and a few hybrid paths somewhere in between.

I picked the slow option, because the contamination is worse than the slowness. The retrieval bank has to be trustworthy for the composer to ground on. A bank with forty percent garbage in it isn't a bank. It's a noise source the composer would faithfully serve back to users as facts.

The same evening, while the rollback was running, I kicked off a hundred-turn conversation audit against the live system to see how badly the missing substrate was hurting it. The audit answered that question, and then it answered another one I hadn't asked. The dispatcher had developed a problem of its own.

That's Part 17.

One guy. One GPU. One $1,800 computer in Arizona. Still building.


Origin is developed at Fallen Angel Systems with the Genesis framework - NVIDIA Inception member. (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. Defense. Offense. Creation.

fallenangelsystems.com | Judgement on GitHub | Guardian on GitHub

Questions or consulting inquiries: josh@fallenangelsystems.com

Top comments (0)