Origin Part 9: The Data Plan

#aitraining #developmentalai #genesisframework #olt1

85.7% of concepts were data-starved. That was the problem. What happened next taught us something about the problem itself.

OLT-1 is a concept-based AI that understands language without tokenization. Characters go in. Concepts come out. The encoder is what makes that mapping. If it can't reliably tell concepts apart, nothing downstream works.

Part 8 left the encoder firing on too many slots per query. The concept space was crowded and noisy. A sandbox experiment had already shown that the same architecture could lift top1 from 33% to 80% just by feeding it richer data. Same model. Different food.

That made the next move obvious: stop tuning the encoder and feed it properly. We wrote a plan and built a scope fence around it.

The Plan

One sentence: every V2C concept has at least 30 natural-context positives before any further retrain. Not WordNet glosses. Not template sentences. Real text from books or Wikipedia where the concept is used naturally.

The scope fence was strict: no hard-negative tuning, no decoder dispatch guards, no architecture changes, no tier-test-specific quick fixes. Five things we'd been tempted to try in past sessions and would not be trying this session.

Three data phases with gates: coverage audit, source expansion if needed, then per-concept generation. Then retrain. Then probe.

Phase 1: Coverage Audit

We walked the existing data: book ingestion proposals, elaboration candidates, the grounding cache. Counted how many natural-context sentences each of the 3687 concepts had.

The number: 3158 concepts (85.7%) below the threshold. Most were stuck in the 10-29 range. Some data, but not enough.

Phase 2: Source Expansion

We tagged every concept with one of 17 domain labels using gemma-2-9b, built a Wikipedia full-article adapter, and routed encyclopedic concepts (biology, science, physics, history) to Wikipedia and conversational concepts (emotion, self_state, language) to Gutenberg fiction.

The first run came back thin. Wikipedia and Gutenberg both produced fewer candidates than expected, and the per-domain medians barely moved. Most of the new positives went to common quantifier words: some, many, all. The ones most likely to be the only known concept in any given sentence.

That last detail was the clue.

The Rule That Was Right and Wrong

The book ingestion pipeline has a strict rule: if a sentence mentions more than one known concept, drop it. The rule was correct for the original use case. You never want to assign the wrong concept to a sentence. But it was actively working against us here. The sentences we needed most, ones like "the cell membrane regulates what enters the cell," got dropped because they mention two concepts.

We almost missed it. The clue was that common quantifier words kept getting the new positives. They're the ones most likely to appear alone in a sentence. The interesting concepts, the semantically rich ones, were still getting filtered out at every pass.

We sandboxed a relaxed variant: assign multi-concept sentences to the least-common concept. The argument is information-theoretic. Rare concepts gain more from each new positive. A 50-sample spot-check came back 88% good, 12% defensible-either-way, 0% wrong. We shipped it as a separate file. The original strict rule still serves Discovery unchanged.

181 more concepts crossed the threshold. Per-domain medians moved up three to five positives across the board.

Phase 3: Generation

The last data step. We pulled everything together: book ingestion proposals, elaboration candidates, and the new Path B output. One training file: 94,000 natural-context pairs covering 96.7% of the vocabulary. Wired into the encoder trainer's Phase A data list.

The trainer now had three times its previous data. Every phase gate had passed. We hit run on the retrain and set a timer. 65 minutes later, we'd know if the data had been the problem all along.

Origin is developed at Fallen Angel Systems with the Genesis framework — NVIDIA Inception member. (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. Defense. Offense. Creation.

fallenangelsystems.com | Judgement on GitHub | Guardian on GitHub

Questions or consulting inquiries: josh@fallenangelsystems.com