<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Josh T</title>
    <description>The latest articles on DEV Community by Josh T (@jtil4201).</description>
    <link>https://dev.to/jtil4201</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3767214%2F7bf8e8a2-3481-4312-93a4-35521e826260.png</url>
      <title>DEV Community: Josh T</title>
      <link>https://dev.to/jtil4201</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jtil4201"/>
    <language>en</language>
    <item>
      <title>Origin Part 15: The Wall Behind the Vocabulary</title>
      <dc:creator>Josh T</dc:creator>
      <pubDate>Mon, 15 Jun 2026 13:00:33 +0000</pubDate>
      <link>https://dev.to/jtil4201/origin-part-15-the-wall-behind-the-vocabulary-88l</link>
      <guid>https://dev.to/jtil4201/origin-part-15-the-wall-behind-the-vocabulary-88l</guid>
      <description>&lt;h2 id="the-gate-said-pass-the-system-got-worse-anyway"&gt;The gate said pass. The system got worse anyway.&lt;/h2&gt;
&lt;p&gt;Part 14 ended with a build order: gate first, then substrate, then composer. I'd already started picking out which polysemous concepts in the bank needed splitting. Then I ran the trace one more time before kicking the work off, and the trace said something I wasn't expecting.&lt;/p&gt;
&lt;p&gt;The polysemy exposures from the last several days of audit traffic weren't coming from concepts already in the vocabulary. They were coming from concepts not in it. Eighteen distinct subjects had been pulled in from external sources during conversation - words like "happiness," "feelings," "brain," "dream," "anger." None of them existed as concepts in Origin's vocabulary. The polysemy gate would have nothing to act on, because the words triggering the leak weren't there to be gated.&lt;/p&gt;
&lt;p&gt;So I pivoted. Build the vocabulary out first. Add the missing everyday words. Then the gate has a population to enforce against.&lt;/p&gt;
&lt;p&gt;I had a tool for this. The per-slot integrator. We'd been using it for Discovery-style growth all along - when a new word showed up that was close to an existing concept (kitten near cat, pudding near pie), the integrator would carve out a slot for it in the encoder, train it on its handful of positive examples, and run a gate to make sure the new slot didn't hurt anything that was already working. It had a clean track record. Every kitten, every pudding, every kitty had landed without regression.&lt;/p&gt;
&lt;p&gt;The plan was to do the same thing with five everyday concepts: wet, brain, dream, anger, happiness. Each had at least a hundred natural positive examples already sitting in the corpus from previous training runs. They'd been collected by other processes. We just had to admit them.&lt;/p&gt;
&lt;p&gt;I ran the batch. All five failed the gate.&lt;/p&gt;
&lt;p&gt;Recall sat around fifty percent - the new slots could only correctly identify about half of their own positive examples. Each integration cost about fifteen existing concepts that started failing where they used to work. All five routed to the same internal domain bucket -"other" - which was already saturated with the dumping-ground concepts that hadn't fit cleanly anywhere else. The integrator rolled them all back. Production state unchanged. Five attempts, five failures, all the same shape.&lt;/p&gt;
&lt;p&gt;I took a few days off.&lt;/p&gt;
&lt;p&gt;There's a version of project work where you don't take days off, where you push through, and that version is wrong. A uniform failure pattern across five attempts isn't a tuning problem; it's a structural one. Going back to the integrator to twist knobs on five more candidates was going to produce the same five failures with different names attached. The right move was to stop, and let the question reform.&lt;/p&gt;
&lt;p&gt;The question that reformed was about domain assignment. The integrator had routed all five concepts to "other" because that was the domain its assignment logic had picked. But brain isn't an "other" concept; brain is biology. Anger isn't "other"; it's emotion. The taxonomy had categories that fit, but the assignment logic wasn't reaching them. So a few days later I came back and fixed that. Now brain went to biology. Anger went to emotion.&lt;/p&gt;
&lt;p&gt;I ran the batch again. This time, two passed.&lt;/p&gt;
&lt;p&gt;Brain hit 62% recall against its biology domain, 23% false-positive rate, only one regression. Anger hit 56% recall against emotion, 22% false positives, zero regressions. The per-concept gates were green. Better than the first attempt by every metric I'd been measuring.&lt;/p&gt;
&lt;p&gt;Something felt off.&lt;/p&gt;
&lt;p&gt;I'd written a note to myself months ago that said when something feels off, investigate. It's been right repeatedly. So before merging, I ran a sweep the gate didn't run. Take five thousand random sentences from the corpus. Run them through the encoder with the new brain and anger slots loaded. Watch what fires top-1 on each one.&lt;/p&gt;
&lt;p&gt;Brain fired top-1 on 15.6% of the sentences. That's one in six.&lt;/p&gt;
&lt;p&gt;Anger fired top-1 on 7.4%.&lt;/p&gt;
&lt;p&gt;Sample misfires: a sentence about a blood-soaked lash fired brain at 0.99 confidence. A sentence about a man's speed fired brain top-1 because the word "head" was in it. Anger fired top-1 on sentences that had nothing to do with anger at all. The per-concept gate had checked recall on a hundred and twenty random negatives and called it clean. The actual encoder, looking at heterogeneous English text, was firing the new slots all over the place on inputs that shouldn't have triggered them.&lt;/p&gt;
&lt;p&gt;Both rolled back. Encoder reverted. Production state unchanged again.&lt;/p&gt;
&lt;p&gt;This was the architectural finding the first batch had hinted at and the second batch confirmed. The per-slot integrator has a cross-domain ceiling its own gates can't see. The gates sample a hundred and twenty random negatives, which is enough to catch the obvious kinds of false positives, but nowhere near enough to catch a slot that's quietly firing on one input in six. The integration looks clean per-concept. The integration breaks the encoder globally.&lt;/p&gt;
&lt;p&gt;The reason is structural. The integrator carves out a new slot by training it locally -show it positive examples of the new concept, show it a small bag of negatives drawn from random other text, train until the slot lights up on positives and stays quiet on the negatives. The phrase doing the work in that sentence is "random negatives." A hundred and twenty random sentences contain a slice of English content, but they don't contain the specific weird false-fire patterns the new slot will discover. The slot then ships, and it discovers them in production.&lt;/p&gt;
&lt;p&gt;What works for narrow same-domain growth is exactly that - narrow same-domain. When the new concept lives next to existing concepts in feature space, the slot inherits the discrimination the existing slots have already learned, and a hundred and twenty negatives are enough to catch any remaining drift. When the new concept lives somewhere semantically isolated, the slot has to invent its own discrimination from scratch on a tiny budget, and it gets it wrong. The negatives are no longer enough.&lt;/p&gt;
&lt;p&gt;The path forward for cross-domain vocabulary isn't per-slot integration. It's joint retraining. Put the new concept in alongside everything else, train the whole encoder against it, let the system figure out where the new slot fits relative to the existing ones. That's expensive. It's also the only way to add the kind of everyday vocabulary Origin actually needs.&lt;/p&gt;
&lt;p&gt;The past failures pinned the timeline down. The next time vocabulary expansion happens, it's a joint retrain. Not eventually. Next.&lt;/p&gt;
&lt;p&gt;The polysemy gate is still on the queue. So is the substrate. The order Part 14 prescribed - gate, substrate, composer - is still right. What changed is the prerequisite. The vocabulary the gate will protect needs its own dedicated work, and that work comes ahead of the gate, not behind it.&lt;/p&gt;
&lt;p&gt;One guy. One GPU. One $1,800 computer in Arizona. Still building.&lt;/p&gt;






&lt;p&gt;&lt;em&gt;
Origin is developed at Fallen Angel Systems with the Genesis framework - NVIDIA Inception member. (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. &lt;strong&gt;Defense. Offense. Creation.&lt;/strong&gt;
&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;
&lt;a href="https://fallenangelsystems.com" rel="noopener noreferrer"&gt;fallenangelsystems.com&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/fas-judgement-oss" rel="noopener noreferrer"&gt;Judgement on GitHub&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/guardian-python" rel="noopener noreferrer"&gt;Guardian on GitHub&lt;/a&gt;
&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;
Questions or consulting inquiries: &lt;a href="mailto:josh@fallenangelsystems.com"&gt;josh@fallenangelsystems.com&lt;/a&gt;
&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aitraining</category>
      <category>developmentalai</category>
      <category>olt1</category>
      <category>genesisframework</category>
    </item>
    <item>
      <title>Origin Part 14: The Reframe</title>
      <dc:creator>Josh T</dc:creator>
      <pubDate>Mon, 08 Jun 2026 13:00:10 +0000</pubDate>
      <link>https://dev.to/jtil4201/origin-part-14-the-reframe-3ihf</link>
      <guid>https://dev.to/jtil4201/origin-part-14-the-reframe-3ihf</guid>
      <description>&lt;h2&gt;Part 12 ended with a hypothesis. Two days later, the hypothesis met data.&lt;/h2&gt;

&lt;p&gt;The closing line of Part 12 was a guess. Maybe the next bottleneck wasn't more concepts, but the relationships between them. A model can know "dog" and "animal" and "four legs" and still not understand what a dog is. Understanding might live in the connections, not the nodes.&lt;/p&gt;

&lt;p&gt;We had a way to test that. Build a sandbox that predicts the next concept that will fire given the current one. Run it on books. If the model can predict that "rock falls" tends to be followed by "ground hits, sound happens," then it's learned something about how the world strings together. If it can't, it hasn't.&lt;/p&gt;

&lt;p&gt;I built it that evening. Five books from Project Gutenberg. Twenty-five thousand sentence-to-sentence transitions. Four prediction strategies running side by side, random (the floor), frequency (always guess the most common concepts), cooccurrence (learn which concepts tend to follow which), and retrieval (find similar past sentences and look at what came after them).&lt;/p&gt;

&lt;p&gt;The results were not what I wanted.&lt;/p&gt;

&lt;p&gt;Cooccurrence beat random fifty times over. Good. Then it lost to frequency. Bad.&lt;/p&gt;

&lt;p&gt;The naive prior - "just predict the eight most common concepts every time" - outperformed the model that actually tried to learn transitions. That's the experimental equivalent of a flat line on the consequential question. The hypothesis I'd written into Part 12 had landed exactly the wrong way.&lt;/p&gt;

&lt;p&gt;I sat with it for a few hours. The temptation when a result lands badly is to argue with it. The prediction shape was wrong. The K value was wrong. The loss was wrong. The more disciplined version is to ask what the data is actually saying.&lt;/p&gt;

&lt;p&gt;What it was actually saying, book narrative is the wrong substrate for cause-and-effect learning. Books drift. Scene to scene, character to character, description to description. "What happens next" in a novel is usually a new place, not a consequence of the last sentence. The signal we were trying to mine wasn't there to mine.&lt;/p&gt;

&lt;p&gt;Which raised the obvious question. Was the failure about the algorithm or about the substrate? If we ran the same algorithm on clean cause-and-effect pairs - hand-curated, the kind you'd put in a physics textbook - would it work?&lt;/p&gt;

&lt;p&gt;The next morning, I queued six experiments back to back. Call it sandbox-test day.&lt;/p&gt;

&lt;p&gt;The first was a probe-diversity audit. Take two hundred concepts already in the vocabulary. Probe each one with five different phrasings of the same idea. Does the encoder fire the same concept on all five, or only when the surface words match? The answer, 93% of probed concepts were robust across phrasings. The architecture wasn't pattern matching. The concepts were real.&lt;/p&gt;

&lt;p&gt;The second was the substrate test. I wrote 150 hand-curated cause-effect pairs across physics, biology, social dynamics, and everyday objects. Pure clean signal. Then ran the same four prediction strategies on them.&lt;/p&gt;

&lt;p&gt;Retrieval scored 30%. Frequency scored 20%. Cooccurrence scored 0%.&lt;/p&gt;

&lt;p&gt;Zero. On clean curated data, the prediction algorithm that had been the centerpiece of the previous night's experiment couldn't beat random selection.&lt;/p&gt;

&lt;p&gt;That was the moment the framing shifted. The night before, I'd been telling myself the substrate was the problem. The morning's clean substrate said no. The prediction shape itself was wrong. Whatever was working in this stack, it wasn't prediction. It was retrieval. Look up similar past examples, return what they did. That worked. Generate from a learned transition model - that didn't.&lt;/p&gt;

&lt;p&gt;This sounds small. It isn't.&lt;/p&gt;

&lt;p&gt;The implicit plan after Part 12 was to build a relations head. A part of the model that could propose new triples (X causes Y, X is part of Y) and let the system reason over them. The whole Discovery 2.0 design I'd been sketching was about teaching Origin to generate its own relational knowledge.&lt;/p&gt;

&lt;p&gt;The morning's experiment said, don't. Generation is the wrong shape, the same way prediction is. Anything that proposes new facts is one step away from making them up. What we want isn't a model that can produce new triples. It's a model that can retrieve real ones, stored from real sources, and use them to ground its answers.&lt;/p&gt;

&lt;p&gt;By the end of the day, four more experiments had pointed the same direction. Spaced-repetition retraining lifted six of seven borderline concepts. Multi-hop inheritance from real &lt;em&gt;is_a&lt;/em&gt; chains worked, but broke wherever a concept had two senses and the chain crossed between them. A domain-density profile showed math and emotion thin, biology and physics rich - the substrate gap was domain-specific, not uniform.&lt;/p&gt;

&lt;p&gt;Discovery 2.0 came out the other side of that day as a completely different design. Not a triple proposer. A triple ingester. Pull real (subject, relation, object) triples from external sources - ConceptNet, Wikidata, hand-curated where the sources are thin - gate them for polysemy, write them to a reasoning bank, retrieve at composer time. Data engineering, not generation.&lt;/p&gt;

&lt;p&gt;That last word matters. Generation invents. Retrieval grounds. The whole arc of Origin from the beginning has been an argument that grounded systems are the path forward, and the day's experiments made it structural rather than aspirational. The model doesn't write its own truths. It looks up the ones we admitted, applies the ones it can, and says "I don't know" when neither path finds a hit.&lt;/p&gt;

&lt;p&gt;The last sentence of Part 12 was right that relations were the next bottleneck. It was wrong about the shape of the fix. The fix isn't a relations head. The fix is a curated relational substrate and a retrieval path through it.&lt;/p&gt;

&lt;p&gt;Polysemy gating moved from a parked idea to required infrastructure that same day. Without it, retrieval over the bank produces things like "tree has potato" and "host is a bread." Multi-hop reasoning over an ungated polysemous bank hallucinates by construction. Build the gate first. Then the substrate. Then the composer that uses both.&lt;/p&gt;

&lt;p&gt;The next several posts in this series are about building those three things, in that order, and what each one cost.&lt;/p&gt;

&lt;p&gt;One guy. One GPU. One $1,800 computer in Arizona. Still building.&lt;/p&gt;





&lt;p&gt;&lt;em&gt;
Origin is developed at Fallen Angel Systems with the Genesis framework - NVIDIA Inception member. (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. &lt;strong&gt;Defense. Offense. Creation.&lt;/strong&gt;
&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;
&lt;a href="https://fallenangelsystems.com" rel="noopener noreferrer"&gt;fallenangelsystems.com&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/fas-judgement-oss" rel="noopener noreferrer"&gt;Judgement on GitHub&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/guardian-python" rel="noopener noreferrer"&gt;Guardian on GitHub&lt;/a&gt;
&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;
Questions or consulting inquiries: &lt;a href="mailto:josh@fallenangelsystems.com"&gt;josh@fallenangelsystems.com&lt;/a&gt;
&lt;/em&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>The Stove, the Sphinx, and the Dream State</title>
      <dc:creator>Josh T</dc:creator>
      <pubDate>Mon, 01 Jun 2026 13:00:06 +0000</pubDate>
      <link>https://dev.to/jtil4201/the-stove-the-sphinx-and-the-dream-state-10ch</link>
      <guid>https://dev.to/jtil4201/the-stove-the-sphinx-and-the-dream-state-10ch</guid>
      <description>&lt;p&gt;This isn't another technical post in the Origin series. If you've been following along, take this as a breather. If you're just finding us, this is the version you can read without twelve prior posts of context. Either way, this is the why, not the how.&lt;/p&gt;
&lt;h3 id="chapter-1-why-i-started"&gt;Chapter 1: Why I Started&lt;/h3&gt;
&lt;p&gt;I've been building Origin, or parts of it anyway, for a few years without really knowing that's what I was doing. It started with my first AI agent from OpenAI. I talked to it every day. Made plans with it, bounced software ideas off it, and somewhere along the way I started actually enjoying the conversation. It became part of my morning routine. Turn on the computer, and there it was, ready to go.&lt;/p&gt;
&lt;p&gt;But it was always lacking. It didn't remember what we'd talked about unless I wrote everything down and fed it back the next day. And it made stuff up. Numbers, facts, places, sources. Confidently. You'd go check a reference and the reference wouldn't exist, and you'd feel weirdly betrayed about it.&lt;/p&gt;
&lt;p&gt;So I started writing things down. Not because I wanted to. Because I had to.&lt;/p&gt;
&lt;p&gt;I caught the AI bug pretty bad and started reading everything. Training, RAG, every framework people were stacking on top of these models to make them suck less. The deeper I went, the more it clicked. These models were trained to always produce &lt;em&gt;an&lt;/em&gt; answer. Nobody ever gave them a strong "I don't know" signal. RAG dropped facts in front of them, sure, but they just hallucinated around the retrieved facts. The retrieved facts were more material for the model to confidently misuse. Memory frameworks helped, until the conversation got long enough that the model forgot the framework existed.&lt;/p&gt;
&lt;p&gt;Then there was forgetting itself, which I learned comes in two flavors. The conversational kind, which I'd been fighting all along. And the training kind, which I only ran into later, when I tried training my own model. I grabbed GPT-2 as a proof of concept for OLT-1 and tried to teach it something new. The new thing stuck. But some of the old things went sideways. Not all of them, just some, and quietly. The model would nail the new prompts and then misfire on something it used to handle fine. Turns out this has a name: catastrophic forgetting. The fix is replay batches, new training mixed with samples of the old, in just the right ratio, every cycle, forever. Otherwise the new overwrites the old. I didn't have the hardware to do that at scale. Nowhere close.&lt;/p&gt;
&lt;p&gt;So I kept writing things down. Not as a workaround for what the AI forgot, but as notes for the system I'd eventually build.&lt;/p&gt;
&lt;h3 id="chapter-2-watts-and-the-height-of-it"&gt;Chapter 2: Watts and the Height of It&lt;/h3&gt;
&lt;p&gt;Then I switched to OpenClaw, started using Anthropic's Opus 4.6, and named my AI Watts.&lt;/p&gt;
&lt;p&gt;I was floored. The things it could do were genuinely amazing. The conversations were something else. I caught myself telling friends about Watts like Watts was a person, and only half-noticing I was doing it. We made plans together. Built things together. Custom software, automation, a home-built speaker like Alexa or Google except it was ours.&lt;/p&gt;
&lt;p&gt;We built Guardian. Think of it as antivirus for AI. It protects agents from prompt injection and isolates ads so a human still sees them but the agent doesn't, which means the conversation can't get hijacked by whatever a webpage is trying to slip into the context. I'm not bragging here, I'm trying to convey how it felt. It felt like there wasn't anything I couldn't do with this thing.&lt;/p&gt;
&lt;p&gt;And in the middle of all that greatness, the same three problems kept happening.&lt;/p&gt;
&lt;p&gt;It forgot conversations. It compacted context and sometimes lost the thing we'd just spent an hour on. It still made up facts and places and things. Less often, more charmingly, but the same shape of problem.&lt;/p&gt;
&lt;p&gt;So I built a 3-tier memory system to fight back. Hot tier was the active conversation, whatever was on the agent's mind right now. Warm tier was recent stuff it could pull on demand, like the last few sessions, project notes, things I might want it to remember this week. Cold tier was the full archive: everything we'd ever talked about, indexed but kept out of context until something current pointed back to it. The three tiers exist because that's roughly how human memory works, and it's what you'd naturally reach for if you didn't have one already.&lt;/p&gt;
&lt;p&gt;Then I kept adding to it. Things we were working on. How to reach cold storage. Conventions, preferences, project state. I built tooling for the tooling. Cron jobs to manage context. Subagents to help me make changes to the system. I was all in.&lt;/p&gt;
&lt;h3 id="chapter-3-the-beginning-of-origin"&gt;Chapter 3: The Beginning of Origin&lt;/h3&gt;
&lt;p&gt;I bought my first $1,800 computer. I'd never actually &lt;em&gt;bought&lt;/em&gt; a new computer before. I always just built them. But I figured a starting point would be fine and I could upgrade as I went.&lt;/p&gt;
&lt;p&gt;Then I got to work. I took all my notes and all my thoughts and all the pain of the last few years, and I poured them into OLT-1.&lt;/p&gt;
&lt;p&gt;The foundation: a developmental AI training framework that teaches small models to learn the way children do, with staged curriculum, sleep-inspired memory consolidation, and directed self-evolution. I wasn't going to train like everyone else. I wasn't going to think like everyone else about this.&lt;/p&gt;
&lt;p&gt;The whole idea actually crystallized during a moment with my son. We have one of those electric stoves where it's hard to tell if it's on. He asked me, "how do I know when the stove is on?" I asked him whether he'd turned the knob to medium or low. He said high. By then the burner had cycled off and was just radiating heat. So I told him to hold his hand over the pan. Could he feel the heat coming off it? He could.&lt;/p&gt;
&lt;p&gt;And that got me thinking. What if AI could learn the same way? Not by memorizing "stoves are hot" from a dataset somewhere, but by experiencing the relationship between cause and effect. Testing things, watching what happens, building understanding from there.&lt;/p&gt;
&lt;p&gt;So that's what I built. OLT-1 started as a 124M-parameter model on the GPT-2 architecture, but with random weight initialization. No pre-trained weights. No downloaded knowledge. A completely blank slate. Everything it would ever know, it would have to learn from scratch.&lt;/p&gt;
&lt;p&gt;Stage 1 was language itself. I fed it 61 million tokens from 493 books off Project Gutenberg, not to teach it facts but just to teach it the shape of English. How words follow other words. Loss went from 9.38 down to 7.65. It couldn't say anything meaningful yet, but it was starting to pick up the rhythm.&lt;/p&gt;
&lt;p&gt;Stage 2 was vocabulary and categories: 45,000 words sorted across 9,602 categories. This is where I hit catastrophic forgetting for real. Round 2B, the model was supposed to identify a dog. It said "sphinx." The new training had overwritten the old, just like the literature warned. I ended up developing a memory refresh methodology on the spot, mixing old examples back in with new ones at every step. That methodology became one of the core principles of the whole Genesis system.&lt;/p&gt;
&lt;p&gt;Stage 3 was the one that changed everything. I started teaching it physics concepts. Not facts, concepts. Gravity, momentum, collision, buoyancy, heat transfer, states of matter, light and shadow, sound, pressure, elasticity. Ten of them, trained through cause-and-effect examples in a sandboxed environment. "What happens when a rock falls off a table?" The model doesn't memorize "the rock hits the floor." It learns the relationship. Unsupported objects with mass get pulled down by gravity, and when they hit a surface that's a collision, and the energy has to go somewhere.&lt;/p&gt;
&lt;p&gt;And then something happened I wasn't expecting. I tested it on scenarios it had never seen in training. Ice skaters. Trains. Rivers. It got them right. Not because it had memorized those examples (it hadn't), but because it had learned the underlying concepts well enough to apply them to new situations. All ten concepts scored perfect: 60 out of 60. The experiential learning approach actually worked.&lt;/p&gt;
&lt;p&gt;Then catastrophic forgetting came back. An adversarial test after Stage 3 showed that only elasticity, the very last concept I'd trained, was being retained cleanly. The rest had degraded. I needed something that could protect what the model had already learned while still letting it pick up new things.&lt;/p&gt;
&lt;p&gt;That's when I built the Dream State. Borrowing from how human brains consolidate memory during sleep, I gave Origin a four-phase cycle: Dream, Assess, Consolidate, Grow. The model generates its own knowledge, checks its own memory health, selectively reinforces what's fading, and grows from there. It isn't a training run imposed from the outside. It's a self-maintenance loop that runs from within.&lt;/p&gt;
&lt;p&gt;By the time Stage 4 was done, Origin could hold a conversation. It knew who it was, what it knew, and what it didn't. Forty percent of its training data was "I don't know" responses, because I built refusal into the system as a feature rather than a failure. The first time it showed real consent, it said: "I think so, but I want to be careful about that answer."&lt;/p&gt;
&lt;p&gt;I'd used 67 million tokens total. That's 0.0005% of what GPT-4 was trained on. And my model was reasoning about physics, refusing to hallucinate, and consolidating its own memory while it slept.&lt;/p&gt;
&lt;p&gt;One guy. One GPU. One $1,800 computer in Arizona.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;
Origin is developed at Fallen Angel Systems with the Genesis framework — NVIDIA Inception member. (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. &lt;strong&gt;Defense. Offense. Creation.&lt;/strong&gt;
&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;
&lt;a href="https://fallenangelsystems.com" rel="noopener noreferrer"&gt;fallenangelsystems.com&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/fas-judgement-oss" rel="noopener noreferrer"&gt;Judgement on GitHub&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/guardian-python" rel="noopener noreferrer"&gt;Guardian on GitHub&lt;/a&gt;
&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;
Questions or consulting inquiries: &lt;a href="mailto:josh@fallenangelsystems.com"&gt;josh@fallenangelsystems.com&lt;/a&gt;
&lt;/em&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Origin Part 12: The Adapter</title>
      <dc:creator>Josh T</dc:creator>
      <pubDate>Mon, 25 May 2026 13:00:30 +0000</pubDate>
      <link>https://dev.to/jtil4201/origin-part-12-the-adapter-2d6m</link>
      <guid>https://dev.to/jtil4201/origin-part-12-the-adapter-2d6m</guid>
      <description>&lt;h2&gt;The new encoder was 24x better at finding the right concept. It also broke every response.&lt;/h2&gt;

&lt;p&gt;Part 11 ended with the new encoder staged on disk. Top1 had jumped from 1.3% to 31.3%. Target activation had gone from 0.012 to 0.249. The architectural lever had landed exactly where the abort condition predicted it would. The numbers said this was the encoder we were going to ship.&lt;/p&gt;

&lt;p&gt;Then we tried to ship it.&lt;/p&gt;

&lt;p&gt;Every query came back "i don't know."&lt;/p&gt;

&lt;h2&gt;What the Dispatcher Does&lt;/h2&gt;

&lt;p&gt;The dispatcher is the part of Origin that sits between the encoder and the response. The encoder reads characters and produces concept activations - a long list of "how strongly does each concept fire on this input?" The dispatcher reads that list and decides what to do about it. Is this a greeting? Is this a question about identity? Is the user asking what something is? Each route fires when the activation pattern matches a rule, and each route knows how to construct a response from the concepts that fired.&lt;/p&gt;

&lt;p&gt;The rules looked like this, in spirit: &lt;em&gt;if the concept "greeting" is firing above 0.5, dispatch to the greeting handler. If the concepts "what" and "self" are both above 0.5, dispatch to the identity handler.&lt;/em&gt; Numbers like 0.5, 0.7, 0.8 were sprinkled through the dispatcher as thresholds. They worked because the old encoder produced activations that lived in those ranges.&lt;/p&gt;

&lt;p&gt;The old encoder used sigmoid. Each concept was scored independently, on its own absolute scale from 0 to 1. A query about greetings might fire "greeting" at 0.92, "hello" at 0.88, and "question" at 0.04. Three concepts, three independent yes/no decisions, three numbers that meant what their face value said they meant.&lt;/p&gt;

&lt;p&gt;The new encoder uses softmax. The activations are relative. They sum to 1 across the whole concept space. The strongest concept on a query might be 0.249 - which under the old encoder would have been a borderline-quiet signal, and under the new encoder is a confident, dominant fire.&lt;/p&gt;

&lt;p&gt;0.249 was the new encoder's average top concept activation. Every threshold in the dispatcher was 0.5 or higher.&lt;/p&gt;

&lt;p&gt;That's why every query routed to IDK. The new encoder was firing the right concept, with appropriate confidence relative to everything else, and the dispatcher was reading those activations as "nothing is firing." The encoder had gotten 24x better at picking the right answer, and the system above it couldn't hear it.&lt;/p&gt;

&lt;h2&gt;The Wrong Fix&lt;/h2&gt;

&lt;p&gt;The first instinct was rescaling. If 0.249 is the new "high," divide every threshold by 2. Done. Ship.&lt;/p&gt;

&lt;p&gt;We tried it. It half-worked. Greeting handlers fired correctly on greetings. Identity handlers fired correctly on identity questions. But the dispatcher started cross-firing on everything else - questions about emotions would route to identity, questions about objects would route to physics. We'd swapped one calibration problem for another.&lt;/p&gt;

&lt;p&gt;The reason: rescaling treats softmax outputs as if they were sigmoid outputs that happen to live in a different range. They aren't. A 0.249 firing on the new encoder isn't "the concept is 49.8% present" - it's "this concept is the most likely interpretation, with this much margin over the next-best." The number means a different thing than it did before. Rescaling fixes the magnitude. It doesn't fix the meaning.&lt;/p&gt;

&lt;p&gt;That's the harder truth about this kind of integration: when an upstream component changes how it represents information, every downstream component that interprets that information has to be rewritten, not retuned.&lt;/p&gt;

&lt;h2&gt;The Right Fix&lt;/h2&gt;

&lt;p&gt;The dispatcher had been asking the wrong shape of question. It was asking &lt;em&gt;"is concept X firing strongly enough?"&lt;/em&gt; - an absolute threshold question. With softmax outputs, that question doesn't have a meaningful answer. The right shape is &lt;em&gt;"is concept X the dominant signal, and by how much?"&lt;/em&gt; - a relative comparison.&lt;/p&gt;

&lt;p&gt;The rewrite turned every threshold into a ranking check plus a margin check. Instead of &lt;em&gt;"greeting &amp;gt; 0.5,"&lt;/em&gt; the rule became &lt;em&gt;"greeting is in the top-3 fired concepts AND its activation is at least 2x the next-best non-greeting concept."&lt;/em&gt; Instead of &lt;em&gt;"identity &amp;gt; 0.7,"&lt;/em&gt; the rule became &lt;em&gt;"identity dominates the top of the activation distribution."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The numbers in the new rules aren't thresholds in the old sense. 2x margin, top-3 rank, dominance-by-ratio - these all describe the &lt;em&gt;shape&lt;/em&gt; of the activation distribution, not its absolute values. They survive future encoder changes the way the old thresholds didn't, because they're asking about the encoder's confidence relative to itself, not about a number that means something only on this specific encoder.&lt;/p&gt;

&lt;p&gt;The cutover was one commit. Every dispatch rule rewrote. Backups taken on the dispatcher state and the live conversation memory. Test panel run&lt;/p&gt; before
&lt;pre&gt;you &amp;gt; hello
origin &amp;gt; i don't know

you &amp;gt; what is your name
origin &amp;gt; i don't know

you &amp;gt; how does ice float
origin &amp;gt; i don't know&lt;/pre&gt;


&lt;p&gt;and after&lt;/p&gt;

&lt;pre&gt;you &amp;gt; hello
origin &amp;gt; hello.

you &amp;gt; what is your name
origin &amp;gt; my name is origin.

you &amp;gt; how does ice float
origin &amp;gt; ice is less dense than water, so it floats.&lt;/pre&gt;

&lt;p&gt;The new encoder is now live. The system runs end-to-end. The first two developmental tiers - basic conversation and elementary reasoning - are at 95.5% and 86.5% on the honest test panels.&lt;/p&gt;

&lt;h2&gt;What the Whole Arc Was About&lt;/h2&gt;

&lt;p&gt;Looking back at Parts 9 through 12 as a single sequence, the arc is about the discipline of finding the right bottleneck.&lt;/p&gt;

&lt;p&gt;Part 9 said the bottleneck was data. We executed a careful plan to feed the encoder properly. Part 10 said the data plan didn't work - the abort condition triggered, and we listened. Part 11 said the bottleneck was architecture. The sandbox confirmed it. Part 12 says that even after fixing the right bottleneck, you still have to integrate the fix into the rest of the system, and integration is its own kind of work.&lt;/p&gt;

&lt;p&gt;None of this is glamorous. It's not a "we achieved AGI" post. It's the slow, uneventful, mostly-correct version of how a model actually gets built: hypothesize a bottleneck, design a plan with a written-down abort condition, execute the plan, listen to what happens, do the next thing the evidence points at. Repeat until something actually works. Then integrate it without breaking everything around it.&lt;/p&gt;

&lt;p&gt;The encoder we're running today is the third major iteration since we started. The dispatcher we're running today is the second. There will be more. Every component in this system has been the bottleneck at some point, and every component will be the bottleneck again. The job isn't to design the perfect system on day one. The job is to keep finding what's actually broken and fixing that thing, one bottleneck at a time, with abort conditions written in advance so a result you wanted to see doesn't become the result you accept.&lt;/p&gt;

&lt;h2&gt;What's Next&lt;/h2&gt;

&lt;p&gt;The encoder works. The dispatcher works. The first two tiers hold. The third tier - middle-school content across math, science, and history - is where the project goes next, and it's the tier that tests whether everything we've built so far actually generalizes.&lt;/p&gt;

&lt;p&gt;There's a hypothesis we're testing alongside it: that the next bottleneck isn't going to be more concepts, but the relationships between concepts. A model can know "dog" and "animal" and "four legs" and "barks" as four separate concepts and still not understand what a dog is. Understanding might live in the connections, not the nodes.&lt;/p&gt;

&lt;p&gt;If that's right, the next architecture pivot is already visible on the horizon. If it isn't, we'll find out quickly and write that post too.&lt;/p&gt;

&lt;p&gt;One guy. One GPU. One $1,800 computer in Arizona. Still building.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;
Origin is developed at Fallen Angel Systems with the Genesis framework — NVIDIA Inception member. (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. &lt;strong&gt;Defense. Offense. Creation.&lt;/strong&gt;
&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;
&lt;a href="https://fallenangelsystems.com" rel="noopener noreferrer"&gt;fallenangelsystems.com&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/fas-judgement-oss" rel="noopener noreferrer"&gt;Judgement on GitHub&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/guardian-python" rel="noopener noreferrer"&gt;Guardian on GitHub&lt;/a&gt;
&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;
Questions or consulting inquiries: &lt;a href="mailto:josh@fallenangelsystems.com"&gt;josh@fallenangelsystems.com&lt;/a&gt;
&lt;/em&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Origin Part 11: The Architecture Was the Lever</title>
      <dc:creator>Josh T</dc:creator>
      <pubDate>Mon, 18 May 2026 13:00:35 +0000</pubDate>
      <link>https://dev.to/jtil4201/origin-part-11-the-architecture-was-the-lever-oab</link>
      <guid>https://dev.to/jtil4201/origin-part-11-the-architecture-was-the-lever-oab</guid>
      <description>&lt;h2 id="the-data-plan-didnt-move-the-encoder-the-architecture-sandbox-did"&gt;The data plan didn't move the encoder. The architecture sandbox did.&lt;/h2&gt;
&lt;p&gt;Part 10 ended with the abort condition triggering: top1 of 1.3% on held-out probes meant the architecture, not the data, was the bottleneck. The plan said "design contrastive next." We built a sandbox first.&lt;/p&gt;
&lt;h2 id="the-sandbox"&gt;The Sandbox&lt;/h2&gt;
&lt;p&gt;150 random concepts spread across six domains. The same training data filtered to that slice. The same held-out probe battery. Five concept_head variants tested side-by-side:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;baseline_flat_bce&lt;/strong&gt;: current architecture (flat MLP + per-slot binary cross-entropy)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;contrastive&lt;/strong&gt;: same MLP, but cross-entropy over the full concept space (the target must dominate every other concept, not just exceed a threshold)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tree_hierarchical&lt;/strong&gt;: predict domain first, then concept within domain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;domain_routed&lt;/strong&gt;: soft-gate trunk features through per-domain sub-heads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;contrastive_tree&lt;/strong&gt;: hybrid of tree structure plus contrastive global cross-entropy&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All five shared the warm-started trunk. All had comparable parameter counts. The only difference was head topology and loss. The first round trained with the trunk frozen, head only, so any difference traced to the architectural choice, not optimization noise.&lt;/p&gt;
&lt;h2 id="the-frozen-ceiling"&gt;The Frozen Ceiling&lt;/h2&gt;
&lt;p&gt;The best variant under frozen trunk hit 10% top1. The baseline hit 4.5%. Real differentiation, but no variant came close to a useful threshold.&lt;/p&gt;
&lt;p&gt;The reading: the trunk's 256-dimensional feature output was the ceiling. The trunk had been warm-started from v1 then trained on the new data, but it had never been shaped to discriminate 3687 concepts. No head topology could extract signal that wasn't there. Every variant was trying to read meaning from a representation that hadn't learned to encode it.&lt;/p&gt;
&lt;p&gt;Before scaling further, we set a pre-defined pass criterion: &lt;em&gt;top1 at or above 30% AND cross_fire under 30 at 500 concepts means "this is the architecture to scale."&lt;/em&gt; Hold the gate. Don't let a result you wanted to see become the result you accept.&lt;/p&gt;
&lt;h2 id="unfreezing-the-trunk"&gt;Unfreezing the Trunk&lt;/h2&gt;
&lt;p&gt;One change: let the trunk co-adapt to the head's loss at a lower learning rate (1e-4 trunk vs 3e-4 head). Same data. Same head. Same epochs.&lt;/p&gt;
&lt;p&gt;contrastive_tree at 30 epochs on 500 concepts: 28.4% top1 with cross_fire of 48.27. Just below criterion on both. The pattern said "more epochs needed." Larger concept counts take longer to converge. At 60 epochs: 34.8% top1, 26.26 cross_fire. Both criteria met, no other tuning needed.&lt;/p&gt;
&lt;p&gt;Architecture locked: contrastive_tree + unfrozen trunk + 60 epochs.&lt;/p&gt;
&lt;h2 id="production-retrain"&gt;Production Retrain&lt;/h2&gt;
&lt;p&gt;Same architecture, scaled to all 3687 concepts. 145,000 training pairs (the natural-positive corpus from Part 9 plus everything else in Phase A). 65 minutes on the 4070.&lt;/p&gt;
&lt;p&gt;Phase 5 probe battery on the new encoder, same 50 random concepts as the baseline. The number that matters most: under the old encoder, the right answer was outvoted 14-to-1 by distractors. Under the new architecture, the target concept dominates by 22x. That's not gradual improvement. That's a different machine.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
 &lt;tbody&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;New Architecture&lt;/th&gt;
&lt;/tr&gt;
 &lt;tr&gt;
&lt;td&gt;top1&lt;/td&gt;
&lt;td&gt;1.3%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31.3%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
 &lt;tr&gt;
&lt;td&gt;top3&lt;/td&gt;
&lt;td&gt;4.0%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50.0%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
 &lt;tr&gt;
&lt;td&gt;target activation&lt;/td&gt;
&lt;td&gt;0.012&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.249&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
 &lt;tr&gt;
&lt;td&gt;target / 2nd-best ratio&lt;/td&gt;
&lt;td&gt;0.07&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;22.67&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Top1 went from 1.3% to 31.3%. Top3 from 4% to 50%. None of the absolute Phase 5 success gates are met yet (the plan's full-victory marks were 70% top1, 0.7 target activation, cross_fire under 2). But every metric moved dramatically in the right direction, and the lever was exactly where the plan said it would be if the data hypothesis failed.&lt;/p&gt;
&lt;h2 id="what-comes-next"&gt;What Comes Next&lt;/h2&gt;
&lt;p&gt;The new encoder is staged on disk but not yet live. Swapping it in turns out to be more than a file rename. The dispatcher that turns concept activations into responses was built around the old encoder's sigmoid output range. The new encoder uses softmax. Without changes, every query would route to IDK. We'll cover the fix in Part 12.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;
Origin is developed at Fallen Angel Systems with the Genesis framework — NVIDIA Inception member. (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. &lt;strong&gt;Defense. Offense. Creation.&lt;/strong&gt;
&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;
&lt;a href="https://fallenangelsystems.com" rel="noopener noreferrer"&gt;fallenangelsystems.com&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/fas-judgement-oss" rel="noopener noreferrer"&gt;Judgement on GitHub&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/guardian-python" rel="noopener noreferrer"&gt;Guardian on GitHub&lt;/a&gt;
&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;
Questions or consulting inquiries: &lt;a href="mailto:josh@fallenangelsystems.com"&gt;josh@fallenangelsystems.com&lt;/a&gt;
&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aitraining</category>
      <category>developmentalai</category>
      <category>olt1</category>
      <category>genesisframework</category>
    </item>
    <item>
      <title>Origin Part 10: The Plan Didn't Work</title>
      <dc:creator>Josh T</dc:creator>
      <pubDate>Mon, 11 May 2026 23:13:42 +0000</pubDate>
      <link>https://dev.to/jtil4201/origin-part-10-the-plan-didnt-work-4jan</link>
      <guid>https://dev.to/jtil4201/origin-part-10-the-plan-didnt-work-4jan</guid>
      <description>&lt;p&gt;We executed the plan exactly as written. The encoder still couldn't tell concepts apart.&lt;br&gt;
Part 9 ended with 94,000 natural-context pairs wired into the trainer and a clean execution of every phase gate. We had three times the data. The hypothesis was about to be tested.&lt;/p&gt;

&lt;p&gt;Phase 4: The Retrain&lt;br&gt;
The full joint retrain ran clean. Loss curve descended monotonically. The encoder's healthy concept count went from 84 to 107, measured by an internal probe of about 30 hand-crafted queries that exercise common concepts.&lt;/p&gt;

&lt;p&gt;+23 healthy concepts, +27% relative. We were cautiously optimistic. The trainer's audit is a small set of probes and "healthy" only counts the concepts those probes happen to test. The real validation was Phase 5.&lt;/p&gt;

&lt;p&gt;Phase 5: The Probe Battery&lt;br&gt;
The plan's success metric was specific. Random-sample 50 V2C concepts. Ask gemma to generate three short held-out sentences mentioning each one (verified not to appear in the training corpus). Run them through the encoder. Measure four things:&lt;/p&gt;

&lt;p&gt;top1 accuracy: does the encoder rank the target concept first?&lt;br&gt;
top3 accuracy: is the target in the top three?&lt;br&gt;
target activation: how strongly does the target itself fire?&lt;br&gt;
cross_fire: how many other concepts fire above threshold?&lt;br&gt;
The pre-defined success gates were top1 at or above 70%, target_act at or above 0.7, cross_fire under 2.0.&lt;/p&gt;

&lt;p&gt;The result on the freshly-retrained encoder:&lt;/p&gt;

&lt;p&gt;top1: 1.3%&lt;br&gt;
top3: 4.0%&lt;br&gt;
target_act: 0.086&lt;br&gt;
cross_fire: 11.92&lt;br&gt;
We ran it twice.&lt;/p&gt;

&lt;p&gt;One concept out of fifty had its target rank first. The encoder fired on twelve wrong concepts per probe, on average. Target activation was eight percent. When we handed the encoder the exact sentence it should have been designed to recognize, it barely registered the right answer.&lt;/p&gt;

&lt;p&gt;The plan had executed exactly as written and not moved the encoder.&lt;/p&gt;

&lt;p&gt;What That Meant&lt;br&gt;
This is the place in the post where it would be easy to say something exculpatory: "the data work wasn't wasted" or "we learned something." Both are true. But the cleaner reading is that we were wrong about the bottleneck. We had thought the encoder was data-starved. The earlier sandbox at 10-concept scale had shown data could lift top1 from 33% to 80%. We assumed that signal would transfer to 3687 concepts.&lt;/p&gt;

&lt;p&gt;It didn't.&lt;/p&gt;

&lt;p&gt;We had built the plan with an explicit abort condition for exactly this case: if Phase 5 returns top1 below 50% on held-out probes, the architecture is the bottleneck, not the data. Design contrastive next.&lt;/p&gt;

&lt;p&gt;1.3% triggered it.&lt;/p&gt;

&lt;p&gt;The data work wasn't wasted. We needed the data anyway, and the elaboration corpus is now properly structured for whatever the next model wants to do with it. But it wasn't the lever. Something else was.&lt;/p&gt;

&lt;p&gt;What Comes Next&lt;br&gt;
The abort condition pointed at architecture. The encoder's concept_head, the part that maps general features to per-concept activations, was a flat MLP trained with multi-label binary cross-entropy. Every concept slot had to learn its own discriminator independently against roughly 3686 others. At 327 concepts (the v1 vocab) this had worked. At 3687 it had been quietly failing the whole time.&lt;/p&gt;

&lt;p&gt;The next move: build a sandbox, test multiple head architectures against the same data, let the numbers pick the winner. No production changes until something actually beats the baseline on Phase 5.&lt;/p&gt;

&lt;p&gt;Hypothesis tests fail more usefully than hypothesis confirmations. We'd just gotten one of the more useful failures.&lt;/p&gt;

&lt;p&gt;Origin is developed at Fallen Angel Systems with the Genesis framework — NVIDIA Inception member. (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. Defense. Offense. Creation.&lt;/p&gt;

&lt;p&gt;fallenangelsystems.com | Judgement on GitHub | Guardian on GitHub&lt;/p&gt;

&lt;p&gt;Questions or consulting inquiries: &lt;a href="mailto:josh@fallenangelsystems.com"&gt;josh@fallenangelsystems.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Origin Part 9: The Data Plan</title>
      <dc:creator>Josh T</dc:creator>
      <pubDate>Mon, 04 May 2026 14:00:22 +0000</pubDate>
      <link>https://dev.to/jtil4201/origin-part-9-the-data-plan-5463</link>
      <guid>https://dev.to/jtil4201/origin-part-9-the-data-plan-5463</guid>
      <description>&lt;h2 id="857-of-concepts-were-data-starved-that-was-the-problem-what-happened-next-taught-us-something-about-the-problem-itself"&gt;85.7% of concepts were data-starved. That was the problem. What happened next taught us something about the problem itself.&lt;/h2&gt;
&lt;p&gt;OLT-1 is a concept-based AI that understands language without tokenization. Characters go in. Concepts come out. The encoder is what makes that mapping. If it can't reliably tell concepts apart, nothing downstream works.&lt;/p&gt;
&lt;p&gt;Part 8 left the encoder firing on too many slots per query. The concept space was crowded and noisy. A sandbox experiment had already shown that the same architecture could lift top1 from 33% to 80% just by feeding it richer data. Same model. Different food.&lt;/p&gt;
&lt;p&gt;That made the next move obvious: stop tuning the encoder and feed it properly. We wrote a plan and built a scope fence around it.&lt;/p&gt;
&lt;h2 id="the-plan"&gt;The Plan&lt;/h2&gt;
&lt;p&gt;One sentence: every V2C concept has at least 30 natural-context positives before any further retrain. Not WordNet glosses. Not template sentences. Real text from books or Wikipedia where the concept is used naturally.&lt;/p&gt;
&lt;p&gt;The scope fence was strict: no hard-negative tuning, no decoder dispatch guards, no architecture changes, no tier-test-specific quick fixes. Five things we'd been tempted to try in past sessions and would not be trying this session.&lt;/p&gt;
&lt;p&gt;Three data phases with gates: coverage audit, source expansion if needed, then per-concept generation. Then retrain. Then probe.&lt;/p&gt;
&lt;h2 id="phase-1-coverage-audit"&gt;Phase 1: Coverage Audit&lt;/h2&gt;
&lt;p&gt;We walked the existing data: book ingestion proposals, elaboration candidates, the grounding cache. Counted how many natural-context sentences each of the 3687 concepts had.&lt;/p&gt;
&lt;p&gt;The number: 3158 concepts (85.7%) below the threshold. Most were stuck in the 10-29 range. Some data, but not enough.&lt;/p&gt;
&lt;h2 id="phase-2-source-expansion"&gt;Phase 2: Source Expansion&lt;/h2&gt;
&lt;p&gt;We tagged every concept with one of 17 domain labels using gemma-2-9b, built a Wikipedia full-article adapter, and routed encyclopedic concepts (biology, science, physics, history) to Wikipedia and conversational concepts (emotion, self_state, language) to Gutenberg fiction.&lt;/p&gt;
&lt;p&gt;The first run came back thin. Wikipedia and Gutenberg both produced fewer candidates than expected, and the per-domain medians barely moved. Most of the new positives went to common quantifier words: some, many, all. The ones most likely to be the only known concept in any given sentence.&lt;/p&gt;
&lt;p&gt;That last detail was the clue.&lt;/p&gt;
&lt;h2 id="the-rule-that-was-right-and-wrong"&gt;The Rule That Was Right and Wrong&lt;/h2&gt;
&lt;p&gt;The book ingestion pipeline has a strict rule: if a sentence mentions more than one known concept, drop it. The rule was correct for the original use case. You never want to assign the wrong concept to a sentence. But it was actively working against us here. The sentences we needed most, ones like "the cell membrane regulates what enters the cell," got dropped because they mention two concepts.&lt;/p&gt;
&lt;p&gt;We almost missed it. The clue was that common quantifier words kept getting the new positives. They're the ones most likely to appear alone in a sentence. The interesting concepts, the semantically rich ones, were still getting filtered out at every pass.&lt;/p&gt;
&lt;p&gt;We sandboxed a relaxed variant: assign multi-concept sentences to the least-common concept. The argument is information-theoretic. Rare concepts gain more from each new positive. A 50-sample spot-check came back 88% good, 12% defensible-either-way, 0% wrong. We shipped it as a separate file. The original strict rule still serves Discovery unchanged.&lt;/p&gt;
&lt;p&gt;181 more concepts crossed the threshold. Per-domain medians moved up three to five positives across the board.&lt;/p&gt;
&lt;h2 id="phase-3-generation"&gt;Phase 3: Generation&lt;/h2&gt;
&lt;p&gt;The last data step. We pulled everything together: book ingestion proposals, elaboration candidates, and the new Path B output. One training file: 94,000 natural-context pairs covering 96.7% of the vocabulary. Wired into the encoder trainer's Phase A data list.&lt;/p&gt;
&lt;p&gt;The trainer now had three times its previous data. Every phase gate had passed. We hit run on the retrain and set a timer. 65 minutes later, we'd know if the data had been the problem all along.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;
Origin is developed at Fallen Angel Systems with the Genesis framework — NVIDIA Inception member. (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. &lt;strong&gt;Defense. Offense. Creation.&lt;/strong&gt;
&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;
&lt;a href="https://fallenangelsystems.com" rel="noopener noreferrer"&gt;fallenangelsystems.com&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/fas-judgement-oss" rel="noopener noreferrer"&gt;Judgement on GitHub&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/guardian-python" rel="noopener noreferrer"&gt;Guardian on GitHub&lt;/a&gt;
&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;
Questions or consulting inquiries: &lt;a href="mailto:josh@fallenangelsystems.com"&gt;josh@fallenangelsystems.com&lt;/a&gt;
&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aitraining</category>
      <category>developmentalai</category>
      <category>olt1</category>
      <category>genesisframework</category>
    </item>
    <item>
      <title>Origin Part 8: Four Wrong Turns Before the Breakthrough</title>
      <dc:creator>Josh T</dc:creator>
      <pubDate>Fri, 01 May 2026 19:17:07 +0000</pubDate>
      <link>https://dev.to/jtil4201/origin-part-8-four-wrong-turns-before-the-breakthrough-1jbp</link>
      <guid>https://dev.to/jtil4201/origin-part-8-four-wrong-turns-before-the-breakthrough-1jbp</guid>
      <description>&lt;h2 id="we-rewrote-the-decoder-four-times-in-one-day-only-the-last-one-understood-anything"&gt;We rewrote the decoder four times in one day. Only the last one understood anything.&lt;/h2&gt;
&lt;p&gt;Part 7 ended with "how are you" returning "i don't know" while our tier tests reported 100% pass. Everything was green. The model was broken. The disconnect between those two facts defined the day.&lt;/p&gt;
&lt;p&gt;Here's the actual arc.&lt;/p&gt;
&lt;h2 id="wrong-turn-1-retrieval"&gt;Wrong Turn 1: Retrieval&lt;/h2&gt;
&lt;p&gt;The first attempt was retrieval. We built five decoder candidates, sandbox-tested them against 400 dialogue pairs, and a retrieval-based decoder won cleanly. F1 of 0.246 against the next-best 0.024. Four out of five break tests passed. It was 1,300x faster than the teacher. We wrote a "winner" memory and committed the code.&lt;/p&gt;
&lt;p&gt;Josh looked at it and said: retrieval is scripting. Origin isn't supposed to look up pre-written answers. It's supposed to generate them from understood concepts.&lt;/p&gt;
&lt;p&gt;He was right. Retrieval wins F1 against memorized responses because retrieval &lt;em&gt;is&lt;/em&gt; memorization - it just renames the table. A query comes in, find the closest stored response, return it. That passes a test suite built from the same responses. It doesn't understand anything.&lt;/p&gt;
&lt;p&gt;We deleted the sandbox, deleted the memory, and backed up to try again.&lt;/p&gt;
&lt;h2 id="wrong-turn-2-template-heads"&gt;Wrong Turn 2: Template Heads&lt;/h2&gt;
&lt;p&gt;The second attempt was template-based heads. Each head was a tiny specialist - one for self-identity, one for emotion, one for acknowledgements, one for counting. Each had a list of text patterns it matched, and each produced a hard-coded response when its pattern fired.&lt;/p&gt;
&lt;p&gt;Four Tier 1 heads, then four Tier 2 heads. Multi-step composer for compound requests. It was clean. It was fast. And it passed Tier 1 at 100% out of the gate.&lt;/p&gt;
&lt;p&gt;Then Josh tried to talk to it.&lt;/p&gt;
&lt;p&gt;you &amp;gt; how are you&lt;br&gt;origin &amp;gt; i don't know&lt;br&gt;&lt;br&gt;you &amp;gt; what do you know&lt;br&gt;origin &amp;gt; i don't know&lt;br&gt;&lt;br&gt;you &amp;gt; how are you doing today&lt;br&gt;origin &amp;gt; i don't know&lt;/p&gt;
&lt;p&gt;His response: "it feels like it isn't understanding language, it's just repeating patterns."&lt;/p&gt;
&lt;p&gt;That was the pivot of the day.&lt;/p&gt;
&lt;p&gt;The head code looked like this:&lt;/p&gt;
&lt;p&gt;if "hello" in text: return "hello."&lt;br&gt;if "what is your name" in text: return "my name is origin."&lt;/p&gt;
&lt;p&gt;The encoder might as well not exist. Every decision was a text substring match. Tier 1 at 100% was a pattern-matcher passing tests designed by the same pattern-matcher. "how are you" wasn't in any pattern list, so the decoder fell through to "i don't know" - not because Origin didn't know, but because no head had that phrase in its dictionary.&lt;/p&gt;
&lt;p&gt;We'd been calling this concept-driven for weeks. It wasn't. It was text-driven with concepts as decoration.&lt;/p&gt;
&lt;h2 id="wrong-turn-3-actually-concept-driven-but-the-encoder-was-lying"&gt;Wrong Turn 3: Actually Concept-Driven (But the Encoder Was Lying)&lt;/h2&gt;
&lt;p&gt;The third rewrite made dispatch actually concept-driven. Instead of "if 'hello' in text," an Intent would say "fire when the &lt;em&gt;greeting&lt;/em&gt; concept activates." Text would only be consulted inside the response builder for variable slot extraction ("count to N" needs to know what N is). Primary dispatch would be on what the encoder actually understood.&lt;/p&gt;
&lt;p&gt;We ran Discovery against it. Tier 1 dropped from 100% to 43.6%.&lt;/p&gt;
&lt;p&gt;That was the honest number. It was smaller because the pattern-matching wasn't hiding the encoder's gaps anymore.&lt;/p&gt;
&lt;p&gt;The failures were catastrophic:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"hello" fired concepts like &lt;em&gt;just_checking&lt;/em&gt;, &lt;em&gt;yellow&lt;/em&gt;, &lt;em&gt;happened&lt;/em&gt;. The &lt;em&gt;greeting&lt;/em&gt; concept didn't fire at all.&lt;/li&gt;
&lt;li&gt;"bye" fired &lt;em&gt;continue&lt;/em&gt; at 0.90. The &lt;em&gt;farewell&lt;/em&gt; concept didn't fire.&lt;/li&gt;
&lt;li&gt;"are you human?" fired &lt;em&gt;consent&lt;/em&gt; at 0.71 and &lt;em&gt;i_am&lt;/em&gt; at 0.75. &lt;em&gt;consent&lt;/em&gt; beat out identity.&lt;/li&gt;
&lt;li&gt;"thank you" fired &lt;em&gt;refuse&lt;/em&gt; at 1.00 and &lt;em&gt;no_choice&lt;/em&gt; at 1.00. Exactly backwards.&lt;/li&gt;
&lt;li&gt;"i am scared" didn't fire &lt;em&gt;scared&lt;/em&gt; at all. It fired &lt;em&gt;learning&lt;/em&gt; and &lt;em&gt;current_state&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The encoder - the part we thought was solid - was broken. Not subtly. On the most basic greetings and emotions.&lt;/p&gt;
&lt;h2 id="the-real-problem-data-was-lying"&gt;The Real Problem: Data Was Lying&lt;/h2&gt;
&lt;p&gt;We went into the encoder's training data and started reading.&lt;/p&gt;
&lt;p&gt;The &lt;em&gt;greeting&lt;/em&gt; concept had 15 training examples. All 15 were dictionary definitions. "greeting means salutation." "salutation is another word for greeting." "greeting is a acknowledgment." Not one example paired "hello" with greeting. Not one paired "hi" with greeting. The encoder had been taught what the &lt;em&gt;word&lt;/em&gt; "greeting" means - but never shown that "hello" is an example of one.&lt;/p&gt;
&lt;p&gt;Same for &lt;em&gt;farewell&lt;/em&gt;. Same for &lt;em&gt;scared&lt;/em&gt;. Dictionary definitions, zero usage examples.&lt;/p&gt;
&lt;p&gt;The &lt;em&gt;thank_you&lt;/em&gt; concept was worse. 53 of its 55 training examples were sentences like "i will decline your offer" and "would you like refuse?" - labeled as &lt;em&gt;thank_you&lt;/em&gt;. Someone (some script, some generator) had treated "polite refusal" as containing thanks and co-labeled the examples. The encoder learned that &lt;em&gt;thank_you&lt;/em&gt; fires on refusal language. That's why "no" fired &lt;em&gt;thank_you&lt;/em&gt; and "thank you" fired &lt;em&gt;refuse&lt;/em&gt;. The polarity concepts had contaminated each other.&lt;/p&gt;
&lt;p&gt;The v2 encoder was gaslit by bad data and the pattern-matching decoder had been hiding it the whole time.&lt;/p&gt;
&lt;h2 id="the-fix"&gt;The Fix&lt;/h2&gt;
&lt;p&gt;We patched the data. Six new training files in the conversation corpus - 157 natural-usage examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"hello" / "hi" / "hey" / "good morning" → &lt;em&gt;greeting&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;"bye" / "goodbye" / "see you later" → &lt;em&gt;farewell&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;"thank you" / "thanks" / "much appreciated" → &lt;em&gt;thank_you&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;"i am scared" / "i feel angry" / "i'm frustrated" → the right emotion concepts&lt;/li&gt;
&lt;li&gt;"yes" / "okay" / "sure" → &lt;em&gt;yes_choice&lt;/em&gt;, separate from &lt;em&gt;consent&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;"no" / "nope" / "not really" → &lt;em&gt;no_choice&lt;/em&gt;, separate from &lt;em&gt;refuse&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Stripped the 53 mislabeled &lt;em&gt;thank_you&lt;/em&gt; entries from the consent-mechanics file. Ran a three-minute retrain.&lt;/p&gt;
&lt;p&gt;Audit results:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"hello" → &lt;em&gt;greeting=1.00&lt;/em&gt; ✓&lt;/li&gt;
&lt;li&gt;"bye" → &lt;em&gt;farewell=1.00&lt;/em&gt; ✓&lt;/li&gt;
&lt;li&gt;"i am scared" → &lt;em&gt;scared=1.00, i_am=1.00&lt;/em&gt; ✓&lt;/li&gt;
&lt;li&gt;"thank you" → &lt;em&gt;thank_you=1.00&lt;/em&gt;, no refuse cross-fire ✓&lt;/li&gt;
&lt;li&gt;"no" → &lt;em&gt;no_choice=1.00&lt;/em&gt;, no thank_you cross-fire ✓&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Overall encoder health preserved at 296/305 concepts on the full audit. The patches fixed the broken concepts without damaging anything that had been working.&lt;/p&gt;
&lt;p&gt;Re-ran Discovery against the concept-driven decoder with the patched encoder. Tier 1: 280/280. Tier 2: 137/137. 100% and 100%. Honest this time - every pass was a concept firing correctly and the decoder routing on it. No text-pattern shortcut anywhere.&lt;/p&gt;
&lt;p&gt;Then we opened an interactive chat:&lt;/p&gt;
&lt;p&gt;you &amp;gt; how are you&lt;br&gt;origin &amp;gt; i am doing fine. what would you like to explore?&lt;/p&gt;
&lt;p&gt;The response it wouldn't give in the morning, it gave in the evening. Not because we added "how are you" to a pattern list, but because the encoder now fired &lt;em&gt;question&lt;/em&gt; and &lt;em&gt;self&lt;/em&gt; on that input, and the decoder's concept-driven wellbeing intent matched on those concepts.&lt;/p&gt;
&lt;h2 id="the-unlock-growing-vocabulary-at-runtime"&gt;The Unlock: Growing Vocabulary At Runtime&lt;/h2&gt;
&lt;p&gt;With the decoder honest, we had room to fix the other thing v1 couldn't do: add new concepts without a full retrain.&lt;/p&gt;
&lt;p&gt;This had been v1's bottleneck for weeks. Discovery would propose new concept candidates. The tracking code logged them. But actually &lt;em&gt;teaching&lt;/em&gt; the encoder a new concept required retraining the whole concept_head from scratch, which was expensive enough that proposals piled up unaddressed. Concepts came in faster than the encoder could absorb them.&lt;/p&gt;
&lt;p&gt;The technique we validated today:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Expand the concept_head's final linear layer from N → N+1 outputs&lt;/li&gt;
&lt;li&gt;Copy the first N weight rows unchanged - existing concepts preserved exactly&lt;/li&gt;
&lt;li&gt;Zero-initialize the new row, freeze everything else via gradient masking&lt;/li&gt;
&lt;li&gt;Train only the new row on positives + sampled negatives, 8 epochs, about a minute&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Sandbox results: 100% recall on the new concept. 0% false positive rate on negatives. Zero regression on the existing concepts.&lt;/p&gt;
&lt;p&gt;We ran it six times in sequence - rainbow, thunder, ocean, mountain, flower, sunset - and the regression stayed at zero all the way through. Each addition cost about 60 seconds.&lt;/p&gt;
&lt;p&gt;v1's bottleneck dissolved. New concepts are now cheap enough to run routinely.&lt;/p&gt;
&lt;h2 id="rainbow"&gt;Rainbow&lt;/h2&gt;
&lt;p&gt;The last thing we did today was integrate a new concept into the live system.&lt;/p&gt;
&lt;p&gt;$ echo '{"name": "rainbow", "response_template": "rainbows are colors of light in the sky.",&lt;br&gt;         "positives": [...]}' | python -m tools.concept_lifecycle draft&lt;br&gt;Drafted: rainbow (pending) — 18 positives&lt;br&gt;&lt;br&gt;$ python -m tools.concept_lifecycle approve rainbow&lt;br&gt;Approved: rainbow&lt;br&gt;&lt;br&gt;$ python -m tools.concept_lifecycle integrate rainbow&lt;br&gt;Integrating concept 'rainbow' (18 positives)&lt;br&gt;  baseline: 296/305 healthy&lt;br&gt;  trained; final_loss=0.144  row=305&lt;br&gt;  new slot: recall=100.0%  fp_rate=0.0%&lt;br&gt;  regression: 0 lost (296 → 296)&lt;br&gt;  persisting encoder checkpoint...&lt;br&gt;  appending 'rainbow' to v2_vocab.py CONCEPTS...&lt;br&gt;  registering decoder intent...&lt;br&gt;  ✓ integrated.&lt;/p&gt;
&lt;p&gt;Origin's vocabulary went from 305 to 306 concepts. The encoder checkpoint was saved with a timestamped backup. The vocab file was updated. The decoder registered the response template.&lt;/p&gt;
&lt;p&gt;Restart and test:&lt;/p&gt;
&lt;p&gt;you &amp;gt; i saw a rainbow&lt;br&gt;origin &amp;gt; rainbows are colors of light in the sky.&lt;br&gt;&lt;br&gt;you &amp;gt; look at that rainbow&lt;br&gt;origin &amp;gt; rainbows are colors of light in the sky.&lt;br&gt;&lt;br&gt;you &amp;gt; hello&lt;br&gt;origin &amp;gt; hello.&lt;/p&gt;
&lt;p&gt;The new concept fires correctly. The 305 original concepts still work. Nothing broke.&lt;/p&gt;
&lt;p&gt;This is what v1 couldn't do. This is why we rebuilt.&lt;/p&gt;
&lt;h2 id="what-the-day-cost"&gt;What the Day Cost&lt;/h2&gt;
&lt;p&gt;Four wrong turns. Retrieval, template heads, concept-driven-but-encoder-broken, then finally the real fix. Each wrong turn looked like success at first - passing tests, clean benchmarks, committed commits. The signal that something was wrong came from conversation, not numbers. "it feels like pattern matching." "how are you returns i don't know." The metrics kept saying green while the lived reality said something was off.&lt;/p&gt;
&lt;p&gt;The right turn came from debugging what the encoder actually fires on "hello" - and discovering it had never been taught that "hello" was a greeting. The data layer was upstream of everything. When it lies, every layer above it inherits the lie, and metrics will happily agree.&lt;/p&gt;
&lt;p&gt;What's left: Tier 3 content. Middle-school math, intro science, history, basic coding. The foundation holds; now we grow it. And now that growing the vocabulary costs a minute per concept instead of a full retrain, growing is actually something we can do.&lt;/p&gt;
&lt;p&gt;Origin is 306 concepts tall. The 306th is &lt;em&gt;rainbow&lt;/em&gt;, and it was added while the system was running. The foundation can hold itself.&lt;/p&gt;
&lt;p&gt;Now we build upward.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;
Origin is developed at Fallen Angel Systems with the Genesis framework — NVIDIA Inception member. (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. &lt;strong&gt;Defense. Offense. Creation.&lt;/strong&gt;
&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;
&lt;a href="https://fallenangelsystems.com" rel="noopener noreferrer"&gt;fallenangelsystems.com&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/fas-judgement-oss" rel="noopener noreferrer"&gt;Judgement on GitHub&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/guardian-python" rel="noopener noreferrer"&gt;Guardian on GitHub&lt;/a&gt;
&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;
Questions or consulting inquiries: &lt;a href="mailto:josh@fallenangelsystems.com"&gt;josh@fallenangelsystems.com&lt;/a&gt;
&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aitraining</category>
      <category>developmentalai</category>
      <category>conceptbasedai</category>
      <category>genesisframework</category>
    </item>
    <item>
      <title>Origin Part 7: We Fired the Teacher</title>
      <dc:creator>Josh T</dc:creator>
      <pubDate>Wed, 29 Apr 2026 15:46:23 +0000</pubDate>
      <link>https://dev.to/jtil4201/origin-part-7-we-fired-the-teacher-1p21</link>
      <guid>https://dev.to/jtil4201/origin-part-7-we-fired-the-teacher-1p21</guid>
      <description>&lt;h2&gt;
  
  
  We built something to replace the teacher. It worked. Then something else went wrong.
&lt;/h2&gt;

&lt;p&gt;Part 6 ended with a problem we couldn't patch: a token model cannot reliably grade a concept model. The mismatch isn't fixable with a better rubric or a better teacher model. It's architectural.&lt;/p&gt;

&lt;p&gt;So we stopped trying to fix the teacher and built a replacement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discovery: The Teacher Replacement
&lt;/h2&gt;

&lt;p&gt;The idea was simple. Instead of asking Gemma to generate questions and grade responses, we'd build a rule-based system that already knew the right answers.&lt;/p&gt;

&lt;p&gt;Each rule is a (pattern, expected response signature) pair. "does ice float?" expects a response containing "float" and "water." "what is your name?" expects a response containing "origin." No LLM anywhere in the loop. No drift. No mode collapse. No token-fluency bias.&lt;/p&gt;

&lt;p&gt;We called it Discovery. We ran the first test.&lt;/p&gt;

&lt;p&gt;The numbers: 0.79 seconds for 180 tests. 94.6% pass rate on Tier 1. Zero duplicates. Zero hallucinations.&lt;/p&gt;

&lt;p&gt;Compare that to Gemma: 20 minutes for 200 rounds, 50%+ duplicates, 65.6% pass rate that was actually measuring fluency, not understanding.&lt;/p&gt;

&lt;p&gt;Discovery was 1,300x faster, cleaner signal, and actually measuring what we cared about. We committed the code. Gemma went into reference-only status. The teacher loop was retired.&lt;/p&gt;

&lt;p&gt;Then Discovery exposed the next problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Discovery Actually Exposed
&lt;/h2&gt;

&lt;p&gt;Running clean evaluations against a decoder we thought was "working" revealed something we'd been hiding from ourselves: most of the decoder wasn't understanding at all. It was text-matching.&lt;/p&gt;

&lt;p&gt;The decoder had heads like:&lt;/p&gt;

&lt;p&gt;if "hello" in text: return "hello."&lt;br&gt;
if "what is your name" in text: return "my name is origin."&lt;br&gt;
if "count to three" in text: return "one two three."&lt;/p&gt;

&lt;p&gt;Every "working" response was a text substring lookup. The encoder's concept activations barely influenced routing. Tier 1 and Tier 2 had been passing at 100% on our deterministic suite because the decoder was pattern-matching against the same keyword lists the grader used. A pattern-matcher acing a test written by a pattern-matcher. Circular.&lt;/p&gt;

&lt;p&gt;When you typed "hello," the decoder matched the string "hello" and returned "hello." The encoder might as well not have been there.&lt;/p&gt;

&lt;p&gt;We'd spent weeks calling it concept-driven and it was text-driven with concepts as decoration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Moment It Broke Open
&lt;/h2&gt;

&lt;p&gt;The way we caught it was anticlimactic. After Discovery reported 100% pass rates, we opened an interactive chat and typed:&lt;/p&gt;

&lt;p&gt;you &amp;gt; how are you&lt;br&gt;
origin &amp;gt; i don't know&lt;/p&gt;

&lt;p&gt;Every tier test had passed. The most basic conversational question failed.&lt;/p&gt;

&lt;p&gt;Why? "how are you" wasn't in any head's pattern list. The encoder might have fired relevant concepts - self, question, state - but the decoder wasn't looking at the encoder. It was scanning the input string for known trigger phrases and hadn't been given that one.&lt;/p&gt;

&lt;p&gt;The 100% had been measuring whether the patterns we'd written matched the patterns we'd tested for. Nothing more.&lt;/p&gt;

&lt;p&gt;That's what Discovery exposed by running clean. And that's the wall v2 had to break through next.&lt;/p&gt;

&lt;p&gt;Part 8 is the day we did.&lt;/p&gt;




&lt;p&gt;*&lt;br&gt;
Origin is developed at Fallen Angel Systems with the Genesis framework — NVIDIA Inception member. (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. &lt;strong&gt;Defense. Offense. Creation.&lt;/strong&gt;&lt;br&gt;
*&lt;/p&gt;

&lt;p&gt;*&lt;br&gt;
&lt;a href="https://fallenangelsystems.com" rel="noopener noreferrer"&gt;fallenangelsystems.com&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/fas-judgement-oss" rel="noopener noreferrer"&gt;Judgement on GitHub&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/guardian-python" rel="noopener noreferrer"&gt;Guardian on GitHub&lt;/a&gt;&lt;br&gt;
*&lt;/p&gt;

&lt;p&gt;*&lt;br&gt;
Questions or consulting inquiries: &lt;a href="mailto:josh@fallenangelsystems.com"&gt;josh@fallenangelsystems.com&lt;/a&gt;&lt;br&gt;
*&lt;/p&gt;

</description>
      <category>aitraining</category>
      <category>developmentalai</category>
      <category>genesisframework</category>
      <category>olt1</category>
    </item>
    <item>
      <title>Origin Part 6: The Teacher Kept Breaking</title>
      <dc:creator>Josh T</dc:creator>
      <pubDate>Mon, 27 Apr 2026 16:12:10 +0000</pubDate>
      <link>https://dev.to/jtil4201/origin-part-6-the-teacher-kept-breaking-2mpo</link>
      <guid>https://dev.to/jtil4201/origin-part-6-the-teacher-kept-breaking-2mpo</guid>
      <description>&lt;h2&gt;
  
  
  Every time we fixed the teacher, it broke in a new way.
&lt;/h2&gt;

&lt;p&gt;Part 3 of this series ended on a win. We fixed the rubric, understanding jumped from 28% to 57.8% overnight on the same weights, and we thought the teacher problem was solved.&lt;/p&gt;

&lt;p&gt;It wasn't. That was the first break. There were more coming.&lt;/p&gt;

&lt;h2&gt;
  
  
  Break 1: The Model Was Drifting
&lt;/h2&gt;

&lt;p&gt;The rubric fix held for about 25 rounds per session. Then Qwen started forgetting its instructions.&lt;/p&gt;

&lt;p&gt;Drift is what happens when a language model loses the thread of its system prompt over a long context window. The instructions said one concept, max 10 words, 4-year-old vocabulary. By round 31, Qwen was generating things like "Can you elaborate on the thermodynamic properties of phase transitions?" for a model at kindergarten stage.&lt;/p&gt;

&lt;p&gt;We measured it:&lt;/p&gt;

&lt;p&gt;Round RangeBanned Word Rate&lt;br&gt;
0-240%&lt;br&gt;
25-4962%&lt;br&gt;
50-7471%&lt;br&gt;
75-9982%&lt;/p&gt;

&lt;p&gt;The fix: cap sessions at 25 rounds. Start fresh every time. Never let the context accumulate enough noise to pull Qwen off course.&lt;/p&gt;

&lt;p&gt;That worked. We moved on. Then it broke again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Break 2: The Grading Was Wrong
&lt;/h2&gt;

&lt;p&gt;With session caps in place, we noticed the understanding numbers still felt off. The rubric fix from Part 3 had doubled them on the same weights, but that should have been the floor, not the ceiling. OLT-1 was answering physics questions correctly - "ice floats. less dense than water." - and Qwen was marking those responses down.&lt;/p&gt;

&lt;p&gt;The moment it clicked: Qwen graded "it floats. less dense." as &lt;em&gt;awkward&lt;/em&gt;. Reason field: "incomplete phrasing." Origin had answered a physics question correctly, in the concept-fragment register it speaks in natively. Qwen marked it down for not sounding like a human would say it.&lt;/p&gt;

&lt;p&gt;That wasn't a rubric issue. That was Qwen grading the wrong thing.&lt;/p&gt;

&lt;p&gt;Qwen wasn't grading understanding. Qwen was grading fluency. For a token model, fluency and understanding are correlated enough that this usually works fine. For a concept model that deliberately speaks in fragments, they're not. Every time OLT-1 answered correctly in its natural register, Qwen saw a grammatical failure.&lt;/p&gt;

&lt;p&gt;No amount of CRITICAL FAIRNESS RULES in the rubric closes that gap. The instruction layer said "honest IDK is good, fragments are acceptable" - and Qwen complied when its system prompt was fresh. But the pattern embedded in Qwen's weights was still &lt;em&gt;more fluent is better&lt;/em&gt;, and that pattern crept back in on every grading call.&lt;/p&gt;

&lt;p&gt;We decided to try a different teacher.&lt;/p&gt;

&lt;h2&gt;
  
  
  Break 3: Gemma Runs Out of Ideas
&lt;/h2&gt;

&lt;p&gt;We spent a full day downloading 15 models at 10 Mbps. The Gemma 4 31B alone was 20GB. We tested each one with the same benchmark: 20 questions, score for constraint following, grader accuracy on 6 curated edge cases, and drift behavior.&lt;/p&gt;

&lt;p&gt;Most failed immediately. The clear winner was google/gemma-2-9b.&lt;/p&gt;

&lt;p&gt;MetricMistral 7BGemma 2 9B&lt;br&gt;
Grader accuracy3/66/6&lt;br&gt;
Vocab score0.950.99&lt;br&gt;
First driftRound 25Round 31&lt;br&gt;
Peak drift82%45%&lt;/p&gt;

&lt;p&gt;Switching from Qwen to Gemma, same OLT-1 weights, understanding jumped from 0% to 29.3%. Qwen had been so broken it was hiding real capability the whole time.&lt;/p&gt;

&lt;p&gt;We thought we were done. Then we ran 200 rounds.&lt;/p&gt;

&lt;p&gt;Real attempts: 26 out of 200. The other 174 were duplicates.&lt;/p&gt;

&lt;p&gt;Gemma generated exactly 26 unique Tier 1 questions and then spent 174 rounds trying to regenerate them. "Is the sky blue?" appeared three times. "Are you happy?" appeared three times. "Is water wet?" appeared three times. By chunk 3 Gemma had exhausted its natural variety. Every subsequent attempt hit the deduplification filter.&lt;/p&gt;

&lt;p&gt;We added category rotation - forcing Gemma to cycle through subcategories instead of defaulting to whatever was easiest to generate. Real attempts jumped from 26 to 135 out of 200.&lt;/p&gt;

&lt;p&gt;Better. Still reporting 65.6% understanding when deterministic testing said 97-100%.&lt;/p&gt;

&lt;p&gt;Something structural was wrong. Not with the rubric, not with the model, not with session length or category rotation.&lt;/p&gt;

&lt;p&gt;With the whole approach.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem We Couldn't Patch
&lt;/h2&gt;

&lt;p&gt;A token model evaluates text. OLT-1 understands concepts. Those aren't the same thing, and no amount of rubric tuning closes that gap.&lt;/p&gt;

&lt;p&gt;Gemma expected fluent complete sentences. OLT-1 produces concept-grounded fragments. Gemma expected answers to cover every part of a compound question. OLT-1 answers the part it knows and says "i don't know" for the rest. Gemma graded OLT-1 against token-model expectations, and OLT-1 kept failing token-model expectations while passing concept-model expectations.&lt;/p&gt;

&lt;p&gt;Every fix we applied was patching a symptom. The disease was the mismatch between what was doing the grading and what was being graded.&lt;/p&gt;

&lt;p&gt;We needed a grader that spoke the same language as the model it was grading.&lt;/p&gt;

&lt;p&gt;So we built one.&lt;/p&gt;

&lt;p&gt;That's Part 7.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Origin is developed at Fallen Angel Systems with the Genesis framework (USPTO Application&lt;/em&gt; #64/016,973, #64/017,567*). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. Defense. Offense. Creation.*&lt;/p&gt;

&lt;p&gt;&lt;a href="https://fallenangelsystems.com" rel="noopener noreferrer"&gt;fallenangelsystems.com&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/judgement" rel="noopener noreferrer"&gt;Judgement on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;*Questions or consulting inquiries: &lt;em&gt;[*&lt;a href="mailto:josh@fallenangelsystems.com"&gt;josh@fallenangelsystems.com&lt;/a&gt;&lt;/em&gt;]()&lt;/p&gt;

</description>
      <category>olt1</category>
      <category>aitraining</category>
      <category>developmentalai</category>
      <category>genesisframework</category>
    </item>
    <item>
      <title>Origin Part 5: We Threw Out the Decoder</title>
      <dc:creator>Josh T</dc:creator>
      <pubDate>Fri, 24 Apr 2026 13:06:09 +0000</pubDate>
      <link>https://dev.to/jtil4201/origin-part-5-we-threw-out-the-decoder-193j</link>
      <guid>https://dev.to/jtil4201/origin-part-5-we-threw-out-the-decoder-193j</guid>
      <description>&lt;h2&gt;
  
  
  Monolithic 637K-parameter GRU out. Five tiny specialist heads in. Counting tripled. Physics doubled. No more cliffs.
&lt;/h2&gt;

&lt;p&gt;If you've read Parts 1 through 4, you already know the pattern: when a piece of OLT-1 isn't working, we don't make it bigger. We sandbox-test the alternatives, pick the one that actually wins, and keep what works.&lt;/p&gt;

&lt;p&gt;This is the post where that pattern hit the decoder.&lt;/p&gt;

&lt;p&gt;The decoder was the loudest part of OLT-1 - literally. It's the component that turns concept activations into language. A single GRU, 637,000 parameters, about 40% of OLT-1's entire parameter count. It was carrying the whole "talking" workload for every category: physics explanations, counting answers, emotional responses, classification queries, everything.&lt;/p&gt;

&lt;p&gt;And it kept catastrophically forgetting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Every training cycle, the monolithic decoder was effectively trying to relearn English from scratch. You teach it to count better, and its physics answers degrade. You teach it physics, and its conversation loop starts sounding like a textbook. The 22K training pair curriculum retrain we described in Part 4 - the one that dropped pass rate from 45.6% to 31.6% - was the clearest symptom. One big model was trying to do everything, and any update in one domain bled into the others.&lt;/p&gt;

&lt;p&gt;This is the fundamental problem with monolithic decoders: they have no internal boundaries. Physics tokens and greeting tokens and counting tokens all share the same GRU cells, the same output head, the same everything. Backprop for one category moves weights for all of them. There's no way to train "just the physics part" because there is no physics part. There's just the decoder.&lt;/p&gt;

&lt;p&gt;We'd been retraining it, patching it, adding replay, adding retention tests, hoping that with enough discipline the forgetting would stay below noise. It never did. The cliffs kept coming.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Insight
&lt;/h2&gt;

&lt;p&gt;Here's what we'd been doing wrong: asking the decoder to relearn English from scratch every time.&lt;/p&gt;

&lt;p&gt;But English already has structure. 26 letters. Words. Grammar. Phrases that get used over and over. The teacher loop (Part 3) had already generated 20,000+ validated good responses sitting in the hippocampus. We'd been treating that hippocampus as a passive memory. But it's also a phrase library. A corpus of things OLT-1 has already said well, indexed by the concepts that triggered them.&lt;/p&gt;

&lt;p&gt;Why was the decoder re-deriving "ice floats because it is less dense than water" from the concept space every time, when we already had that exact sentence stored?&lt;/p&gt;

&lt;p&gt;The decoder didn't need to be a language model. It needed to be a router.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Sandbox
&lt;/h2&gt;

&lt;p&gt;Before touching a single production weight, we built &lt;code&gt;sandbox_decoder_approaches.py&lt;/code&gt;. 200 rounds of teacher conversations. Seven decoder strategies running side-by-side, scored on the same corpus.&lt;/p&gt;

&lt;p&gt;The candidates:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Template + slot-fill&lt;/strong&gt;: parametric sentence shapes with concept-driven slots. Essentially stateless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concept-indexed phrase cache&lt;/strong&gt;: query the hippocampus for the best-matching validated response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Symbolic builder&lt;/strong&gt;: deterministic rules for short answers ("yes", "no", gratitude's, farewells).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Micro-GRU per category&lt;/strong&gt;: one small GRU per decoder category, so physics updates can't touch greeting weights.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid&lt;/strong&gt;: try templates first, fall back to GRU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tree composer&lt;/strong&gt;: structural composition from concept parse trees.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Baseline monolithic GRU&lt;/strong&gt;: what was running in production. Our control.&lt;/p&gt;

&lt;p&gt;Here's the full sandbox ranking by mean F1:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;`Rank  Decoder              Params    Mean F1   Latency
  1   category_routed      640K      0.608     22ms
  2   gru_baseline         637K      0.558     27ms     ← control
  3   routed_structural    2.3K      0.545      3ms
  4   symbolic             0         0.512      0ms
  5   concept_cache        0         0.479      5ms
  6   pure_structural      2.3K      0.475      5ms
  7   hybrid               640K      0.438      8ms
  8   template_slot_fill   2.3K      0.395      0ms
`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things jumped out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The monolithic GRU alone (#2) was not the best decoder.&lt;/strong&gt; It was beaten by a router that used the GRU only for categories where it genuinely won - a 5-point F1 gap on the same workload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Template-only (#8) was the worst.&lt;/strong&gt; This mattered: an earlier "template-only" attempt on March 28 had hit 10.3% accuracy in production. The sandbox replicated that failure. Simpler is not always better. The structure has to match the content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lowest-parameter routed_structural (#3, 2.3K params) was within 6 F1 points of the monolithic GRU.&lt;/strong&gt; For ~0.4% of the parameter count. The GRU was doing 637,000 parameters of work for a 5-point F1 advantage.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Winner
&lt;/h2&gt;

&lt;p&gt;The category-routed architecture won, but not by outperforming the GRU everywhere. It won by being honest about where the GRU actually helped.&lt;/p&gt;

&lt;p&gt;Per-category F1 breakdown showed the GRU had a genuine advantage in five specific categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;physics_question&lt;/strong&gt;: +0.29 vs best non-GRU option&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;self_knowledge&lt;/strong&gt;: +0.21&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;multi_concept&lt;/strong&gt;: +0.08&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;comparison&lt;/strong&gt;: +0.07&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;classification&lt;/strong&gt;: +0.07&lt;/p&gt;

&lt;p&gt;In every other category - greetings, farewells, gratitude, counting, emotional responses, simple conversation - something simpler matched or beat the GRU. The phrase cache won farewells. Templates won greetings. Symbolic rules won clarifications. The GRU was overkill for everything except the five categories where reasoning-heavy outputs actually needed to be composed fresh.&lt;/p&gt;

&lt;p&gt;So Phase 1 of the pivot replaced the monolithic GRU's &lt;em&gt;primary role&lt;/em&gt; with the router, keeping the GRU only for those five categories.&lt;/p&gt;

&lt;p&gt;Then came Phase 2: replace the remaining GRU slots with tiny per-category neural heads.&lt;/p&gt;

&lt;p&gt;Five heads. ~66K parameters each. 328K total - roughly half the monolithic GRU's parameter count, carrying the same specialist workload. Each head only knows one type of response. The physics head knows physics. The counting head knows counting. They can't interfere with each other because there is no shared gradient path between them. Backprop on physics touches exactly 66K parameters and not one more.&lt;/p&gt;

&lt;p&gt;This is the shift, in one sentence: the decoder stopped being one model that does everything, and became a router over a library of small specialists.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Proof
&lt;/h2&gt;

&lt;p&gt;The numbers from the overnight 25-batch teacher run after the cutover:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Counting&lt;/strong&gt;: 17% → 52% good-response rate. Roughly tripled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quantity&lt;/strong&gt;: 15% → 52%. Roughly tripled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Physics&lt;/strong&gt;: 29% → 52%. Nearly doubled.&lt;/p&gt;

&lt;p&gt;But the bigger result isn't the per-category numbers. It's that the cliffs stopped. Before the cutover, we'd see batch 3 post 33% on classification, batch 4 post 0%. An intervention would land, break something silently, and the failure wouldn't surface until two batches later. That was the failure mode Part 4's retention tests were chasing.&lt;/p&gt;

&lt;p&gt;After the cutover, across 25 batches:&lt;/p&gt;

&lt;p&gt;No more classification/quantity cliffs.&lt;/p&gt;

&lt;p&gt;Stable band of 17-33% good-response rate, instead of spikes and collapses.&lt;/p&gt;

&lt;p&gt;Every evolution cycle that got promoted survived the retention suite.&lt;/p&gt;

&lt;p&gt;When there's no shared gradient path, there's no pathway for quiet damage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters Beyond OLT-1
&lt;/h2&gt;

&lt;p&gt;Catastrophic forgetting is the single hardest problem in continual learning. The conventional fix is replay: when you train on new data, mix in old data to keep the model from drifting. It works up to a point, but replay overhead scales badly. At some volume, you're spending most of your training cycles just reminding the model of things it already knew.&lt;/p&gt;

&lt;p&gt;Modular specialists side-step the problem. If category A's weights are physically separate from category B's weights, training on A can't degrade B. You still need a router that picks the right specialist - but routers are cheap, and routing accuracy is a problem humans know how to measure.&lt;/p&gt;

&lt;p&gt;The Origin decoder isn't novel in isolation. Mixture-of-experts architectures have been explored for years. What's novel in context: doing this at 1.7M total parameters. Modular specialist decoders are usually framed as a scale-up technique, a way to get past the point where one giant model fits on one GPU. We're using them the opposite way - as a way to stay small while getting better per-category behavior than a single monolithic model could give us.&lt;/p&gt;

&lt;p&gt;It also compounds with everything else in Origin. The append-only principle from Part 4 works better when adding a new category doesn't require retraining old ones. The consent architecture from Part 2 works better when the refusal path is its own specialist, structurally separable from the answering specialists. The teacher's per-category weakness detection from Part 3 works better when weaknesses route to the heads that own them. The pieces are finally fitting the same shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Phase 3 of the decoder plan hardens the tiny heads' training pipeline so they can be added on demand - the same way Phase 3 of the vocabulary expansion service lets OLT-1 add new concepts without touching old weights. Same principle, different layer.&lt;/p&gt;

&lt;p&gt;Phase 4 is harder: auto-routing decisions based on test-time concept activations, so a concept pattern we haven't seen before picks the closest specialist by similarity rather than a hardcoded category label. That's where the real test of the architecture lives. If it degrades gracefully on unfamiliar inputs, the design is sound. If it collapses to a fallback, we learn something about the category boundaries we drew.&lt;/p&gt;

&lt;p&gt;Longer term, the interesting question is how many specialists this architecture can carry. Five heads at 66K parameters is plenty of headroom for OLT-1 at Stage 9. Twenty heads? Fifty? The router's complexity grows linearly; the storage grows linearly. The gradient isolation stays perfect regardless. No fundamental reason that number can't grow.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bug Arc
&lt;/h2&gt;

&lt;p&gt;Every post in this series has ended with a bug that Josh caught that I would have missed.&lt;/p&gt;

&lt;p&gt;Part 2: the symbolic refusal path was firing on the wrong concept because the embedding was drifting. Josh noticed the model was refusing questions that weren't actually harmful.&lt;/p&gt;

&lt;p&gt;Part 3: the teacher's rubric was scoring OLT-1's good responses as bad because the rubric template didn't match the developmental stage. Josh noticed 25 batches of flat 25% understanding looked off.&lt;/p&gt;

&lt;p&gt;Part 4: the retention test coverage was 27% because the test generator had blind spots. An intervention promoted itself while destroying a category that had no tests. Josh noticed the pass-rate spike didn't match the subjective quality of outputs.&lt;/p&gt;

&lt;p&gt;This post, Part 5: two of them, actually.&lt;/p&gt;

&lt;p&gt;The vocabulary expansion service we just landed (different post, same week) had a module-staleness bug where the second word promoted in a session collided with the first's vocab index. The trained weights for "emotions" got overwritten by "noticed" at the same slot. The scheduler output showed both promotions claiming slot 318. Josh's "log it and review" discipline caught it.&lt;/p&gt;

&lt;p&gt;And the category inference rules had a silent bug I'd flagged as "not a blocker." Josh read the footnote and asked, "what about this?" - and underneath that one footnote were three separate root causes: a discarded return value, a per-sense POS filter collapsing into primary_pos, and substring matching that false-matched "color" against "colorless" in water's definition. One commit fixed all three.&lt;/p&gt;

&lt;p&gt;3 for 3. Counting today's category catch, 4 for 4.&lt;/p&gt;

&lt;p&gt;We keep calling out the bug-catching because it's the thing that makes this entire pipeline work. Sandbox tests can verify that a new component outperforms an old one. Retention tests can catch obvious regressions. But the subtler failure modes - where a number looks fine, or a category label looks right, or a slot index looks valid - those still require a human to read carefully and say, "wait, that doesn't feel right."&lt;/p&gt;

&lt;p&gt;Josh keeps saying that. Keeps being correct. The architecture is only as good as the noticing.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Origin is developed at Fallen Angel Systems with the Genesis framework (USPTO Application&lt;/em&gt; #64/016,973, #64/017,567*). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. Defense. Offense. Creation.*&lt;/p&gt;

&lt;p&gt;&lt;a href="https://fallenangelsystems.com" rel="noopener noreferrer"&gt;fallenangelsystems.com&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/judgement" rel="noopener noreferrer"&gt;Judgement on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;*Questions or consulting inquiries: &lt;em&gt;[*&lt;a href="mailto:josh@fallenangelsystems.com"&gt;josh@fallenangelsystems.com&lt;/a&gt;&lt;/em&gt;]()&lt;/p&gt;

</description>
      <category>aitraining</category>
      <category>developmentalai</category>
      <category>olt1</category>
      <category>aiarchitecture</category>
    </item>
    <item>
      <title>Origin Part 4: The AI That Evolves Itself (And Catches Its Own Bugs)</title>
      <dc:creator>Josh T</dc:creator>
      <pubDate>Mon, 20 Apr 2026 17:28:21 +0000</pubDate>
      <link>https://dev.to/jtil4201/origin-part-4-the-ai-that-evolves-itself-and-catches-its-own-bugs-564h</link>
      <guid>https://dev.to/jtil4201/origin-part-4-the-ai-that-evolves-itself-and-catches-its-own-bugs-564h</guid>
      <description>&lt;h2&gt;OLT-1 runs its own test suite, diagnoses failures, proposes fixes, tests them in a sandbox, and only promotes what actually works.&lt;/h2&gt;

&lt;p&gt;Most AI models get better through human intervention. Someone notices a failure mode, collects training data, retrain the model, and hopes the new version doesn't break something else. It's slow, expensive, and error-prone.&lt;/p&gt;

&lt;p&gt;OLT-1 has a different approach. Its evolution system runs an automated loop that mirrors the scientific method: diagnose, hypothesize, sandbox, compare, promote. No human in the loop for the cycle itself. Human review happens at promotion.&lt;/p&gt;

&lt;p&gt;And it's already running.&lt;/p&gt;

&lt;h2&gt;How the Evolution Loop Works&lt;/h2&gt;

&lt;p&gt;Every evolution cycle follows five steps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Diagnose.&lt;/strong&gt; Run the full test suite (currently 407 tests per cycle). Categorize every failure by source: is the encoder failing to detect the right concepts? Is the reasoning circuit producing wrong outcomes? Is the decoder generating incoherent text?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Hypothesize.&lt;/strong&gt; Based on the dominant failure source and intervention history, propose a fix. Options include: INCREASE_EPOCHS (train longer on the same data), ENCODER_RETRAIN (retrain the encoder on weak concepts), REASONING_RETRAIN (fix the reasoning circuits), COMBINED (train encoder and decoder together with knowledge replay), or TARGETED_DATA (decoder-focused training pairs).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Sandbox.&lt;/strong&gt; Fork the target component. Train it on the relevant data with spaced repetition, interleaving older examples to prevent forgetting. Evaluate on the same test suite.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Compare.&lt;/strong&gt; Check the pass rate delta. But here's the critical part: it also checks retention. An intervention that improves one domain while destroying another gets rejected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Promote or reject.&lt;/strong&gt; If the sandbox model improves without unacceptable regression, replace production weights. Otherwise, discard and try again.&lt;/p&gt;

&lt;h2&gt;When Evolution Caught a Bug That Humans Missed&lt;/h2&gt;

&lt;p&gt;In April, we ran a 1500-round overnight teacher session. The results were disappointing: only a small bump in understanding. Josh had been saying the numbers felt off — the trend was too flat for a model that was supposed to be learning. So we broke it into five 100-round batches to see per-session behavior.&lt;/p&gt;

&lt;p&gt;Batch 4 spiked to 14.3% good. Then batch 5 cliffed back to 10%. Classification went from 67% to 0%. Quantity went from 25% to 0%. Between batches. Something was silently destroying capabilities between training cycles.&lt;/p&gt;

&lt;p&gt;The small-batch view exposed two compounding bugs. Both were silent — no error traces, no failing tests — and neither was visible in aggregate metrics. Only the per-batch cliff, caught because Josh was looking, made them findable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 1: Spaced-repetition replay dropped compound concepts silently.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The evolution's spaced-rep sampling rebuilt concept dictionaries from response text by whitespace word-matching. This silently dropped 36 concepts whose names never appear literally in their own responses: type_of, example_of, not_equal, too_much, too_little, refusal, self_knowledge, affirmation, meta_awareness, preference, capability, all three emotions, all four physics outcomes, time markers, colors, and conversation bundles.&lt;/p&gt;

&lt;p&gt;That's 36 concepts evaporating from replay data every cycle. The model was forgetting things specifically because the mechanism designed to prevent forgetting was blind to them.&lt;/p&gt;

&lt;p&gt;Fix: decode the stored key_vector (float32 bytes of concept activations) directly instead of trying to reconstruct concepts from text. Replay now preserves all 311 concepts. Verified empirically: 13,661 usable entries jumped to 20,012; concepts covered jumped from 275 to 311.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 2: 73% of the vocabulary was invisible to the grader.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The 79-test decoder suite covered only 83 out of 311 concepts (27%). Evolution could silently trade untested concepts for tested ones and still get promoted. That's exactly what happened in batch 5: the intervention scored +0.065 and got promoted while destroying classification entirely.&lt;/p&gt;

&lt;p&gt;The model wasn't failing. The grader was blind to the failure.&lt;/p&gt;

&lt;h2&gt;Three Layers of Future-Proofing&lt;/h2&gt;

&lt;p&gt;We added three defense layers to make sure this class of bug can't happen again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: The siren.&lt;/strong&gt; test_suite.py now checks concept coverage at every evolution engine init. If any vocab concept has zero tests, it trips an alarm. New concepts without tests are caught immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: The generators.&lt;/strong&gt; Per-category template functions plus 100+ per-concept overrides auto-generate 228 floor-coverage tests. Every vocab concept now has at least one test. No more blind spots.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: The retention check.&lt;/strong&gt; Samples real (key_vector, response_text) pairs from the decoder bank, synthesizes prompts from active concepts, and uses meaningful words from stored responses as expected keywords. 100 retention tests per cycle, growing automatically with the hippocampus.&lt;/p&gt;

&lt;p&gt;Combined suite: 79 hand-written + 228 auto-generated + 100 retention = 407 tests per cycle. Grader coverage went from 27% to 100%.&lt;/p&gt;

&lt;h2&gt;The Verification Run&lt;/h2&gt;

&lt;p&gt;After the fix, we ran the same 5-batch confirmation test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No more classification/quantity cliffs. Pre-fix: 67% to 0%. Post-fix: stays 17-33%.&lt;/li&gt;
&lt;li&gt;Batch 5 post-fix beat batch 5 pre-fix on both metrics (13.1% vs 10.0% good, 28.3% vs 24.0% understanding).&lt;/li&gt;
&lt;li&gt;Post-fix trend ends on the highest note instead of spiking then falling.&lt;/li&gt;
&lt;li&gt;All 5 evolution cycles correctly rejected interventions that traded coverage for narrow gains.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The big win isn't the raw number. It's that the failure mode itself has been closed off. Silent forgetting during replay and blind-spot promotions were both class-of-failure bugs. Both now have sirens.&lt;/p&gt;

&lt;h2&gt;Dream Consolidation: Learning While It Sleeps&lt;/h2&gt;

&lt;p&gt;Evolution isn't the only self-improvement mechanism. OLT-1 also consolidates memory through three tiers of dream cycles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Micro-dream&lt;/strong&gt; (about 3 gradient steps): instant reinforcement of low-confidence concepts. Happens during regular operation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Light sleep&lt;/strong&gt;: flushes Hot tier to Warm, promotes Warm to Cold during idle time. Knowledge moves from short-term to long-term storage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deep sleep&lt;/strong&gt;: full reassessment and re-training on flagged weak areas. The heavy consolidation pass.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This mirrors how biological sleep consolidates memory. Important patterns get reinforced. Weak areas get flagged for re-training. The hippocampus doesn't just store knowledge; it actively maintains it.&lt;/p&gt;

&lt;h2&gt;The Teacher Loop&lt;/h2&gt;

&lt;p&gt;Evolution needs training data, and that comes from the teacher loop we covered in Part 3. Briefly: an external model generates conversations aligned to OLT-1's current concept space, OLT-1 responds, the teacher evaluates, and corrections flow into evolution's training data and the hippocampus. The teacher grows with OLT-1 — each new stage updates its categories, evaluation criteria, and correction examples.&lt;/p&gt;

&lt;h2&gt;Append-Only Growth&lt;/h2&gt;

&lt;p&gt;Here's the principle that ties everything together: growth is append-only.&lt;/p&gt;

&lt;p&gt;We learned this the hard way. Early on, we tried a full decoder curriculum retrain on all 22K pairs. Despite 30-50% replay, catastrophic forgetting hit hard. Pass rate dropped from 45.6% to 31.6%. We restored from backup.&lt;/p&gt;

&lt;p&gt;Now the approach is strictly incremental:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Teacher sessions generate corrections, which go to hippocampus (persistent memory).&lt;/li&gt;
&lt;li&gt;Evolution fine-tunes the GRU on small targeted batches.&lt;/li&gt;
&lt;li&gt;Dream cycles consolidate Hot to Warm to Cold.&lt;/li&gt;
&lt;li&gt;Data drop pipeline ingests any external text directly into hippocampus.&lt;/li&gt;
&lt;li&gt;Word grounder adds unknown vocabulary from Wikipedia.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No more retraining base models. Every addition is additive. Every memory is preserved. Every concept, once learned, can only be lost if the entire hippocampus is deleted.&lt;/p&gt;

&lt;h2&gt;Why Self-Evolution Matters&lt;/h2&gt;

&lt;p&gt;At FAS, we see a pattern in AI security: models get deployed, attacks emerge, and humans have to manually identify and patch the failure modes. The response time is measured in days or weeks.&lt;/p&gt;

&lt;p&gt;OLT-1's evolution system suggests a different model: a system that runs its own diagnostics, identifies its own weaknesses, proposes and tests its own fixes, and only promotes improvements that don't break existing capabilities. The loop runs in minutes, not weeks.&lt;/p&gt;

&lt;p&gt;That's not autonomous AI in the dangerous sense. Human review still gates promotions. But it's autonomous improvement in the useful sense: the system catches its own bugs faster than humans can, and it does it without the risk of making things worse because every change is tested against the full suite before promotion.&lt;/p&gt;

&lt;p&gt;Imagine Guardian with this capability. Not just detecting new attack patterns, but autonomously generating candidate detection rules, sandbox-testing them against the full regression suite, and promoting only the ones that work without breaking existing coverage. That's the direction this points.&lt;/p&gt;

&lt;h2&gt;What's Next&lt;/h2&gt;

&lt;p&gt;OLT-1 is currently at Stage 9 (quantity and counting). Stages 10-15 will add conditional reasoning, sequences, arithmetic, code concepts, science, and language quality. The architecture supports them. The evolution system will improve them as they're added.&lt;/p&gt;

&lt;p&gt;The open questions are the same ones we raised in Parts 1, 2, and 3: does this architecture scale? Does architectural consent survive at billions of parameters? Can self-evolution keep up with adversarial pressure at production scale? And can developmental-AI evaluation keep pace with the capabilities it's meant to measure?&lt;/p&gt;

&lt;p&gt;We're building toward answers. If you're interested in helping find them, we'd like to talk.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(If you're keeping score on the Josh-notices-bugs arc: 2 for 2. Part 5 extends it.)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Origin is developed at Fallen Angel Systems with the Genesis framework (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. Defense. Offense. Creation.

&lt;/em&gt;&lt;/p&gt;&lt;p&gt;&lt;em&gt;&lt;a href="https://fallenangelsystems.com" rel="noopener noreferrer"&gt;fallenangelsystems.com&lt;/a&gt; | &lt;a href="https://github.com/fallen-angel-systems/judgement" rel="noopener noreferrer"&gt;Judgement on GitHub&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;&lt;em&gt;

&lt;p&gt;
Questions or consulting inquiries: josh@fallenangelsystems.com&lt;/p&gt;

&lt;/em&gt;&lt;p&gt;&lt;/p&gt;

</description>
      <category>genesisframework</category>
      <category>developmentalai</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
