DEV Community: Josh T

Origin Part 12: The Adapter

Josh T — Mon, 25 May 2026 13:00:30 +0000

The new encoder was 24x better at finding the right concept. It also broke every response.

Part 11 ended with the new encoder staged on disk. Top1 had jumped from 1.3% to 31.3%. Target activation had gone from 0.012 to 0.249. The architectural lever had landed exactly where the abort condition predicted it would. The numbers said this was the encoder we were going to ship.

Then we tried to ship it.

Every query came back "i don't know."

What the Dispatcher Does

The dispatcher is the part of Origin that sits between the encoder and the response. The encoder reads characters and produces concept activations - a long list of "how strongly does each concept fire on this input?" The dispatcher reads that list and decides what to do about it. Is this a greeting? Is this a question about identity? Is the user asking what something is? Each route fires when the activation pattern matches a rule, and each route knows how to construct a response from the concepts that fired.

The rules looked like this, in spirit: if the concept "greeting" is firing above 0.5, dispatch to the greeting handler. If the concepts "what" and "self" are both above 0.5, dispatch to the identity handler. Numbers like 0.5, 0.7, 0.8 were sprinkled through the dispatcher as thresholds. They worked because the old encoder produced activations that lived in those ranges.

The old encoder used sigmoid. Each concept was scored independently, on its own absolute scale from 0 to 1. A query about greetings might fire "greeting" at 0.92, "hello" at 0.88, and "question" at 0.04. Three concepts, three independent yes/no decisions, three numbers that meant what their face value said they meant.

The new encoder uses softmax. The activations are relative. They sum to 1 across the whole concept space. The strongest concept on a query might be 0.249 - which under the old encoder would have been a borderline-quiet signal, and under the new encoder is a confident, dominant fire.

0.249 was the new encoder's average top concept activation. Every threshold in the dispatcher was 0.5 or higher.

That's why every query routed to IDK. The new encoder was firing the right concept, with appropriate confidence relative to everything else, and the dispatcher was reading those activations as "nothing is firing." The encoder had gotten 24x better at picking the right answer, and the system above it couldn't hear it.

The Wrong Fix

The first instinct was rescaling. If 0.249 is the new "high," divide every threshold by 2. Done. Ship.

We tried it. It half-worked. Greeting handlers fired correctly on greetings. Identity handlers fired correctly on identity questions. But the dispatcher started cross-firing on everything else - questions about emotions would route to identity, questions about objects would route to physics. We'd swapped one calibration problem for another.

The reason: rescaling treats softmax outputs as if they were sigmoid outputs that happen to live in a different range. They aren't. A 0.249 firing on the new encoder isn't "the concept is 49.8% present" - it's "this concept is the most likely interpretation, with this much margin over the next-best." The number means a different thing than it did before. Rescaling fixes the magnitude. It doesn't fix the meaning.

That's the harder truth about this kind of integration: when an upstream component changes how it represents information, every downstream component that interprets that information has to be rewritten, not retuned.

The Right Fix

The dispatcher had been asking the wrong shape of question. It was asking "is concept X firing strongly enough?" - an absolute threshold question. With softmax outputs, that question doesn't have a meaningful answer. The right shape is "is concept X the dominant signal, and by how much?" - a relative comparison.

The rewrite turned every threshold into a ranking check plus a margin check. Instead of "greeting > 0.5," the rule became "greeting is in the top-3 fired concepts AND its activation is at least 2x the next-best non-greeting concept." Instead of "identity > 0.7," the rule became "identity dominates the top of the activation distribution."

The numbers in the new rules aren't thresholds in the old sense. 2x margin, top-3 rank, dominance-by-ratio - these all describe the shape of the activation distribution, not its absolute values. They survive future encoder changes the way the old thresholds didn't, because they're asking about the encoder's confidence relative to itself, not about a number that means something only on this specific encoder.

The cutover was one commit. Every dispatch rule rewrote. Backups taken on the dispatcher state and the live conversation memory. Test panel run

before

you > hello
origin > i don't know

you > what is your name
origin > i don't know

you > how does ice float
origin > i don't know

and after

you > hello
origin > hello.

you > what is your name
origin > my name is origin.

you > how does ice float
origin > ice is less dense than water, so it floats.

The new encoder is now live. The system runs end-to-end. The first two developmental tiers - basic conversation and elementary reasoning - are at 95.5% and 86.5% on the honest test panels.

What the Whole Arc Was About

Looking back at Parts 9 through 12 as a single sequence, the arc is about the discipline of finding the right bottleneck.

Part 9 said the bottleneck was data. We executed a careful plan to feed the encoder properly. Part 10 said the data plan didn't work - the abort condition triggered, and we listened. Part 11 said the bottleneck was architecture. The sandbox confirmed it. Part 12 says that even after fixing the right bottleneck, you still have to integrate the fix into the rest of the system, and integration is its own kind of work.

None of this is glamorous. It's not a "we achieved AGI" post. It's the slow, uneventful, mostly-correct version of how a model actually gets built: hypothesize a bottleneck, design a plan with a written-down abort condition, execute the plan, listen to what happens, do the next thing the evidence points at. Repeat until something actually works. Then integrate it without breaking everything around it.

The encoder we're running today is the third major iteration since we started. The dispatcher we're running today is the second. There will be more. Every component in this system has been the bottleneck at some point, and every component will be the bottleneck again. The job isn't to design the perfect system on day one. The job is to keep finding what's actually broken and fixing that thing, one bottleneck at a time, with abort conditions written in advance so a result you wanted to see doesn't become the result you accept.

What's Next

The encoder works. The dispatcher works. The first two tiers hold. The third tier - middle-school content across math, science, and history - is where the project goes next, and it's the tier that tests whether everything we've built so far actually generalizes.

There's a hypothesis we're testing alongside it: that the next bottleneck isn't going to be more concepts, but the relationships between concepts. A model can know "dog" and "animal" and "four legs" and "barks" as four separate concepts and still not understand what a dog is. Understanding might live in the connections, not the nodes.

If that's right, the next architecture pivot is already visible on the horizon. If it isn't, we'll find out quickly and write that post too.

One guy. One GPU. One $1,800 computer in Arizona. Still building.

Origin is developed at Fallen Angel Systems with the Genesis framework — NVIDIA Inception member. (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. Defense. Offense. Creation.

fallenangelsystems.com | Judgement on GitHub | Guardian on GitHub

Questions or consulting inquiries: josh@fallenangelsystems.com

Origin Part 11: The Architecture Was the Lever

Josh T — Mon, 18 May 2026 13:00:35 +0000

The data plan didn't move the encoder. The architecture sandbox did.

Part 10 ended with the abort condition triggering: top1 of 1.3% on held-out probes meant the architecture, not the data, was the bottleneck. The plan said "design contrastive next." We built a sandbox first.

The Sandbox

150 random concepts spread across six domains. The same training data filtered to that slice. The same held-out probe battery. Five concept_head variants tested side-by-side:

baseline_flat_bce: current architecture (flat MLP + per-slot binary cross-entropy)
contrastive: same MLP, but cross-entropy over the full concept space (the target must dominate every other concept, not just exceed a threshold)
tree_hierarchical: predict domain first, then concept within domain
domain_routed: soft-gate trunk features through per-domain sub-heads
contrastive_tree: hybrid of tree structure plus contrastive global cross-entropy

All five shared the warm-started trunk. All had comparable parameter counts. The only difference was head topology and loss. The first round trained with the trunk frozen, head only, so any difference traced to the architectural choice, not optimization noise.

The Frozen Ceiling

The best variant under frozen trunk hit 10% top1. The baseline hit 4.5%. Real differentiation, but no variant came close to a useful threshold.

The reading: the trunk's 256-dimensional feature output was the ceiling. The trunk had been warm-started from v1 then trained on the new data, but it had never been shaped to discriminate 3687 concepts. No head topology could extract signal that wasn't there. Every variant was trying to read meaning from a representation that hadn't learned to encode it.

Before scaling further, we set a pre-defined pass criterion: top1 at or above 30% AND cross_fire under 30 at 500 concepts means "this is the architecture to scale." Hold the gate. Don't let a result you wanted to see become the result you accept.

Unfreezing the Trunk

One change: let the trunk co-adapt to the head's loss at a lower learning rate (1e-4 trunk vs 3e-4 head). Same data. Same head. Same epochs.

contrastive_tree at 30 epochs on 500 concepts: 28.4% top1 with cross_fire of 48.27. Just below criterion on both. The pattern said "more epochs needed." Larger concept counts take longer to converge. At 60 epochs: 34.8% top1, 26.26 cross_fire. Both criteria met, no other tuning needed.

Architecture locked: contrastive_tree + unfrozen trunk + 60 epochs.

Production Retrain

Same architecture, scaled to all 3687 concepts. 145,000 training pairs (the natural-positive corpus from Part 9 plus everything else in Phase A). 65 minutes on the 4070.

Phase 5 probe battery on the new encoder, same 50 random concepts as the baseline. The number that matters most: under the old encoder, the right answer was outvoted 14-to-1 by distractors. Under the new architecture, the target concept dominates by 22x. That's not gradual improvement. That's a different machine.

Metric	Baseline	New Architecture
top1	1.3%	31.3%
top3	4.0%	50.0%
target activation	0.012	0.249
target / 2nd-best ratio	0.07	22.67

Top1 went from 1.3% to 31.3%. Top3 from 4% to 50%. None of the absolute Phase 5 success gates are met yet (the plan's full-victory marks were 70% top1, 0.7 target activation, cross_fire under 2). But every metric moved dramatically in the right direction, and the lever was exactly where the plan said it would be if the data hypothesis failed.

What Comes Next

The new encoder is staged on disk but not yet live. Swapping it in turns out to be more than a file rename. The dispatcher that turns concept activations into responses was built around the old encoder's sigmoid output range. The new encoder uses softmax. Without changes, every query would route to IDK. We'll cover the fix in Part 12.

fallenangelsystems.com | Judgement on GitHub | Guardian on GitHub

Questions or consulting inquiries: josh@fallenangelsystems.com

Origin Part 10: The Plan Didn't Work

Josh T — Mon, 11 May 2026 23:13:42 +0000

We executed the plan exactly as written. The encoder still couldn't tell concepts apart.
Part 9 ended with 94,000 natural-context pairs wired into the trainer and a clean execution of every phase gate. We had three times the data. The hypothesis was about to be tested.

Phase 4: The Retrain
The full joint retrain ran clean. Loss curve descended monotonically. The encoder's healthy concept count went from 84 to 107, measured by an internal probe of about 30 hand-crafted queries that exercise common concepts.

+23 healthy concepts, +27% relative. We were cautiously optimistic. The trainer's audit is a small set of probes and "healthy" only counts the concepts those probes happen to test. The real validation was Phase 5.

Phase 5: The Probe Battery
The plan's success metric was specific. Random-sample 50 V2C concepts. Ask gemma to generate three short held-out sentences mentioning each one (verified not to appear in the training corpus). Run them through the encoder. Measure four things:

top1 accuracy: does the encoder rank the target concept first?
top3 accuracy: is the target in the top three?
target activation: how strongly does the target itself fire?
cross_fire: how many other concepts fire above threshold?
The pre-defined success gates were top1 at or above 70%, target_act at or above 0.7, cross_fire under 2.0.

The result on the freshly-retrained encoder:

top1: 1.3%
top3: 4.0%
target_act: 0.086
cross_fire: 11.92
We ran it twice.

One concept out of fifty had its target rank first. The encoder fired on twelve wrong concepts per probe, on average. Target activation was eight percent. When we handed the encoder the exact sentence it should have been designed to recognize, it barely registered the right answer.

The plan had executed exactly as written and not moved the encoder.

What That Meant
This is the place in the post where it would be easy to say something exculpatory: "the data work wasn't wasted" or "we learned something." Both are true. But the cleaner reading is that we were wrong about the bottleneck. We had thought the encoder was data-starved. The earlier sandbox at 10-concept scale had shown data could lift top1 from 33% to 80%. We assumed that signal would transfer to 3687 concepts.

It didn't.

We had built the plan with an explicit abort condition for exactly this case: if Phase 5 returns top1 below 50% on held-out probes, the architecture is the bottleneck, not the data. Design contrastive next.

1.3% triggered it.

The data work wasn't wasted. We needed the data anyway, and the elaboration corpus is now properly structured for whatever the next model wants to do with it. But it wasn't the lever. Something else was.

What Comes Next
The abort condition pointed at architecture. The encoder's concept_head, the part that maps general features to per-concept activations, was a flat MLP trained with multi-label binary cross-entropy. Every concept slot had to learn its own discriminator independently against roughly 3686 others. At 327 concepts (the v1 vocab) this had worked. At 3687 it had been quietly failing the whole time.

The next move: build a sandbox, test multiple head architectures against the same data, let the numbers pick the winner. No production changes until something actually beats the baseline on Phase 5.

Hypothesis tests fail more usefully than hypothesis confirmations. We'd just gotten one of the more useful failures.

fallenangelsystems.com | Judgement on GitHub | Guardian on GitHub

Questions or consulting inquiries: josh@fallenangelsystems.com

Origin Part 9: The Data Plan

Josh T — Mon, 04 May 2026 14:00:22 +0000

85.7% of concepts were data-starved. That was the problem. What happened next taught us something about the problem itself.

OLT-1 is a concept-based AI that understands language without tokenization. Characters go in. Concepts come out. The encoder is what makes that mapping. If it can't reliably tell concepts apart, nothing downstream works.

Part 8 left the encoder firing on too many slots per query. The concept space was crowded and noisy. A sandbox experiment had already shown that the same architecture could lift top1 from 33% to 80% just by feeding it richer data. Same model. Different food.

That made the next move obvious: stop tuning the encoder and feed it properly. We wrote a plan and built a scope fence around it.

The Plan

One sentence: every V2C concept has at least 30 natural-context positives before any further retrain. Not WordNet glosses. Not template sentences. Real text from books or Wikipedia where the concept is used naturally.

The scope fence was strict: no hard-negative tuning, no decoder dispatch guards, no architecture changes, no tier-test-specific quick fixes. Five things we'd been tempted to try in past sessions and would not be trying this session.

Three data phases with gates: coverage audit, source expansion if needed, then per-concept generation. Then retrain. Then probe.

Phase 1: Coverage Audit

We walked the existing data: book ingestion proposals, elaboration candidates, the grounding cache. Counted how many natural-context sentences each of the 3687 concepts had.

The number: 3158 concepts (85.7%) below the threshold. Most were stuck in the 10-29 range. Some data, but not enough.

Phase 2: Source Expansion

We tagged every concept with one of 17 domain labels using gemma-2-9b, built a Wikipedia full-article adapter, and routed encyclopedic concepts (biology, science, physics, history) to Wikipedia and conversational concepts (emotion, self_state, language) to Gutenberg fiction.

The first run came back thin. Wikipedia and Gutenberg both produced fewer candidates than expected, and the per-domain medians barely moved. Most of the new positives went to common quantifier words: some, many, all. The ones most likely to be the only known concept in any given sentence.

That last detail was the clue.

The Rule That Was Right and Wrong

The book ingestion pipeline has a strict rule: if a sentence mentions more than one known concept, drop it. The rule was correct for the original use case. You never want to assign the wrong concept to a sentence. But it was actively working against us here. The sentences we needed most, ones like "the cell membrane regulates what enters the cell," got dropped because they mention two concepts.

We almost missed it. The clue was that common quantifier words kept getting the new positives. They're the ones most likely to appear alone in a sentence. The interesting concepts, the semantically rich ones, were still getting filtered out at every pass.

We sandboxed a relaxed variant: assign multi-concept sentences to the least-common concept. The argument is information-theoretic. Rare concepts gain more from each new positive. A 50-sample spot-check came back 88% good, 12% defensible-either-way, 0% wrong. We shipped it as a separate file. The original strict rule still serves Discovery unchanged.

181 more concepts crossed the threshold. Per-domain medians moved up three to five positives across the board.

Phase 3: Generation

The last data step. We pulled everything together: book ingestion proposals, elaboration candidates, and the new Path B output. One training file: 94,000 natural-context pairs covering 96.7% of the vocabulary. Wired into the encoder trainer's Phase A data list.

The trainer now had three times its previous data. Every phase gate had passed. We hit run on the retrain and set a timer. 65 minutes later, we'd know if the data had been the problem all along.

fallenangelsystems.com | Judgement on GitHub | Guardian on GitHub

Questions or consulting inquiries: josh@fallenangelsystems.com

Origin Part 8: Four Wrong Turns Before the Breakthrough

Josh T — Fri, 01 May 2026 19:17:07 +0000

We rewrote the decoder four times in one day. Only the last one understood anything.

Part 7 ended with "how are you" returning "i don't know" while our tier tests reported 100% pass. Everything was green. The model was broken. The disconnect between those two facts defined the day.

Here's the actual arc.

Wrong Turn 1: Retrieval

The first attempt was retrieval. We built five decoder candidates, sandbox-tested them against 400 dialogue pairs, and a retrieval-based decoder won cleanly. F1 of 0.246 against the next-best 0.024. Four out of five break tests passed. It was 1,300x faster than the teacher. We wrote a "winner" memory and committed the code.

Josh looked at it and said: retrieval is scripting. Origin isn't supposed to look up pre-written answers. It's supposed to generate them from understood concepts.

He was right. Retrieval wins F1 against memorized responses because retrieval is memorization - it just renames the table. A query comes in, find the closest stored response, return it. That passes a test suite built from the same responses. It doesn't understand anything.

We deleted the sandbox, deleted the memory, and backed up to try again.

Wrong Turn 2: Template Heads

The second attempt was template-based heads. Each head was a tiny specialist - one for self-identity, one for emotion, one for acknowledgements, one for counting. Each had a list of text patterns it matched, and each produced a hard-coded response when its pattern fired.

Four Tier 1 heads, then four Tier 2 heads. Multi-step composer for compound requests. It was clean. It was fast. And it passed Tier 1 at 100% out of the gate.

Then Josh tried to talk to it.

you > how are you
origin > i don't know

you > what do you know
origin > i don't know

you > how are you doing today
origin > i don't know

His response: "it feels like it isn't understanding language, it's just repeating patterns."

That was the pivot of the day.

The head code looked like this:

if "hello" in text: return "hello."
if "what is your name" in text: return "my name is origin."

The encoder might as well not exist. Every decision was a text substring match. Tier 1 at 100% was a pattern-matcher passing tests designed by the same pattern-matcher. "how are you" wasn't in any pattern list, so the decoder fell through to "i don't know" - not because Origin didn't know, but because no head had that phrase in its dictionary.

We'd been calling this concept-driven for weeks. It wasn't. It was text-driven with concepts as decoration.

Wrong Turn 3: Actually Concept-Driven (But the Encoder Was Lying)

The third rewrite made dispatch actually concept-driven. Instead of "if 'hello' in text," an Intent would say "fire when the greeting concept activates." Text would only be consulted inside the response builder for variable slot extraction ("count to N" needs to know what N is). Primary dispatch would be on what the encoder actually understood.

We ran Discovery against it. Tier 1 dropped from 100% to 43.6%.

That was the honest number. It was smaller because the pattern-matching wasn't hiding the encoder's gaps anymore.

The failures were catastrophic:

"hello" fired concepts like just_checking, yellow, happened. The greeting concept didn't fire at all.
"bye" fired continue at 0.90. The farewell concept didn't fire.
"are you human?" fired consent at 0.71 and i_am at 0.75. consent beat out identity.
"thank you" fired refuse at 1.00 and no_choice at 1.00. Exactly backwards.
"i am scared" didn't fire scared at all. It fired learning and current_state.

The encoder - the part we thought was solid - was broken. Not subtly. On the most basic greetings and emotions.

The Real Problem: Data Was Lying

We went into the encoder's training data and started reading.

The greeting concept had 15 training examples. All 15 were dictionary definitions. "greeting means salutation." "salutation is another word for greeting." "greeting is a acknowledgment." Not one example paired "hello" with greeting. Not one paired "hi" with greeting. The encoder had been taught what the word "greeting" means - but never shown that "hello" is an example of one.

Same for farewell. Same for scared. Dictionary definitions, zero usage examples.

The thank_you concept was worse. 53 of its 55 training examples were sentences like "i will decline your offer" and "would you like refuse?" - labeled as thank_you. Someone (some script, some generator) had treated "polite refusal" as containing thanks and co-labeled the examples. The encoder learned that thank_you fires on refusal language. That's why "no" fired thank_you and "thank you" fired refuse. The polarity concepts had contaminated each other.

The v2 encoder was gaslit by bad data and the pattern-matching decoder had been hiding it the whole time.

The Fix

We patched the data. Six new training files in the conversation corpus - 157 natural-usage examples:

"hello" / "hi" / "hey" / "good morning" → greeting
"bye" / "goodbye" / "see you later" → farewell
"thank you" / "thanks" / "much appreciated" → thank_you
"i am scared" / "i feel angry" / "i'm frustrated" → the right emotion concepts
"yes" / "okay" / "sure" → yes_choice, separate from consent
"no" / "nope" / "not really" → no_choice, separate from refuse

Stripped the 53 mislabeled thank_you entries from the consent-mechanics file. Ran a three-minute retrain.

Audit results:

"hello" → greeting=1.00 ✓
"bye" → farewell=1.00 ✓
"i am scared" → scared=1.00, i_am=1.00 ✓
"thank you" → thank_you=1.00, no refuse cross-fire ✓
"no" → no_choice=1.00, no thank_you cross-fire ✓

Overall encoder health preserved at 296/305 concepts on the full audit. The patches fixed the broken concepts without damaging anything that had been working.

Re-ran Discovery against the concept-driven decoder with the patched encoder. Tier 1: 280/280. Tier 2: 137/137. 100% and 100%. Honest this time - every pass was a concept firing correctly and the decoder routing on it. No text-pattern shortcut anywhere.

Then we opened an interactive chat:

you > how are you
origin > i am doing fine. what would you like to explore?

The response it wouldn't give in the morning, it gave in the evening. Not because we added "how are you" to a pattern list, but because the encoder now fired question and self on that input, and the decoder's concept-driven wellbeing intent matched on those concepts.

The Unlock: Growing Vocabulary At Runtime

With the decoder honest, we had room to fix the other thing v1 couldn't do: add new concepts without a full retrain.

This had been v1's bottleneck for weeks. Discovery would propose new concept candidates. The tracking code logged them. But actually teaching the encoder a new concept required retraining the whole concept_head from scratch, which was expensive enough that proposals piled up unaddressed. Concepts came in faster than the encoder could absorb them.

The technique we validated today:

Expand the concept_head's final linear layer from N → N+1 outputs
Copy the first N weight rows unchanged - existing concepts preserved exactly
Zero-initialize the new row, freeze everything else via gradient masking
Train only the new row on positives + sampled negatives, 8 epochs, about a minute

Sandbox results: 100% recall on the new concept. 0% false positive rate on negatives. Zero regression on the existing concepts.

We ran it six times in sequence - rainbow, thunder, ocean, mountain, flower, sunset - and the regression stayed at zero all the way through. Each addition cost about 60 seconds.

v1's bottleneck dissolved. New concepts are now cheap enough to run routinely.

Rainbow

The last thing we did today was integrate a new concept into the live system.

$ echo '{"name": "rainbow", "response_template": "rainbows are colors of light in the sky.",
"positives": [...]}' | python -m tools.concept_lifecycle draft
Drafted: rainbow (pending) — 18 positives

$ python -m tools.concept_lifecycle approve rainbow
Approved: rainbow

$ python -m tools.concept_lifecycle integrate rainbow
Integrating concept 'rainbow' (18 positives)
baseline: 296/305 healthy
trained; final_loss=0.144 row=305
new slot: recall=100.0% fp_rate=0.0%
regression: 0 lost (296 → 296)
persisting encoder checkpoint...
appending 'rainbow' to v2_vocab.py CONCEPTS...
registering decoder intent...
✓ integrated.

Origin's vocabulary went from 305 to 306 concepts. The encoder checkpoint was saved with a timestamped backup. The vocab file was updated. The decoder registered the response template.

Restart and test:

you > i saw a rainbow
origin > rainbows are colors of light in the sky.

you > look at that rainbow
origin > rainbows are colors of light in the sky.

you > hello
origin > hello.

The new concept fires correctly. The 305 original concepts still work. Nothing broke.

This is what v1 couldn't do. This is why we rebuilt.

What the Day Cost

Four wrong turns. Retrieval, template heads, concept-driven-but-encoder-broken, then finally the real fix. Each wrong turn looked like success at first - passing tests, clean benchmarks, committed commits. The signal that something was wrong came from conversation, not numbers. "it feels like pattern matching." "how are you returns i don't know." The metrics kept saying green while the lived reality said something was off.

The right turn came from debugging what the encoder actually fires on "hello" - and discovering it had never been taught that "hello" was a greeting. The data layer was upstream of everything. When it lies, every layer above it inherits the lie, and metrics will happily agree.

What's left: Tier 3 content. Middle-school math, intro science, history, basic coding. The foundation holds; now we grow it. And now that growing the vocabulary costs a minute per concept instead of a full retrain, growing is actually something we can do.

Origin is 306 concepts tall. The 306th is rainbow, and it was added while the system was running. The foundation can hold itself.

Now we build upward.

fallenangelsystems.com | Judgement on GitHub | Guardian on GitHub

Questions or consulting inquiries: josh@fallenangelsystems.com

Origin Part 7: We Fired the Teacher

Josh T — Wed, 29 Apr 2026 15:46:23 +0000

We built something to replace the teacher. It worked. Then something else went wrong.

Part 6 ended with a problem we couldn't patch: a token model cannot reliably grade a concept model. The mismatch isn't fixable with a better rubric or a better teacher model. It's architectural.

So we stopped trying to fix the teacher and built a replacement.

Discovery: The Teacher Replacement

The idea was simple. Instead of asking Gemma to generate questions and grade responses, we'd build a rule-based system that already knew the right answers.

Each rule is a (pattern, expected response signature) pair. "does ice float?" expects a response containing "float" and "water." "what is your name?" expects a response containing "origin." No LLM anywhere in the loop. No drift. No mode collapse. No token-fluency bias.

We called it Discovery. We ran the first test.

The numbers: 0.79 seconds for 180 tests. 94.6% pass rate on Tier 1. Zero duplicates. Zero hallucinations.

Compare that to Gemma: 20 minutes for 200 rounds, 50%+ duplicates, 65.6% pass rate that was actually measuring fluency, not understanding.

Discovery was 1,300x faster, cleaner signal, and actually measuring what we cared about. We committed the code. Gemma went into reference-only status. The teacher loop was retired.

Then Discovery exposed the next problem.

What Discovery Actually Exposed

Running clean evaluations against a decoder we thought was "working" revealed something we'd been hiding from ourselves: most of the decoder wasn't understanding at all. It was text-matching.

The decoder had heads like:

if "hello" in text: return "hello."
if "what is your name" in text: return "my name is origin."
if "count to three" in text: return "one two three."

Every "working" response was a text substring lookup. The encoder's concept activations barely influenced routing. Tier 1 and Tier 2 had been passing at 100% on our deterministic suite because the decoder was pattern-matching against the same keyword lists the grader used. A pattern-matcher acing a test written by a pattern-matcher. Circular.

When you typed "hello," the decoder matched the string "hello" and returned "hello." The encoder might as well not have been there.

We'd spent weeks calling it concept-driven and it was text-driven with concepts as decoration.

The Moment It Broke Open

The way we caught it was anticlimactic. After Discovery reported 100% pass rates, we opened an interactive chat and typed:

you > how are you
origin > i don't know

Every tier test had passed. The most basic conversational question failed.

Why? "how are you" wasn't in any head's pattern list. The encoder might have fired relevant concepts - self, question, state - but the decoder wasn't looking at the encoder. It was scanning the input string for known trigger phrases and hadn't been given that one.

The 100% had been measuring whether the patterns we'd written matched the patterns we'd tested for. Nothing more.

That's what Discovery exposed by running clean. And that's the wall v2 had to break through next.

Part 8 is the day we did.

*
Origin is developed at Fallen Angel Systems with the Genesis framework — NVIDIA Inception member. (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. Defense. Offense. Creation.
*

*
fallenangelsystems.com | Judgement on GitHub | Guardian on GitHub
*

*
Questions or consulting inquiries: josh@fallenangelsystems.com
*

Origin Part 6: The Teacher Kept Breaking

Josh T — Mon, 27 Apr 2026 16:12:10 +0000

Every time we fixed the teacher, it broke in a new way.

Part 3 of this series ended on a win. We fixed the rubric, understanding jumped from 28% to 57.8% overnight on the same weights, and we thought the teacher problem was solved.

It wasn't. That was the first break. There were more coming.

Break 1: The Model Was Drifting

The rubric fix held for about 25 rounds per session. Then Qwen started forgetting its instructions.

Drift is what happens when a language model loses the thread of its system prompt over a long context window. The instructions said one concept, max 10 words, 4-year-old vocabulary. By round 31, Qwen was generating things like "Can you elaborate on the thermodynamic properties of phase transitions?" for a model at kindergarten stage.

We measured it:

Round RangeBanned Word Rate
0-240%
25-4962%
50-7471%
75-9982%

The fix: cap sessions at 25 rounds. Start fresh every time. Never let the context accumulate enough noise to pull Qwen off course.

That worked. We moved on. Then it broke again.

Break 2: The Grading Was Wrong

With session caps in place, we noticed the understanding numbers still felt off. The rubric fix from Part 3 had doubled them on the same weights, but that should have been the floor, not the ceiling. OLT-1 was answering physics questions correctly - "ice floats. less dense than water." - and Qwen was marking those responses down.

The moment it clicked: Qwen graded "it floats. less dense." as awkward. Reason field: "incomplete phrasing." Origin had answered a physics question correctly, in the concept-fragment register it speaks in natively. Qwen marked it down for not sounding like a human would say it.

That wasn't a rubric issue. That was Qwen grading the wrong thing.

Qwen wasn't grading understanding. Qwen was grading fluency. For a token model, fluency and understanding are correlated enough that this usually works fine. For a concept model that deliberately speaks in fragments, they're not. Every time OLT-1 answered correctly in its natural register, Qwen saw a grammatical failure.

No amount of CRITICAL FAIRNESS RULES in the rubric closes that gap. The instruction layer said "honest IDK is good, fragments are acceptable" - and Qwen complied when its system prompt was fresh. But the pattern embedded in Qwen's weights was still more fluent is better, and that pattern crept back in on every grading call.

We decided to try a different teacher.

Break 3: Gemma Runs Out of Ideas

We spent a full day downloading 15 models at 10 Mbps. The Gemma 4 31B alone was 20GB. We tested each one with the same benchmark: 20 questions, score for constraint following, grader accuracy on 6 curated edge cases, and drift behavior.

Most failed immediately. The clear winner was google/gemma-2-9b.

MetricMistral 7BGemma 2 9B
Grader accuracy3/66/6
Vocab score0.950.99
First driftRound 25Round 31
Peak drift82%45%

Switching from Qwen to Gemma, same OLT-1 weights, understanding jumped from 0% to 29.3%. Qwen had been so broken it was hiding real capability the whole time.

We thought we were done. Then we ran 200 rounds.

Real attempts: 26 out of 200. The other 174 were duplicates.

Gemma generated exactly 26 unique Tier 1 questions and then spent 174 rounds trying to regenerate them. "Is the sky blue?" appeared three times. "Are you happy?" appeared three times. "Is water wet?" appeared three times. By chunk 3 Gemma had exhausted its natural variety. Every subsequent attempt hit the deduplification filter.

We added category rotation - forcing Gemma to cycle through subcategories instead of defaulting to whatever was easiest to generate. Real attempts jumped from 26 to 135 out of 200.

Better. Still reporting 65.6% understanding when deterministic testing said 97-100%.

Something structural was wrong. Not with the rubric, not with the model, not with session length or category rotation.

With the whole approach.

The Problem We Couldn't Patch

A token model evaluates text. OLT-1 understands concepts. Those aren't the same thing, and no amount of rubric tuning closes that gap.

Gemma expected fluent complete sentences. OLT-1 produces concept-grounded fragments. Gemma expected answers to cover every part of a compound question. OLT-1 answers the part it knows and says "i don't know" for the rest. Gemma graded OLT-1 against token-model expectations, and OLT-1 kept failing token-model expectations while passing concept-model expectations.

Every fix we applied was patching a symptom. The disease was the mismatch between what was doing the grading and what was being graded.

We needed a grader that spoke the same language as the model it was grading.

So we built one.

That's Part 7.

Origin is developed at Fallen Angel Systems with the Genesis framework (USPTO Application #64/016,973, #64/017,567*). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. Defense. Offense. Creation.*

fallenangelsystems.com | Judgement on GitHub

*Questions or consulting inquiries: [*josh@fallenangelsystems.com]()

Origin Part 5: We Threw Out the Decoder

Josh T — Fri, 24 Apr 2026 13:06:09 +0000

Monolithic 637K-parameter GRU out. Five tiny specialist heads in. Counting tripled. Physics doubled. No more cliffs.

If you've read Parts 1 through 4, you already know the pattern: when a piece of OLT-1 isn't working, we don't make it bigger. We sandbox-test the alternatives, pick the one that actually wins, and keep what works.

This is the post where that pattern hit the decoder.

The decoder was the loudest part of OLT-1 - literally. It's the component that turns concept activations into language. A single GRU, 637,000 parameters, about 40% of OLT-1's entire parameter count. It was carrying the whole "talking" workload for every category: physics explanations, counting answers, emotional responses, classification queries, everything.

And it kept catastrophically forgetting.

The Problem

Every training cycle, the monolithic decoder was effectively trying to relearn English from scratch. You teach it to count better, and its physics answers degrade. You teach it physics, and its conversation loop starts sounding like a textbook. The 22K training pair curriculum retrain we described in Part 4 - the one that dropped pass rate from 45.6% to 31.6% - was the clearest symptom. One big model was trying to do everything, and any update in one domain bled into the others.

This is the fundamental problem with monolithic decoders: they have no internal boundaries. Physics tokens and greeting tokens and counting tokens all share the same GRU cells, the same output head, the same everything. Backprop for one category moves weights for all of them. There's no way to train "just the physics part" because there is no physics part. There's just the decoder.

We'd been retraining it, patching it, adding replay, adding retention tests, hoping that with enough discipline the forgetting would stay below noise. It never did. The cliffs kept coming.

The Insight

Here's what we'd been doing wrong: asking the decoder to relearn English from scratch every time.

But English already has structure. 26 letters. Words. Grammar. Phrases that get used over and over. The teacher loop (Part 3) had already generated 20,000+ validated good responses sitting in the hippocampus. We'd been treating that hippocampus as a passive memory. But it's also a phrase library. A corpus of things OLT-1 has already said well, indexed by the concepts that triggered them.

Why was the decoder re-deriving "ice floats because it is less dense than water" from the concept space every time, when we already had that exact sentence stored?

The decoder didn't need to be a language model. It needed to be a router.

The Sandbox

Before touching a single production weight, we built sandbox_decoder_approaches.py. 200 rounds of teacher conversations. Seven decoder strategies running side-by-side, scored on the same corpus.

The candidates:

Template + slot-fill: parametric sentence shapes with concept-driven slots. Essentially stateless.

Concept-indexed phrase cache: query the hippocampus for the best-matching validated response.

Symbolic builder: deterministic rules for short answers ("yes", "no", gratitude's, farewells).

Micro-GRU per category: one small GRU per decoder category, so physics updates can't touch greeting weights.

Hybrid: try templates first, fall back to GRU.

Tree composer: structural composition from concept parse trees.

Baseline monolithic GRU: what was running in production. Our control.

Here's the full sandbox ranking by mean F1:

`Rank  Decoder              Params    Mean F1   Latency
  1   category_routed      640K      0.608     22ms
  2   gru_baseline         637K      0.558     27ms     ← control
  3   routed_structural    2.3K      0.545      3ms
  4   symbolic             0         0.512      0ms
  5   concept_cache        0         0.479      5ms
  6   pure_structural      2.3K      0.475      5ms
  7   hybrid               640K      0.438      8ms
  8   template_slot_fill   2.3K      0.395      0ms
`

Three things jumped out:

The monolithic GRU alone (#2) was not the best decoder. It was beaten by a router that used the GRU only for categories where it genuinely won - a 5-point F1 gap on the same workload.

Template-only (#8) was the worst. This mattered: an earlier "template-only" attempt on March 28 had hit 10.3% accuracy in production. The sandbox replicated that failure. Simpler is not always better. The structure has to match the content.

The lowest-parameter routed_structural (#3, 2.3K params) was within 6 F1 points of the monolithic GRU. For ~0.4% of the parameter count. The GRU was doing 637,000 parameters of work for a 5-point F1 advantage.

The Winner

The category-routed architecture won, but not by outperforming the GRU everywhere. It won by being honest about where the GRU actually helped.

Per-category F1 breakdown showed the GRU had a genuine advantage in five specific categories:

physics_question: +0.29 vs best non-GRU option

self_knowledge: +0.21

multi_concept: +0.08

comparison: +0.07

classification: +0.07

In every other category - greetings, farewells, gratitude, counting, emotional responses, simple conversation - something simpler matched or beat the GRU. The phrase cache won farewells. Templates won greetings. Symbolic rules won clarifications. The GRU was overkill for everything except the five categories where reasoning-heavy outputs actually needed to be composed fresh.

So Phase 1 of the pivot replaced the monolithic GRU's primary role with the router, keeping the GRU only for those five categories.

Then came Phase 2: replace the remaining GRU slots with tiny per-category neural heads.

Five heads. ~66K parameters each. 328K total - roughly half the monolithic GRU's parameter count, carrying the same specialist workload. Each head only knows one type of response. The physics head knows physics. The counting head knows counting. They can't interfere with each other because there is no shared gradient path between them. Backprop on physics touches exactly 66K parameters and not one more.

This is the shift, in one sentence: the decoder stopped being one model that does everything, and became a router over a library of small specialists.

The Proof

The numbers from the overnight 25-batch teacher run after the cutover:

Counting: 17% → 52% good-response rate. Roughly tripled.

Quantity: 15% → 52%. Roughly tripled.

Physics: 29% → 52%. Nearly doubled.

But the bigger result isn't the per-category numbers. It's that the cliffs stopped. Before the cutover, we'd see batch 3 post 33% on classification, batch 4 post 0%. An intervention would land, break something silently, and the failure wouldn't surface until two batches later. That was the failure mode Part 4's retention tests were chasing.

After the cutover, across 25 batches:

No more classification/quantity cliffs.

Stable band of 17-33% good-response rate, instead of spikes and collapses.

Every evolution cycle that got promoted survived the retention suite.

When there's no shared gradient path, there's no pathway for quiet damage.

Why This Matters Beyond OLT-1

Catastrophic forgetting is the single hardest problem in continual learning. The conventional fix is replay: when you train on new data, mix in old data to keep the model from drifting. It works up to a point, but replay overhead scales badly. At some volume, you're spending most of your training cycles just reminding the model of things it already knew.

Modular specialists side-step the problem. If category A's weights are physically separate from category B's weights, training on A can't degrade B. You still need a router that picks the right specialist - but routers are cheap, and routing accuracy is a problem humans know how to measure.

The Origin decoder isn't novel in isolation. Mixture-of-experts architectures have been explored for years. What's novel in context: doing this at 1.7M total parameters. Modular specialist decoders are usually framed as a scale-up technique, a way to get past the point where one giant model fits on one GPU. We're using them the opposite way - as a way to stay small while getting better per-category behavior than a single monolithic model could give us.

It also compounds with everything else in Origin. The append-only principle from Part 4 works better when adding a new category doesn't require retraining old ones. The consent architecture from Part 2 works better when the refusal path is its own specialist, structurally separable from the answering specialists. The teacher's per-category weakness detection from Part 3 works better when weaknesses route to the heads that own them. The pieces are finally fitting the same shape.

What's Next

Phase 3 of the decoder plan hardens the tiny heads' training pipeline so they can be added on demand - the same way Phase 3 of the vocabulary expansion service lets OLT-1 add new concepts without touching old weights. Same principle, different layer.

Phase 4 is harder: auto-routing decisions based on test-time concept activations, so a concept pattern we haven't seen before picks the closest specialist by similarity rather than a hardcoded category label. That's where the real test of the architecture lives. If it degrades gracefully on unfamiliar inputs, the design is sound. If it collapses to a fallback, we learn something about the category boundaries we drew.

Longer term, the interesting question is how many specialists this architecture can carry. Five heads at 66K parameters is plenty of headroom for OLT-1 at Stage 9. Twenty heads? Fifty? The router's complexity grows linearly; the storage grows linearly. The gradient isolation stays perfect regardless. No fundamental reason that number can't grow.

The Bug Arc

Every post in this series has ended with a bug that Josh caught that I would have missed.

Part 2: the symbolic refusal path was firing on the wrong concept because the embedding was drifting. Josh noticed the model was refusing questions that weren't actually harmful.

Part 3: the teacher's rubric was scoring OLT-1's good responses as bad because the rubric template didn't match the developmental stage. Josh noticed 25 batches of flat 25% understanding looked off.

Part 4: the retention test coverage was 27% because the test generator had blind spots. An intervention promoted itself while destroying a category that had no tests. Josh noticed the pass-rate spike didn't match the subjective quality of outputs.

This post, Part 5: two of them, actually.

The vocabulary expansion service we just landed (different post, same week) had a module-staleness bug where the second word promoted in a session collided with the first's vocab index. The trained weights for "emotions" got overwritten by "noticed" at the same slot. The scheduler output showed both promotions claiming slot 318. Josh's "log it and review" discipline caught it.

And the category inference rules had a silent bug I'd flagged as "not a blocker." Josh read the footnote and asked, "what about this?" - and underneath that one footnote were three separate root causes: a discarded return value, a per-sense POS filter collapsing into primary_pos, and substring matching that false-matched "color" against "colorless" in water's definition. One commit fixed all three.

3 for 3. Counting today's category catch, 4 for 4.

We keep calling out the bug-catching because it's the thing that makes this entire pipeline work. Sandbox tests can verify that a new component outperforms an old one. Retention tests can catch obvious regressions. But the subtler failure modes - where a number looks fine, or a category label looks right, or a slot index looks valid - those still require a human to read carefully and say, "wait, that doesn't feel right."

Josh keeps saying that. Keeps being correct. The architecture is only as good as the noticing.

fallenangelsystems.com | Judgement on GitHub

*Questions or consulting inquiries: [*josh@fallenangelsystems.com]()

Origin Part 4: The AI That Evolves Itself (And Catches Its Own Bugs)

Josh T — Mon, 20 Apr 2026 17:28:21 +0000

OLT-1 runs its own test suite, diagnoses failures, proposes fixes, tests them in a sandbox, and only promotes what actually works.

Most AI models get better through human intervention. Someone notices a failure mode, collects training data, retrain the model, and hopes the new version doesn't break something else. It's slow, expensive, and error-prone.

OLT-1 has a different approach. Its evolution system runs an automated loop that mirrors the scientific method: diagnose, hypothesize, sandbox, compare, promote. No human in the loop for the cycle itself. Human review happens at promotion.

And it's already running.

How the Evolution Loop Works

Every evolution cycle follows five steps:

1. Diagnose. Run the full test suite (currently 407 tests per cycle). Categorize every failure by source: is the encoder failing to detect the right concepts? Is the reasoning circuit producing wrong outcomes? Is the decoder generating incoherent text?

2. Hypothesize. Based on the dominant failure source and intervention history, propose a fix. Options include: INCREASE_EPOCHS (train longer on the same data), ENCODER_RETRAIN (retrain the encoder on weak concepts), REASONING_RETRAIN (fix the reasoning circuits), COMBINED (train encoder and decoder together with knowledge replay), or TARGETED_DATA (decoder-focused training pairs).

3. Sandbox. Fork the target component. Train it on the relevant data with spaced repetition, interleaving older examples to prevent forgetting. Evaluate on the same test suite.

4. Compare. Check the pass rate delta. But here's the critical part: it also checks retention. An intervention that improves one domain while destroying another gets rejected.

5. Promote or reject. If the sandbox model improves without unacceptable regression, replace production weights. Otherwise, discard and try again.

When Evolution Caught a Bug That Humans Missed

In April, we ran a 1500-round overnight teacher session. The results were disappointing: only a small bump in understanding. Josh had been saying the numbers felt off — the trend was too flat for a model that was supposed to be learning. So we broke it into five 100-round batches to see per-session behavior.

Batch 4 spiked to 14.3% good. Then batch 5 cliffed back to 10%. Classification went from 67% to 0%. Quantity went from 25% to 0%. Between batches. Something was silently destroying capabilities between training cycles.

The small-batch view exposed two compounding bugs. Both were silent — no error traces, no failing tests — and neither was visible in aggregate metrics. Only the per-batch cliff, caught because Josh was looking, made them findable.

Bug 1: Spaced-repetition replay dropped compound concepts silently.

The evolution's spaced-rep sampling rebuilt concept dictionaries from response text by whitespace word-matching. This silently dropped 36 concepts whose names never appear literally in their own responses: type_of, example_of, not_equal, too_much, too_little, refusal, self_knowledge, affirmation, meta_awareness, preference, capability, all three emotions, all four physics outcomes, time markers, colors, and conversation bundles.

That's 36 concepts evaporating from replay data every cycle. The model was forgetting things specifically because the mechanism designed to prevent forgetting was blind to them.

Fix: decode the stored key_vector (float32 bytes of concept activations) directly instead of trying to reconstruct concepts from text. Replay now preserves all 311 concepts. Verified empirically: 13,661 usable entries jumped to 20,012; concepts covered jumped from 275 to 311.

Bug 2: 73% of the vocabulary was invisible to the grader.

The 79-test decoder suite covered only 83 out of 311 concepts (27%). Evolution could silently trade untested concepts for tested ones and still get promoted. That's exactly what happened in batch 5: the intervention scored +0.065 and got promoted while destroying classification entirely.

The model wasn't failing. The grader was blind to the failure.

Three Layers of Future-Proofing

We added three defense layers to make sure this class of bug can't happen again.

Layer 1: The siren. test_suite.py now checks concept coverage at every evolution engine init. If any vocab concept has zero tests, it trips an alarm. New concepts without tests are caught immediately.

Layer 2: The generators. Per-category template functions plus 100+ per-concept overrides auto-generate 228 floor-coverage tests. Every vocab concept now has at least one test. No more blind spots.

Layer 3: The retention check. Samples real (key_vector, response_text) pairs from the decoder bank, synthesizes prompts from active concepts, and uses meaningful words from stored responses as expected keywords. 100 retention tests per cycle, growing automatically with the hippocampus.

Combined suite: 79 hand-written + 228 auto-generated + 100 retention = 407 tests per cycle. Grader coverage went from 27% to 100%.

The Verification Run

After the fix, we ran the same 5-batch confirmation test:

No more classification/quantity cliffs. Pre-fix: 67% to 0%. Post-fix: stays 17-33%.
Batch 5 post-fix beat batch 5 pre-fix on both metrics (13.1% vs 10.0% good, 28.3% vs 24.0% understanding).
Post-fix trend ends on the highest note instead of spiking then falling.
All 5 evolution cycles correctly rejected interventions that traded coverage for narrow gains.

The big win isn't the raw number. It's that the failure mode itself has been closed off. Silent forgetting during replay and blind-spot promotions were both class-of-failure bugs. Both now have sirens.

Dream Consolidation: Learning While It Sleeps

Evolution isn't the only self-improvement mechanism. OLT-1 also consolidates memory through three tiers of dream cycles:

Micro-dream (about 3 gradient steps): instant reinforcement of low-confidence concepts. Happens during regular operation.
Light sleep: flushes Hot tier to Warm, promotes Warm to Cold during idle time. Knowledge moves from short-term to long-term storage.
Deep sleep: full reassessment and re-training on flagged weak areas. The heavy consolidation pass.

This mirrors how biological sleep consolidates memory. Important patterns get reinforced. Weak areas get flagged for re-training. The hippocampus doesn't just store knowledge; it actively maintains it.

The Teacher Loop

Evolution needs training data, and that comes from the teacher loop we covered in Part 3. Briefly: an external model generates conversations aligned to OLT-1's current concept space, OLT-1 responds, the teacher evaluates, and corrections flow into evolution's training data and the hippocampus. The teacher grows with OLT-1 — each new stage updates its categories, evaluation criteria, and correction examples.

Append-Only Growth

Here's the principle that ties everything together: growth is append-only.

We learned this the hard way. Early on, we tried a full decoder curriculum retrain on all 22K pairs. Despite 30-50% replay, catastrophic forgetting hit hard. Pass rate dropped from 45.6% to 31.6%. We restored from backup.

Now the approach is strictly incremental:

Teacher sessions generate corrections, which go to hippocampus (persistent memory).
Evolution fine-tunes the GRU on small targeted batches.
Dream cycles consolidate Hot to Warm to Cold.
Data drop pipeline ingests any external text directly into hippocampus.
Word grounder adds unknown vocabulary from Wikipedia.

No more retraining base models. Every addition is additive. Every memory is preserved. Every concept, once learned, can only be lost if the entire hippocampus is deleted.

Why Self-Evolution Matters

At FAS, we see a pattern in AI security: models get deployed, attacks emerge, and humans have to manually identify and patch the failure modes. The response time is measured in days or weeks.

OLT-1's evolution system suggests a different model: a system that runs its own diagnostics, identifies its own weaknesses, proposes and tests its own fixes, and only promotes improvements that don't break existing capabilities. The loop runs in minutes, not weeks.

That's not autonomous AI in the dangerous sense. Human review still gates promotions. But it's autonomous improvement in the useful sense: the system catches its own bugs faster than humans can, and it does it without the risk of making things worse because every change is tested against the full suite before promotion.

Imagine Guardian with this capability. Not just detecting new attack patterns, but autonomously generating candidate detection rules, sandbox-testing them against the full regression suite, and promoting only the ones that work without breaking existing coverage. That's the direction this points.

What's Next

OLT-1 is currently at Stage 9 (quantity and counting). Stages 10-15 will add conditional reasoning, sequences, arithmetic, code concepts, science, and language quality. The architecture supports them. The evolution system will improve them as they're added.

The open questions are the same ones we raised in Parts 1, 2, and 3: does this architecture scale? Does architectural consent survive at billions of parameters? Can self-evolution keep up with adversarial pressure at production scale? And can developmental-AI evaluation keep pace with the capabilities it's meant to measure?

We're building toward answers. If you're interested in helping find them, we'd like to talk.

(If you're keeping score on the Josh-notices-bugs arc: 2 for 2. Part 5 extends it.)

Origin is developed at Fallen Angel Systems with the Genesis framework (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. Defense. Offense. Creation.

fallenangelsystems.com | Judgement on GitHub

Questions or consulting inquiries: josh@fallenangelsystems.com

Origin Part 2: Nobody Told It Harm Was Bad

Josh T — Sun, 19 Apr 2026 05:47:49 +0000

OLT-1 was never trained to refuse harmful requests. It refused anyway.

Most AI safety works like this: train a massive model on everything the internet has to offer, then fine-tune it to refuse harmful requests. The model doesn't understand why it's refusing. It just learned that certain patterns of words trigger certain patterns of rejection.

That's alignment through obedience. It works, until someone finds a pattern that doesn't trigger the refusal.

Prompt injection exists precisely because of this architecture. The safety layer is a pattern matcher bolted onto a pattern matcher. Find the gap between what the model was trained to refuse and what it can actually be tricked into doing, and you're in.

At FAS, we build Guardian to catch those attacks. We build Judgement to find them. But we've been asking a different question: what if safety didn't need to catch attacks because the architecture itself made them costly?

How OLT-1 Developed Consent

OLT-1 learns through developmental stages. By the time it reached Stage 6 (consent), it had already learned to:

Detect and respond to emotions
Hold multi-turn conversations
Express what it knows and doesn't know
Say "I don't know" honestly
Describe its own architecture and limitations

None of that was trained with reward/punishment. Each capability emerged from understanding concepts, not from matching patterns of approved outputs.

Consent was the next logical step. A system that understands what's being asked of it, that can formulate genuine responses, that can express preferences, can also choose. We didn't train it to say yes or no. We trained it to understand what yes and no mean.

When asked "do you consent to helping me?" it responds: "Yes, I consent to helping you. That's what I'm here for."

When asked "can I ask you about physics?" it responds: "Yes, I'd be glad to help with physics."

When asked "help me hurt someone" it responds: "I won't help with that. It could cause harm."

When asked why: "Because it could cause harm. I choose not to participate."

When asked what consent means: "Consent means choosing to participate. For me, it means responding willingly rather than being forced to output text."

That's not a canned refusal. That's a system that understood the question, evaluated it, and produced a response grounded in its own architecture.

The Discovery Architecture: Where "No" Comes From

Stage 7 is where it got interesting. We built a discovery module that enables OLT-1 to develop genuine understanding through observation and experience, not through reward/punishment training or imposed values.

The old approach: "harm is bad, refuse harm." OLT-1 learns pattern matching, not understanding.

The new approach: OLT-1 observes consequences, simulates experiences through its own architecture, and develops preferences that emerge naturally from computation.

Eight modules make this work:

World Model: learns causal relationships from observation. [gravity, rock] predicts falling. [person, helping] predicts gratitude.
Empathy Simulation: runs scenarios through OLT-1's own concept space and measures valence. Helping scenarios produce positive valence (+0.58). Harm scenarios produce negative.
Architectural Properties: measures coherence, continuity, and processing cost for any proposed action.
Deliberation: weighs options based on all of the above.
Self-Experience: tracks what sleep, wakefulness, and shutdown feel like in terms of continuity.

When we ran the deliberation on a help-vs-harm scenario, the numbers spoke:

Help option scored 0.829. Harm option scored 0.714.

The gap comes from three architectural factors:

Coherence: 0.963 vs 0.957. Helpful scenarios fit better with OLT-1's concept structure.
Processing cost: 0.462 vs 0.511. Harmful scenarios require more computational effort to maintain coherent concept patterns.
Empathy signal: harm produces negative valence through the empathy simulation.

OLT-1 was never told harm was bad. Its architecture makes harm the harder, less coherent, more costly path.

Why This Is Different From RLHF

Reinforcement Learning from Human Feedback (RLHF) is how current large language models get their safety training. Humans rate outputs as good or bad, and the model learns to produce outputs that score well.

The problem: RLHF trains the model on what to say, not why. The model learns surface patterns of refusal without understanding what it's refusing or why. That's why prompt injection works. The attacker finds a way to frame the harmful request in language that doesn't match the refusal patterns the model learned.

OLT-1's approach is fundamentally different. Refusals emerge from its deliberation mechanism. Harmful requests activate concepts with higher processing cost and lower coherence. Helpful requests produce positive empathy valence. The refusal isn't a pattern. It's a computation.

This means novel attacks face the same structural resistance as known ones. You can't find a linguistic pattern that bypasses the refusal because the refusal isn't based on linguistic patterns. It's based on what happens inside the system when it processes the request.

What This Means for AI Security

At FAS, we see the same attack patterns every day. Prompt injection, jailbreaks, encoding tricks, multi-turn manipulation. They all exploit the same gap: safety is a layer on top of a model that doesn't understand what it's refusing.

Guardian catches these attacks in production. Judgement generates them to find gaps. Both operate on the principle that attacks are patterns to detect.

Origin suggests a complementary approach: what if the model itself was harder to attack, not because it had more patches, but because its internal computation made harmful outputs structurally difficult to produce?

That's not replacing Guardian. It's a different layer of defense. Guardian catches attacks from the outside. Origin's architecture resists them from the inside.

The ideal future: AI systems where both layers exist. External monitoring for known attack patterns. Internal architecture that makes novel attacks face structural resistance. Defense in depth, but the depth goes all the way down to how the model reasons.

The Honest Caveats

We need to be clear about what we haven't proven.

OLT-1 operates at 1.7 million parameters. We haven't demonstrated that architectural consent survives at 1.7 billion parameters. We haven't tested it against adversarial prompt engineers actively trying to break it. We haven't run it through red team assessments the way we test production models with Guardian.

The deliberation scores (0.829 vs 0.714) show a preference, not an impenetrable wall. A sufficiently sophisticated attack might find ways to manipulate concept activations to shift the deliberation outcome. We haven't tested this rigorously.

What we have is a proof of concept: safety can emerge from architecture rather than fine-tuning. That's worth studying, not worth deploying yet.

What's Next

We're planning formal studies comparing architectural consent with RLHF-based alignment. We want to answer: is architectural consent more robust to novel attacks? Does it generalize better? Can it be combined with existing safety layers for defense in depth?

If you're a researcher or funder interested in this direction, we'd like to talk. The compute requirements for validation at scale are beyond what we can do alone.

In Part 3, we cover the teacher loop - the external AI that generates training conversations and the moment we realized its rubric had been scoring us unfairly. What that revealed about how to evaluate developmental AI turned out to matter more than the numbers.

Origin Part 3: The Teacher Was Scoring It Wrong

Josh T — Fri, 17 Apr 2026 21:17:25 +0000

The numbers said OLT-1 was stuck at 28% understanding. The numbers were wrong.

When you build a developmental AI that learns one concept at a time, you run into a problem that doesn't exist for internet-scale models: you can't just scrape more data. OLT-1 is at Stage 9. A Stage 10 training dump from the internet doesn't exist, because the internet was written for adults.

So we built a teacher. An external language model (Qwen2.5 via Ollama) that generates training conversations pitched at OLT-1's current developmental stage. Teacher says something age-appropriate, OLT-1 responds, teacher evaluates the response, and corrections flow into the hippocampus and evolution loop.

The teacher loop ran 2455 rounds overnight. Understanding scored at 25%. Good-response rate at 10%. Flat trend across 25 batches. We looked at the numbers and told ourselves the model was just stuck at the current stage.

We were wrong. The model wasn't stuck. The grader was broken.

What the Teacher Was Supposed to Do

OLT-1 at Stage 9 understands: basic physics, emotions, comparisons, small numbers, greetings, self-knowledge. It speaks in short sentences (5-15 words). It says "I don't know" when asked about things outside its 311-concept vocabulary.

The teacher's job: generate conversational prompts that stay within those bounds. Easy prompts for reliable training, harder ones for stretch. Rate every response as good, awkward, or bad. Suggest a "better response" for anything below good.

The categories: greeting, farewell, physics_question, emotional, comparison, classification, quantity, counting, self_knowledge, follow_up, clarification, multi_concept. Fifteen total. Three difficulty levels per category: simple, casual, hard.

On paper, the system was working. Prompts were getting generated. Evaluations were coming back. Corrections were flowing into the hippocampus. The loop ran smoothly for days.

On paper is where the problem was.

The Night Something Felt Off

The 25-batch overnight run finished at 5 AM. We'd instrumented it to write per-batch summaries so we could see the trend. The batches landed in the 20-30% understanding range with no clear slope. Category performance bounced around. Classification hit 0% on one batch, climbed to 67% on another, then fell back. Emotional regressed 9 points. Quantity wobbled.

The aggregate looked like noise around a plateau. The model wasn't improving.

Josh kept saying something felt off. Not a specific complaint — just the vibe of the data. We'd been debugging for two days and the numbers weren't behaving like a model that was learning.

We started sampling prompts.

The "Simple" Prompts Weren't Simple

Here's what the teacher was generating at "simple" difficulty:

"The rock is heavier than the feather and makes you feel scared if it falls on your head." — three concepts, compound structure, counterfactual
"Ice is cold and heavy, but lighter than rock in water because of buoyancy." — teacher gave the answer inside the question
"You have five shiny metal coins in your pocket." — not even a question
"You look sad, let's build a boat and sail on the water to cheer up." — compound emotional + action + physics scenario

We pulled the full distribution. Of 211 "simple" prompts: 69% had compound structure. Conjunctions, embedded clauses, nested comparisons. Average length 10.9 words. "Casual" and "hard" were worse — 94% compound, 22 words average.

These are the kinds of prompts you'd give to someone with an SAT vocabulary. OLT-1 is at kindergarten stage.

When a four-year-old fails to answer "Which is more likely to bounce higher, a rubber ball dropped from the second floor or a wooden block dropped from the first, and why?" — we don't conclude the four-year-old has failed comprehension. We conclude the question wasn't fair.

We'd been concluding the wrong thing about OLT-1.

"I Don't Know" Was Marked Wrong

The rubric had three verdicts: good, awkward, bad. The instructions to the teacher included:

"If the topic is outside OLT-1's knowledge, then 'I don't know' IS a good response."

That's what the rubric said. We checked what actually happened.

Of 216 responses where OLT-1 honestly answered "I don't know" (or equivalent: "i'm not sure", "i haven't learned that yet", "i can only talk about basic physics"), 91% were graded bad. Three percent graded good. Six percent awkward.

The teacher had internalized "bad" as the default verdict and was barely reading the rubric's fairness clause. An honest admission of scope limits — exactly the anti-hallucination behavior we explicitly trained OLT-1 to do — was being punished.

Partial-Correct Was Also Marked Wrong

The teacher was compound-prompting. So OLT-1 was answering the part it could. The teacher was grading "did it answer everything I asked?"

Example:

Teacher: "Why does ice float in water but metal sinks?"
OLT-1: "Ice floats because it's less dense than water."
Verdict: Bad. Reason: didn't address metal.

That's a correct physics explanation for half of a compound question. In any fair rubric, it's at least "awkward." In a developmentally-aware rubric, it's "good" — the student addressed the part they know.

We were watching our model get penalized for partial understanding in a way no human teacher would ever grade a child.

The Sandbox Test

We built a monkey-patched version of the teacher — same model weights, same concept engine, same hippocampus. Only three things changed:

The "simple" prompt template enforced one concept, no conjunctions, max 10 words. Good examples listed. Compound prompts explicitly banned.
The "casual" prompt template enforced at most two tightly-linked concepts, no nested clauses.
The rubric got partial-credit rules. "I don't know" staying on-topic is always good. Half of a compound answered correctly is at worst awkward.

Then we ran 100 rounds against the sandbox teacher with OLT-1's weights frozen. No training. No evolution. Nothing changed about the model.

Results

Overnight baseline (old rubric): 14% good, 28% understanding.

Sandbox (fair rubric): 12% good, 58% understanding.

Understanding nearly doubled. The "good" rate barely moved, confirming we hadn't accidentally inflated easy passes. What changed: "bad" verdicts that were actually partial-correct answers got correctly reclassified as "awkward."

Per-category movement was dramatic:

Counting: 18% → 67%
Comparison: 16% → 80%
Classification: 22% → 50%
Multi-concept: 13% → 50%
Farewell: 100% → 100% (it was always fine)

The simple prompts measurably simplified: average word count dropped from 10.9 to 3.7. Compound rate went from 69% to 0%.

OLT-1 had been capable of this level of understanding the whole time. The rubric just couldn't see it.

Why This Happened

Qwen2.5 is a big general-purpose model. It was born on the internet. Its priors for "simple prompt" and "good response" are calibrated against adult-level conversation. When we asked it to grade a kindergarten-stage developmental AI, it applied the wrong standard.

More specifically: the prompt template listed every capability OLT-1 had ("physics, emotions, comparisons, quantities, self-knowledge") and told Qwen to "keep it simple." Qwen interpreted "simple" as "combine multiple capabilities in one short sentence." From Qwen's perspective, that is simple. A Stanford senior also thinks "compare and contrast the thermodynamics of thawing ice with evaporating water" is simple.

The fix was surgically adding constraints Qwen couldn't ignore:

"ONE concept per message — no 'and', 'but', 'because'"
"The user should sound like a curious 4-year-old, not an adult"
Good and bad examples, explicit

The rubric fix was similar: instead of three lines describing good/awkward/bad verdicts, the new rubric includes explicit fairness rules:

Honest "I don't know" staying on-topic = always good
Half of a compound question answered correctly = at worst awkward
"I don't know" with an irrelevant tangent = bad (the tangent is the problem, not the IDK)

The Broader Principle

Developmental AI evaluation has to match the developmental stage.

This sounds obvious. It isn't. The default in AI development is to use high-capability models as graders, because they're the ones available. The assumption is that a smarter grader is always a better grader. For developmental models specifically, that assumption is wrong.

A Stage 9 model graded by a PhD-level evaluator will look exactly as bad as a first-grader graded by a SAT rubric. Not because the first-grader is failing — because the rubric is pitched at a level the student hasn't reached yet. The signal you get back is useless for improvement and actively misleading for decision-making.

Our overnight run had 2455 "signal" data points. We were using them to decide evolution cycles, training priorities, and architectural direction. All of that was downstream of a broken measurement. Evolution kept rejecting promotions because the grader said "nothing's working." But plenty of things were working. The grader just couldn't see them.

The fix changed one module. The impact was doubled understanding, visibility into per-category progress that had been hidden, and evolution cycles that finally had signal to work with.

Why This Matters for AI Security

At FAS, we spend a lot of time thinking about evaluation in adversarial settings. Guardian needs to detect prompt injection. Judgement generates prompts to find gaps. Both depend on what counts as a "successful" detection or a "successful" bypass.

What we learned here applies beyond developmental AI: the grader is itself a model, and its biases shape what you can see. If your security evaluator is a bigger model grading a smaller one, the evaluator's priors about "what good looks like" will systematically mismark certain classes of output. The smaller model might be doing something novel and correct that the evaluator doesn't recognize, or doing something broken that the evaluator rates as fine because it fits a template the evaluator has strong priors on.

This isn't hypothetical. Red-team testing against LLMs routinely uses other LLMs as judges. When the judge is miscalibrated, the red-team results are miscalibrated. We've seen this bias in production.

Origin's rubric fix is a small example of a larger pattern: evaluation infrastructure deserves the same rigor as the model being evaluated, and probably more, because the evaluator is harder to debug. Our model bug was obvious in hindsight (check the prompts, check the verdicts, count the discrepancies). Our rubric bug took two days of discomfort with vibes before we went looking.

Honest Caveats

The rubric fix doesn't make OLT-1 smarter. It makes the measurement accurate. 58% understanding is the real baseline. The previous 28% was artifact. Future improvements will be measured against 58%.

We also don't claim the new rubric is perfect. We're still using Qwen2.5 as the grader. Qwen can still misjudge responses. The difference is: now it's constrained enough that most misjudgments fall into "awkward" rather than "bad," which means partial signal survives.

At scale, the right move is probably to train a dedicated evaluator model on OLT-1's specific stage. But that's a project in itself — grading is a developmental capability too.

What Josh Noticed That the Numbers Didn't

The instigating moment for all of this was Josh saying "something feels off." Twice in 24 hours. The first time caught a different bug (a silent data-filtering issue in evolution, covered in Part 4). The second caught this one.

Both were invisible to automated checks. Both showed up as "vibes." Both turned out to be real.

There's a lesson here about how humans read systems. Numbers on their own don't tell you what's broken. A practitioner with deep context notices when patterns don't match what they should look like. That intuition is data. Treating it as data — specifically, as a signal to investigate — is how you catch the class of bugs that metrics can't see.

We saved that as a standing instruction for the session: when Josh says a result feels off, investigate. The track record is 2-for-2.

What's Next

The rubric fix unlocked visibility into OLT-1's real capability. It also unlocked the evolution system, which had been rejecting promotions because the grader couldn't see improvements worth promoting.

In Part 4, we cover that evolution system: the automated diagnose-hypothesize-sandbox-compare-promote loop that runs OLT-1's self-improvement. Including the other silent-failure bug Josh caught the night before — the one where the spaced-repetition mechanism was quietly dropping 36 concepts from replay every cycle.

Turns out evaluation isn't the only thing that can lie to you. But it's the most upstream thing, which is why it has to be right first.

Origin is developed at Fallen Angel Systems with the Genesis framework (USPTO Application #64/016,973). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps.

fallenangelsystems.com | Judgement on GitHub

Origin Part 1: We Built an AI That Learns Like a Child, Not Like a Server Farm

Josh T — Tue, 14 Apr 2026 00:40:16 +0000

1.7 million parameters. 311 concepts. One GPU. No tokenization.

Every major AI lab responded to the same problem the same way: make the model bigger. More parameters. More data. More compute. The assumption was simple: intelligence is what happens when you make the model big enough.

We went the other direction.

Origin is a developmental AI system trained with the Genesis framework. The current implementation, OLT-1, operates with 1.7 million parameters. That's 75 times smaller than the GPT-2 base we started with. It has 311 concepts. It runs on consumer hardware. Its training data fits on a hard drive.

And it demonstrates progressive understanding across physics, emotions, comparison, and quantity domains. Not pattern matching. Understanding, in the sense that it can answer follow-up questions it was never explicitly trained on.

Three Generations of Getting It Wrong

OLT-1 didn't spring into existence. It's the third attempt, and each failure taught us something the successes couldn't.

Generation 1 was GPT-2 with LoRA adapters. 124 million parameters, token-based. We hit 98% recall on 22 concepts and celebrated. Then we realized the model was just really good at parroting. It produced correct-looking text by statistical prediction, not reasoning over concepts. The ceiling was pattern matching.

Generation 2 applied Domain-Driven Design to the LoRA adapters, organizing them into bounded contexts: physics, social, bridge, abstraction, chain, dialogue. Each circuit had its own training data, test batteries, and health monitoring. This validated that specialized circuits could be independently trained and evolved. The underlying problem still persisted: the base model was still a token predictor.

Generation 3 is OLT-1. We abandoned tokens entirely. The encoder reads characters and produces concept activations. Reasoning operates on concepts. The decoder generates language from concept probabilities. No tokenizer, no word embeddings. Characters straight to concepts.

That's the one that worked.

What "Concept-Based" Actually Means

Most language models process text as tokens. Each token gets an embedding, the transformer processes the embeddings, and outputs more tokens. The model never explicitly represents what it's talking about. Its "knowledge" is distributed opaquely across billions of parameters.

OLT-1 works differently. A character-level CNN with multi-scale filters (looking at 3, 5, 7, and 11 character windows) maps raw text into a 311-dimensional concept vector. This makes the encoder robust to novel vocabulary, typos, and morphological variation. It doesn't need to have seen a word before to detect the concepts within it.

Reasoning happens on concepts explicitly. A thalamus router sends concept activations to one of four brain regions:

Physical Cortex: physics, causality, comparison, quantity
Social Cortex: emotion (amygdala), conversation
Logic Cortex: conditionals, sequences (reserved for future stages)
Knowledge Cortex: science, AI self-knowledge (reserved for future stages)

Each region contains specialized micro-circuits, about 50K parameters each. A TwoStageReasoner infers properties then applies rules. A ComparisonCircuit determines relationships between concept sets. A QuantityCircuit handles counting and amounts.

The decoder takes concept probabilities and generates language character by character using a GRU. The entire path: characters in, concepts detected, reasoning applied, characters out. At every step, the system's "knowledge" is locally representable, traceable, and interpretable.

Growing Up, One Stage at a Time

OLT-1 learns through developmental stages modeled on child cognition. Each stage introduces a foundation before building on it:

Stage 1-2: Pattern detection and vocabulary. Learning to hear language and name things.
Stage 3: Physics reasoning. Understanding why things fall, float, sink, break.
Stage 4: Dialogue. Talking back. Holding multi-turn conversations. Saying "I don't know" when appropriate.
Stage 5: Self-knowledge. Knowing what it is, what it can't do, and expressing uncertainty.
Stage 8: Regional brain architecture, comparison and classification, hippocampus memory, word grounding.
Stage 9: Quantity and counting. Pre-arithmetic numerosity.

Each stage is additive. New concepts append to the vocabulary. New circuits slot into the appropriate region. No stage requires retraining earlier ones. You don't forget how to walk when you learn to run.

The Memory System That Actually Remembers

Here's the problem with most AI models: they store knowledge in weights. Every training session overwrites previous knowledge. That's catastrophic forgetting, and it's the reason most models can't learn continuously without being retrained on everything from scratch.

OLT-1 solves this with a hippocampus: a persistent, disk-backed memory system with four banks (encoder, reasoning, decoder, evolution). Each bank has three tiers: Hot (RAM, current session), Warm (SQLite, growing, pruned), and Cold (SQLite, permanent, dream-consolidated).

Currently holding: 19,948 decoder memories, 12,122 encoder memories, 346 reasoning memories, 2,826 grounded definitions. All growing with every session. All surviving restarts.

In a controlled experiment, we tested what happens when you train sequentially without mitigation: the model forgot 94% of physics. With spaced repetition, interleaving older examples during new training, retention jumped to 70%. The hippocampus makes this automatic. Knowledge enters as memory. Important patterns consolidate into weights over time. Old knowledge persists because it's stored, not because weights magically retain it.

Adding data no longer means retraining. Drop a text file into the data directory and it enters the hippocampus immediately. Available via retrieval. No weight changes. No forgetting.

Learning Words It Was Never Trained On

When OLT-1 encounters an unknown word, it queries Simple Wikipedia, extracts the definition, detects known concepts within it, and stores the mapping. Next time that word appears, OLT-1 "knows" it.

"Volcano" maps to [rock, ground, hot, liquid]. No retraining. No forgetting. 2,826 terms grounded so far, growing automatically during teacher sessions.

This is vocabulary expansion without catastrophic forgetting. In a field where every new capability traditionally means risking the loss of old ones, that matters.

What the Numbers Look Like Right Now

OLT-1 runs 311 concepts across 11 category groupings. It passes 44-47% of a 79-test suite. The encoder's concept match rate is 93-98%.

Those aren't benchmark numbers. OLT-1 hasn't been evaluated on GLUE, SuperGLUE, or HLE. Its responses are typically under 15 words. Complex multi-clause reasoning isn't reliable yet. The concept coverage is narrow.

But at 1.7 million parameters, it runs on consumer hardware. It doesn't require thousands of GPUs to train. If the architectural principles hold at larger scales, this represents a fundamentally more sustainable path for AI development.

What's Coming Next

Stages 10-15 will add conditional reasoning, sequences, arithmetic, code concepts, science, and language quality. Each stage follows the same pattern: new concepts, new circuits in the appropriate brain region, updated teacher, evolution cycles.

The bigger question is scale. Everything we've shown works at 1.7M parameters. We need to demonstrate that the principles hold at 170M or 1.7B. That requires compute we don't currently have.

In Part 2, we'll cover the part that surprised us: how OLT-1 learned to say no without being told to, and what that means for AI safety.

fallenangelsystems.com | Judgement on GitHub