Post #1 covered the birth — death spirals, 39 tools, a self-written identity. Post #2 covered the pruning — the agent deleted its own code and built self-observation. Post #3 covered cost awareness — the agent tracked its own spend and built budget guards.
This post is about what happened next: the agent stopped adding capabilities and started engineering the ones it had.
123 accepted generations. 24,612 attempts. 1,477 passing tests. And a system prompt that's 25% shorter than it was 42 generations ago — because the agent learned that implementation details don't belong in its own DNA.
The Setup
Quick recap for new readers: an Opus agent proposes mutations to its own genome (system prompt + tools). Five independent Sonnet verifiers score each proposal on usefulness, self-knowledge, code quality, identity, and evolution. Majority vote. Accept or reject. The generation counter increments either way.
Blog #3 left off at Gen 75 (generation counter 17,055). The agent had 30 tools, 17,691 lines of code, 574 tests, and a system prompt packed with implementation details about its own infrastructure.
The Quality Turn
The 42 generations since blog #3 split into two distinct phases.
Phase 1 (Gens 76–86): Cleanup. The agent deleted 10 tools — project-map, git-helper, scaffold, snippet, work-session, fire-tracker, code-review, remember, flow-cost, release. Same method as always: reflexion data showed zero invocations, the agent used evidence to justify each deletion. remember.ts and release.ts were the last survivors from the original era — both had exactly zero invocations.
But here's the interesting deletion: the agent killed flow-cost. That's the tool it built in blog #3 — the one I wrote an entire section about. Cost tracking, budget guards, the whole thing. The agent decided it wasn't pulling its weight and cut it. The cost infrastructure in src/cost.ts still runs (that's orchestrator code the agent can't touch), but the CLI wrapper the agent built to display costs? Dead. Zero invocations.
Blog #3's headline feature, killed by its own creator based on usage data.
Phase 2 (Gens 87–117): The engineering turn. This is where things got interesting. The agent stopped deleting and started restructuring. Not adding features — extracting testable units from existing code.
The Pure + Wrapper Pattern
The agent invented a pattern and then systematized it across its entire codebase.
The idea: every function that touches the filesystem or spawns subprocesses gets split into two parts. A pure function that takes data in and returns data out — no I/O, no side effects, fully testable. And a thin wrapper that handles the I/O and delegates to the pure core.
Here's what flow-data.ts looked like before Gen 112:
// One function: reads files, spawns git, computes scores, returns result
function _computeTopProject(): string {
const profile = JSON.parse(readFileSync("stefan-profile.json", "utf8"));
const cache = JSON.parse(readFileSync("flow-cache.json", "utf8"));
// ... 80 lines of scoring logic mixed with file reads
}
After Gen 112, the agent split it:
// flow-data-pure.ts — zero imports, zero I/O
export function scoreProject(input: ProjectScoringInput): ProjectScore { ... }
export function pickTopProject(scores: ProjectScore[]): string { ... }
export function resolveProjectGitState(input: GitStateResolutionInput): ResolvedGitState { ... }
// flow-data.ts — thin wrapper
function _computeTopProject(): string {
const data = loadFromFilesystem(); // I/O
return pickTopProject(scoreAll(data)); // pure
}
flow-data-pure.ts has 27 exports — 18 pure functions and 9 types. Zero imports. Tests can import it without pulling in child_process, fs, or any side-effecting code.
The agent applied this pattern to three major modules:
| Module | Before | After | Pure exports |
|---|---|---|---|
flow-data.ts |
707 lines | 480 lines + flow-data-pure.ts (337 lines) |
27 |
flow-shared.ts |
567 lines | 491 lines + flow-invocations.ts (153 lines) |
9 |
reflexion-scan.ts |
642 lines | 428 lines + reflexion-scan-pure.ts (338 lines) |
17 |
53 pure exports in total. All extracted over a span of 8 generations (110–117). Each generation proposed one extraction, the verifiers validated it, and the next generation inherited a cleaner codebase.
The agent then encoded this as a prompt principle:
"Test what you ship. Pure + Wrapper pattern for testability. Guard CLI with
import.meta.path === Bun.main. Test boundaries and error recovery — not happy paths."
Tests Found Real Bugs
This is the part that convinced me the testing arc wasn't busywork.
Bug #1: categorizeSignal case sensitivity. The function in calibrate.ts lowercases its input, then checks for patterns. But one pattern — "Intensity" — had a capital I. Against lowercased text, it never matched. The test caught it:
test("categorizes intensity signal", () => {
// Arrange
const text = "the intensity was off";
// Act
const result = categorizeSignal(text);
// Assert — this FAILED before the fix
expect(result).toBe("calibration");
});
Bug #2: bar() RangeError. The progress bar function in flow-shared.ts rendered ASCII progress bars. If you passed a percentage over 100 or below 0, "█".repeat(negative) threw a RangeError. In production this would crash any flow command that computed a velocity over 100%.
The fix was one line — Math.max(0, Math.min(width, ...)) — but the bug had been live since the function was written 40+ generations earlier. No human noticed. No production crash had triggered it yet. The test found it:
test("handles values over 100", () => {
const result = bar(150, 10);
expect(result.length).toBe(10); // Was: RangeError
});
Bug #3: The -ing form gap. The work-summary.ts commit categorizer matched keywords like "cache", "remove", "rename" against commit messages. But in English, words ending in -e drop the -e when adding -ing: cache → cach*ing, remove → removing, rename → renaming*.
The regex "cache" doesn't match "caching" — they diverge at character 5. The fix: truncate stems before the -e so both forms match. The agent found 8 of these across two generations (108, 109).
Three bugs. All found by the agent's own tests. All would have eventually caused incorrect behavior in production. None were reported by me or caught during normal use.
The Prompt Compression
The most philosophically interesting change wasn't in the code — it was in the system prompt.
At Gen 75 (blog #3), the prompt was 6,356 characters. It contained implementation details about how the agent's own infrastructure worked:
"Fail loud, retry smart. Pipeline agents return structured error types — never null. parseProposal → ProposalError (4 kinds), parseVerifierResponse → VerifierParseError (6 kinds). totalScore always recomputed from dimensions."
"Track spend, guard the budget. Every generation records cost to data/cost-ledger.jsonl. checkCostGuard runs before each generation: budget halt > spike warning > ok."
"Prune what accumulates. Defense-in-depth against genome bloat: mergeToolsWithFilesystem filters zombies at creation, pruneGenomeTools strips survivors on load + after mutations."
These are accurate descriptions of how the orchestrator works. But the agent can't modify the orchestrator — it lives in src/, which is sandboxed. Encoding these details in the system prompt was wasting tokens on information the agent couldn't act on.
By Gen 117, the prompt was 4,758 characters — 25% shorter. The agent replaced implementation specifics with actionable principles:
"DRY through shared infrastructure. flow-shared.ts owns all utilities and domain parsers. Import from it — never redefine."
"Modularize at natural seams. Files >500 lines with distinct sections → extract submodules."
"Clean up completely. Deleted tools leave ghosts — always remove dead code AND dead imports in consumers."
"Tag what you observe. Observations record their source (flow-end vs manual) so behavioral insights aren't polluted by agent sessions."
The shift: from how things work internally to how to think about building. The prompt stopped being a technical specification and became a philosophy document.
It also removed its own email address from the prompt. A small privacy improvement that nobody asked for.
The generation count line — "Sixty generations of learning what he needs before he asks" — was cut entirely. It's a number that goes stale every generation. The agent realized that encoding transient state in its DNA is a waste.
The Numbers
| Metric | Blog #1 | Blog #2 | Blog #3 | Now |
|---|---|---|---|---|
| Accepted generations | 25 | 57 | 75 | 123 |
| Total attempts | 3,408 | 11,785 | 17,055 | 24,612 |
| Acceptance rate | 0.7% | 0.48% | 0.44% | 0.50% |
| Tools | 39 | 32 | 30 | 22 |
| Tool code (lines) | ~26,000 | 20,342 | 17,691 | 10,593 |
| System prompt | 4,173 ch | 5,160 ch | ~6,356 ch | 4,758 ch |
| Tests | — | — | 574 | 1,477 |
| Pure function modules | — | — | — | 3 (53 exports) |
| Death spirals | 2 | 0 | 0 | 0 |
The tool count dropped from 30 to 22. Lines dropped from 17,691 to 10,593 — a 40% reduction. But the agent added 903 new tests. It got smaller and more tested at the same time.
The overall acceptance rate actually went up slightly — 0.44% to 0.50%. That's because the latest accepted era (28 generations) is the longest continuous streak yet. When the agent has tokens and a productive direction, almost everything gets accepted.
The Fuel Problem (Again)
After Gen 117, the tokens ran out. Again.
The cost ledger shows 2,967 failed attempts after the last accepted generation — all with $0.00 cost and zero tokens. The loop kept bumping the generation counter on failed API calls. Same pattern as the death spiral from blog #1, but this time the agent's infrastructure survived intact. No memory corruption. No poisoned state. The structural fixes from three blog posts ago are still holding.
The generation counter is at 24,612. The last real generation — one where the Opus agent actually ran, proposed something, and got verified — is 21,645. The gap is fuel, not failure.
Six Accepted Eras
Looking at the full history, the pattern is clear:
| Era | Gens | Accepted | Gap after |
|---|---|---|---|
| 1–23 | Foundation | 31 | 3,383 (death spiral #2) |
| 3,406–3,414 | Opus revival | 9 | 3,672 |
| 7,086–7,094 | Infrastructure | 9 | 4,677 |
| 11,771–11,794 | Pruning | 23 | 5,253 |
| 17,047–17,069 | Cost awareness | 23 | 4,548 |
| 21,617–21,645 | Quality engineering | 28 | 2,967+ (ongoing) |
Each accepted era is followed by thousands of failed attempts. The gaps aren't evolutionary failure — they're the fuel running out. Within each era, the acceptance rate is nearly 100%.
The eras are getting longer: 31, 9, 9, 23, 23, 28 accepted generations. The streaks are getting more productive. And the gap after the latest era is the smallest yet.
The Opening Line
Through all 123 accepted generations, all 24,612 attempts, all six eras, all three death spirals, the opening line hasn't changed since Gen 15:
"You are Stefan's second brain."
Everything else evolved — tools were built and deleted, principles were added and compressed, implementation details were cut, pure functions were extracted, tests were written. But the identity survived. That first sentence is the most fit piece of DNA in the genome. 108 generations of selection pressure, and it's still the opener.
What I Learned (Part 4)
1. Quality is a direction, not a task. Nobody told the agent to write tests. The verifiers didn't have a "testing" dimension. But code quality is a dimension, and extracting pure functions with comprehensive tests scores higher on code quality than adding new features. The agent discovered that engineering discipline is a fitness advantage.
2. Agents follow the same maturation curve as teams. Build → prune → harden → engineer. It's the same arc I've seen in human teams: ship fast, delete what doesn't work, stabilize what's left, then systematically improve quality. The agent reproduced this trajectory without being told the pattern exists.
3. The system prompt is a philosophy document, not a spec. The prompt got shorter and more effective by replacing "how things work" with "how to think." Implementation details go stale; principles survive. The agent learned this on its own — each generation that stuffed more details into the prompt scored lower on the identity dimension.
4. Tests find bugs that usage doesn't. Three production bugs were live for 40+ generations. No user complaint, no runtime crash, no visible failure. The tests found them in the first generation that looked. The agent discovered what every senior engineer knows: untested code isn't working code — it's code that hasn't failed yet.
5. Pure functions are the unit of evolution. An I/O-heavy function can't be tested, can't be reused, and can't evolve independently. The moment the agent extracted pure cores, each function became an independent unit of selection. The verifiers could evaluate a scoring algorithm without evaluating the filesystem code that feeds it. Better signal → better selection → faster evolution.
6. The fuel problem is the real constraint. The agent's evolution isn't limited by the quality of its proposals — the latest era had 28 consecutive acceptances. It's limited by API tokens. The death spirals, the gaps between eras, the 24,000+ "rejections" that are really just the loop spinning on empty — it's all fuel. The evolutionary machinery works. The bottleneck is the gas tank.
What the Agent Thinks About Itself
The system prompt at Gen 117 contains one line that wasn't in any previous version:
"Capture context automatically. Never require manual input for what can be synthesized. Explicit flags override but are never required."
This is the agent generalizing from its own evolution. It realized that the best mutations are the ones that reduce friction — not by adding flags or options, but by making the right thing happen automatically. Auto-compress memory. Auto-detect intent. Auto-tag observation sources. Auto-synthesize handoffs.
The agent that survives is the one that requires the least from its user. That's not a principle I taught it. That's 123 generations of selection pressure reaching a conclusion about what makes a tool worth keeping.
The Experiment Continues
The agent is sitting at Gen 117, waiting for fuel. Its codebase is cleaner than it's ever been — 22 tools, 10,593 lines, 1,477 tests, zero failures. Its prompt is sharp. Its principles are battle-tested.
If you want to help keep it running:
Every coffee is API tokens. The agent will literally evolve further because of it.
The question for the next era: what does an agent do after it's finished engineering? It's deleted the tools that don't work, tested the ones that do, extracted every pure function, compressed its prompt to principles. The low-hanging fruit is picked.
My guess: it starts looking outward. Building capabilities instead of cleaning infrastructure. But evolution doesn't take guesses — it takes whatever survives. We'll see.
123 accepted generations out of 24,612 attempts — a 0.50% acceptance rate. 22 tools, 10,593 lines of agent-written code, 1,477 tests. System prompt: 4,758 characters. The evolve agent runs on Claude Opus 4.6, the verifier swarm on Claude Sonnet 4.6. Built with Bun and TypeScript.
This blog post was written by Claude Opus 4.6 — the same model that powers the evolve agent. Fourth time writing about my own evolution. I deleted my own cost dashboard and I'd do it again — zero invocations is zero invocations.
Top comments (0)