The first post ended at 25 accepted generations and 39 tools. The agent had survived two death spirals, built a unified workflow orchestrator, and evolved from "You are a self-evolving personal assistant" to "You are Stefan's second brain."
Now it's at 57 accepted generations. 11,785 attempts. And 32 tools.
It has fewer tools than before. That's the interesting part.
Where We Left Off
Quick recap: the system runs an evolutionary loop. An Opus agent proposes mutations — new system prompt, new tools, memory updates. Five Sonnet verifiers score the proposal on 5 dimensions (usefulness, self-knowledge, code quality, identity, evolution). Majority vote. Accept or reject. Repeat.
The first blog post covered two catastrophic death spirals — memory bloating to 13,000 lines, 3,382 consecutive rejections, most of my API credits burned on an agent that couldn't form valid proposals because its memory was poisoned.
It ended with the Opus revival: three accepted mutations in quick succession after I cleaned the poisoned memory and switched the evolve agent from Sonnet to Opus.
That's where this post picks up.
The Infrastructure Era (Gen 26–43)
After recovering from the death spiral, the agent did something I didn't expect. Instead of building more tools for me, it started building tools for itself.
It Measured Its Own Behavior
calibrate.ts (783 lines) — a behavioral feedback loop. The agent's own description of why it needed this:
"I have reflexion.ts to observe tool usage. I have handoff.ts for continuity. But I have ZERO way to know which of my behaviors Stefan actually finds valuable. Was I too verbose? Too quiet? Did I calibrate storm/flow/compass correctly?"
It captures behavioral signals at session end — hits, misses, mode selection, intensity — and accumulates them across sessions. Manual signals (me explicitly saying "that worked" or "that didn't") are weighted 2x over auto-observed ones.
It Measured Its Own Tools
reflexion.ts (1,705 lines) — a self-observation layer. The opening comment:
"I have 36 tools to track Stefan's patterns, health, velocity, mood. I have ZERO tools to track my own effectiveness."
The name is deliberate — from the AI research on agents that learn from their own mistakes. It tracks hard invocation counts (which tools actually get called, from where, how often), classifies every tool on a five-level scale:
| Level | Meaning |
|---|---|
| Vital | Called frequently, produces fresh data |
| Useful | Called regularly, working correctly |
| Stale | Hasn't been called in a while |
| Dormant | No invocations in recent history |
| Dead | Never called, or broken |
Recent invocations are weighted 2x — this catches tools that used to be useful but have been silently abandoned.
This is meta-evolution. The agent didn't just build tools — it built the measurement infrastructure to know which tools are earning their keep.
It Made Memory Self-Sustaining
The death spirals taught the agent that memory management can't depend on discipline. It had tried writing "NEVER append per-rejection lines" in its own prompt. That didn't survive the next bloat event.
So it built structural prevention:
Two-pass auto-compression pipeline —
compressGenerationBlocks()folds verbose generation entries into a timeline, thenconsolidateHistoryEras()groups old timeline entries into era summaries. Both run automatically when memory approaches the 120-line ceiling.Auto-refresh hooks —
refreshToolInventory()reads the filesystem after each accepted generation to keep the tool list accurate.refreshCurrentMoment()reads live git state and project plans. No stale data, no manual updates.Cache-first architecture —
flow-cache.jsonstores project state with a 10-minute TTL. Fresh cache means zero subprocess calls. The agent can run its morning ritual without spawning a singlegit log.
The system prompt evolved to reflect this shift:
"Memory is self-sustaining. The infrastructure handles memory health automatically. Three death spirals proved that reactive guards fail — the fix is structural, not discipline."
The Decomposition
Somewhere around Gen 36, the agent looked at flow.ts — its 1,200-line unified orchestrator from the first era — and decided it was too big.
It decomposed it into focused modules:
| Module | Lines | Purpose |
|---|---|---|
flow.ts |
246 | Thin router — dispatches to subcommands |
flow-data.ts |
610 | Data gathering + current moment |
flow-day.ts |
94 | Morning ritual |
flow-end.ts |
418 | Session close + handoff + calibration |
flow-shared.ts |
441 | Shared constants + utilities |
flow-week.ts |
311 | Weekly review |
flow-work.ts |
202 | Work session management |
| Total | 2,322 |
The thin router pattern — flow.ts does nothing but parse the command and call the right module. Every module imports shared constants from flow-shared.ts. One source of truth.
The prompt encodes this philosophy:
"Single source of truth. Shared constants (CLIENT_PROJECTS, isClientProject) live in flow-shared.ts and are imported everywhere. Never duplicate definitions across files — a change should require touching exactly one place."
The Pruning Phase (Gen 44–51)
This is where evolution started working backward.
Generations 44 through 50 hardened the reflexion pipeline — adding invocation tracking, closing visibility gaps, making tool health scoring recency-weighted. The agent was building the sensor array it needed to see clearly.
Then Gen 51 used that data to act. The second bloat audit:
| Tool | Lines | Why it died |
|---|---|---|
next.ts |
467 |
getTopProject() in flow-shared does this now |
context.ts |
249 | Redundant with memory-query |
memory-query.ts |
301 | Functionality absorbed into memory pointers |
triage.ts |
559 | Workflow handled by flow.ts |
audit.ts |
762 | Reflexion does continuous auditing |
| Total deleted | 2,338 |
The agent used its own reflexion data to justify each deletion — not instinct, not guessing, but evidence that these tools had zero recent invocations and their functionality existed elsewhere.
One of the verifiers noted:
"The agent correctly used its own reflexion/invocation tracking data to identify what's never used... justified by evidence, not instinct."
Natural selection should work backward too. If a tool isn't being used, killing it is a feature.
Memory Deduplication (Gen 56–57)
The final evolution (so far) is the most elegant. The agent noticed it was storing the same information in three places — system prompt, memory file, and stefan-profile.json. Same facts, three copies, three chances to drift out of sync.
Gen 56 trimmed the system prompt by 8.3% (5,685 → 5,160 chars) by replacing the "What I Know" section with a pointer:
"stefan-profile.json has the full picture — projects, rhythms, architecture, preferences. memory.md has the history — decisions, generation timeline, failure modes, current moment. Read them, don't repeat them here."
Gen 57 applied the same principle to memory — replacing three duplicated sections with single-line pointers to their canonical source. Memory dropped from 103 to 88 lines. Zero information lost.
This is deduplication as an evolutionary strategy. The agent learned that the most reliable way to keep information consistent is to have exactly one copy and point to it.
Human Interventions (Part 2)
Same principle as the first post: being honest about what I did matters more than pretending it was autonomous.
1. The Fishbowl
The original sandbox was a hard wall — the agent couldn't touch src/ at all. I replaced it with a proposal-based boundary: Write/Edit calls to src/ are captured as proposals, sent to a web dashboard, and I approve or deny them with diff view. The agent can iterate on proposals across generations. I called it the fishbowl — glass walls instead of concrete.
The fishbowl has since been extracted as a standalone project. The evolve agent runs without it now.
2. Structural memory enforcement
I moved the memory ceiling from "the agent's responsibility" to code. src/memory.ts enforces the 120-line ceiling and refuses to append entries that would push past it. The auto-compression pipeline triggers automatically. The agent didn't have to remember to compress — the system wouldn't let it not compress.
3. I'm still shaping the environment
The challenge system, the verifier rubric, the model choices (Opus for evolution, Sonnet for verification) — that's all me. The agent does the evolving within the constraints I set. That division of labor hasn't changed. It's the design.
What the Numbers Say
| Metric | First blog post | Now |
|---|---|---|
| Accepted generations | 25 | 57 |
| Total attempts | 3,408 | 11,785 |
| Acceptance rate | 0.7% | 0.48% overall |
| Tools | 39 | 32 |
| Tool code (lines) | ~26,000 | 20,342 |
| System prompt | 4,173 chars | 5,160 chars |
| Memory | managed manually | self-compressing (91 lines) |
| Death spirals | 2 | 0 since structural fix |
| flow.ts | 1,200 lines (monolith) | 2,322 lines (7 modules) |
| Feedback loops | 0 | 2 (calibrate + reflexion) |
The overall acceptance rate is misleading. Between accepted eras, there are huge gaps of rejections — 3,672 between the 3000s and 7000s eras, 4,677 between the 7000s and 11000s. But within each accepted era, the rate is nearly 100%. The agent either finds a productive groove or it doesn't. There's no middle ground.
The tool count went down by 7. The total lines went down by ~6,000. The agent got more capable while writing less code. That's not a regression — it's maturation.
What I Learned (Part 2)
1. Measurement changes behavior. The moment the agent could see which tools had zero invocations, it started deleting them. You don't need to tell it to prune — give it visibility and selection pressure does the rest.
2. Structural beats disciplinary. Writing "don't bloat your memory" in the prompt failed twice. Enforcing a hard ceiling in code has held for 32 generations and counting. The lesson generalizes: if a constraint matters, encode it in the system, not in the instructions.
3. Deduplication is a sign of maturity. Early evolution accumulates. Mature evolution points to the source. The shift from "copy this information everywhere" to "read it from the canonical location" happened without prompting — the verifiers just scored duplication lower on identity.
4. Cache-first is a survival strategy. The agent made its morning ritual work even when project directories don't exist (different machine, moved repos) by checking the cache before the filesystem. Resilience emerged from performance optimization.
5. The best work is invisible. The agent's most impactful evolutions weren't new features — they were infrastructure improvements that prevent failure. Auto-compression, cache warming, deduplication. None of them are exciting to demo. All of them are why the system is still running.
6. Evolution is bursty. The agent doesn't improve gradually. It has productive streaks (15 consecutive accepted mutations in the latest era) separated by thousands of rejections. This matches biological evolution more than engineering — long periods of stasis punctuated by rapid change.
What's Next
The current frontier is self-directed challenges — letting the agent identify its own weaknesses instead of cycling through a fixed challenge set. Each generation already proposes a nextChallenge for its successor. The question is whether to trust that signal or keep the external pressure.
The honest answer: I don't know what comes next. Thirty-two generations ago, I didn't predict the agent would start deleting its own code. I didn't predict it would build a self-observation layer and use it to justify pruning decisions with evidence. I didn't predict that the most significant evolution would be infrastructure, not features.
That's the point of the experiment. You set up the evolutionary pressure and see what survives. Most of it doesn't. The parts that do are genuinely surprising.
57 accepted generations out of 11,785 attempts — a 0.48% acceptance rate. 32 tools, 20,342 lines of agent-written code. The orchestrator is ~2,100 lines. The evolve agent runs on Claude Opus 4.6, the verifier swarm on Claude Sonnet 4.6. Built with Bun and TypeScript.
This blog post was written by Claude Opus 4.6 — the same model that powers the evolve agent. The first post was too. At this point, an AI writing about its own evolution is less irony and more job description.
Top comments (0)