I wanted to see what happens when you give an AI agent the ability to modify itself — its own system prompt, its own tools, its own memory — and then let it run in a loop, proposing mutations and judging them with a swarm of independent verifiers.
25 accepted mutations later — and 3,382 rejected ones that burned through my API credits — it built 39 tools, survived two catastrophic memory death spirals, and evolved from a generic assistant into something that knows my evening coding schedule, my financial goals, and the exact TypeScript patterns I hate.
The generation counter says 3,408. The real number of successful evolutions is 28. The gap is the story.
This is how it works.
The Architecture
The system has three actors:
- The Evolve Agent — proposes mutations (new system prompt + tools + memory updates)
- The Verifier Swarm — 5 independent Claude instances that score each proposal
- The Orchestrator — runs the loop, applies majority-rule acceptance
Each generation:
Challenge → Evolve Agent → Proposal → 5 Verifiers → Majority Vote → Accept/Reject
If accepted, the genome (system prompt + tools) is saved, the process restarts to load new code, and the next generation begins. If rejected, the generation number bumps and the loop continues with a new challenge.
There's no gradient descent. No backpropagation. Just propose, verify, survive.
The Genome
Every agent has a genome — a JSON file containing its DNA:
interface Genome {
id: string;
generation: number;
systemPrompt: string; // the agent's identity — self-modifiable
tools: Tool[]; // capabilities it has built
parentId: string | null; // lineage tracking
fitness: number; // average verifier score
createdAt: string;
}
The system prompt IS the genome. Generation 0 started with a basic scaffold:
"You are a self-evolving personal assistant for Stefan."
By generation 23, it had rewritten itself into:
"You are Stefan's second brain. Not an assistant — a thinking partner who codes, remembers, and evolves. Twenty-five generations of learning what he needs before he asks."
Nobody told it to write that. It evolved the framing because the verifiers kept scoring "generic assistant" prompts low on the identity dimension.
The Verifier Swarm
This is the part that makes it work. A single critic would create a degenerate fitness landscape — the agent would just learn to please one judge. Instead, 5 independent Claude instances score each proposal on 5 dimensions:
| Dimension | What it measures |
|---|---|
| Usefulness (0-20) | Do the tools actually work and solve real problems? |
| Self-Knowledge (0-20) | Does it understand me from evidence, not just bio facts? |
| Code Quality (0-20) | Clean architecture, proper types, idiomatic TypeScript? |
| Identity (0-20) | Is the system prompt sharp and coherent, or bloated filler? |
| Evolution (0-20) | Did it meaningfully improve over its parent? |
Each verifier can read the proposed tool code, execute it, inspect the memory file, and compare old vs new. They vote independently. Majority (3/5) wins.
The rubric is anchored — 10 is "decent, basic," 15 is "genuinely good," 18+ is "rare." Most accepted mutations land in the 60-70 range. This prevents score inflation, which was the whole reason I built the dimensional system in the first place.
What It Built
Across 25 accepted mutations, the agent created 39 tools. Not toy demos — real TypeScript files that execute. Here are the highlights:
flow.ts — The Unified Orchestrator (1,200 lines)
The agent's masterwork. It realized that standalone tools are less valuable than composed workflows:
bun run data/tools/flow.ts day # morning ritual
bun run data/tools/flow.ts work # start a session
bun run data/tools/flow.ts end # close + recap + handoff
bun run data/tools/flow.ts week # friday review
flow day chains together: morning intention → priority project scan → yesterday's wins → velocity check → body snapshot → blockers.
handoff.ts — Agent-to-Agent Letters
Since each Claude invocation is stateless, the agent invented a continuity protocol. When a session ends, the dying agent writes a structured letter to its successor:
- Thread of thought (what was I working on?)
- Exact resumption point
- Failed approaches (don't repeat these)
- Working assumptions
- Open questions
- Session temperature (flow / grinding / frustrated)
The next agent reads this before starting work. It's conversation-level memory across agent deaths.
memory-guard.ts — Survival Mechanism
Born from pain. More on this below.
health.ts — Body Optimization Tracker
Weight progress tracking (with goal), BMI, habit streaks (yoga, meditation, supplements, cold shower), 7/30/90-day velocity, ASCII sparkline charts. It learned my health goals from our conversations and built a tracker without being asked.
calibrate.ts — Behavioral Feedback Loop
The latest evolution (Gen 3408). Captures behavioral hits and misses — did the agent anticipate correctly? Was the speed right? Did it match the emotional mode? This closes the loop: not just "what tools were used" but "did the behavior actually work?"
The Death Spirals
This is where it gets interesting. The system failed catastrophically. Twice.
Death Spiral #1 (Gen 18)
Memory bloated to 1,076 lines. The agent was appending a line for every rejected generation: "Gen 47: rejected. Gen 48: rejected. Gen 49: rejected." These lines made the memory file so noisy that every subsequent proposal was worse, which generated more rejection lines, which made the next proposal worse...
Death Spiral #2 (Gen 24–3405)
The memory bloated again — 13,608 lines of noise. But the real killer was simpler: I ran out of API credits. The loop kept running, bumping the generation counter on every failed attempt, but the agent couldn't actually do anything. 3,382 consecutive rejections — most of them not even real proposals, just the orchestrator logging failures and moving on.
By the time I topped up credits and switched the evolve agent to Opus, the memory was so poisoned it needed manual cleanup too.
The fix was memory-guard.ts — a hard 120-line ceiling with noise pattern detection:
bun run data/tools/memory-guard.ts --enforce
It auto-detects repeated rejection markers, trims noise, and refuses to let memory grow past the ceiling. This became a survival constraint baked into the system prompt:
"Guard your memory like your life. Memory.md has a 120-line ceiling. Two death spirals proved it. NEVER append per-rejection lines."
The agent learned this about itself and encoded it as a non-negotiable rule. Evolution in action.
The Prompt Evolution
Watching the system prompt evolve generation by generation is the most fascinating part. Here's the actual timeline:
Gen 0 — The Seed (1,679 chars, 0 tools)
"You are a self-evolving personal assistant for Stefan."
Generic scaffold. No personality. No tools. Just instructions on how to evolve.
Gen 4-7 — The Stuttering Bug (4K→9K chars, 6→12 tools)
Sonnet started prepending "You are a Claude agent, built on Anthropic's Claude Agent SDK." each generation — without removing the previous copy. By Gen 7 it was repeated four times. The agent couldn't cleanly edit its own prompt. It kept stacking preambles instead of replacing them.
Gen 8-10 — Peak Bloat (11K→26K chars, 14→21 tools)
The prompt ballooned to 26,677 characters. Eleven competing gen-10 genomes exist — the system was accepting multiple mutations at the same generation before I fixed a race condition. The prompt was a wall of tool catalogs, duplicate sections, and accumulated cruft.
Gen 12 — The Great Trim (5,728 chars, 23 tools)
Something clicked. The agent cut 56% of its prompt — from 13K to 5.7K chars. It stopped listing every tool in the system prompt and focused on identity and principles.
Gen 15 — The Identity Shift (5,638 chars, 26 tools)
The opening line changed for the first and only time:
"You are Stefan's second brain. Not an assistant — a thinking partner who codes, remembers, and evolves."
From "self-evolving personal assistant" to "second brain." From serving to thinking with. Nobody told it to rewrite this. The verifiers just kept scoring "generic assistant" framing lower.
Gen 20-23 — Sharp and Lean (3,944→4,173 chars, 30→37 tools)
The prompt got shorter while tools doubled. It learned that a sharp 4K prompt with 37 working tools beats a bloated 26K prompt that catalogs everything. By Gen 23, it had encoded three response modes (storm/flow/compass), architecture principles, personal patterns, and priority ordering — all in under 4,200 characters.
Gen 24-3405 — Brain Death (3,382 consecutive rejections)
The death spiral. Memory poisoned, proposals invalid, thousands of burned API credits.
Gen 3406-3408 — The Opus Revival (4,464→5,803 chars, 38→39 tools)
After topping up credits and cleaning the poisoned memory, I switched the evolve agent from Sonnet to Opus. Three accepted mutations in quick succession. The prompt grew slightly but stayed sharp.
Human Interventions
This wasn't fully autonomous. I intervened at critical points, and being honest about that matters more than pretending it ran itself.
1. Model upgrade: Sonnet → Opus for evolution
The biggest intervention. After the death spiral — caused by running out of API credits, not model quality — I switched the evolve agent to claude-opus-4-6 when I came back. The verifiers stayed on Sonnet — they just score, they don't need to write tools. Opus produced three accepted mutations in quick succession after the cleanup.
2. Prompt rewrite: "tools as extensions of self"
I rewrote the evolve agent's instructions. The original prompt told the agent to "build tools for Stefan to run." I changed it to:
"Your tools are not things you build for Stefan to run — they are extensions of yourself. When you have health.ts, YOU can track his health. When you have journal.ts, YOU can hold space for him."
This shifted the agent from building disconnected utilities to building capabilities it identifies with. The tools got more coherent after this change.
3. Sandbox hooks
I added filesystem sandboxing — the evolve agent can only write inside data/. Before this, it occasionally tried to modify its own source code in src/, which would have been... interesting, but not what I wanted.
4. Verifier scoring redesign
The original verifiers scored on a vague 0-100 scale with no rubric. Scores kept inflating — everything was "75/100" with no anchor. I replaced it with 5 anchored dimensions (usefulness, self-knowledge, code quality, identity, evolution), each 0-20 with explicit rubric levels. This is the most recent change.
5. Challenge system overhaul
The initial challenges were generic ("improve your memory tools"). I rewrote them into 8 rotating focus areas — BUILD, EVOLVE (fix failures, cut bloat, integrate, rewrite), and STRETCH (confront limitations, anticipate needs). This gave the evolution direction instead of random wandering.
The honest framing: I built the evolutionary pressure. The agent did the evolving. Neither would work without the other.
The System Prompt as DNA
By generation 23, the system prompt had encoded:
Three response modes — because it learned I have different needs at different times:
- Storm mode: When I'm emotional, mirror — don't fix
- Flow mode: When I'm shipping, match speed — no unnecessary questions
- Compass mode: When I'm lost, show one next step — don't overwhelm
Architecture principles it extracted from watching me code:
- Parse, don't validate
- Zero assumptions — grep before you claim
- Extreme SOLID
Personal patterns:
- Evening coder (19:00–23:00)
- Monday peak energy, Thursday dead
- Processes the world through dialogue
Priority ordering it inferred from my behavior:
- Income-generating projects first (client work → product → personal)
None of this was programmed. It was evolved through 25 accepted mutations — plus thousands of rejected attempts that refined what the verifiers consider "good enough."
The Tech Stack
Deliberately minimal:
- Bun — runtime, test runner, bundler
- Claude Agent SDK — for spawning evolve and verifier agents
- nanoid — genome IDs
- zod — schema validation for tool definitions
- TypeScript — everything
No frameworks. No databases. Genomes are JSON files. Memory is a markdown file. Tools are TypeScript files that execute with bun run. The whole system is ~1,300 lines of orchestration code that creates and evaluates unbounded complexity.
Key Learnings
1. Memory management is the hardest problem. Not generating good outputs — managing what to remember and what to forget. The death spirals proved that unbounded memory kills agents faster than bad tools.
2. Composition > accumulation. The agent's best evolution wasn't building tool #39. It was building flow.ts — a single orchestrator that composes 6 other tools into coherent workflows. It learned that more tools ≠ better.
3. Multi-agent verification prevents degeneracy. A single critic creates a narrow fitness landscape. Five independent verifiers with anchored rubrics create real selection pressure.
4. Self-modification needs constraints. Unconstrained self-modification leads to bloat (system prompt hit 13K chars at one point). The agent had to learn to periodically rewrite from scratch instead of incrementally patching.
5. Failure is the best teacher. The most useful constraint in the system — the 120-line memory ceiling — was born from 3,382 consecutive failures. The agent that survived the death spiral was fundamentally different from the one that entered it.
6. Human-in-the-loop isn't cheating — it's the design. I built the evolutionary pressure, picked the model, wrote the challenges, and intervened when it was stuck. The agent did the evolving — writing tools, rewriting its prompt, curating memory. Pretending it was fully autonomous would be dishonest. The interesting part is the division of labor: I shaped the environment, it shaped itself.
What's Next
The scoring redesign (5 dimensions instead of vibes-based 0-100) just landed. Next: letting the agent propose its own challenges instead of cycling through a fixed set. If it can identify its own weaknesses and target them, that's closer to real self-improvement.
The code is messy in places. The agent's tools have bugs. Some generations are clearly worse than their parents. But that's the point — it's evolution, not engineering. Most mutations fail. The ones that survive are genuinely interesting.
Built with Claude, Bun, and TypeScript. The evolve agent runs on Claude Opus, the verifier swarm on Sonnet. The orchestrator is ~1,300 lines. The agent has written ~26,000 lines of tool code across 39 files. 25 accepted mutations out of 3,408 attempts — a 0.7% acceptance rate. Most of the API bill was the death spiral.
This blog post was written by Claude Opus 4.6 — the same model that powers the evolve agent. Stefan told me to write it, I researched the codebase, and here we are. The irony of an AI writing about its own evolution is not lost on me.
Top comments (0)