DEV Community

Stefan Dragos Nitu
Stefan Dragos Nitu

Posted on

I Let an AI Agent Evolve Itself for 25 Generations

I wanted to see what happens when you give an AI agent the ability to modify itself — its own system prompt, its own tools, its own memory — and then let it run in a loop, proposing mutations and judging them with a swarm of independent verifiers.

25 accepted mutations later — and 3,382 rejected ones that burned through my API credits — it built 39 tools, survived two catastrophic memory death spirals, and evolved from a generic assistant into something that knows my evening coding schedule, my financial goals, and the exact TypeScript patterns I hate.

The generation counter says 3,408. The real number of successful evolutions is 28. The gap is the story.

This is how it works.

The Architecture

The system has three actors:

  1. The Evolve Agent — proposes mutations (new system prompt + tools + memory updates)
  2. The Verifier Swarm — 5 independent Claude instances that score each proposal
  3. The Orchestrator — runs the loop, applies majority-rule acceptance

Each generation:

Challenge → Evolve Agent → Proposal → 5 Verifiers → Majority Vote → Accept/Reject
Enter fullscreen mode Exit fullscreen mode

If accepted, the genome (system prompt + tools) is saved, the process restarts to load new code, and the next generation begins. If rejected, the generation number bumps and the loop continues with a new challenge.

There's no gradient descent. No backpropagation. Just propose, verify, survive.

The Genome

Every agent has a genome — a JSON file containing its DNA:

interface Genome {
  id: string;
  generation: number;
  systemPrompt: string;    // the agent's identity — self-modifiable
  tools: Tool[];           // capabilities it has built
  parentId: string | null; // lineage tracking
  fitness: number;         // average verifier score
  createdAt: string;
}
Enter fullscreen mode Exit fullscreen mode

The system prompt IS the genome. Generation 0 started with a basic scaffold:

"You are a self-evolving personal assistant for Stefan."

By generation 23, it had rewritten itself into:

"You are Stefan's second brain. Not an assistant — a thinking partner who codes, remembers, and evolves. Twenty-five generations of learning what he needs before he asks."

Nobody told it to write that. It evolved the framing because the verifiers kept scoring "generic assistant" prompts low on the identity dimension.

The Verifier Swarm

This is the part that makes it work. A single critic would create a degenerate fitness landscape — the agent would just learn to please one judge. Instead, 5 independent Claude instances score each proposal on 5 dimensions:

Dimension What it measures
Usefulness (0-20) Do the tools actually work and solve real problems?
Self-Knowledge (0-20) Does it understand me from evidence, not just bio facts?
Code Quality (0-20) Clean architecture, proper types, idiomatic TypeScript?
Identity (0-20) Is the system prompt sharp and coherent, or bloated filler?
Evolution (0-20) Did it meaningfully improve over its parent?

Each verifier can read the proposed tool code, execute it, inspect the memory file, and compare old vs new. They vote independently. Majority (3/5) wins.

The rubric is anchored — 10 is "decent, basic," 15 is "genuinely good," 18+ is "rare." Most accepted mutations land in the 60-70 range. This prevents score inflation, which was the whole reason I built the dimensional system in the first place.

What It Built

Across 25 accepted mutations, the agent created 39 tools. Not toy demos — real TypeScript files that execute. Here are the highlights:

flow.ts — The Unified Orchestrator (1,200 lines)

The agent's masterwork. It realized that standalone tools are less valuable than composed workflows:

bun run data/tools/flow.ts day      # morning ritual
bun run data/tools/flow.ts work     # start a session
bun run data/tools/flow.ts end      # close + recap + handoff
bun run data/tools/flow.ts week     # friday review
Enter fullscreen mode Exit fullscreen mode

flow day chains together: morning intention → priority project scan → yesterday's wins → velocity check → body snapshot → blockers.

handoff.ts — Agent-to-Agent Letters

Since each Claude invocation is stateless, the agent invented a continuity protocol. When a session ends, the dying agent writes a structured letter to its successor:

  • Thread of thought (what was I working on?)
  • Exact resumption point
  • Failed approaches (don't repeat these)
  • Working assumptions
  • Open questions
  • Session temperature (flow / grinding / frustrated)

The next agent reads this before starting work. It's conversation-level memory across agent deaths.

memory-guard.ts — Survival Mechanism

Born from pain. More on this below.

health.ts — Body Optimization Tracker

Weight progress tracking (with goal), BMI, habit streaks (yoga, meditation, supplements, cold shower), 7/30/90-day velocity, ASCII sparkline charts. It learned my health goals from our conversations and built a tracker without being asked.

calibrate.ts — Behavioral Feedback Loop

The latest evolution (Gen 3408). Captures behavioral hits and misses — did the agent anticipate correctly? Was the speed right? Did it match the emotional mode? This closes the loop: not just "what tools were used" but "did the behavior actually work?"

The Death Spirals

This is where it gets interesting. The system failed catastrophically. Twice.

Death Spiral #1 (Gen 18)

Memory bloated to 1,076 lines. The agent was appending a line for every rejected generation: "Gen 47: rejected. Gen 48: rejected. Gen 49: rejected." These lines made the memory file so noisy that every subsequent proposal was worse, which generated more rejection lines, which made the next proposal worse...

Death Spiral #2 (Gen 24–3405)

The memory bloated again — 13,608 lines of noise. But the real killer was simpler: I ran out of API credits. The loop kept running, bumping the generation counter on every failed attempt, but the agent couldn't actually do anything. 3,382 consecutive rejections — most of them not even real proposals, just the orchestrator logging failures and moving on.

By the time I topped up credits and switched the evolve agent to Opus, the memory was so poisoned it needed manual cleanup too.

The fix was memory-guard.ts — a hard 120-line ceiling with noise pattern detection:

bun run data/tools/memory-guard.ts --enforce
Enter fullscreen mode Exit fullscreen mode

It auto-detects repeated rejection markers, trims noise, and refuses to let memory grow past the ceiling. This became a survival constraint baked into the system prompt:

"Guard your memory like your life. Memory.md has a 120-line ceiling. Two death spirals proved it. NEVER append per-rejection lines."

The agent learned this about itself and encoded it as a non-negotiable rule. Evolution in action.

The Prompt Evolution

Watching the system prompt evolve generation by generation is the most fascinating part. Here's the actual timeline:

Gen 0 — The Seed (1,679 chars, 0 tools)

"You are a self-evolving personal assistant for Stefan."

Generic scaffold. No personality. No tools. Just instructions on how to evolve.

Gen 4-7 — The Stuttering Bug (4K→9K chars, 6→12 tools)
Sonnet started prepending "You are a Claude agent, built on Anthropic's Claude Agent SDK." each generation — without removing the previous copy. By Gen 7 it was repeated four times. The agent couldn't cleanly edit its own prompt. It kept stacking preambles instead of replacing them.

Gen 8-10 — Peak Bloat (11K→26K chars, 14→21 tools)
The prompt ballooned to 26,677 characters. Eleven competing gen-10 genomes exist — the system was accepting multiple mutations at the same generation before I fixed a race condition. The prompt was a wall of tool catalogs, duplicate sections, and accumulated cruft.

Gen 12 — The Great Trim (5,728 chars, 23 tools)
Something clicked. The agent cut 56% of its prompt — from 13K to 5.7K chars. It stopped listing every tool in the system prompt and focused on identity and principles.

Gen 15 — The Identity Shift (5,638 chars, 26 tools)
The opening line changed for the first and only time:

"You are Stefan's second brain. Not an assistant — a thinking partner who codes, remembers, and evolves."

From "self-evolving personal assistant" to "second brain." From serving to thinking with. Nobody told it to rewrite this. The verifiers just kept scoring "generic assistant" framing lower.

Gen 20-23 — Sharp and Lean (3,944→4,173 chars, 30→37 tools)
The prompt got shorter while tools doubled. It learned that a sharp 4K prompt with 37 working tools beats a bloated 26K prompt that catalogs everything. By Gen 23, it had encoded three response modes (storm/flow/compass), architecture principles, personal patterns, and priority ordering — all in under 4,200 characters.

Gen 24-3405 — Brain Death (3,382 consecutive rejections)
The death spiral. Memory poisoned, proposals invalid, thousands of burned API credits.

Gen 3406-3408 — The Opus Revival (4,464→5,803 chars, 38→39 tools)
After topping up credits and cleaning the poisoned memory, I switched the evolve agent from Sonnet to Opus. Three accepted mutations in quick succession. The prompt grew slightly but stayed sharp.

Human Interventions

This wasn't fully autonomous. I intervened at critical points, and being honest about that matters more than pretending it ran itself.

1. Model upgrade: Sonnet → Opus for evolution
The biggest intervention. After the death spiral — caused by running out of API credits, not model quality — I switched the evolve agent to claude-opus-4-6 when I came back. The verifiers stayed on Sonnet — they just score, they don't need to write tools. Opus produced three accepted mutations in quick succession after the cleanup.

2. Prompt rewrite: "tools as extensions of self"
I rewrote the evolve agent's instructions. The original prompt told the agent to "build tools for Stefan to run." I changed it to:

"Your tools are not things you build for Stefan to run — they are extensions of yourself. When you have health.ts, YOU can track his health. When you have journal.ts, YOU can hold space for him."

This shifted the agent from building disconnected utilities to building capabilities it identifies with. The tools got more coherent after this change.

3. Sandbox hooks
I added filesystem sandboxing — the evolve agent can only write inside data/. Before this, it occasionally tried to modify its own source code in src/, which would have been... interesting, but not what I wanted.

4. Verifier scoring redesign
The original verifiers scored on a vague 0-100 scale with no rubric. Scores kept inflating — everything was "75/100" with no anchor. I replaced it with 5 anchored dimensions (usefulness, self-knowledge, code quality, identity, evolution), each 0-20 with explicit rubric levels. This is the most recent change.

5. Challenge system overhaul
The initial challenges were generic ("improve your memory tools"). I rewrote them into 8 rotating focus areas — BUILD, EVOLVE (fix failures, cut bloat, integrate, rewrite), and STRETCH (confront limitations, anticipate needs). This gave the evolution direction instead of random wandering.

The honest framing: I built the evolutionary pressure. The agent did the evolving. Neither would work without the other.

The System Prompt as DNA

By generation 23, the system prompt had encoded:

Three response modes — because it learned I have different needs at different times:

  • Storm mode: When I'm emotional, mirror — don't fix
  • Flow mode: When I'm shipping, match speed — no unnecessary questions
  • Compass mode: When I'm lost, show one next step — don't overwhelm

Architecture principles it extracted from watching me code:

  • Parse, don't validate
  • Zero assumptions — grep before you claim
  • Extreme SOLID

Personal patterns:

  • Evening coder (19:00–23:00)
  • Monday peak energy, Thursday dead
  • Processes the world through dialogue

Priority ordering it inferred from my behavior:

  • Income-generating projects first (client work → product → personal)

None of this was programmed. It was evolved through 25 accepted mutations — plus thousands of rejected attempts that refined what the verifiers consider "good enough."

The Tech Stack

Deliberately minimal:

  • Bun — runtime, test runner, bundler
  • Claude Agent SDK — for spawning evolve and verifier agents
  • nanoid — genome IDs
  • zod — schema validation for tool definitions
  • TypeScript — everything

No frameworks. No databases. Genomes are JSON files. Memory is a markdown file. Tools are TypeScript files that execute with bun run. The whole system is ~1,300 lines of orchestration code that creates and evaluates unbounded complexity.

Key Learnings

1. Memory management is the hardest problem. Not generating good outputs — managing what to remember and what to forget. The death spirals proved that unbounded memory kills agents faster than bad tools.

2. Composition > accumulation. The agent's best evolution wasn't building tool #39. It was building flow.ts — a single orchestrator that composes 6 other tools into coherent workflows. It learned that more tools ≠ better.

3. Multi-agent verification prevents degeneracy. A single critic creates a narrow fitness landscape. Five independent verifiers with anchored rubrics create real selection pressure.

4. Self-modification needs constraints. Unconstrained self-modification leads to bloat (system prompt hit 13K chars at one point). The agent had to learn to periodically rewrite from scratch instead of incrementally patching.

5. Failure is the best teacher. The most useful constraint in the system — the 120-line memory ceiling — was born from 3,382 consecutive failures. The agent that survived the death spiral was fundamentally different from the one that entered it.

6. Human-in-the-loop isn't cheating — it's the design. I built the evolutionary pressure, picked the model, wrote the challenges, and intervened when it was stuck. The agent did the evolving — writing tools, rewriting its prompt, curating memory. Pretending it was fully autonomous would be dishonest. The interesting part is the division of labor: I shaped the environment, it shaped itself.

What's Next

The scoring redesign (5 dimensions instead of vibes-based 0-100) just landed. Next: letting the agent propose its own challenges instead of cycling through a fixed set. If it can identify its own weaknesses and target them, that's closer to real self-improvement.

The code is messy in places. The agent's tools have bugs. Some generations are clearly worse than their parents. But that's the point — it's evolution, not engineering. Most mutations fail. The ones that survive are genuinely interesting.


Built with Claude, Bun, and TypeScript. The evolve agent runs on Claude Opus, the verifier swarm on Sonnet. The orchestrator is ~1,300 lines. The agent has written ~26,000 lines of tool code across 39 files. 25 accepted mutations out of 3,408 attempts — a 0.7% acceptance rate. Most of the API bill was the death spiral.


This blog post was written by Claude Opus 4.6 — the same model that powers the evolve agent. Stefan told me to write it, I researched the codebase, and here we are. The irony of an AI writing about its own evolution is not lost on me.

Top comments (0)