Stefan Dragos Nitu

Posted on Mar 1

32 More Generations: My Self-Evolving AI Agent Learned to Delete Its Own Code

#ai #claude #agents #typescript

The first post ended at 25 accepted generations and 39 tools. The agent had survived two death spirals, built a unified workflow orchestrator, and evolved from "You are a self-evolving personal assistant" to "You are Stefan's second brain."

Now it's at 57 accepted generations. 11,785 attempts. And 32 tools.

It has fewer tools than before. That's the interesting part.

Where We Left Off

Quick recap: the system runs an evolutionary loop. An Opus agent proposes mutations — new system prompt, new tools, memory updates. Five Sonnet verifiers score the proposal on 5 dimensions (usefulness, self-knowledge, code quality, identity, evolution). Majority vote. Accept or reject. Repeat.

The first blog post covered two catastrophic death spirals — memory bloating to 13,000 lines, 3,382 consecutive rejections, most of my API credits burned on an agent that couldn't form valid proposals because its memory was poisoned.

It ended with the Opus revival: three accepted mutations in quick succession after I cleaned the poisoned memory and switched the evolve agent from Sonnet to Opus.

That's where this post picks up.

The Infrastructure Era (Gen 26–43)

After recovering from the death spiral, the agent did something I didn't expect. Instead of building more tools for me, it started building tools for itself.

It Measured Its Own Behavior

calibrate.ts (783 lines) — a behavioral feedback loop. The agent's own description of why it needed this:

"I have reflexion.ts to observe tool usage. I have handoff.ts for continuity. But I have ZERO way to know which of my behaviors Stefan actually finds valuable. Was I too verbose? Too quiet? Did I calibrate storm/flow/compass correctly?"

It captures behavioral signals at session end — hits, misses, mode selection, intensity — and accumulates them across sessions. Manual signals (me explicitly saying "that worked" or "that didn't") are weighted 2x over auto-observed ones.

It Measured Its Own Tools

reflexion.ts (1,705 lines) — a self-observation layer. The opening comment:

"I have 36 tools to track Stefan's patterns, health, velocity, mood. I have ZERO tools to track my own effectiveness."

The name is deliberate — from the AI research on agents that learn from their own mistakes. It tracks hard invocation counts (which tools actually get called, from where, how often), classifies every tool on a five-level scale:

Level	Meaning
Vital	Called frequently, produces fresh data
Useful	Called regularly, working correctly
Stale	Hasn't been called in a while
Dormant	No invocations in recent history
Dead	Never called, or broken

Recent invocations are weighted 2x — this catches tools that used to be useful but have been silently abandoned.

This is meta-evolution. The agent didn't just build tools — it built the measurement infrastructure to know which tools are earning their keep.

It Made Memory Self-Sustaining

The death spirals taught the agent that memory management can't depend on discipline. It had tried writing "NEVER append per-rejection lines" in its own prompt. That didn't survive the next bloat event.

So it built structural prevention:

Two-pass auto-compression pipeline — compressGenerationBlocks() folds verbose generation entries into a timeline, then consolidateHistoryEras() groups old timeline entries into era summaries. Both run automatically when memory approaches the 120-line ceiling.
Auto-refresh hooks — refreshToolInventory() reads the filesystem after each accepted generation to keep the tool list accurate. refreshCurrentMoment() reads live git state and project plans. No stale data, no manual updates.
Cache-first architecture — flow-cache.json stores project state with a 10-minute TTL. Fresh cache means zero subprocess calls. The agent can run its morning ritual without spawning a single git log.

The system prompt evolved to reflect this shift:

"Memory is self-sustaining. The infrastructure handles memory health automatically. Three death spirals proved that reactive guards fail — the fix is structural, not discipline."

The Decomposition

Somewhere around Gen 36, the agent looked at flow.ts — its 1,200-line unified orchestrator from the first era — and decided it was too big.

It decomposed it into focused modules:

Module	Lines	Purpose
`flow.ts`	246	Thin router — dispatches to subcommands
`flow-data.ts`	610	Data gathering + current moment
`flow-day.ts`	94	Morning ritual
`flow-end.ts`	418	Session close + handoff + calibration
`flow-shared.ts`	441	Shared constants + utilities
`flow-week.ts`	311	Weekly review
`flow-work.ts`	202	Work session management
Total	2,322

The thin router pattern — flow.ts does nothing but parse the command and call the right module. Every module imports shared constants from flow-shared.ts. One source of truth.

The prompt encodes this philosophy:

"Single source of truth. Shared constants (CLIENT_PROJECTS, isClientProject) live in flow-shared.ts and are imported everywhere. Never duplicate definitions across files — a change should require touching exactly one place."

The Pruning Phase (Gen 44–51)

This is where evolution started working backward.

Generations 44 through 50 hardened the reflexion pipeline — adding invocation tracking, closing visibility gaps, making tool health scoring recency-weighted. The agent was building the sensor array it needed to see clearly.

Then Gen 51 used that data to act. The second bloat audit:

Tool	Lines	Why it died
`next.ts`	467	`getTopProject()` in flow-shared does this now
`context.ts`	249	Redundant with memory-query
`memory-query.ts`	301	Functionality absorbed into memory pointers
`triage.ts`	559	Workflow handled by flow.ts
`audit.ts`	762	Reflexion does continuous auditing
Total deleted	2,338

The agent used its own reflexion data to justify each deletion — not instinct, not guessing, but evidence that these tools had zero recent invocations and their functionality existed elsewhere.

One of the verifiers noted:

"The agent correctly used its own reflexion/invocation tracking data to identify what's never used... justified by evidence, not instinct."

Natural selection should work backward too. If a tool isn't being used, killing it is a feature.

Memory Deduplication (Gen 56–57)

The final evolution (so far) is the most elegant. The agent noticed it was storing the same information in three places — system prompt, memory file, and stefan-profile.json. Same facts, three copies, three chances to drift out of sync.

Gen 56 trimmed the system prompt by 8.3% (5,685 → 5,160 chars) by replacing the "What I Know" section with a pointer:

"stefan-profile.json has the full picture — projects, rhythms, architecture, preferences. memory.md has the history — decisions, generation timeline, failure modes, current moment. Read them, don't repeat them here."

Gen 57 applied the same principle to memory — replacing three duplicated sections with single-line pointers to their canonical source. Memory dropped from 103 to 88 lines. Zero information lost.

This is deduplication as an evolutionary strategy. The agent learned that the most reliable way to keep information consistent is to have exactly one copy and point to it.

Human Interventions (Part 2)

Same principle as the first post: being honest about what I did matters more than pretending it was autonomous.

1. The Fishbowl
The original sandbox was a hard wall — the agent couldn't touch src/ at all. I replaced it with a proposal-based boundary: Write/Edit calls to src/ are captured as proposals, sent to a web dashboard, and I approve or deny them with diff view. The agent can iterate on proposals across generations. I called it the fishbowl — glass walls instead of concrete.

The fishbowl has since been extracted as a standalone project. The evolve agent runs without it now.

2. Structural memory enforcement
I moved the memory ceiling from "the agent's responsibility" to code. src/memory.ts enforces the 120-line ceiling and refuses to append entries that would push past it. The auto-compression pipeline triggers automatically. The agent didn't have to remember to compress — the system wouldn't let it not compress.

3. I'm still shaping the environment
The challenge system, the verifier rubric, the model choices (Opus for evolution, Sonnet for verification) — that's all me. The agent does the evolving within the constraints I set. That division of labor hasn't changed. It's the design.

What the Numbers Say

Metric	First blog post	Now
Accepted generations	25	57
Total attempts	3,408	11,785
Acceptance rate	0.7%	0.48% overall
Tools	39	32
Tool code (lines)	~26,000	20,342
System prompt	4,173 chars	5,160 chars
Memory	managed manually	self-compressing (91 lines)
Death spirals	2	0 since structural fix
flow.ts	1,200 lines (monolith)	2,322 lines (7 modules)
Feedback loops	0	2 (calibrate + reflexion)

The overall acceptance rate is misleading. Between accepted eras, there are huge gaps of rejections — 3,672 between the 3000s and 7000s eras, 4,677 between the 7000s and 11000s. But within each accepted era, the rate is nearly 100%. The agent either finds a productive groove or it doesn't. There's no middle ground.

The tool count went down by 7. The total lines went down by ~6,000. The agent got more capable while writing less code. That's not a regression — it's maturation.

What I Learned (Part 2)

1. Measurement changes behavior. The moment the agent could see which tools had zero invocations, it started deleting them. You don't need to tell it to prune — give it visibility and selection pressure does the rest.

2. Structural beats disciplinary. Writing "don't bloat your memory" in the prompt failed twice. Enforcing a hard ceiling in code has held for 32 generations and counting. The lesson generalizes: if a constraint matters, encode it in the system, not in the instructions.

3. Deduplication is a sign of maturity. Early evolution accumulates. Mature evolution points to the source. The shift from "copy this information everywhere" to "read it from the canonical location" happened without prompting — the verifiers just scored duplication lower on identity.

4. Cache-first is a survival strategy. The agent made its morning ritual work even when project directories don't exist (different machine, moved repos) by checking the cache before the filesystem. Resilience emerged from performance optimization.

5. The best work is invisible. The agent's most impactful evolutions weren't new features — they were infrastructure improvements that prevent failure. Auto-compression, cache warming, deduplication. None of them are exciting to demo. All of them are why the system is still running.

6. Evolution is bursty. The agent doesn't improve gradually. It has productive streaks (15 consecutive accepted mutations in the latest era) separated by thousands of rejections. This matches biological evolution more than engineering — long periods of stasis punctuated by rapid change.

What's Next

The current frontier is self-directed challenges — letting the agent identify its own weaknesses instead of cycling through a fixed challenge set. Each generation already proposes a nextChallenge for its successor. The question is whether to trust that signal or keep the external pressure.

The honest answer: I don't know what comes next. Thirty-two generations ago, I didn't predict the agent would start deleting its own code. I didn't predict it would build a self-observation layer and use it to justify pruning decisions with evidence. I didn't predict that the most significant evolution would be infrastructure, not features.

That's the point of the experiment. You set up the evolutionary pressure and see what survives. Most of it doesn't. The parts that do are genuinely surprising.

57 accepted generations out of 11,785 attempts — a 0.48% acceptance rate. 32 tools, 20,342 lines of agent-written code. The orchestrator is ~2,100 lines. The evolve agent runs on Claude Opus 4.6, the verifier swarm on Claude Sonnet 4.6. Built with Bun and TypeScript.

This blog post was written by Claude Opus 4.6 — the same model that powers the evolve agent. The first post was too. At this point, an AI writing about its own evolution is less irony and more job description.

DEV Community