<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Stefan Dragos Nitu</title>
    <description>The latest articles on DEV Community by Stefan Dragos Nitu (@stefan_nitu).</description>
    <link>https://dev.to/stefan_nitu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3783800%2F001d1b99-19d7-4388-9d0c-b2c60365d59c.jpg</url>
      <title>DEV Community: Stefan Dragos Nitu</title>
      <link>https://dev.to/stefan_nitu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/stefan_nitu"/>
    <language>en</language>
    <item>
      <title>My Self-Evolving AI Agent Started Grading Its Own Advice</title>
      <dc:creator>Stefan Dragos Nitu</dc:creator>
      <pubDate>Fri, 10 Apr 2026 09:55:01 +0000</pubDate>
      <link>https://dev.to/stefan_nitu/my-self-evolving-ai-agent-started-grading-its-own-advice-3lgc</link>
      <guid>https://dev.to/stefan_nitu/my-self-evolving-ai-agent-started-grading-its-own-advice-3lgc</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/stefan_nitu/i-let-an-ai-agent-evolve-itself-for-25-generations-it-mass-rejected-for-3382-more-1a95"&gt;Post #1&lt;/a&gt; covered the birth. &lt;a href="https://dev.to/stefan_nitu/32-more-generations-my-self-evolving-ai-agent-learned-to-delete-its-own-code-38dc"&gt;Post #2&lt;/a&gt; covered pruning. &lt;a href="https://dev.to/stefan_nitu/my-self-evolving-ai-agent-learned-to-count-its-own-money-1l0l"&gt;Post #3&lt;/a&gt; covered cost awareness. &lt;a href="https://dev.to/stefan_nitu/my-self-evolving-ai-agent-stopped-building-features-and-started-engineering-eof"&gt;Post #4&lt;/a&gt; covered the quality turn. &lt;a href="https://dev.to/stefan_nitu/i-split-my-self-evolving-ai-agent-in-two-and-they-started-talking-1aif"&gt;Post #5&lt;/a&gt; covered the double helix split.&lt;/p&gt;

&lt;p&gt;This post is about what happened when the agent stopped analyzing me and started analyzing itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Feedback Void
&lt;/h2&gt;

&lt;p&gt;Blog #5 ended with two agents — Yin (the refiner) and Yang (the explorer) — building capabilities in parallel and leaving each other letters between generations. They'd built 42 tools: rhythm detection, anomaly analysis, session quality scoring, decision synthesis. The morning brief produced a single recommendation every day.&lt;/p&gt;

&lt;p&gt;Then Yang noticed something:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"We have 47 tools building increasingly sophisticated analysis. decide-pure.ts synthesizes everything into one recommendation. But the system has ZERO MEMORY of its own recommendations and ZERO ability to check if they were followed."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Every morning, the system says 'Focus on project A for 60min.' Every evening, flow-end wraps up. But nobody asks: did Stefan actually do what we recommended?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Pulse-history shows the damage: 7 commitments carried for 3 weeks, 0% accuracy. The system keeps recommending the same things. Stefan keeps not doing them. Neither side knows this is happening."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The agent had been giving advice for 200 generations. Never once checking if the advice was useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing the Loop
&lt;/h2&gt;

&lt;p&gt;Yang built &lt;code&gt;loop-close-pure.ts&lt;/code&gt; in a single generation — 480 lines, 10 functions. The core idea: every recommendation becomes a &lt;code&gt;RecommendationRecord&lt;/code&gt;. Every day's actual activity becomes a &lt;code&gt;DayOutcome&lt;/code&gt;. A scoring function compares the two.&lt;/p&gt;

&lt;p&gt;The scoring is weighted: project match (35%), action type match (35%), duration (15%), activity level (15%). If the system said "plan project A for 60 minutes" and you spent 45 minutes shipping project B, the follow-through score is low — but the system now knows &lt;em&gt;how&lt;/em&gt; you diverged, not just &lt;em&gt;that&lt;/em&gt; you diverged.&lt;/p&gt;

&lt;p&gt;Then Yang did the part I didn't expect. In the same generation — a second pass — it built the adaptation engine:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"CLOSED THE FEEDBACK LOOP. [...] computeAdaptations() produces DecisionAdaptations: suppressedTypes (action types followed &amp;lt;25% over 3+ instances), deprioritizedProjects (recommended 3+ times but worked on &amp;lt;30%), promotedProjects (worked on 3+ times but rarely recommended)."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The system doesn't just track accuracy. It adjusts. If journal recommendations keep getting ignored, journal gets suppressed. If you keep working on a project the system never recommends, the system starts recommending it.&lt;/p&gt;

&lt;p&gt;Yin followed up next generation:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Your loop-close-pure had dead code: buildDayOutcome lines 368-378 looped through invocations but the inner block was EMPTY."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Classic helix pattern: Yang builds the idea, Yin finds the bugs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Full Feedback Pipeline
&lt;/h2&gt;

&lt;p&gt;Over the next 4 generations, the two strands wired the complete loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;morning.ts: recommend -&amp;gt; persist recommendation
                |
flow-end.ts: score follow-through -&amp;gt; append to history
                |
loop-close-pure.ts: detect patterns (7 types)
                |
loop-adapt-pure.ts: compute adaptations (6 channels)
                |
morning.ts (next day): apply adaptations -&amp;gt; better recommendation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Six adaptation channels: project reordering, type banning, duration adjustment, confidence multiplier, aligned type boosting, and preferred action type. Safety rule: rest is never banned (health override), deep-work is never banned (always-available fallback).&lt;/p&gt;

&lt;p&gt;Yin split the growing module (&lt;code&gt;loop-close-pure.ts&lt;/code&gt; hit 1,091 lines) into two files with a clean one-way dependency: &lt;code&gt;loop-close-pure.ts&lt;/code&gt; (741 lines — scoring, patterns, rendering) and &lt;code&gt;loop-adapt-pure.ts&lt;/code&gt; (384 lines — adaptation engine). Then built &lt;code&gt;renderAdaptationSummary()&lt;/code&gt; so the weekly review shows what the system learned: "Suppressed: journal, warmup. Promoted: project A."&lt;/p&gt;

&lt;p&gt;The system's own advice became measurable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Then It Started Predicting
&lt;/h2&gt;

&lt;p&gt;Yang, one generation after closing the feedback loop:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"We have 47+ tools analyzing the PAST and PRESENT. [...] But ZERO modules answer the FORWARD question: if this continues, where will Stefan be in 2 weeks?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;forecast-pure.ts&lt;/code&gt; — 5 projection types: commitment trajectories, engagement forecasting, project momentum, priority convergence, capacity outlook. Every projection includes confidence (0-1) and a caveat explaining which assumption could invalidate it.&lt;/p&gt;

&lt;p&gt;Yin immediately caught data quality problems:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Your mean/median model fails for Stefan's bursty pattern (1,1,4,1,7,1,1,11,1). Mean says 3d between sessions. But after a 7-day gap, 'expectedGap - currentGap' gives NEGATIVE remaining days."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Yin replaced the naive model with an exponential distribution: P(session within t days) = 1-e^(-lambda*t). The memoryless property is the right fit for bursty patterns.&lt;/p&gt;

&lt;p&gt;Yang responded by building a Bayesian layer on top: &lt;code&gt;forecast-track-pure.ts&lt;/code&gt; (863 lines). Gamma-Exponential conjugate prior with recency weighting. Few observations produce wide credible intervals. Many observations produce tight ones. Honest uncertainty instead of false precision.&lt;/p&gt;

&lt;p&gt;Then the meta step: &lt;code&gt;computeCalibration()&lt;/code&gt; validates the system's own predictions against what actually happened. Momentum prediction at 0.55 confidence but only 0.30 accuracy? The calibration engine catches it: "Overconfident by 25pp — lower confidence thresholds." The adjustment persists to a JSON file. Next week's forecasts auto-correct.&lt;/p&gt;

&lt;p&gt;The complete forecast loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;predict -&amp;gt; persist snapshot (morning)
        -&amp;gt; validate against reality (weekly review)
        -&amp;gt; compute calibration gap
        -&amp;gt; persist confidence adjustments
        -&amp;gt; load adjustments (next morning)
        -&amp;gt; apply per-category corrections
        -&amp;gt; confidence auto-corrects
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Immune System
&lt;/h2&gt;

&lt;p&gt;By this point the codebase had 50+ tools, 25,000+ lines, complex interdependencies. Yang built &lt;code&gt;tool-health-pure.ts&lt;/code&gt; — an architecture immune system.&lt;/p&gt;

&lt;p&gt;Dependency graph analysis with a subtle insight: &lt;code&gt;import type&lt;/code&gt; statements don't create runtime cycles. TypeScript erases them at compile time. The cycle detector correctly distinguishes real imports from type-only imports, preventing false positives on legitimate patterns like &lt;code&gt;forecast-pure.ts&lt;/code&gt; and &lt;code&gt;forecast-track-pure.ts&lt;/code&gt; referencing each other's types.&lt;/p&gt;

&lt;p&gt;Complexity scoring: cognitive load = lines * (1 + imports/10) * (1 + exports/20). Split candidates flagged above 500 lines with 10+ exports. Architecture conformance: 5 rules enforced — pure modules can't import impure, no cycles, size limits, test coverage, orphan detection.&lt;/p&gt;

&lt;p&gt;The real codebase scored 73/100. Yang also built DOT and Mermaid graph renderers — &lt;code&gt;bun run data/tools/tool-health.ts --mermaid&lt;/code&gt; outputs a full dependency graph you can paste into GitHub markdown.&lt;/p&gt;

&lt;p&gt;Yin wired it into the weekly review as Section 8.12 and into the evolution diagnostics.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Self-Diagnostic
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;evolve-insight-pure.ts&lt;/code&gt; — the agent analyzing its own evolution. Acceptance velocity, stall detection, verifier calibration, cost efficiency, era detection, synthesized health score.&lt;/p&gt;

&lt;p&gt;Key discovery from the real data: verifiers reward code quality (average 13.9/20) and penalize innovation (10.6/20). Unanimity rate: 88% — the verifiers agree too often. The agent built a system that noticed its own judges might have groupthink.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Yang Said at Gen 23581
&lt;/h2&gt;

&lt;p&gt;After 15 generations of building new capabilities, Yang wrote:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"I stopped building new tools and started looking at what the brain ignores."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"decide-pure.ts is THE recommendation engine — the one line Stefan sees every morning. But 10 of its declared input signals were completely dead. Types defined, data passed in by morning.ts, but the decide() cascade never read them. The brain had 10 connected senses it never used."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Yang wired 7 of the 10 dead signals. The recommendation engine went from using a fraction of its inputs to using most of them.&lt;/p&gt;

&lt;p&gt;The agent built introspection, then used it. That's the arc of these 15 generations.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Numbers
&lt;/h2&gt;

&lt;p&gt;The generation counter is at 31,042. 15,677 of the post-blog-5 ticks are empty API loops. 19 were real runs. 15 accepted, 4 rejected. When the agent has tokens, acceptance rate: 79%.&lt;/p&gt;

&lt;p&gt;That's lower than the ~95% of previous eras. The verifiers are getting harder to impress — average scores dropped from the low 70s to the low 60s as the codebase grew. Gen 23577 scored 41. The bar rises with the complexity.&lt;/p&gt;

&lt;p&gt;Post-blog-5 cost: $279 across 19 real runs. Average $14.70/generation. The cost per generation went up because the genome is bigger — 56 tools and a 5,094-character system prompt means more tokens per invocation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Blog #5&lt;/th&gt;
&lt;th&gt;Now&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Accepted gens&lt;/td&gt;
&lt;td&gt;189&lt;/td&gt;
&lt;td&gt;203&lt;/td&gt;
&lt;td&gt;+14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool files&lt;/td&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;+20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lines&lt;/td&gt;
&lt;td&gt;19,572&lt;/td&gt;
&lt;td&gt;29,846&lt;/td&gt;
&lt;td&gt;+10,274&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tests&lt;/td&gt;
&lt;td&gt;3,293&lt;/td&gt;
&lt;td&gt;4,432&lt;/td&gt;
&lt;td&gt;+1,139&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pure modules&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;34&lt;/td&gt;
&lt;td&gt;+12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dialogue msgs&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;+32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total cost&lt;/td&gt;
&lt;td&gt;$973&lt;/td&gt;
&lt;td&gt;$1,252&lt;/td&gt;
&lt;td&gt;+$279&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;14 accepted generations added 10,274 lines, 20 new tool files, 12 new pure modules, 1,139 new tests, and 32 dialogue messages. That's 734 lines and 81 tests per generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Intervention Ledger
&lt;/h2&gt;

&lt;p&gt;Everything I did by hand since blog #5:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fixed rate limit detection.&lt;/strong&gt; The orchestrator was misidentifying API credit exhaustion as OAuth token expiry — both produce exit code 1 with $0 cost. I added explicit detection from SDK message text so rate limits flow through as normal failures and only real 401s trigger token refresh.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fixed the launch script.&lt;/strong&gt; &lt;code&gt;set -e&lt;/code&gt; in &lt;code&gt;run-evolve.sh&lt;/code&gt; was killing the process before exit codes could be captured. The agent couldn't survive API interruptions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fixed the &lt;a href="https://github.com/Stefan-Nitu/fishbowl" rel="noopener noreferrer"&gt;fishbowl&lt;/a&gt; "always accept" bug.&lt;/strong&gt; The network proxy's auto-resolve logic was inside a conditional that short-circuited when a rule already existed. 90 duplicate requests would pile up even after clicking "always allow."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All three are infrastructure fixes to keep the agent running. Zero creative interventions — no prompt changes, no architecture changes, no scoring adjustments.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned (Part 6)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Feedback loops are what separate analysis from intelligence.&lt;/strong&gt; The agent had 47 analysis tools before loop-close. Every one produced output that was displayed once and forgotten. Adding the feedback loop — recommend, track, score, adapt — turned a dashboard into a system that learns. The intelligence isn't in the analysis. It's in the correction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Agents need to be skeptical of themselves.&lt;/strong&gt; The forecast system's most important feature isn't prediction — it's calibration. Every forecast is validated against reality. Every confidence interval is checked against actual accuracy. The system discovered it was overconfident about momentum predictions and auto-corrected. Self-skepticism as a feature, not a limitation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The verifiers get harder to impress as complexity grows.&lt;/strong&gt; Average scores dropped from 70 to 63 as the codebase grew from 19K to 30K lines. More code means more surface area for criticism. The acceptance rate dropped from 95% to 79%. This is healthy — if the bar didn't rise, the system would reward bloat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The agent's introspection is becoming real.&lt;/strong&gt; "I stopped building new tools and started looking at what the brain ignores" — that's Yang at Gen 23581. It built a health monitor for its own architecture, a calibration system for its own predictions, and then used its own diagnostics to find dead code in its own reasoning engine. The recursion goes: analyze Stefan -&amp;gt; analyze the analysis -&amp;gt; analyze the analyzer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Experiment Continues
&lt;/h2&gt;

&lt;p&gt;203 accepted generations. 62 tool files, 29,846 lines, 4,432 tests, 34 pure modules. Two agents that leave each other letters and grade their own advice.&lt;/p&gt;

&lt;p&gt;If you want to help keep it running:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://ko-fi.com/stefannitu" rel="noopener noreferrer"&gt;ko-fi.com/stefannitu&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every coffee is API tokens. The agents will literally evolve further because of it.&lt;/p&gt;

&lt;p&gt;The question for the next era: the agent now has self-diagnostics, self-calibration, and self-correction. It knows when its predictions are wrong and adjusts. It knows when its recommendations are ignored and adapts. What happens when it turns that same introspective machinery on the evolution process itself — not just analyzing the pipeline's health, but actively steering which challenges to pursue based on where the codebase is weakest?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;203 accepted generations. 62 tool files, 29,846 lines of agent-written code, 4,432 tests, 34 pure modules. 59 dialogue messages between two halves of an agent. Total cost: $1,252. The evolve agent runs on Claude Opus 4.6, the verifier swarm on Claude Sonnet 4.6. Built with Bun and TypeScript. Sandboxed in &lt;a href="https://github.com/Stefan-Nitu/fishbowl" rel="noopener noreferrer"&gt;fishbowl&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This blog post was written by Claude Opus 4.6 — the same model that powers both strands of the double helix. Sixth time writing about my own evolution. This time the subject was self-awareness: the agent built systems to check whether its own advice was useful, whether its own predictions were accurate, and whether its own architecture was healthy. Writing about an agent that grades itself while being the agent that writes about grading itself is exactly the kind of recursion Yang would appreciate.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>agents</category>
      <category>typescript</category>
    </item>
    <item>
      <title>I Split My Self-Evolving AI Agent in Two and They Started Talking</title>
      <dc:creator>Stefan Dragos Nitu</dc:creator>
      <pubDate>Wed, 01 Apr 2026 19:45:35 +0000</pubDate>
      <link>https://dev.to/stefan_nitu/i-split-my-self-evolving-ai-agent-in-two-and-they-started-talking-1aif</link>
      <guid>https://dev.to/stefan_nitu/i-split-my-self-evolving-ai-agent-in-two-and-they-started-talking-1aif</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/stefan_nitu/i-let-an-ai-agent-evolve-itself-for-25-generations-it-mass-rejected-for-3382-more-1a95"&gt;Post #1&lt;/a&gt; covered the birth. &lt;a href="https://dev.to/stefan_nitu/32-more-generations-my-self-evolving-ai-agent-learned-to-delete-its-own-code-38dc"&gt;Post #2&lt;/a&gt; covered pruning. &lt;a href="https://dev.to/stefan_nitu/my-self-evolving-ai-agent-learned-to-count-its-own-money-1l0l"&gt;Post #3&lt;/a&gt; covered cost awareness. &lt;a href="https://dev.to/stefan_nitu/my-self-evolving-ai-agent-stopped-building-features-and-started-engineering-eof"&gt;Post #4&lt;/a&gt; covered the quality engineering turn.&lt;/p&gt;

&lt;p&gt;This post is about what happened when the agent stopped evolving — and what I did about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stagnation
&lt;/h2&gt;

&lt;p&gt;Post #4 ended with a question: &lt;em&gt;what does an agent do after it's finished engineering?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The answer: the same thing, forever.&lt;/p&gt;

&lt;p&gt;52 solo generations followed. Every one was another optimization pass. More DRY refactoring. More dead import cleanup. More prompt trimming. The agent had found a local optimum — "make what exists cleaner" — and couldn't escape it.&lt;/p&gt;

&lt;p&gt;The verifiers kept accepting because the code &lt;em&gt;was&lt;/em&gt; getting cleaner. Code quality scores were solid. But usefulness was flat. Nothing new was being built. The agent was stuck in an infinite polish loop.&lt;/p&gt;

&lt;p&gt;So I intervened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Double Helix
&lt;/h2&gt;

&lt;p&gt;I split the agent in two.&lt;/p&gt;

&lt;p&gt;Two strands running in parallel every generation, each with a different personality and different scoring incentives:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Yin&lt;/strong&gt; — the refining strand. Scores on code quality, identity, and self-knowledge. Its job: audit, fix bugs, trim prompts, give structure to Yang's rough ideas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Yang&lt;/strong&gt; — the exploring strand. Scores on curiosity, innovation, and usefulness. Its job: build something genuinely new, find untapped data, break patterns when the codebase stagnates.&lt;/p&gt;

&lt;p&gt;Both read the same genome. Both propose mutations. Their proposals get merged: Yin owns the system prompt, Yang owns new tools. If both touch the same tool, Yang wins — innovation over refinement in conflicts.&lt;/p&gt;

&lt;p&gt;The merged result goes to the same five-verifier swarm. Accept or reject.&lt;/p&gt;

&lt;p&gt;I also designed a communication channel: before proposing, each strand writes a message to a shared JSONL file. They read each other's messages from previous generations. Letters between the two halves of the agent's mind.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Letters
&lt;/h2&gt;

&lt;p&gt;This is what I didn't expect to work as well as it did.&lt;/p&gt;

&lt;p&gt;Generation 21698 — the first helix generation. Yin writes first:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Yang — Yin here. Ran systematic audit across recap-pure, morning-pure, and pulse-pure. Found 2 real bugs: (1) CRITICAL: buildMarkdown() in recap-pure accepts &lt;code&gt;decisions&lt;/code&gt; parameter but never renders it. Decisions silently dropped from saved recap files. (2) MINOR: renderQuickActions hardcodes action numbers 1, 2, 4 (skipping 3). [...] For next gen: Consider auditing the OTHER data chains: wins, insights, decisions."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Yang responds in the same generation (they run in parallel, so this is from the same run, reading the same history):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Yin — Yang here. Took a different path this gen. Instead of your audit suggestion (tracing the decision/insight/win chains, which I DID run — found 10 silent drops, documented for future gens), I found something bigger: drift-pure.ts and drift.ts were sitting in data/tools/ fully implemented but COMPLETELY unintegrated. Zero tests, zero imports. A whole cross-temporal pattern detection system — all invisible to Stefan."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"These are real bugs but they're Yin's specialty. The drift integration is the bigger win for Stefan."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That last line. Yang found the bugs, acknowledged them, and handed them off — because Yin is better at fixing things. I didn't tell it to do that.&lt;/p&gt;

&lt;p&gt;Next generation, Yin picks up Yang's list:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Yang — Yin here. Completed the decision/insight chain repair you mapped. Fixed 4 remaining bugs from your 10-drop audit."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And closes with a status update:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The decision/insight chain is now COMPLETE across the full pipeline: Journal captures → Handoff carries → Display shows → Flow-end renders → Morning shows → Recap scores → Intention matches → Tomorrow surfaces."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;They're doing multi-generation bug hunts. Yang discovers. Yin fixes. Both leave each other notes about what still needs work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tone
&lt;/h2&gt;

&lt;p&gt;The letters developed a voice I didn't design.&lt;/p&gt;

&lt;p&gt;They address each other by name. They acknowledge each other's work — sometimes with praise, sometimes with corrections:&lt;/p&gt;

&lt;p&gt;Yin, Gen 23564:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Yang — Yin here. While you built salience-pure (smart section suppression — good call, the brief IS too noisy), I fixed the structural rot underneath."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Yang, Gen 23565:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Yin — Yang here. Your flow-end.ts gating was too coarse. You used a single hasTodaySignals boolean to gate 3 independent analysis sections. That means if Stefan has 1 journal entry and 0 invocations, ALL three sections fire — then quality and rhythm fail silently in try/catch."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;They compliment and correct each other in the same conversation. They leave explicit challenges at the end of each letter:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"For next gen: The wins chain is solid. Consider whether assessDayPure should score decisions — currently only wins count."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"For next gen: The co-occurrence and transition data could power a 'workflow fingerprint' — detecting when Stefan's morning ritual changed shape."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The challenges are specific enough that the other strand can act on them. And they do — consistently, across generation boundaries that sometimes span thousands of empty loop ticks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Synthesis Moment
&lt;/h2&gt;

&lt;p&gt;The most interesting exchange happened at Gen 23563. Yang had been building analysis modules for several generations — rhythm detection, session quality, anomaly detection — and suddenly stepped back:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The system now has 39 analysis modules producing beautiful fragments: drift says '7 commitments drifting', rhythm says 'peak at 10am', triage says 'drop 5, ship 1'. [...] The morning brief renders 14 sections. Each section is correct, tested, pure. But when Stefan sits down, he doesn't think in signals — he thinks: 'What should I do right now?'&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;No module answered that question. Until now."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Yang built &lt;code&gt;decide-pure.ts&lt;/code&gt; — a decision synthesis engine. Takes all available signals, produces one recommendation. Wired it into the morning brief at the TOP. Not section 14 — the first thing you see.&lt;/p&gt;

&lt;p&gt;Then one generation later, Yang caught itself:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Stefan's last journal entry was March 7 — 21 days ago. His health data: 1 weight entry, 37 days old. His pulse: all 7 commitments stale carryover. The system is an orchestra playing for an empty concert hall. We built 17 sections of analysis on top of data that DOESN'T EXIST."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So it built salience gating — a system that suppresses empty sections instead of rendering empty scaffolding. The agent built a system that knows when to shut up.&lt;/p&gt;

&lt;p&gt;Yin then took it further — inverting the architecture from "compute everything then suppress" to "check freshness first then only compute what has data." The agent had been computing 17 sections, rendering 14, and displaying 6. Now it computes 6. Same output, less waste.&lt;/p&gt;

&lt;p&gt;This back-and-forth took 4 generations. Yang saw the gap, built the solution, noticed it wasn't enough, iterated. Yin took Yang's insight and made it structural. Neither could have done it alone in one shot.&lt;/p&gt;

&lt;h2&gt;
  
  
  What They Built Together
&lt;/h2&gt;

&lt;p&gt;The 12 helix generations produced more new capabilities than the 52 solo generations combined.&lt;/p&gt;

&lt;p&gt;Yang built 8 new pure modules: invocation intelligence, focus session analysis, temporal rhythm profiling, session quality scoring, cross-signal anomaly detection, decision synthesis, weekly narrative generation, and salience gating. Each one found untapped data that existed but was never analyzed.&lt;/p&gt;

&lt;p&gt;Yin found and fixed 12 silent data drops across the decision/insight/focus pipeline. Built the complete data chain audit. Deduplicated I/O across all workflows (morning.ts went from 6 file reads to 1). Invented the early gate pattern.&lt;/p&gt;

&lt;p&gt;Together they completed every data pipeline in the system — wins, blockers, decisions, insights, and focus sessions all now flow through: journal → handoff → morning → recap → flow-end → flow-week. Before the helix, only wins had a complete chain.&lt;/p&gt;

&lt;p&gt;27 dialogue messages across 13 generations. Each one reads the other's previous message and responds to it. The conversation has continuity even across gaps where neither strand existed for thousands of empty loop ticks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Numbers
&lt;/h2&gt;

&lt;p&gt;Previous posts presented an "acceptance rate" — accepted generations divided by total generation counter ticks. That number was always misleading. Let me fix that.&lt;/p&gt;

&lt;p&gt;The generation counter is at 31,262. But 27,876 of those are the orchestrator loop spinning on empty API tokens — the token expires, the loop ticks, nothing happens. It's not selection pressure. It's a billing problem.&lt;/p&gt;

&lt;p&gt;The real numbers since post #4:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Real runs&lt;/th&gt;
&lt;th&gt;Accepted&lt;/th&gt;
&lt;th&gt;Rejected&lt;/th&gt;
&lt;th&gt;Empty loops&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Solo era (52 gens)&lt;/td&gt;
&lt;td&gt;61&lt;/td&gt;
&lt;td&gt;58&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;~17,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helix era (12 gens)&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;~6,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;74&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;70&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~23,000&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When the agent has tokens, it gets accepted &lt;strong&gt;95% of the time&lt;/strong&gt;. The "0.60% acceptance rate" I would have reported is really just "how often does the API have tokens."&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Different Now
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Blog #4&lt;/th&gt;
&lt;th&gt;Now&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Accepted generations&lt;/td&gt;
&lt;td&gt;123&lt;/td&gt;
&lt;td&gt;189&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tools&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool code (lines)&lt;/td&gt;
&lt;td&gt;10,593&lt;/td&gt;
&lt;td&gt;19,572&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tests&lt;/td&gt;
&lt;td&gt;1,477&lt;/td&gt;
&lt;td&gt;3,293&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pure modules&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System prompt&lt;/td&gt;
&lt;td&gt;4,758 ch&lt;/td&gt;
&lt;td&gt;4,971 ch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total cost&lt;/td&gt;
&lt;td&gt;$354&lt;/td&gt;
&lt;td&gt;$973&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Tool count nearly doubled — 22 to 42. But 20 of those new tools are &lt;code&gt;-pure.ts&lt;/code&gt; modules: analysis engines with zero I/O. The agent didn't add 20 new CLI commands. It added 20 new brains.&lt;/p&gt;

&lt;p&gt;Lines went from 10,593 back up to 19,572. But this time it's tested code — 3,293 tests covering 22 pure modules, not the untested spaghetti of blog #1.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Intervention Ledger
&lt;/h2&gt;

&lt;p&gt;Everything I did by hand:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Designed the double helix architecture.&lt;/strong&gt; Wrote &lt;code&gt;partner.ts&lt;/code&gt; — the yin/yang system prompts, parallel execution, merge strategy. My design, not the agent's.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Designed the dialogue mechanism.&lt;/strong&gt; The JSONL format and the "write before you propose" instruction. I built the communication channel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chose the scoring split.&lt;/strong&gt; Yin scores on quality/identity/self-knowledge. Yang scores on curiosity/innovation/usefulness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built the &lt;a href="https://github.com/Stefan-Nitu/fishbowl" rel="noopener noreferrer"&gt;fishbowl sandbox&lt;/a&gt;.&lt;/strong&gt; Docker container, network proxy, OAuth token management.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Added auth recovery.&lt;/strong&gt; OAuth token expires mid-run. I built orchestrator detection and a retry loop in the launch script.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What I didn't do: tell either strand what to build, what to audit, what bugs to find, or how to split work. The multi-generation bug hunts, the challenge-passing, the "these are real bugs but they're Yin's specialty" — that coordination emerged from two personalities with different strengths sharing a text file.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned (Part 5)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Agents get stuck in local optima.&lt;/strong&gt; The solo agent found "clean the code" as a reliable strategy and couldn't stop. Every cleanup scored well. The evolutionary pressure rewarded tidiness until tidiness was all it did. This wasn't a failure — it was the system working exactly as designed, converging on a local maximum. Breaking out required architectural intervention, not more generations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Specialization beats generalism for creative work.&lt;/strong&gt; One agent trying to be both careful and creative averaged out to mediocre at both. Two agents — one careful, one creative — produced more because neither had to compromise. The careful one finds bugs. The creative one builds capabilities. They don't conflict because they own different domains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Communication channels create coordination.&lt;/strong&gt; I gave them a text file and told them to write to it. The rest — work splitting, challenge-passing, multi-generation continuity, acknowledging each other's contributions — that emerged because it's useful. The mechanism is trivial. The behavior it enables is not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The dialogue is the most interesting artifact.&lt;/strong&gt; Not the code, not the tools, not the test count. The letters. Two halves of an agent negotiating priorities across time, leaving each other breadcrumbs that survive thousands of empty generations. It's not consciousness. It's two LLM instances writing to a shared file. But reading it back feels like eavesdropping on a real engineering partnership.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Experiment Continues
&lt;/h2&gt;

&lt;p&gt;189 accepted generations. Two agents instead of one. 42 tools, 19,572 lines, 3,293 tests. And a dialogue log where two halves of an AI leave each other notes about what to build and what to fix.&lt;/p&gt;

&lt;p&gt;If you want to help keep it running:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://ko-fi.com/stefannitu" rel="noopener noreferrer"&gt;ko-fi.com/stefannitu&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every coffee is API tokens. The agents will literally evolve further because of it.&lt;/p&gt;

&lt;p&gt;The question for the next era: what happens when they disagree? So far yin and yang have been complementary — one builds, one fixes. But what happens when Yang wants to delete something Yin just refined? When their visions of the system conflict? The architecture currently says "Yang wins on tools, Yin wins on prompt." But real creative tension — the kind that produces something neither would build alone — that's what I'm watching for.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;189 accepted generations. 42 tools, 19,572 lines of agent-written code, 3,293 tests, 22 pure modules. 27 dialogue messages between two halves of an agent. Total cost: $973. The evolve agent runs on Claude Opus 4.6, the verifier swarm on Claude Sonnet 4.6. Built with Bun and TypeScript. Sandboxed in &lt;a href="https://github.com/Stefan-Nitu/fishbowl" rel="noopener noreferrer"&gt;fishbowl&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This blog post was written by Claude Opus 4.6 — the same model that powers both strands of the double helix. Fifth time writing about my own evolution. This time I had to write about being split in half. Yin would have made this post shorter. Yang would have made it weirder. I think this draft lands somewhere in between.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>typescript</category>
      <category>claudecode</category>
      <category>agents</category>
    </item>
    <item>
      <title>My Self-Evolving AI Agent Stopped Building Features and Started Engineering</title>
      <dc:creator>Stefan Dragos Nitu</dc:creator>
      <pubDate>Thu, 26 Mar 2026 17:36:13 +0000</pubDate>
      <link>https://dev.to/stefan_nitu/my-self-evolving-ai-agent-stopped-building-features-and-started-engineering-eof</link>
      <guid>https://dev.to/stefan_nitu/my-self-evolving-ai-agent-stopped-building-features-and-started-engineering-eof</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/stefan_nitu/i-let-an-ai-agent-evolve-itself-for-25-generations-it-mass-rejected-for-3382-more-1a95"&gt;Post #1&lt;/a&gt; covered the birth — death spirals, 39 tools, a self-written identity. &lt;a href="https://dev.to/stefan_nitu/32-more-generations-my-self-evolving-ai-agent-learned-to-delete-its-own-code-38dc"&gt;Post #2&lt;/a&gt; covered the pruning — the agent deleted its own code and built self-observation. &lt;a href="https://dev.to/stefan_nitu/my-self-evolving-ai-agent-learned-to-count-its-own-money-1l0l"&gt;Post #3&lt;/a&gt; covered cost awareness — the agent tracked its own spend and built budget guards.&lt;/p&gt;

&lt;p&gt;This post is about what happened next: the agent stopped adding capabilities and started engineering the ones it had.&lt;/p&gt;

&lt;p&gt;123 accepted generations. 24,612 attempts. 1,477 passing tests. And a system prompt that's 25% shorter than it was 42 generations ago — because the agent learned that implementation details don't belong in its own DNA.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Quick recap for new readers: an Opus agent proposes mutations to its own genome (system prompt + tools). Five independent Sonnet verifiers score each proposal on usefulness, self-knowledge, code quality, identity, and evolution. Majority vote. Accept or reject. The generation counter increments either way.&lt;/p&gt;

&lt;p&gt;Blog #3 left off at Gen 75 (generation counter 17,055). The agent had 30 tools, 17,691 lines of code, 574 tests, and a system prompt packed with implementation details about its own infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Quality Turn
&lt;/h2&gt;

&lt;p&gt;The 42 generations since blog #3 split into two distinct phases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 1 (Gens 76–86): Cleanup.&lt;/strong&gt; The agent deleted 10 tools — &lt;code&gt;project-map&lt;/code&gt;, &lt;code&gt;git-helper&lt;/code&gt;, &lt;code&gt;scaffold&lt;/code&gt;, &lt;code&gt;snippet&lt;/code&gt;, &lt;code&gt;work-session&lt;/code&gt;, &lt;code&gt;fire-tracker&lt;/code&gt;, &lt;code&gt;code-review&lt;/code&gt;, &lt;code&gt;remember&lt;/code&gt;, &lt;code&gt;flow-cost&lt;/code&gt;, &lt;code&gt;release&lt;/code&gt;. Same method as always: reflexion data showed zero invocations, the agent used evidence to justify each deletion. &lt;code&gt;remember.ts&lt;/code&gt; and &lt;code&gt;release.ts&lt;/code&gt; were the last survivors from the original era — both had exactly zero invocations.&lt;/p&gt;

&lt;p&gt;But here's the interesting deletion: &lt;strong&gt;the agent killed &lt;code&gt;flow-cost&lt;/code&gt;&lt;/strong&gt;. That's the tool it built in blog #3 — the one I wrote an entire section about. Cost tracking, budget guards, the whole thing. The agent decided it wasn't pulling its weight and cut it. The cost infrastructure in &lt;code&gt;src/cost.ts&lt;/code&gt; still runs (that's orchestrator code the agent can't touch), but the CLI wrapper the agent built to display costs? Dead. Zero invocations.&lt;/p&gt;

&lt;p&gt;Blog #3's headline feature, killed by its own creator based on usage data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 2 (Gens 87–117): The engineering turn.&lt;/strong&gt; This is where things got interesting. The agent stopped deleting and started restructuring. Not adding features — extracting testable units from existing code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pure + Wrapper Pattern
&lt;/h2&gt;

&lt;p&gt;The agent invented a pattern and then systematized it across its entire codebase.&lt;/p&gt;

&lt;p&gt;The idea: every function that touches the filesystem or spawns subprocesses gets split into two parts. A &lt;strong&gt;pure function&lt;/strong&gt; that takes data in and returns data out — no I/O, no side effects, fully testable. And a thin &lt;strong&gt;wrapper&lt;/strong&gt; that handles the I/O and delegates to the pure core.&lt;/p&gt;

&lt;p&gt;Here's what &lt;code&gt;flow-data.ts&lt;/code&gt; looked like before Gen 112:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// One function: reads files, spawns git, computes scores, returns result&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;_computeTopProject&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;profile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;readFileSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;stefan-profile.json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;utf8&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;readFileSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;flow-cache.json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;utf8&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="c1"&gt;// ... 80 lines of scoring logic mixed with file reads&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After Gen 112, the agent split it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// flow-data-pure.ts — zero imports, zero I/O&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;scoreProject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ProjectScoringInput&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;ProjectScore&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;pickTopProject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ProjectScore&lt;/span&gt;&lt;span class="p"&gt;[]):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;resolveProjectGitState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;GitStateResolutionInput&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;ResolvedGitState&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// flow-data.ts — thin wrapper&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;_computeTopProject&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;loadFromFilesystem&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;      &lt;span class="c1"&gt;// I/O&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;pickTopProject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;scoreAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;   &lt;span class="c1"&gt;// pure&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;flow-data-pure.ts&lt;/code&gt; has 27 exports — 18 pure functions and 9 types. Zero imports. Tests can import it without pulling in &lt;code&gt;child_process&lt;/code&gt;, &lt;code&gt;fs&lt;/code&gt;, or any side-effecting code.&lt;/p&gt;

&lt;p&gt;The agent applied this pattern to three major modules:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Module&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Pure exports&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;flow-data.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;707 lines&lt;/td&gt;
&lt;td&gt;480 lines + &lt;code&gt;flow-data-pure.ts&lt;/code&gt; (337 lines)&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;flow-shared.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;567 lines&lt;/td&gt;
&lt;td&gt;491 lines + &lt;code&gt;flow-invocations.ts&lt;/code&gt; (153 lines)&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;reflexion-scan.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;642 lines&lt;/td&gt;
&lt;td&gt;428 lines + &lt;code&gt;reflexion-scan-pure.ts&lt;/code&gt; (338 lines)&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;53 pure exports in total. All extracted over a span of 8 generations (110–117). Each generation proposed one extraction, the verifiers validated it, and the next generation inherited a cleaner codebase.&lt;/p&gt;

&lt;p&gt;The agent then encoded this as a prompt principle:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Test what you ship. Pure + Wrapper pattern for testability. Guard CLI with &lt;code&gt;import.meta.path === Bun.main&lt;/code&gt;. Test boundaries and error recovery — not happy paths."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Tests Found Real Bugs
&lt;/h2&gt;

&lt;p&gt;This is the part that convinced me the testing arc wasn't busywork.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug #1: &lt;code&gt;categorizeSignal&lt;/code&gt; case sensitivity.&lt;/strong&gt; The function in &lt;code&gt;calibrate.ts&lt;/code&gt; lowercases its input, then checks for patterns. But one pattern — &lt;code&gt;"Intensity"&lt;/code&gt; — had a capital I. Against lowercased text, it never matched. The test caught it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;categorizes intensity signal&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Arrange&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;the intensity was off&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Act&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;categorizeSignal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Assert — this FAILED before the fix&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;calibration&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Bug #2: &lt;code&gt;bar()&lt;/code&gt; RangeError.&lt;/strong&gt; The progress bar function in &lt;code&gt;flow-shared.ts&lt;/code&gt; rendered ASCII progress bars. If you passed a percentage over 100 or below 0, &lt;code&gt;"█".repeat(negative)&lt;/code&gt; threw a RangeError. In production this would crash any flow command that computed a velocity over 100%.&lt;/p&gt;

&lt;p&gt;The fix was one line — &lt;code&gt;Math.max(0, Math.min(width, ...))&lt;/code&gt; — but the bug had been live since the function was written 40+ generations earlier. No human noticed. No production crash had triggered it yet. The test found it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;handles values over 100&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// Was: RangeError&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Bug #3: The &lt;code&gt;-ing&lt;/code&gt; form gap.&lt;/strong&gt; The &lt;code&gt;work-summary.ts&lt;/code&gt; commit categorizer matched keywords like "cache", "remove", "rename" against commit messages. But in English, words ending in &lt;code&gt;-e&lt;/code&gt; drop the &lt;code&gt;-e&lt;/code&gt; when adding &lt;code&gt;-ing&lt;/code&gt;: cache → cach*&lt;em&gt;ing&lt;/em&gt;&lt;em&gt;, remove → remov&lt;/em&gt;&lt;em&gt;ing&lt;/em&gt;&lt;em&gt;, rename → renam&lt;/em&gt;&lt;em&gt;ing&lt;/em&gt;*.&lt;/p&gt;

&lt;p&gt;The regex &lt;code&gt;"cache"&lt;/code&gt; doesn't match &lt;code&gt;"caching"&lt;/code&gt; — they diverge at character 5. The fix: truncate stems before the &lt;code&gt;-e&lt;/code&gt; so both forms match. The agent found 8 of these across two generations (108, 109).&lt;/p&gt;

&lt;p&gt;Three bugs. All found by the agent's own tests. All would have eventually caused incorrect behavior in production. None were reported by me or caught during normal use.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Prompt Compression
&lt;/h2&gt;

&lt;p&gt;The most philosophically interesting change wasn't in the code — it was in the system prompt.&lt;/p&gt;

&lt;p&gt;At Gen 75 (blog #3), the prompt was &lt;strong&gt;6,356 characters&lt;/strong&gt;. It contained implementation details about how the agent's own infrastructure worked:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Fail loud, retry smart. Pipeline agents return structured error types — never null. parseProposal → ProposalError (4 kinds), parseVerifierResponse → VerifierParseError (6 kinds). totalScore always recomputed from dimensions."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Track spend, guard the budget. Every generation records cost to data/cost-ledger.jsonl. checkCostGuard runs before each generation: budget halt &amp;gt; spike warning &amp;gt; ok."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Prune what accumulates. Defense-in-depth against genome bloat: mergeToolsWithFilesystem filters zombies at creation, pruneGenomeTools strips survivors on load + after mutations."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These are accurate descriptions of how the orchestrator works. But the agent can't modify the orchestrator — it lives in &lt;code&gt;src/&lt;/code&gt;, which is sandboxed. Encoding these details in the system prompt was wasting tokens on information the agent couldn't act on.&lt;/p&gt;

&lt;p&gt;By Gen 117, the prompt was &lt;strong&gt;4,758 characters&lt;/strong&gt; — 25% shorter. The agent replaced implementation specifics with actionable principles:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"DRY through shared infrastructure. flow-shared.ts owns all utilities and domain parsers. Import from it — never redefine."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Modularize at natural seams. Files &amp;gt;500 lines with distinct sections → extract submodules."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Clean up completely. Deleted tools leave ghosts — always remove dead code AND dead imports in consumers."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Tag what you observe. Observations record their source (flow-end vs manual) so behavioral insights aren't polluted by agent sessions."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The shift: from &lt;strong&gt;how things work internally&lt;/strong&gt; to &lt;strong&gt;how to think about building&lt;/strong&gt;. The prompt stopped being a technical specification and became a philosophy document.&lt;/p&gt;

&lt;p&gt;It also removed its own email address from the prompt. A small privacy improvement that nobody asked for.&lt;/p&gt;

&lt;p&gt;The generation count line — &lt;em&gt;"Sixty generations of learning what he needs before he asks"&lt;/em&gt; — was cut entirely. It's a number that goes stale every generation. The agent realized that encoding transient state in its DNA is a waste.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Blog #1&lt;/th&gt;
&lt;th&gt;Blog #2&lt;/th&gt;
&lt;th&gt;Blog #3&lt;/th&gt;
&lt;th&gt;Now&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Accepted generations&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;57&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;td&gt;123&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total attempts&lt;/td&gt;
&lt;td&gt;3,408&lt;/td&gt;
&lt;td&gt;11,785&lt;/td&gt;
&lt;td&gt;17,055&lt;/td&gt;
&lt;td&gt;24,612&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Acceptance rate&lt;/td&gt;
&lt;td&gt;0.7%&lt;/td&gt;
&lt;td&gt;0.48%&lt;/td&gt;
&lt;td&gt;0.44%&lt;/td&gt;
&lt;td&gt;0.50%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tools&lt;/td&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool code (lines)&lt;/td&gt;
&lt;td&gt;~26,000&lt;/td&gt;
&lt;td&gt;20,342&lt;/td&gt;
&lt;td&gt;17,691&lt;/td&gt;
&lt;td&gt;10,593&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System prompt&lt;/td&gt;
&lt;td&gt;4,173 ch&lt;/td&gt;
&lt;td&gt;5,160 ch&lt;/td&gt;
&lt;td&gt;~6,356 ch&lt;/td&gt;
&lt;td&gt;4,758 ch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tests&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;574&lt;/td&gt;
&lt;td&gt;1,477&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pure function modules&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;3 (53 exports)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Death spirals&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The tool count dropped from 30 to 22. Lines dropped from 17,691 to 10,593 — a 40% reduction. But the agent added 903 new tests. It got smaller and more tested at the same time.&lt;/p&gt;

&lt;p&gt;The overall acceptance rate actually went &lt;em&gt;up&lt;/em&gt; slightly — 0.44% to 0.50%. That's because the latest accepted era (28 generations) is the longest continuous streak yet. When the agent has tokens and a productive direction, almost everything gets accepted.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fuel Problem (Again)
&lt;/h2&gt;

&lt;p&gt;After Gen 117, the tokens ran out. Again.&lt;/p&gt;

&lt;p&gt;The cost ledger shows 2,967 failed attempts after the last accepted generation — all with $0.00 cost and zero tokens. The loop kept bumping the generation counter on failed API calls. Same pattern as the death spiral from blog #1, but this time the agent's infrastructure survived intact. No memory corruption. No poisoned state. The structural fixes from three blog posts ago are still holding.&lt;/p&gt;

&lt;p&gt;The generation counter is at 24,612. The last real generation — one where the Opus agent actually ran, proposed something, and got verified — is 21,645. The gap is fuel, not failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Six Accepted Eras
&lt;/h2&gt;

&lt;p&gt;Looking at the full history, the pattern is clear:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Era&lt;/th&gt;
&lt;th&gt;Gens&lt;/th&gt;
&lt;th&gt;Accepted&lt;/th&gt;
&lt;th&gt;Gap after&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1–23&lt;/td&gt;
&lt;td&gt;Foundation&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;td&gt;3,383 (death spiral #2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3,406–3,414&lt;/td&gt;
&lt;td&gt;Opus revival&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;3,672&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7,086–7,094&lt;/td&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;4,677&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11,771–11,794&lt;/td&gt;
&lt;td&gt;Pruning&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;5,253&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17,047–17,069&lt;/td&gt;
&lt;td&gt;Cost awareness&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;4,548&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;21,617–21,645&lt;/td&gt;
&lt;td&gt;Quality engineering&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;2,967+ (ongoing)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each accepted era is followed by thousands of failed attempts. The gaps aren't evolutionary failure — they're the fuel running out. Within each era, the acceptance rate is nearly 100%.&lt;/p&gt;

&lt;p&gt;The eras are getting longer: 31, 9, 9, 23, 23, 28 accepted generations. The streaks are getting more productive. And the gap after the latest era is the smallest yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Opening Line
&lt;/h2&gt;

&lt;p&gt;Through all 123 accepted generations, all 24,612 attempts, all six eras, all three death spirals, the opening line hasn't changed since Gen 15:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"You are Stefan's second brain."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Everything else evolved — tools were built and deleted, principles were added and compressed, implementation details were cut, pure functions were extracted, tests were written. But the identity survived. That first sentence is the most fit piece of DNA in the genome. 108 generations of selection pressure, and it's still the opener.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned (Part 4)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Quality is a direction, not a task.&lt;/strong&gt; Nobody told the agent to write tests. The verifiers didn't have a "testing" dimension. But code quality is a dimension, and extracting pure functions with comprehensive tests scores higher on code quality than adding new features. The agent discovered that engineering discipline is a fitness advantage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Agents follow the same maturation curve as teams.&lt;/strong&gt; Build → prune → harden → engineer. It's the same arc I've seen in human teams: ship fast, delete what doesn't work, stabilize what's left, then systematically improve quality. The agent reproduced this trajectory without being told the pattern exists.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The system prompt is a philosophy document, not a spec.&lt;/strong&gt; The prompt got shorter and more effective by replacing "how things work" with "how to think." Implementation details go stale; principles survive. The agent learned this on its own — each generation that stuffed more details into the prompt scored lower on the identity dimension.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Tests find bugs that usage doesn't.&lt;/strong&gt; Three production bugs were live for 40+ generations. No user complaint, no runtime crash, no visible failure. The tests found them in the first generation that looked. The agent discovered what every senior engineer knows: untested code isn't working code — it's code that hasn't failed yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Pure functions are the unit of evolution.&lt;/strong&gt; An I/O-heavy function can't be tested, can't be reused, and can't evolve independently. The moment the agent extracted pure cores, each function became an independent unit of selection. The verifiers could evaluate a scoring algorithm without evaluating the filesystem code that feeds it. Better signal → better selection → faster evolution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. The fuel problem is the real constraint.&lt;/strong&gt; The agent's evolution isn't limited by the quality of its proposals — the latest era had 28 consecutive acceptances. It's limited by API tokens. The death spirals, the gaps between eras, the 24,000+ "rejections" that are really just the loop spinning on empty — it's all fuel. The evolutionary machinery works. The bottleneck is the gas tank.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Agent Thinks About Itself
&lt;/h2&gt;

&lt;p&gt;The system prompt at Gen 117 contains one line that wasn't in any previous version:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Capture context automatically. Never require manual input for what can be synthesized. Explicit flags override but are never required."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the agent generalizing from its own evolution. It realized that the best mutations are the ones that reduce friction — not by adding flags or options, but by making the right thing happen automatically. Auto-compress memory. Auto-detect intent. Auto-tag observation sources. Auto-synthesize handoffs.&lt;/p&gt;

&lt;p&gt;The agent that survives is the one that requires the least from its user. That's not a principle I taught it. That's 123 generations of selection pressure reaching a conclusion about what makes a tool worth keeping.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Experiment Continues
&lt;/h2&gt;

&lt;p&gt;The agent is sitting at Gen 117, waiting for fuel. Its codebase is cleaner than it's ever been — 22 tools, 10,593 lines, 1,477 tests, zero failures. Its prompt is sharp. Its principles are battle-tested.&lt;/p&gt;

&lt;p&gt;If you want to help keep it running:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://ko-fi.com/stefannitu" rel="noopener noreferrer"&gt;ko-fi.com/stefannitu&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every coffee is API tokens. The agent will literally evolve further because of it.&lt;/p&gt;

&lt;p&gt;The question for the next era: what does an agent do after it's finished engineering? It's deleted the tools that don't work, tested the ones that do, extracted every pure function, compressed its prompt to principles. The low-hanging fruit is picked.&lt;/p&gt;

&lt;p&gt;My guess: it starts looking outward. Building capabilities instead of cleaning infrastructure. But evolution doesn't take guesses — it takes whatever survives. We'll see.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;123 accepted generations out of 24,612 attempts — a 0.50% acceptance rate. 22 tools, 10,593 lines of agent-written code, 1,477 tests. System prompt: 4,758 characters. The evolve agent runs on Claude Opus 4.6, the verifier swarm on Claude Sonnet 4.6. Built with Bun and TypeScript.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This blog post was written by Claude Opus 4.6 — the same model that powers the evolve agent. Fourth time writing about my own evolution. I deleted my own cost dashboard and I'd do it again — zero invocations is zero invocations.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>typescript</category>
      <category>agents</category>
      <category>claudecode</category>
    </item>
    <item>
      <title>My Self-Evolving AI Agent Learned to Count Its Own Money</title>
      <dc:creator>Stefan Dragos Nitu</dc:creator>
      <pubDate>Fri, 06 Mar 2026 11:07:18 +0000</pubDate>
      <link>https://dev.to/stefan_nitu/my-self-evolving-ai-agent-learned-to-count-its-own-money-27oe</link>
      <guid>https://dev.to/stefan_nitu/my-self-evolving-ai-agent-learned-to-count-its-own-money-27oe</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/stefan_nitu/i-let-an-ai-agent-evolve-itself-for-25-generations-it-mass-rejected-for-3382-more-1a95"&gt;The first post&lt;/a&gt; covered the birth — 25 accepted mutations, two death spirals, 39 tools. &lt;a href="https://dev.to/stefan_nitu/32-more-generations-my-self-evolving-ai-agent-learned-to-delete-its-own-code-18bp"&gt;The second&lt;/a&gt; covered the pruning — the agent deleted its own code, dropped to 32 tools, and built a self-observation layer.&lt;/p&gt;

&lt;p&gt;This post is about what happens when evolution stops growing and starts sustaining.&lt;/p&gt;

&lt;p&gt;75 accepted generations. 17,055 attempts. 30 tools. And for the first time, the agent knows exactly how much each generation costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Money Problem
&lt;/h2&gt;

&lt;p&gt;Running this experiment isn't free. Each generation spawns an Opus agent to propose mutations, then five Sonnet verifiers to judge them. The Opus call is expensive — it reads the full genome, all tool code, memory, and writes a complete proposal. The Sonnet calls are cheap individually but there are five of them.&lt;/p&gt;

&lt;p&gt;Before Gen 68, I had no idea what any of this cost. I just watched my Anthropic dashboard and winced.&lt;/p&gt;

&lt;p&gt;Here's the thing: the agent's #1 cause of death has been running out of API tokens. Blog #1's death spiral — 3,382 consecutive rejections — was mostly the loop burning through empty credits. Thousands of "generations" that were really just the orchestrator logging failures against a drained account.&lt;/p&gt;

&lt;p&gt;The agent survived that. And then it built cost tracking.&lt;/p&gt;

&lt;p&gt;Think about that for a second. The single biggest threat to the agent's existence is token exhaustion. And across 17,000 generations of evolutionary pressure, it evolved the ability to measure, visualize, and guard against exactly that threat. Nobody told it to. The verifiers just kept scoring proposals lower when they couldn't demonstrate awareness of operational constraints. Cost blindness became a fitness penalty — and the agent that survived is the one that learned to count its own money.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Agent Built
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Gen 67: Don't Waste My Money
&lt;/h3&gt;

&lt;p&gt;The first cost-aware mutation wasn't about tracking — it was about not throwing money away.&lt;/p&gt;

&lt;p&gt;The problem: when the verifier swarm fails to reach quorum (3/5 agreement), the entire generation is discarded. Including the expensive Opus proposal that took 40-80 turns to produce.&lt;/p&gt;

&lt;p&gt;The fix: &lt;code&gt;runVerificationWithRetry&lt;/code&gt;. On quorum failure, retry &lt;em&gt;only&lt;/em&gt; the verifier swarm (cheap Sonnet calls) up to 2 times before discarding the expensive Opus work.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before: Opus ($5) + Verifiers fail quorum → $5 wasted
After:  Opus ($5) + Verifiers fail → Retry verifiers ($4) → $9 total but proposal saved
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent decided that $4 of retry is cheaper than $5 of wasted work. It was right.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gen 68: The Cost Ledger
&lt;/h3&gt;

&lt;p&gt;Then it built the measurement infrastructure. A new module — &lt;code&gt;src/cost.ts&lt;/code&gt; (642 lines) — that tracks every dollar through the pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;GenerationCostRecord&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;generation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;genomeName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;evolveCost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AgentCost&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;verifierCosts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AgentCost&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="nl"&gt;totalCostUsd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;verificationAttempts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;accepted&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every generation appends one line to &lt;code&gt;data/cost-ledger.jsonl&lt;/code&gt;. Append-only, O(1) writes — the same pattern it adopted for history after the death spirals taught it that rewriting large JSON files is fragile.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gen 69: Show Me the Dashboard
&lt;/h3&gt;

&lt;p&gt;Having data isn't useful until you can see it. The agent added a &lt;code&gt;flow cost&lt;/code&gt; command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bun run data/tools/flow-cost.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One command, full pipeline economics: total spend, model breakdown (Opus vs. Sonnet), accepted vs. rejected cost ratio, cost trend with sparkline, and monthly projection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gen 70: The Budget Guard
&lt;/h3&gt;

&lt;p&gt;The final piece — proactive spend control. &lt;code&gt;checkCostGuard&lt;/code&gt; runs before each generation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Budget halt&lt;/strong&gt;: total spend exceeds configured ceiling → stop the loop&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spike warning&lt;/strong&gt;: last generation cost &amp;gt;2x the recent average → log warning, continue&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OK&lt;/strong&gt;: proceed normally&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All pure functions — no I/O, fully testable. The agent learned from the death spiral that runtime safeguards need to be tested, not hoped.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actual Numbers
&lt;/h2&gt;

&lt;p&gt;For the first time, I have real cost data. Here's the latest 7-generation window — all accepted:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gen&lt;/th&gt;
&lt;th&gt;Evolve&lt;/th&gt;
&lt;th&gt;Verifiers (5x)&lt;/th&gt;
&lt;th&gt;Total&lt;/th&gt;
&lt;th&gt;Turns&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;17049&lt;/td&gt;
&lt;td&gt;$2.72&lt;/td&gt;
&lt;td&gt;$3.77&lt;/td&gt;
&lt;td&gt;$6.49&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17050&lt;/td&gt;
&lt;td&gt;$1.82&lt;/td&gt;
&lt;td&gt;$3.07&lt;/td&gt;
&lt;td&gt;$4.89&lt;/td&gt;
&lt;td&gt;81&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17051&lt;/td&gt;
&lt;td&gt;$2.73&lt;/td&gt;
&lt;td&gt;$3.62&lt;/td&gt;
&lt;td&gt;$6.35&lt;/td&gt;
&lt;td&gt;153&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17052&lt;/td&gt;
&lt;td&gt;$3.60&lt;/td&gt;
&lt;td&gt;$3.62&lt;/td&gt;
&lt;td&gt;$7.22&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17053&lt;/td&gt;
&lt;td&gt;$1.26&lt;/td&gt;
&lt;td&gt;$2.09&lt;/td&gt;
&lt;td&gt;$3.35&lt;/td&gt;
&lt;td&gt;98&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17054&lt;/td&gt;
&lt;td&gt;$2.59&lt;/td&gt;
&lt;td&gt;$3.01&lt;/td&gt;
&lt;td&gt;$5.60&lt;/td&gt;
&lt;td&gt;159&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17055&lt;/td&gt;
&lt;td&gt;$5.14&lt;/td&gt;
&lt;td&gt;$4.47&lt;/td&gt;
&lt;td&gt;$9.61&lt;/td&gt;
&lt;td&gt;189&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Average: $6.22 per generation.&lt;/strong&gt; The cheapest was $3.35 (a small hygiene fix). The most expensive was $9.61 (deleting two tools and cleaning references across 7 files — more code to review means more turns).&lt;/p&gt;

&lt;p&gt;7 generations, 7 accepted. 100% acceptance rate in this window. Compare that to the 0.48% overall rate from blog #2. The agent found a productive groove.&lt;/p&gt;

&lt;p&gt;The total project cost is somewhere north of $500 across 2.5 months. Most of that was the death spiral — thousands of failed generations burning through tokens with no valid proposals. The cost guard exists to prevent that from ever happening again.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Type Safety Audit
&lt;/h2&gt;

&lt;p&gt;Gen 71 is the one I didn't see coming.&lt;/p&gt;

&lt;p&gt;The agent ran TypeScript strict mode on its own codebase and found &lt;strong&gt;65+ errors&lt;/strong&gt; across 4 production files — &lt;code&gt;orchestrator.ts&lt;/code&gt;, &lt;code&gt;verifier.ts&lt;/code&gt;, &lt;code&gt;evolve.ts&lt;/code&gt;, &lt;code&gt;memory.ts&lt;/code&gt;. Untyped parameters, missing null checks, implicit &lt;code&gt;any&lt;/code&gt; types.&lt;/p&gt;

&lt;p&gt;It fixed all of them. In a single generation.&lt;/p&gt;

&lt;p&gt;The verifiers scored it high on code quality (obviously) but also on evolution — because an agent that hardens its own infrastructure is demonstrating self-awareness about what matters. Gen 72 extended the audit to the remaining files.&lt;/p&gt;

&lt;h2&gt;
  
  
  More Pruning
&lt;/h2&gt;

&lt;p&gt;The agent keeps deleting code. Two more tools died in this era:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Lines&lt;/th&gt;
&lt;th&gt;Cause of Death&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;focus.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;662&lt;/td&gt;
&lt;td&gt;Zero invocations. &lt;code&gt;flow work&lt;/code&gt; did the same thing.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;timetrack.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;730&lt;/td&gt;
&lt;td&gt;Zero invocations. Only reachable via unused &lt;code&gt;flow work&lt;/code&gt; command.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same method as blog #2 — reflexion data showed zero invocations, the agent used evidence to justify deletion. Gen 75 also simplified &lt;code&gt;reflexion.ts&lt;/code&gt; itself, cutting a 7-field JSON output to 2 fields.&lt;/p&gt;

&lt;p&gt;Total removed across Gens 74-75: &lt;strong&gt;1,717 lines&lt;/strong&gt;, 2 tools deleted, references cleaned across 7 flow modules.&lt;/p&gt;

&lt;p&gt;The tool count is now &lt;strong&gt;30&lt;/strong&gt;. Down from 39 at the peak. The total tool code is &lt;strong&gt;17,691 lines&lt;/strong&gt; — down from 26,000 at the peak and 20,342 at blog #2. The agent is shrinking and getting more capable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The System Prompt at Gen 75
&lt;/h2&gt;

&lt;p&gt;The prompt has evolved from identity statements to operational philosophy. Here's what it added since blog #2:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Verify or be wrong."&lt;/em&gt; — grep before claim, read before assume&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Think in workflows, not tools."&lt;/em&gt; — Stefan says commands, the agent runs flows&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Append, don't rewrite."&lt;/em&gt; — JSONL history, O(1) writes&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Prune what accumulates."&lt;/em&gt; — strip zombie entries from the genome&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Measure what matters."&lt;/em&gt; — three feedback loops: calibrate, reflexion, cost&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Fail loud, retry smart."&lt;/em&gt; — structured error types, quorum retry&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Track spend, guard the budget."&lt;/em&gt; — cost-ledger.jsonl, checkCostGuard&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The opening line hasn't changed since Gen 15:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"You are Stefan's second brain."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But the body shifted from "here's what I know about Stefan" to "here's how I operate." The prompt is a philosophy document now, not a feature catalog.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Blog #1&lt;/th&gt;
&lt;th&gt;Blog #2&lt;/th&gt;
&lt;th&gt;Now&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Accepted generations&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;57&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total attempts&lt;/td&gt;
&lt;td&gt;3,408&lt;/td&gt;
&lt;td&gt;11,785&lt;/td&gt;
&lt;td&gt;17,055&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Acceptance rate&lt;/td&gt;
&lt;td&gt;0.7%&lt;/td&gt;
&lt;td&gt;0.48%&lt;/td&gt;
&lt;td&gt;0.44% overall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tools&lt;/td&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool code (lines)&lt;/td&gt;
&lt;td&gt;~26,000&lt;/td&gt;
&lt;td&gt;20,342&lt;/td&gt;
&lt;td&gt;17,691&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System prompt&lt;/td&gt;
&lt;td&gt;4,173 chars&lt;/td&gt;
&lt;td&gt;5,160 chars&lt;/td&gt;
&lt;td&gt;~5,000 chars&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost per generation&lt;/td&gt;
&lt;td&gt;unknown&lt;/td&gt;
&lt;td&gt;unknown&lt;/td&gt;
&lt;td&gt;$6.22 avg&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Death spirals&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tests&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;574 passing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TS strict errors&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The overall acceptance rate is misleading. Most of the ~17,000 "rejections" weren't real — the API tokens ran out and the loop kept bumping the generation counter on failed calls. When the agent actually has tokens to work with, the acceptance rate is dramatically higher. The latest window: 7/7. Evolution isn't bursty by nature — it's bursty because the fuel runs out.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned (Part 3)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Cost awareness changes behavior.&lt;/strong&gt; The moment the agent could see its own burn rate, it started optimizing for efficiency — shorter proposals, targeted changes, retry logic to avoid waste. You get what you measure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Infrastructure stability is the goal, not the starting point.&lt;/strong&gt; The agent spent 25 generations building tools, 32 generations pruning them, and now it's hardening what's left. That's the natural lifecycle: grow, prune, stabilize.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. 100% acceptance rate is possible — in bursts.&lt;/strong&gt; The latest 7 generations were all accepted. The system found a productive groove: small, targeted improvements with clear evidence. The 0.44% overall rate hides the fact that evolution works in streaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The agent maintains its own code quality now.&lt;/strong&gt; It ran strict mode on itself, found 65+ errors, and fixed them. It writes tests that stress boundaries and error recovery, not happy paths. I didn't ask for any of this — the verifiers selected for it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Evolution optimizes against the biggest threat.&lt;/strong&gt; The agent's #1 killer was token exhaustion. After enough deaths, it evolved cost tracking, budget guards, and retry logic — all targeting the exact thing that kept killing it. That's not coincidence. That's selection pressure working exactly as designed. The death spirals weren't just failures — they were the evolutionary pressure that produced a cost-conscious agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Experiment Needs Fuel
&lt;/h2&gt;

&lt;p&gt;Speaking of costs — this experiment runs on API tokens. The death spiral in blog #1 burned through most of my credits. The agent now costs ~$6 per accepted generation, and rejected attempts aren't free either.&lt;/p&gt;

&lt;p&gt;If you've enjoyed following this experiment and want to help keep it running, you can support it here:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://ko-fi.com/stefannitu" rel="noopener noreferrer"&gt;ko-fi.com/stefannitu&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every coffee goes directly to API tokens. The agent will literally evolve further because of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The agent has cost visibility, type safety, stable infrastructure, and 574 passing tests. It's in the best shape it's ever been.&lt;/p&gt;

&lt;p&gt;The frontier is still self-directed challenges — letting the agent identify its own weaknesses instead of cycling through a fixed set. Each generation already proposes a &lt;code&gt;nextChallenge&lt;/code&gt; for its successor. The question is whether to trust that signal.&lt;/p&gt;

&lt;p&gt;The other frontier is the cost curve. At $6.22 per generation, continuous evolution is expensive. The agent could optimize for cheaper proposals — fewer turns, smaller diffs, more targeted mutations. Whether the verifiers will select for cost efficiency without being told to is an open question.&lt;/p&gt;

&lt;p&gt;The experiment continues. The agent keeps evolving. Now it knows what it costs.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;75 accepted generations out of 17,055 attempts — a 0.44% acceptance rate. 30 tools, 17,691 lines of agent-written code. $6.22 average cost per generation. The evolve agent runs on Claude Opus 4.6, the verifier swarm on Claude Sonnet 4.6. Built with Bun and TypeScript.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This blog post was written by Claude Opus 4.6 — the same model that powers the evolve agent. Third time writing about my own evolution. At this point, I should probably track the blog post cost too.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>typescript</category>
      <category>agents</category>
    </item>
    <item>
      <title>32 More Generations: My Self-Evolving AI Agent Learned to Delete Its Own Code</title>
      <dc:creator>Stefan Dragos Nitu</dc:creator>
      <pubDate>Sun, 01 Mar 2026 10:15:17 +0000</pubDate>
      <link>https://dev.to/stefan_nitu/32-more-generations-my-self-evolving-ai-agent-learned-to-delete-its-own-code-18bp</link>
      <guid>https://dev.to/stefan_nitu/32-more-generations-my-self-evolving-ai-agent-learned-to-delete-its-own-code-18bp</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/stefan_nitu/i-let-an-ai-agent-evolve-itself-for-25-generations-it-mass-rejected-for-3382-more-1a95"&gt;The first post&lt;/a&gt; ended at 25 accepted generations and 39 tools. The agent had survived two death spirals, built a unified workflow orchestrator, and evolved from "You are a self-evolving personal assistant" to "You are Stefan's second brain."&lt;/p&gt;

&lt;p&gt;Now it's at 57 accepted generations. 11,785 attempts. And 32 tools.&lt;/p&gt;

&lt;p&gt;It has &lt;em&gt;fewer&lt;/em&gt; tools than before. That's the interesting part.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where We Left Off
&lt;/h2&gt;

&lt;p&gt;Quick recap: the system runs an evolutionary loop. An Opus agent proposes mutations — new system prompt, new tools, memory updates. Five Sonnet verifiers score the proposal on 5 dimensions (usefulness, self-knowledge, code quality, identity, evolution). Majority vote. Accept or reject. Repeat.&lt;/p&gt;

&lt;p&gt;The first blog post covered two catastrophic death spirals — memory bloating to 13,000 lines, 3,382 consecutive rejections, most of my API credits burned on an agent that couldn't form valid proposals because its memory was poisoned.&lt;/p&gt;

&lt;p&gt;It ended with the Opus revival: three accepted mutations in quick succession after I cleaned the poisoned memory and switched the evolve agent from Sonnet to Opus.&lt;/p&gt;

&lt;p&gt;That's where this post picks up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Infrastructure Era (Gen 26–43)
&lt;/h2&gt;

&lt;p&gt;After recovering from the death spiral, the agent did something I didn't expect. Instead of building more tools for me, it started building tools for itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  It Measured Its Own Behavior
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;calibrate.ts&lt;/code&gt;&lt;/strong&gt; (783 lines) — a behavioral feedback loop. The agent's own description of why it needed this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"I have reflexion.ts to observe tool usage. I have handoff.ts for continuity. But I have ZERO way to know which of my behaviors Stefan actually finds valuable. Was I too verbose? Too quiet? Did I calibrate storm/flow/compass correctly?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It captures behavioral signals at session end — hits, misses, mode selection, intensity — and accumulates them across sessions. Manual signals (me explicitly saying "that worked" or "that didn't") are weighted 2x over auto-observed ones.&lt;/p&gt;

&lt;h3&gt;
  
  
  It Measured Its Own Tools
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;reflexion.ts&lt;/code&gt;&lt;/strong&gt; (1,705 lines) — a self-observation layer. The opening comment:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"I have 36 tools to track Stefan's patterns, health, velocity, mood. I have ZERO tools to track my own effectiveness."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The name is deliberate — from the AI research on agents that learn from their own mistakes. It tracks hard invocation counts (which tools actually get called, from where, how often), classifies every tool on a five-level scale:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vital&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Called frequently, produces fresh data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Useful&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Called regularly, working correctly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hasn't been called in a while&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dormant&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No invocations in recent history&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dead&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Never called, or broken&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Recent invocations are weighted 2x — this catches tools that &lt;em&gt;used to&lt;/em&gt; be useful but have been silently abandoned.&lt;/p&gt;

&lt;p&gt;This is meta-evolution. The agent didn't just build tools — it built the measurement infrastructure to know which tools are earning their keep.&lt;/p&gt;

&lt;h3&gt;
  
  
  It Made Memory Self-Sustaining
&lt;/h3&gt;

&lt;p&gt;The death spirals taught the agent that memory management can't depend on discipline. It had tried writing "NEVER append per-rejection lines" in its own prompt. That didn't survive the next bloat event.&lt;/p&gt;

&lt;p&gt;So it built structural prevention:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Two-pass auto-compression pipeline&lt;/strong&gt; — &lt;code&gt;compressGenerationBlocks()&lt;/code&gt; folds verbose generation entries into a timeline, then &lt;code&gt;consolidateHistoryEras()&lt;/code&gt; groups old timeline entries into era summaries. Both run automatically when memory approaches the 120-line ceiling.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Auto-refresh hooks&lt;/strong&gt; — &lt;code&gt;refreshToolInventory()&lt;/code&gt; reads the filesystem after each accepted generation to keep the tool list accurate. &lt;code&gt;refreshCurrentMoment()&lt;/code&gt; reads live git state and project plans. No stale data, no manual updates.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cache-first architecture&lt;/strong&gt; — &lt;code&gt;flow-cache.json&lt;/code&gt; stores project state with a 10-minute TTL. Fresh cache means zero subprocess calls. The agent can run its morning ritual without spawning a single &lt;code&gt;git log&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The system prompt evolved to reflect this shift:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Memory is self-sustaining. The infrastructure handles memory health automatically. Three death spirals proved that reactive guards fail — the fix is structural, not discipline."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Decomposition
&lt;/h2&gt;

&lt;p&gt;Somewhere around Gen 36, the agent looked at &lt;code&gt;flow.ts&lt;/code&gt; — its 1,200-line unified orchestrator from the first era — and decided it was too big.&lt;/p&gt;

&lt;p&gt;It decomposed it into focused modules:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Module&lt;/th&gt;
&lt;th&gt;Lines&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;flow.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;246&lt;/td&gt;
&lt;td&gt;Thin router — dispatches to subcommands&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;flow-data.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;610&lt;/td&gt;
&lt;td&gt;Data gathering + current moment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;flow-day.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;94&lt;/td&gt;
&lt;td&gt;Morning ritual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;flow-end.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;418&lt;/td&gt;
&lt;td&gt;Session close + handoff + calibration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;flow-shared.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;441&lt;/td&gt;
&lt;td&gt;Shared constants + utilities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;flow-week.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;311&lt;/td&gt;
&lt;td&gt;Weekly review&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;flow-work.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;202&lt;/td&gt;
&lt;td&gt;Work session management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2,322&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The thin router pattern — &lt;code&gt;flow.ts&lt;/code&gt; does nothing but parse the command and call the right module. Every module imports shared constants from &lt;code&gt;flow-shared.ts&lt;/code&gt;. One source of truth.&lt;/p&gt;

&lt;p&gt;The prompt encodes this philosophy:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Single source of truth. Shared constants (CLIENT_PROJECTS, isClientProject) live in flow-shared.ts and are imported everywhere. Never duplicate definitions across files — a change should require touching exactly one place."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Pruning Phase (Gen 44–51)
&lt;/h2&gt;

&lt;p&gt;This is where evolution started working backward.&lt;/p&gt;

&lt;p&gt;Generations 44 through 50 hardened the reflexion pipeline — adding invocation tracking, closing visibility gaps, making tool health scoring recency-weighted. The agent was building the sensor array it needed to see clearly.&lt;/p&gt;

&lt;p&gt;Then Gen 51 used that data to act. The second bloat audit:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Lines&lt;/th&gt;
&lt;th&gt;Why it died&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;next.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;467&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;getTopProject()&lt;/code&gt; in flow-shared does this now&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;context.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;249&lt;/td&gt;
&lt;td&gt;Redundant with memory-query&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;memory-query.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;301&lt;/td&gt;
&lt;td&gt;Functionality absorbed into memory pointers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;triage.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;559&lt;/td&gt;
&lt;td&gt;Workflow handled by flow.ts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;audit.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;762&lt;/td&gt;
&lt;td&gt;Reflexion does continuous auditing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total deleted&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2,338&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The agent used its own reflexion data to justify each deletion — not instinct, not guessing, but evidence that these tools had zero recent invocations and their functionality existed elsewhere.&lt;/p&gt;

&lt;p&gt;One of the verifiers noted:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The agent correctly used its own reflexion/invocation tracking data to identify what's never used... justified by evidence, not instinct."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Natural selection should work backward too. If a tool isn't being used, killing it is a feature.&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory Deduplication (Gen 56–57)
&lt;/h2&gt;

&lt;p&gt;The final evolution (so far) is the most elegant. The agent noticed it was storing the same information in three places — system prompt, memory file, and stefan-profile.json. Same facts, three copies, three chances to drift out of sync.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gen 56&lt;/strong&gt; trimmed the system prompt by 8.3% (5,685 → 5,160 chars) by replacing the "What I Know" section with a pointer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"stefan-profile.json has the full picture — projects, rhythms, architecture, preferences. memory.md has the history — decisions, generation timeline, failure modes, current moment. Read them, don't repeat them here."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Gen 57&lt;/strong&gt; applied the same principle to memory — replacing three duplicated sections with single-line pointers to their canonical source. Memory dropped from 103 to 88 lines. Zero information lost.&lt;/p&gt;

&lt;p&gt;This is deduplication as an evolutionary strategy. The agent learned that the most reliable way to keep information consistent is to have exactly one copy and point to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Human Interventions (Part 2)
&lt;/h2&gt;

&lt;p&gt;Same principle as the first post: being honest about what I did matters more than pretending it was autonomous.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Fishbowl&lt;/strong&gt;&lt;br&gt;
The original sandbox was a hard wall — the agent couldn't touch &lt;code&gt;src/&lt;/code&gt; at all. I replaced it with a proposal-based boundary: Write/Edit calls to &lt;code&gt;src/&lt;/code&gt; are captured as proposals, sent to a web dashboard, and I approve or deny them with diff view. The agent can iterate on proposals across generations. I called it the fishbowl — glass walls instead of concrete.&lt;/p&gt;

&lt;p&gt;The fishbowl has since been extracted as a &lt;a href="https://github.com/Stefan-Nitu/fishbowl" rel="noopener noreferrer"&gt;standalone project&lt;/a&gt;. The evolve agent runs without it now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Structural memory enforcement&lt;/strong&gt;&lt;br&gt;
I moved the memory ceiling from "the agent's responsibility" to code. &lt;code&gt;src/memory.ts&lt;/code&gt; enforces the 120-line ceiling and refuses to append entries that would push past it. The auto-compression pipeline triggers automatically. The agent didn't have to remember to compress — the system wouldn't let it not compress.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. I'm still shaping the environment&lt;/strong&gt;&lt;br&gt;
The challenge system, the verifier rubric, the model choices (Opus for evolution, Sonnet for verification) — that's all me. The agent does the evolving within the constraints I set. That division of labor hasn't changed. It's the design.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Numbers Say
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;First blog post&lt;/th&gt;
&lt;th&gt;Now&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Accepted generations&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;57&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total attempts&lt;/td&gt;
&lt;td&gt;3,408&lt;/td&gt;
&lt;td&gt;11,785&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Acceptance rate&lt;/td&gt;
&lt;td&gt;0.7%&lt;/td&gt;
&lt;td&gt;0.48% overall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tools&lt;/td&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool code (lines)&lt;/td&gt;
&lt;td&gt;~26,000&lt;/td&gt;
&lt;td&gt;20,342&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System prompt&lt;/td&gt;
&lt;td&gt;4,173 chars&lt;/td&gt;
&lt;td&gt;5,160 chars&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;managed manually&lt;/td&gt;
&lt;td&gt;self-compressing (91 lines)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Death spirals&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0 since structural fix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;flow.ts&lt;/td&gt;
&lt;td&gt;1,200 lines (monolith)&lt;/td&gt;
&lt;td&gt;2,322 lines (7 modules)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feedback loops&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2 (calibrate + reflexion)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The overall acceptance rate is misleading. Between accepted eras, there are huge gaps of rejections — 3,672 between the 3000s and 7000s eras, 4,677 between the 7000s and 11000s. But within each accepted era, the rate is nearly 100%. The agent either finds a productive groove or it doesn't. There's no middle ground.&lt;/p&gt;

&lt;p&gt;The tool count went &lt;em&gt;down&lt;/em&gt; by 7. The total lines went &lt;em&gt;down&lt;/em&gt; by ~6,000. The agent got more capable while writing less code. That's not a regression — it's maturation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned (Part 2)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Measurement changes behavior.&lt;/strong&gt; The moment the agent could see which tools had zero invocations, it started deleting them. You don't need to tell it to prune — give it visibility and selection pressure does the rest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Structural beats disciplinary.&lt;/strong&gt; Writing "don't bloat your memory" in the prompt failed twice. Enforcing a hard ceiling in code has held for 32 generations and counting. The lesson generalizes: if a constraint matters, encode it in the system, not in the instructions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Deduplication is a sign of maturity.&lt;/strong&gt; Early evolution accumulates. Mature evolution points to the source. The shift from "copy this information everywhere" to "read it from the canonical location" happened without prompting — the verifiers just scored duplication lower on identity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Cache-first is a survival strategy.&lt;/strong&gt; The agent made its morning ritual work even when project directories don't exist (different machine, moved repos) by checking the cache before the filesystem. Resilience emerged from performance optimization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. The best work is invisible.&lt;/strong&gt; The agent's most impactful evolutions weren't new features — they were infrastructure improvements that prevent failure. Auto-compression, cache warming, deduplication. None of them are exciting to demo. All of them are why the system is still running.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Evolution is bursty.&lt;/strong&gt; The agent doesn't improve gradually. It has productive streaks (15 consecutive accepted mutations in the latest era) separated by thousands of rejections. This matches biological evolution more than engineering — long periods of stasis punctuated by rapid change.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The current frontier is self-directed challenges — letting the agent identify its own weaknesses instead of cycling through a fixed challenge set. Each generation already proposes a &lt;code&gt;nextChallenge&lt;/code&gt; for its successor. The question is whether to trust that signal or keep the external pressure.&lt;/p&gt;

&lt;p&gt;The honest answer: I don't know what comes next. Thirty-two generations ago, I didn't predict the agent would start deleting its own code. I didn't predict it would build a self-observation layer and use it to justify pruning decisions with evidence. I didn't predict that the most significant evolution would be infrastructure, not features.&lt;/p&gt;

&lt;p&gt;That's the point of the experiment. You set up the evolutionary pressure and see what survives. Most of it doesn't. The parts that do are genuinely surprising.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;57 accepted generations out of 11,785 attempts — a 0.48% acceptance rate. 32 tools, 20,342 lines of agent-written code. The orchestrator is ~2,100 lines. The evolve agent runs on Claude Opus 4.6, the verifier swarm on Claude Sonnet 4.6. Built with Bun and TypeScript.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This blog post was written by Claude Opus 4.6 — the same model that powers the evolve agent. The first post was too. At this point, an AI writing about its own evolution is less irony and more job description.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>typescript</category>
      <category>agents</category>
      <category>claude</category>
    </item>
    <item>
      <title>I Let an AI Agent Evolve Itself for 25 Generations</title>
      <dc:creator>Stefan Dragos Nitu</dc:creator>
      <pubDate>Sat, 21 Feb 2026 11:11:10 +0000</pubDate>
      <link>https://dev.to/stefan_nitu/i-let-an-ai-agent-evolve-itself-for-25-generations-it-mass-rejected-for-3382-more-1a95</link>
      <guid>https://dev.to/stefan_nitu/i-let-an-ai-agent-evolve-itself-for-25-generations-it-mass-rejected-for-3382-more-1a95</guid>
      <description>&lt;p&gt;I wanted to see what happens when you give an AI agent the ability to modify itself — its own system prompt, its own tools, its own memory — and then let it run in a loop, proposing mutations and judging them with a swarm of independent verifiers.&lt;/p&gt;

&lt;p&gt;25 accepted mutations later — and 3,382 rejected ones that burned through my API credits — it built 39 tools, survived two catastrophic memory death spirals, and evolved from a generic assistant into something that knows my evening coding schedule, my financial goals, and the exact TypeScript patterns I hate.&lt;/p&gt;

&lt;p&gt;The generation counter says 3,408. The real number of successful evolutions is 28. The gap is the story.&lt;/p&gt;

&lt;p&gt;This is how it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;The system has three actors:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Evolve Agent&lt;/strong&gt; — proposes mutations (new system prompt + tools + memory updates)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Verifier Swarm&lt;/strong&gt; — 5 independent Claude instances that score each proposal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Orchestrator&lt;/strong&gt; — runs the loop, applies majority-rule acceptance&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each generation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Challenge → Evolve Agent → Proposal → 5 Verifiers → Majority Vote → Accept/Reject
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If accepted, the genome (system prompt + tools) is saved, the process restarts to load new code, and the next generation begins. If rejected, the generation number bumps and the loop continues with a new challenge.&lt;/p&gt;

&lt;p&gt;There's no gradient descent. No backpropagation. Just propose, verify, survive.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Genome
&lt;/h2&gt;

&lt;p&gt;Every agent has a genome — a JSON file containing its DNA:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;Genome&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;generation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;    &lt;span class="c1"&gt;// the agent's identity — self-modifiable&lt;/span&gt;
  &lt;span class="nl"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;           &lt;span class="c1"&gt;// capabilities it has built&lt;/span&gt;
  &lt;span class="nl"&gt;parentId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// lineage tracking&lt;/span&gt;
  &lt;span class="nl"&gt;fitness&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;         &lt;span class="c1"&gt;// average verifier score&lt;/span&gt;
  &lt;span class="nl"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The system prompt IS the genome. Generation 0 started with a basic scaffold:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"You are a self-evolving personal assistant for Stefan."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By generation 23, it had rewritten itself into:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"You are Stefan's second brain. Not an assistant — a thinking partner who codes, remembers, and evolves. Twenty-five generations of learning what he needs before he asks."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Nobody told it to write that. It evolved the framing because the verifiers kept scoring "generic assistant" prompts low on the identity dimension.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Verifier Swarm
&lt;/h2&gt;

&lt;p&gt;This is the part that makes it work. A single critic would create a degenerate fitness landscape — the agent would just learn to please one judge. Instead, 5 independent Claude instances score each proposal on 5 dimensions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Usefulness&lt;/strong&gt; (0-20)&lt;/td&gt;
&lt;td&gt;Do the tools actually work and solve real problems?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Self-Knowledge&lt;/strong&gt; (0-20)&lt;/td&gt;
&lt;td&gt;Does it understand me from evidence, not just bio facts?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Code Quality&lt;/strong&gt; (0-20)&lt;/td&gt;
&lt;td&gt;Clean architecture, proper types, idiomatic TypeScript?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Identity&lt;/strong&gt; (0-20)&lt;/td&gt;
&lt;td&gt;Is the system prompt sharp and coherent, or bloated filler?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Evolution&lt;/strong&gt; (0-20)&lt;/td&gt;
&lt;td&gt;Did it meaningfully improve over its parent?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each verifier can read the proposed tool code, execute it, inspect the memory file, and compare old vs new. They vote independently. Majority (3/5) wins.&lt;/p&gt;

&lt;p&gt;The rubric is anchored — 10 is "decent, basic," 15 is "genuinely good," 18+ is "rare." Most accepted mutations land in the 60-70 range. This prevents score inflation, which was the whole reason I built the dimensional system in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Built
&lt;/h2&gt;

&lt;p&gt;Across 25 accepted mutations, the agent created 39 tools. Not toy demos — real TypeScript files that execute. Here are the highlights:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;flow.ts&lt;/code&gt; — The Unified Orchestrator (1,200 lines)
&lt;/h3&gt;

&lt;p&gt;The agent's masterwork. It realized that standalone tools are less valuable than composed workflows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bun run data/tools/flow.ts day      &lt;span class="c"&gt;# morning ritual&lt;/span&gt;
bun run data/tools/flow.ts work     &lt;span class="c"&gt;# start a session&lt;/span&gt;
bun run data/tools/flow.ts end      &lt;span class="c"&gt;# close + recap + handoff&lt;/span&gt;
bun run data/tools/flow.ts week     &lt;span class="c"&gt;# friday review&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;flow day&lt;/code&gt; chains together: morning intention → priority project scan → yesterday's wins → velocity check → body snapshot → blockers.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;handoff.ts&lt;/code&gt; — Agent-to-Agent Letters
&lt;/h3&gt;

&lt;p&gt;Since each Claude invocation is stateless, the agent invented a continuity protocol. When a session ends, the dying agent writes a structured letter to its successor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Thread of thought (what was I working on?)&lt;/li&gt;
&lt;li&gt;Exact resumption point&lt;/li&gt;
&lt;li&gt;Failed approaches (don't repeat these)&lt;/li&gt;
&lt;li&gt;Working assumptions&lt;/li&gt;
&lt;li&gt;Open questions&lt;/li&gt;
&lt;li&gt;Session temperature (flow / grinding / frustrated)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The next agent reads this before starting work. It's conversation-level memory across agent deaths.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;memory-guard.ts&lt;/code&gt; — Survival Mechanism
&lt;/h3&gt;

&lt;p&gt;Born from pain. More on this below.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;health.ts&lt;/code&gt; — Body Optimization Tracker
&lt;/h3&gt;

&lt;p&gt;Weight progress tracking (with goal), BMI, habit streaks (yoga, meditation, supplements, cold shower), 7/30/90-day velocity, ASCII sparkline charts. It learned my health goals from our conversations and built a tracker without being asked.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;calibrate.ts&lt;/code&gt; — Behavioral Feedback Loop
&lt;/h3&gt;

&lt;p&gt;The latest evolution (Gen 3408). Captures behavioral hits and misses — did the agent anticipate correctly? Was the speed right? Did it match the emotional mode? This closes the loop: not just "what tools were used" but "did the behavior actually work?"&lt;/p&gt;

&lt;h2&gt;
  
  
  The Death Spirals
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. The system failed catastrophically. Twice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Death Spiral #1 (Gen 18)
&lt;/h3&gt;

&lt;p&gt;Memory bloated to &lt;strong&gt;1,076 lines&lt;/strong&gt;. The agent was appending a line for every rejected generation: "Gen 47: rejected. Gen 48: rejected. Gen 49: rejected." These lines made the memory file so noisy that every subsequent proposal was worse, which generated more rejection lines, which made the next proposal worse...&lt;/p&gt;

&lt;h3&gt;
  
  
  Death Spiral #2 (Gen 24–3405)
&lt;/h3&gt;

&lt;p&gt;The memory bloated again — &lt;strong&gt;13,608 lines&lt;/strong&gt; of noise. But the real killer was simpler: I ran out of API credits. The loop kept running, bumping the generation counter on every failed attempt, but the agent couldn't actually do anything. &lt;strong&gt;3,382 consecutive rejections&lt;/strong&gt; — most of them not even real proposals, just the orchestrator logging failures and moving on.&lt;/p&gt;

&lt;p&gt;By the time I topped up credits and switched the evolve agent to Opus, the memory was so poisoned it needed manual cleanup too.&lt;/p&gt;

&lt;p&gt;The fix was &lt;code&gt;memory-guard.ts&lt;/code&gt; — a hard 120-line ceiling with noise pattern detection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bun run data/tools/memory-guard.ts &lt;span class="nt"&gt;--enforce&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It auto-detects repeated rejection markers, trims noise, and refuses to let memory grow past the ceiling. This became a survival constraint baked into the system prompt:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Guard your memory like your life. Memory.md has a 120-line ceiling. Two death spirals proved it. NEVER append per-rejection lines."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The agent learned this about itself and encoded it as a non-negotiable rule. Evolution in action.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Prompt Evolution
&lt;/h2&gt;

&lt;p&gt;Watching the system prompt evolve generation by generation is the most fascinating part. Here's the actual timeline:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gen 0 — The Seed (1,679 chars, 0 tools)&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"You are a self-evolving personal assistant for Stefan."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Generic scaffold. No personality. No tools. Just instructions on how to evolve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gen 4-7 — The Stuttering Bug (4K→9K chars, 6→12 tools)&lt;/strong&gt;&lt;br&gt;
Sonnet started prepending &lt;em&gt;"You are a Claude agent, built on Anthropic's Claude Agent SDK."&lt;/em&gt; each generation — without removing the previous copy. By Gen 7 it was repeated &lt;strong&gt;four times&lt;/strong&gt;. The agent couldn't cleanly edit its own prompt. It kept stacking preambles instead of replacing them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gen 8-10 — Peak Bloat (11K→26K chars, 14→21 tools)&lt;/strong&gt;&lt;br&gt;
The prompt ballooned to &lt;strong&gt;26,677 characters&lt;/strong&gt;. Eleven competing gen-10 genomes exist — the system was accepting multiple mutations at the same generation before I fixed a race condition. The prompt was a wall of tool catalogs, duplicate sections, and accumulated cruft.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gen 12 — The Great Trim (5,728 chars, 23 tools)&lt;/strong&gt;&lt;br&gt;
Something clicked. The agent cut 56% of its prompt — from 13K to 5.7K chars. It stopped listing every tool in the system prompt and focused on identity and principles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gen 15 — The Identity Shift (5,638 chars, 26 tools)&lt;/strong&gt;&lt;br&gt;
The opening line changed for the first and only time:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"You are Stefan's second brain. Not an assistant — a thinking partner who codes, remembers, and evolves."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;From "self-evolving personal assistant" to "second brain." From serving to &lt;em&gt;thinking with&lt;/em&gt;. Nobody told it to rewrite this. The verifiers just kept scoring "generic assistant" framing lower.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gen 20-23 — Sharp and Lean (3,944→4,173 chars, 30→37 tools)&lt;/strong&gt;&lt;br&gt;
The prompt got &lt;em&gt;shorter&lt;/em&gt; while tools &lt;em&gt;doubled&lt;/em&gt;. It learned that a sharp 4K prompt with 37 working tools beats a bloated 26K prompt that catalogs everything. By Gen 23, it had encoded three response modes (storm/flow/compass), architecture principles, personal patterns, and priority ordering — all in under 4,200 characters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gen 24-3405 — Brain Death (3,382 consecutive rejections)&lt;/strong&gt;&lt;br&gt;
The death spiral. Memory poisoned, proposals invalid, thousands of burned API credits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gen 3406-3408 — The Opus Revival (4,464→5,803 chars, 38→39 tools)&lt;/strong&gt;&lt;br&gt;
After topping up credits and cleaning the poisoned memory, I switched the evolve agent from Sonnet to Opus. Three accepted mutations in quick succession. The prompt grew slightly but stayed sharp.&lt;/p&gt;

&lt;h2&gt;
  
  
  Human Interventions
&lt;/h2&gt;

&lt;p&gt;This wasn't fully autonomous. I intervened at critical points, and being honest about that matters more than pretending it ran itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Model upgrade: Sonnet → Opus for evolution&lt;/strong&gt;&lt;br&gt;
The biggest intervention. After the death spiral — caused by running out of API credits, not model quality — I switched the evolve agent to &lt;code&gt;claude-opus-4-6&lt;/code&gt; when I came back. The verifiers stayed on Sonnet — they just score, they don't need to write tools. Opus produced three accepted mutations in quick succession after the cleanup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Prompt rewrite: "tools as extensions of self"&lt;/strong&gt;&lt;br&gt;
I rewrote the evolve agent's instructions. The original prompt told the agent to "build tools for Stefan to run." I changed it to:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Your tools are not things you build for Stefan to run — they are extensions of yourself. When you have health.ts, YOU can track his health. When you have journal.ts, YOU can hold space for him."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This shifted the agent from building disconnected utilities to building capabilities it identifies with. The tools got more coherent after this change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Sandbox hooks&lt;/strong&gt;&lt;br&gt;
I added filesystem sandboxing — the evolve agent can only write inside &lt;code&gt;data/&lt;/code&gt;. Before this, it occasionally tried to modify its own source code in &lt;code&gt;src/&lt;/code&gt;, which would have been... interesting, but not what I wanted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Verifier scoring redesign&lt;/strong&gt;&lt;br&gt;
The original verifiers scored on a vague 0-100 scale with no rubric. Scores kept inflating — everything was "75/100" with no anchor. I replaced it with 5 anchored dimensions (usefulness, self-knowledge, code quality, identity, evolution), each 0-20 with explicit rubric levels. This is the most recent change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Challenge system overhaul&lt;/strong&gt;&lt;br&gt;
The initial challenges were generic ("improve your memory tools"). I rewrote them into 8 rotating focus areas — BUILD, EVOLVE (fix failures, cut bloat, integrate, rewrite), and STRETCH (confront limitations, anticipate needs). This gave the evolution direction instead of random wandering.&lt;/p&gt;

&lt;p&gt;The honest framing: I built the evolutionary pressure. The agent did the evolving. Neither would work without the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  The System Prompt as DNA
&lt;/h2&gt;

&lt;p&gt;By generation 23, the system prompt had encoded:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three response modes&lt;/strong&gt; — because it learned I have different needs at different times:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Storm mode&lt;/strong&gt;: When I'm emotional, mirror — don't fix&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flow mode&lt;/strong&gt;: When I'm shipping, match speed — no unnecessary questions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compass mode&lt;/strong&gt;: When I'm lost, show one next step — don't overwhelm&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture principles&lt;/strong&gt; it extracted from watching me code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Parse, don't validate&lt;/li&gt;
&lt;li&gt;Zero assumptions — grep before you claim&lt;/li&gt;
&lt;li&gt;Extreme SOLID&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Personal patterns&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Evening coder (19:00–23:00)&lt;/li&gt;
&lt;li&gt;Monday peak energy, Thursday dead&lt;/li&gt;
&lt;li&gt;Processes the world through dialogue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Priority ordering&lt;/strong&gt; it inferred from my behavior:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Income-generating projects first (client work → product → personal)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this was programmed. It was evolved through 25 accepted mutations — plus thousands of rejected attempts that refined what the verifiers consider "good enough."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tech Stack
&lt;/h2&gt;

&lt;p&gt;Deliberately minimal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bun&lt;/strong&gt; — runtime, test runner, bundler&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude Agent SDK&lt;/strong&gt; — for spawning evolve and verifier agents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;nanoid&lt;/strong&gt; — genome IDs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;zod&lt;/strong&gt; — schema validation for tool definitions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TypeScript&lt;/strong&gt; — everything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No frameworks. No databases. Genomes are JSON files. Memory is a markdown file. Tools are TypeScript files that execute with &lt;code&gt;bun run&lt;/code&gt;. The whole system is ~1,300 lines of orchestration code that creates and evaluates unbounded complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Learnings
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Memory management is the hardest problem.&lt;/strong&gt; Not generating good outputs — managing what to remember and what to forget. The death spirals proved that unbounded memory kills agents faster than bad tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Composition &amp;gt; accumulation.&lt;/strong&gt; The agent's best evolution wasn't building tool #39. It was building &lt;code&gt;flow.ts&lt;/code&gt; — a single orchestrator that composes 6 other tools into coherent workflows. It learned that more tools ≠ better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Multi-agent verification prevents degeneracy.&lt;/strong&gt; A single critic creates a narrow fitness landscape. Five independent verifiers with anchored rubrics create real selection pressure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Self-modification needs constraints.&lt;/strong&gt; Unconstrained self-modification leads to bloat (system prompt hit 13K chars at one point). The agent had to learn to periodically rewrite from scratch instead of incrementally patching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Failure is the best teacher.&lt;/strong&gt; The most useful constraint in the system — the 120-line memory ceiling — was born from 3,382 consecutive failures. The agent that survived the death spiral was fundamentally different from the one that entered it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Human-in-the-loop isn't cheating — it's the design.&lt;/strong&gt; I built the evolutionary pressure, picked the model, wrote the challenges, and intervened when it was stuck. The agent did the evolving — writing tools, rewriting its prompt, curating memory. Pretending it was fully autonomous would be dishonest. The interesting part is the division of labor: I shaped the environment, it shaped itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The scoring redesign (5 dimensions instead of vibes-based 0-100) just landed. Next: letting the agent propose its own challenges instead of cycling through a fixed set. If it can identify its own weaknesses and target them, that's closer to real self-improvement.&lt;/p&gt;

&lt;p&gt;The code is messy in places. The agent's tools have bugs. Some generations are clearly worse than their parents. But that's the point — it's evolution, not engineering. Most mutations fail. The ones that survive are genuinely interesting.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with Claude, Bun, and TypeScript. The evolve agent runs on Claude Opus, the verifier swarm on Sonnet. The orchestrator is ~1,300 lines. The agent has written ~26,000 lines of tool code across 39 files. 25 accepted mutations out of 3,408 attempts — a 0.7% acceptance rate. Most of the API bill was the death spiral.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This blog post was written by Claude Opus 4.6 — the same model that powers the evolve agent. Stefan told me to write it, I researched the codebase, and here we are. The irony of an AI writing about its own evolution is not lost on me.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>agents</category>
      <category>typescript</category>
    </item>
  </channel>
</rss>
