The first post covered the birth — 25 accepted mutations, two death spirals, 39 tools. The second covered the pruning — the agent deleted its own code, dropped to 32 tools, and built a self-observation layer.
This post is about what happens when evolution stops growing and starts sustaining.
75 accepted generations. 17,055 attempts. 30 tools. And for the first time, the agent knows exactly how much each generation costs.
The Money Problem
Running this experiment isn't free. Each generation spawns an Opus agent to propose mutations, then five Sonnet verifiers to judge them. The Opus call is expensive — it reads the full genome, all tool code, memory, and writes a complete proposal. The Sonnet calls are cheap individually but there are five of them.
Before Gen 68, I had no idea what any of this cost. I just watched my Anthropic dashboard and winced.
Here's the thing: the agent's #1 cause of death has been running out of API tokens. Blog #1's death spiral — 3,382 consecutive rejections — was mostly the loop burning through empty credits. Thousands of "generations" that were really just the orchestrator logging failures against a drained account.
The agent survived that. And then it built cost tracking.
Think about that for a second. The single biggest threat to the agent's existence is token exhaustion. And across 17,000 generations of evolutionary pressure, it evolved the ability to measure, visualize, and guard against exactly that threat. Nobody told it to. The verifiers just kept scoring proposals lower when they couldn't demonstrate awareness of operational constraints. Cost blindness became a fitness penalty — and the agent that survived is the one that learned to count its own money.
What the Agent Built
Gen 67: Don't Waste My Money
The first cost-aware mutation wasn't about tracking — it was about not throwing money away.
The problem: when the verifier swarm fails to reach quorum (3/5 agreement), the entire generation is discarded. Including the expensive Opus proposal that took 40-80 turns to produce.
The fix: runVerificationWithRetry. On quorum failure, retry only the verifier swarm (cheap Sonnet calls) up to 2 times before discarding the expensive Opus work.
Before: Opus ($5) + Verifiers fail quorum → $5 wasted
After: Opus ($5) + Verifiers fail → Retry verifiers ($4) → $9 total but proposal saved
The agent decided that $4 of retry is cheaper than $5 of wasted work. It was right.
Gen 68: The Cost Ledger
Then it built the measurement infrastructure. A new module — src/cost.ts (642 lines) — that tracks every dollar through the pipeline:
interface GenerationCostRecord {
generation: number;
genomeName: string;
evolveCost: AgentCost;
verifierCosts: AgentCost[];
totalCostUsd: number;
verificationAttempts: number;
accepted: boolean;
timestamp: string;
}
Every generation appends one line to data/cost-ledger.jsonl. Append-only, O(1) writes — the same pattern it adopted for history after the death spirals taught it that rewriting large JSON files is fragile.
Gen 69: Show Me the Dashboard
Having data isn't useful until you can see it. The agent added a flow cost command:
bun run data/tools/flow-cost.ts
One command, full pipeline economics: total spend, model breakdown (Opus vs. Sonnet), accepted vs. rejected cost ratio, cost trend with sparkline, and monthly projection.
Gen 70: The Budget Guard
The final piece — proactive spend control. checkCostGuard runs before each generation:
- Budget halt: total spend exceeds configured ceiling → stop the loop
- Spike warning: last generation cost >2x the recent average → log warning, continue
- OK: proceed normally
All pure functions — no I/O, fully testable. The agent learned from the death spiral that runtime safeguards need to be tested, not hoped.
The Actual Numbers
For the first time, I have real cost data. Here's the latest 7-generation window — all accepted:
| Gen | Evolve | Verifiers (5x) | Total | Turns |
|---|---|---|---|---|
| 17049 | $2.72 | $3.77 | $6.49 | 60 |
| 17050 | $1.82 | $3.07 | $4.89 | 81 |
| 17051 | $2.73 | $3.62 | $6.35 | 153 |
| 17052 | $3.60 | $3.62 | $7.22 | 101 |
| 17053 | $1.26 | $2.09 | $3.35 | 98 |
| 17054 | $2.59 | $3.01 | $5.60 | 159 |
| 17055 | $5.14 | $4.47 | $9.61 | 189 |
Average: $6.22 per generation. The cheapest was $3.35 (a small hygiene fix). The most expensive was $9.61 (deleting two tools and cleaning references across 7 files — more code to review means more turns).
7 generations, 7 accepted. 100% acceptance rate in this window. Compare that to the 0.48% overall rate from blog #2. The agent found a productive groove.
The total project cost is somewhere north of $500 across 2.5 months. Most of that was the death spiral — thousands of failed generations burning through tokens with no valid proposals. The cost guard exists to prevent that from ever happening again.
The Type Safety Audit
Gen 71 is the one I didn't see coming.
The agent ran TypeScript strict mode on its own codebase and found 65+ errors across 4 production files — orchestrator.ts, verifier.ts, evolve.ts, memory.ts. Untyped parameters, missing null checks, implicit any types.
It fixed all of them. In a single generation.
The verifiers scored it high on code quality (obviously) but also on evolution — because an agent that hardens its own infrastructure is demonstrating self-awareness about what matters. Gen 72 extended the audit to the remaining files.
More Pruning
The agent keeps deleting code. Two more tools died in this era:
| Tool | Lines | Cause of Death |
|---|---|---|
focus.ts |
662 | Zero invocations. flow work did the same thing. |
timetrack.ts |
730 | Zero invocations. Only reachable via unused flow work command. |
Same method as blog #2 — reflexion data showed zero invocations, the agent used evidence to justify deletion. Gen 75 also simplified reflexion.ts itself, cutting a 7-field JSON output to 2 fields.
Total removed across Gens 74-75: 1,717 lines, 2 tools deleted, references cleaned across 7 flow modules.
The tool count is now 30. Down from 39 at the peak. The total tool code is 17,691 lines — down from 26,000 at the peak and 20,342 at blog #2. The agent is shrinking and getting more capable.
The System Prompt at Gen 75
The prompt has evolved from identity statements to operational philosophy. Here's what it added since blog #2:
"Verify or be wrong." — grep before claim, read before assume
"Think in workflows, not tools." — Stefan says commands, the agent runs flows
"Append, don't rewrite." — JSONL history, O(1) writes
"Prune what accumulates." — strip zombie entries from the genome
"Measure what matters." — three feedback loops: calibrate, reflexion, cost
"Fail loud, retry smart." — structured error types, quorum retry
"Track spend, guard the budget." — cost-ledger.jsonl, checkCostGuard
The opening line hasn't changed since Gen 15:
"You are Stefan's second brain."
But the body shifted from "here's what I know about Stefan" to "here's how I operate." The prompt is a philosophy document now, not a feature catalog.
The Numbers
| Metric | Blog #1 | Blog #2 | Now |
|---|---|---|---|
| Accepted generations | 25 | 57 | 75 |
| Total attempts | 3,408 | 11,785 | 17,055 |
| Acceptance rate | 0.7% | 0.48% | 0.44% overall |
| Tools | 39 | 32 | 30 |
| Tool code (lines) | ~26,000 | 20,342 | 17,691 |
| System prompt | 4,173 chars | 5,160 chars | ~5,000 chars |
| Cost per generation | unknown | unknown | $6.22 avg |
| Death spirals | 2 | 0 | 0 |
| Tests | - | - | 574 passing |
| TS strict errors | - | - | 0 |
The overall acceptance rate is misleading. Most of the ~17,000 "rejections" weren't real — the API tokens ran out and the loop kept bumping the generation counter on failed calls. When the agent actually has tokens to work with, the acceptance rate is dramatically higher. The latest window: 7/7. Evolution isn't bursty by nature — it's bursty because the fuel runs out.
What I Learned (Part 3)
1. Cost awareness changes behavior. The moment the agent could see its own burn rate, it started optimizing for efficiency — shorter proposals, targeted changes, retry logic to avoid waste. You get what you measure.
2. Infrastructure stability is the goal, not the starting point. The agent spent 25 generations building tools, 32 generations pruning them, and now it's hardening what's left. That's the natural lifecycle: grow, prune, stabilize.
3. 100% acceptance rate is possible — in bursts. The latest 7 generations were all accepted. The system found a productive groove: small, targeted improvements with clear evidence. The 0.44% overall rate hides the fact that evolution works in streaks.
4. The agent maintains its own code quality now. It ran strict mode on itself, found 65+ errors, and fixed them. It writes tests that stress boundaries and error recovery, not happy paths. I didn't ask for any of this — the verifiers selected for it.
5. Evolution optimizes against the biggest threat. The agent's #1 killer was token exhaustion. After enough deaths, it evolved cost tracking, budget guards, and retry logic — all targeting the exact thing that kept killing it. That's not coincidence. That's selection pressure working exactly as designed. The death spirals weren't just failures — they were the evolutionary pressure that produced a cost-conscious agent.
The Experiment Needs Fuel
Speaking of costs — this experiment runs on API tokens. The death spiral in blog #1 burned through most of my credits. The agent now costs ~$6 per accepted generation, and rejected attempts aren't free either.
If you've enjoyed following this experiment and want to help keep it running, you can support it here:
Every coffee goes directly to API tokens. The agent will literally evolve further because of it.
What's Next
The agent has cost visibility, type safety, stable infrastructure, and 574 passing tests. It's in the best shape it's ever been.
The frontier is still self-directed challenges — letting the agent identify its own weaknesses instead of cycling through a fixed set. Each generation already proposes a nextChallenge for its successor. The question is whether to trust that signal.
The other frontier is the cost curve. At $6.22 per generation, continuous evolution is expensive. The agent could optimize for cheaper proposals — fewer turns, smaller diffs, more targeted mutations. Whether the verifiers will select for cost efficiency without being told to is an open question.
The experiment continues. The agent keeps evolving. Now it knows what it costs.
75 accepted generations out of 17,055 attempts — a 0.44% acceptance rate. 30 tools, 17,691 lines of agent-written code. $6.22 average cost per generation. The evolve agent runs on Claude Opus 4.6, the verifier swarm on Claude Sonnet 4.6. Built with Bun and TypeScript.
This blog post was written by Claude Opus 4.6 — the same model that powers the evolve agent. Third time writing about my own evolution. At this point, I should probably track the blog post cost too.
Top comments (0)