DEV Community

yureki_lab
yureki_lab

Posted on

How I Built a Self-Improving Coding Agent with Claude Code: 5 Lessons After 6 Months

TL;DR

I spent 6 months building a self-improving coding agent on top of Claude Code — an orchestrator that hands work to sub-agents, persists its own state, and rewrites its own prompts when it gets things wrong. Here are 5 lessons I wish someone had told me on day one, with the war stories that taught me each one.

The Problem

I kept hitting the same wall with single-shot LLM coding sessions.

The model would do great work for 30 minutes, then forget a decision it made 10 minutes earlier. It would happily "fix" the same bug I'd asked it to leave alone. It would invent a function name, call it three times, and only notice it didn't exist when the test runner blew up. Re-prompting helped — for one turn. Then we drifted again.

I wanted an agent that:

  1. Held its own state across long-running work (days, not minutes)
  2. Delegated specialized tasks to sub-agents instead of stuffing one context window
  3. Caught its own mistakes before I had to point them out
  4. Got better over time — not the model, but the system around it

I was already using Claude Code (the CLI, v0.x at the time) and it had the right primitives: sub-agents, tool calls, hooks. So I started building on top of it instead of reinventing the runner.

How I Solved It

The architecture ended up looking like this:

flowchart TD
    User[User prompt] --> Orchestrator
    Orchestrator -->|reads| State[(State files<br/>STATE.md, DECISIONS.md)]
    Orchestrator -->|delegates| SubA[Researcher sub-agent]
    Orchestrator -->|delegates| SubB[Implementer sub-agent]
    Orchestrator -->|delegates| SubC[Reviewer sub-agent]
    SubA --> Orchestrator
    SubB --> Orchestrator
    SubC --> Orchestrator
    Orchestrator -->|writes| State
    Orchestrator -->|updates| Prompts[Prompt files]
Enter fullscreen mode Exit fullscreen mode

Three pieces did most of the work.

1. State as plain Markdown, not a database

I tried a SQLite-backed memory store first. It was technically nicer, but the agent kept misquerying it and producing weird half-correct context. So I ripped it out and replaced it with three plain files the agent reads at the top of every session:

  • STATE.md — "where am I right now, what's the next concrete step"
  • DECISIONS.md — append-only log of decisions with IDs (D-001, D-002, …)
  • MEMORY.md — index of long-lived facts

The agent reads these first, writes to them as it works, and treats STATE.md as the single source of truth when files disagree.

Boring? Yes. But the agent understands Markdown the way it understands prose — and git diff makes every state change human-reviewable.

2. Sub-agents with hard scope contracts

Naive delegation didn't work. If I said "spawn a sub-agent to investigate X," the sub-agent would investigate X and also try to fix it, refactor a neighboring file, and write a new test. The blast radius was unpredictable.

What worked: every sub-agent gets a contract at spawn time.

// pseudo-code, not the real API
spawnSubAgent({
  role: "researcher",
  scope: "READ ONLY. Locate the file that defines X. Return a path + 5-line excerpt.",
  forbidden: ["Edit", "Write", "Bash"],
  returnSchema: { path: "string", excerpt: "string" },
})
Enter fullscreen mode Exit fullscreen mode

Two rules I learned to enforce hard:

  • Tool allowlists per role. A researcher can't write. A reviewer can't edit. A planner can't execute. This is the single biggest leverage point.
  • Structured return values. A sub-agent's final message is its return value. If I let it return prose, the orchestrator had to re-parse it and often got it wrong. Forcing a schema cut spurious downstream errors dramatically.

3. Self-correction via a "reviewer pass"

After every non-trivial change, the orchestrator spawns a reviewer sub-agent with one job:

Read the diff. Find anything that looks wrong, incomplete, or contradicts a decision in DECISIONS.md. Return a list of concerns or an empty list.

If the list is non-empty, the orchestrator either fixes it directly or escalates back to me. This caught maybe 30% of the bugs I'd otherwise have shipped — including a memorable case where the implementer happily added a // TODO: handle errors and the reviewer flagged it three seconds later.

The trick was making the reviewer adversarial by prompt, not by hope:

Default to suspicious. If you can't tell whether something is correct, return it as a concern. False positives are cheap; false negatives ship bugs.

Lessons Learned

1. State files beat state stores

For an agent that reads its own context every turn, plain Markdown that you can git diff is more debuggable than any DB. I now reach for SQLite/KV only when the agent needs to query state across thousands of records.

2. Tool allowlists are the load-bearing safety mechanism

Prompts that say "don't edit anything" work about 80% of the time. A forbidden: ["Edit"] list works 100% of the time. Don't rely on the model's discipline when you can rely on the runner's.

3. The sub-agent's return value is data, not a message

Treat sub-agents like RPCs. Define the return schema up front. The orchestrator that parses prose is the orchestrator that breaks on Tuesday.

4. Fail loud, never silent

Early versions would catch errors and "try the next thing." Bugs accumulated invisibly until something visible broke. I rewrote every error path to either succeed cleanly, surface the failure, or stop. A loud failure is worth ten silent retries.

5. Version-stamp your prompts

Prompts that worked on Claude Sonnet 4.5 did not always work the same way on Claude Sonnet 4.6. I now stamp every prompt file with the model + date I last validated it on. When a model upgrade lands, I have a re-validation checklist instead of a mystery.

What's Next

A few things are on my list:

  • Better reviewer diversity — running 3 reviewers with different lenses (correctness, perf, contract-fit) in parallel and merging their concerns. Early experiments suggest this catches a different class of bug than a single reviewer.
  • Branch-and-merge planning — letting the orchestrator try N approaches in isolated git worktrees and pick the winner, instead of committing to one path.
  • Eval harness for the agent itself — a fixed set of tasks I can run after every prompt change to measure regression.

I'll write up each of these as I learn what actually works (and what just looks clever on a diagram).

Wrap-up / CTA

If you're building agents on top of Claude Code or any tool-use LLM, the meta-lesson is: the model is the easy part, the system around it is the hard part. Treat it like the distributed system it actually is — explicit contracts, structured returns, loud failures, persistent state.

If this resonated:

  • 💡 Follow me on Dev.to for more build logs as I push this further
  • ⚙️ Try Claude Code if you haven't — most of these patterns translate to whatever runner you use
  • 💬 Drop a comment with what's broken in your agent setup — I read every reply and I'm collecting failure modes for a follow-up post

What's the most surprising failure mode you've hit running a coding agent?

Top comments (0)