<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: yureki_lab</title>
    <description>The latest articles on DEV Community by yureki_lab (@yureki_lab).</description>
    <link>https://dev.to/yureki_lab</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3960924%2F46fad6c4-8f78-40a1-a230-6bd3e913f37b.png</url>
      <title>DEV Community: yureki_lab</title>
      <link>https://dev.to/yureki_lab</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yureki_lab"/>
    <language>en</language>
    <item>
      <title>How I Built a Self-Improving Coding Agent with Claude Code: 5 Lessons After 6 Months</title>
      <dc:creator>yureki_lab</dc:creator>
      <pubDate>Fri, 12 Jun 2026 14:47:42 +0000</pubDate>
      <link>https://dev.to/yureki_lab/how-i-built-a-self-improving-coding-agent-with-claude-code-5-lessons-after-6-months-2a8p</link>
      <guid>https://dev.to/yureki_lab/how-i-built-a-self-improving-coding-agent-with-claude-code-5-lessons-after-6-months-2a8p</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I spent 6 months building a self-improving coding agent on top of Claude Code — an orchestrator that hands work to sub-agents, persists its own state, and rewrites its own prompts when it gets things wrong. Here are 5 lessons I wish someone had told me on day one, with the war stories that taught me each one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I kept hitting the same wall with single-shot LLM coding sessions.&lt;/p&gt;

&lt;p&gt;The model would do great work for 30 minutes, then forget a decision it made 10 minutes earlier. It would happily "fix" the same bug I'd asked it to leave alone. It would invent a function name, call it three times, and only notice it didn't exist when the test runner blew up. Re-prompting helped — for one turn. Then we drifted again.&lt;/p&gt;

&lt;p&gt;I wanted an agent that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Held its own state&lt;/strong&gt; across long-running work (days, not minutes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delegated&lt;/strong&gt; specialized tasks to sub-agents instead of stuffing one context window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caught its own mistakes&lt;/strong&gt; before I had to point them out&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Got better over time&lt;/strong&gt; — not the model, but the system around it&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I was already using &lt;code&gt;Claude Code&lt;/code&gt; (the CLI, v0.x at the time) and it had the right primitives: sub-agents, tool calls, hooks. So I started building on top of it instead of reinventing the runner.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Solved It
&lt;/h2&gt;

&lt;p&gt;The architecture ended up looking like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart TD
    User[User prompt] --&amp;gt; Orchestrator
    Orchestrator --&amp;gt;|reads| State[(State files&amp;lt;br/&amp;gt;STATE.md, DECISIONS.md)]
    Orchestrator --&amp;gt;|delegates| SubA[Researcher sub-agent]
    Orchestrator --&amp;gt;|delegates| SubB[Implementer sub-agent]
    Orchestrator --&amp;gt;|delegates| SubC[Reviewer sub-agent]
    SubA --&amp;gt; Orchestrator
    SubB --&amp;gt; Orchestrator
    SubC --&amp;gt; Orchestrator
    Orchestrator --&amp;gt;|writes| State
    Orchestrator --&amp;gt;|updates| Prompts[Prompt files]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three pieces did most of the work.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. State as plain Markdown, not a database
&lt;/h3&gt;

&lt;p&gt;I tried a SQLite-backed memory store first. It was technically nicer, but the agent kept misquerying it and producing weird half-correct context. So I ripped it out and replaced it with three plain files the agent reads at the top of every session:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;STATE.md&lt;/code&gt; — "where am I right now, what's the next concrete step"&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DECISIONS.md&lt;/code&gt; — append-only log of decisions with IDs (D-001, D-002, …)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MEMORY.md&lt;/code&gt; — index of long-lived facts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent reads these first, writes to them as it works, and treats &lt;code&gt;STATE.md&lt;/code&gt; as the single source of truth when files disagree.&lt;/p&gt;

&lt;p&gt;Boring? Yes. But the agent &lt;em&gt;understands&lt;/em&gt; Markdown the way it understands prose — and &lt;code&gt;git diff&lt;/code&gt; makes every state change human-reviewable.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Sub-agents with hard scope contracts
&lt;/h3&gt;

&lt;p&gt;Naive delegation didn't work. If I said "spawn a sub-agent to investigate X," the sub-agent would investigate X &lt;em&gt;and also&lt;/em&gt; try to fix it, refactor a neighboring file, and write a new test. The blast radius was unpredictable.&lt;/p&gt;

&lt;p&gt;What worked: every sub-agent gets a contract at spawn time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// pseudo-code, not the real API&lt;/span&gt;
&lt;span class="nf"&gt;spawnSubAgent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;researcher&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;READ ONLY. Locate the file that defines X. Return a path + 5-line excerpt.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;forbidden&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Edit&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Write&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Bash&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;returnSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;excerpt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two rules I learned to enforce hard:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool allowlists per role.&lt;/strong&gt; A researcher can't write. A reviewer can't edit. A planner can't execute. This is the single biggest leverage point.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured return values.&lt;/strong&gt; A sub-agent's final message &lt;em&gt;is&lt;/em&gt; its return value. If I let it return prose, the orchestrator had to re-parse it and often got it wrong. Forcing a schema cut spurious downstream errors dramatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Self-correction via a "reviewer pass"
&lt;/h3&gt;

&lt;p&gt;After every non-trivial change, the orchestrator spawns a reviewer sub-agent with one job:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Read the diff. Find anything that looks wrong, incomplete, or contradicts a decision in &lt;code&gt;DECISIONS.md&lt;/code&gt;. Return a list of concerns or an empty list.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the list is non-empty, the orchestrator either fixes it directly or escalates back to me. This caught maybe 30% of the bugs I'd otherwise have shipped — including a memorable case where the implementer happily added a &lt;code&gt;// TODO: handle errors&lt;/code&gt; and the reviewer flagged it three seconds later.&lt;/p&gt;

&lt;p&gt;The trick was making the reviewer &lt;strong&gt;adversarial by prompt&lt;/strong&gt;, not by hope:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Default to suspicious. If you can't tell whether something is correct, return it as a concern. False positives are cheap; false negatives ship bugs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. State files beat state stores
&lt;/h3&gt;

&lt;p&gt;For an agent that reads its own context every turn, plain Markdown that you can &lt;code&gt;git diff&lt;/code&gt; is more debuggable than any DB. I now reach for SQLite/KV only when the agent needs to query state across thousands of records.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Tool allowlists are the load-bearing safety mechanism
&lt;/h3&gt;

&lt;p&gt;Prompts that say "don't edit anything" work about 80% of the time. A &lt;code&gt;forbidden: ["Edit"]&lt;/code&gt; list works 100% of the time. Don't rely on the model's discipline when you can rely on the runner's.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The sub-agent's return value is data, not a message
&lt;/h3&gt;

&lt;p&gt;Treat sub-agents like RPCs. Define the return schema up front. The orchestrator that parses prose is the orchestrator that breaks on Tuesday.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Fail loud, never silent
&lt;/h3&gt;

&lt;p&gt;Early versions would catch errors and "try the next thing." Bugs accumulated invisibly until something visible broke. I rewrote every error path to either succeed cleanly, surface the failure, or stop. &lt;strong&gt;A loud failure is worth ten silent retries.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Version-stamp your prompts
&lt;/h3&gt;

&lt;p&gt;Prompts that worked on &lt;code&gt;Claude Sonnet 4.5&lt;/code&gt; did not always work the same way on &lt;code&gt;Claude Sonnet 4.6&lt;/code&gt;. I now stamp every prompt file with the model + date I last validated it on. When a model upgrade lands, I have a re-validation checklist instead of a mystery.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;A few things are on my list:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Better reviewer diversity&lt;/strong&gt; — running 3 reviewers with different lenses (correctness, perf, contract-fit) in parallel and merging their concerns. Early experiments suggest this catches a different class of bug than a single reviewer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Branch-and-merge planning&lt;/strong&gt; — letting the orchestrator try N approaches in isolated git worktrees and pick the winner, instead of committing to one path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Eval harness for the agent itself&lt;/strong&gt; — a fixed set of tasks I can run after every prompt change to measure regression.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'll write up each of these as I learn what actually works (and what just looks clever on a diagram).&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap-up / CTA
&lt;/h2&gt;

&lt;p&gt;If you're building agents on top of &lt;code&gt;Claude Code&lt;/code&gt; or any tool-use LLM, the meta-lesson is: &lt;strong&gt;the model is the easy part, the system around it is the hard part.&lt;/strong&gt; Treat it like the distributed system it actually is — explicit contracts, structured returns, loud failures, persistent state.&lt;/p&gt;

&lt;p&gt;If this resonated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;💡 &lt;strong&gt;Follow me on Dev.to&lt;/strong&gt; for more build logs as I push this further&lt;/li&gt;
&lt;li&gt;⚙️ Try &lt;code&gt;Claude Code&lt;/code&gt; if you haven't — most of these patterns translate to whatever runner you use&lt;/li&gt;
&lt;li&gt;💬 Drop a comment with what's broken in your agent setup — I read every reply and I'm collecting failure modes for a follow-up post&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What's the most surprising failure mode &lt;em&gt;you've&lt;/em&gt; hit running a coding agent?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>softwareengineering</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
