<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: yureki_lab</title>
    <description>The latest articles on DEV Community by yureki_lab (@yureki_lab).</description>
    <link>https://dev.to/yureki_lab</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3960924%2F46fad6c4-8f78-40a1-a230-6bd3e913f37b.png</url>
      <title>DEV Community: yureki_lab</title>
      <link>https://dev.to/yureki_lab</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yureki_lab"/>
    <language>en</language>
    <item>
      <title>How I Gave Claude Code a Long-Term Memory: 5 Lessons from State Persistence</title>
      <dc:creator>yureki_lab</dc:creator>
      <pubDate>Sun, 05 Jul 2026 14:33:06 +0000</pubDate>
      <link>https://dev.to/yureki_lab/how-i-gave-claude-code-a-long-term-memory-5-lessons-from-state-persistence-1k1m</link>
      <guid>https://dev.to/yureki_lab/how-i-gave-claude-code-a-long-term-memory-5-lessons-from-state-persistence-1k1m</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I run a fully autonomous implementation system built on Claude Code that works on my projects around the clock. The single biggest upgrade I ever made to it wasn't a smarter prompt or a better model — it was giving it a &lt;strong&gt;file-based long-term memory&lt;/strong&gt;. This post covers the state persistence design that turned my agent from a goldfish into a coworker, and the 5 lessons I learned getting there. 💡&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Every Claude Code session starts with total amnesia.&lt;/p&gt;

&lt;p&gt;That's fine when &lt;em&gt;you&lt;/em&gt; are the memory. You sit there, you remember that yesterday you decided to use SQLite instead of Postgres, you remember that the flaky test is a known issue, and you steer accordingly.&lt;/p&gt;

&lt;p&gt;It falls apart the moment you take the human out of the loop. My system launches Claude Code sessions on a schedule, with nobody watching. Early on, the transcript of any given night looked like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Night 1:&lt;/strong&gt; Agent investigates a failing build, figures out the root cause is a version mismatch, patches it. Great.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Night 2:&lt;/strong&gt; New session, zero context. Agent sees a &lt;em&gt;related&lt;/em&gt; warning, re-investigates the same dependency tree from scratch, and "fixes" it a second time — differently. Now I have two conflicting fixes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Night 3:&lt;/strong&gt; Agent reverts night 2's change because it looks wrong. Which it is. But so was doing the work three times.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model wasn't dumb. It was &lt;strong&gt;stateless&lt;/strong&gt;. Every session re-derived the world from the repo, and anything that wasn't in the repo — decisions, dead ends, "we tried this and it broke" — evaporated at session end.&lt;/p&gt;

&lt;p&gt;The constraint that made this interesting: I wanted persistence &lt;strong&gt;without&lt;/strong&gt; adding infrastructure. No vector database, no memory service, no external API. The whole system runs on a single Mac mini (macOS 15, Claude Code, launchd for scheduling), and I wanted memory to survive anything short of disk failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Solved It
&lt;/h2&gt;

&lt;p&gt;The answer turned out to be embarrassingly boring: &lt;strong&gt;Markdown files, read at session start, written at decision time.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The three-file core
&lt;/h3&gt;

&lt;p&gt;My persistence layer is built around three files per project, each with exactly one job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;project/
├── CLAUDE.md        # HOW to work: rules, conventions, scope (rarely changes)
├── state/
│   ├── memo.md      # WHERE we are: current status, next action (changes every session)
│   └── tasks.md     # WHAT remains: task list with a status per item
└── decisions.md     # WHY we chose things: append-only decision log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The separation matters more than the format:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The project spec file (&lt;code&gt;CLAUDE.md&lt;/code&gt;)&lt;/strong&gt; holds durable rules — coding conventions, what the agent must never touch, how to report results. Claude Code loads it automatically at session start, which makes it the natural home for anything that should apply to &lt;em&gt;every&lt;/em&gt; session forever.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The state memo&lt;/strong&gt; is the handoff note. It answers one question: &lt;em&gt;if a new session starts right now with zero context, what does it need to know to continue?&lt;/em&gt; Current phase, last completed step, next step, any active weirdness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The task file&lt;/strong&gt; is a checklist with statuses (&lt;code&gt;todo / doing / done / blocked&lt;/code&gt;). Sessions pick up the first non-done item instead of inventing work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The decision log&lt;/strong&gt; is append-only. Every entry gets an ID (&lt;code&gt;D-001&lt;/code&gt;, &lt;code&gt;D-002&lt;/code&gt;, …), a date, the decision, and — critically — the &lt;em&gt;reason&lt;/em&gt;. You never edit old entries; you supersede them with new ones.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The write rules
&lt;/h3&gt;

&lt;p&gt;Files alone do nothing. The behavior comes from two rules baked into the agent's instructions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 1: Read before work.&lt;/strong&gt; Every session starts by reading the spec file, the state memo, and the task file, in that order, &lt;em&gt;before&lt;/em&gt; touching any code. Non-negotiable. A session that skips this is a session that repeats last week's mistakes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 2: Write at decision time, not session end.&lt;/strong&gt; My first version said "update the state memo before you finish." Terrible idea. Sessions die — context fills up, the process gets killed, the machine reboots. Anything buffered for "the end" is exactly what gets lost. Now the rule is: the moment something is decided or discovered, it gets written. The state memo is updated mid-session, every time reality changes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    A[Session start] --&amp;gt; B[Read spec + memo + tasks]
    B --&amp;gt; C[Do work]
    C --&amp;gt; D{Decision or discovery?}
    D -- yes --&amp;gt; E[Write memo / decision log immediately]
    E --&amp;gt; C
    D -- no --&amp;gt; C
    C --&amp;gt; F[Session ends - even abruptly]
    F --&amp;gt; G[Next session starts with full context]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The conflict rule
&lt;/h3&gt;

&lt;p&gt;Once you have multiple files, they &lt;em&gt;will&lt;/em&gt; contradict each other. An old strategy note says "we're doing X", the state memo says "we pivoted to Y". An agent without a tie-breaker will pick whichever it read last — or worse, try to satisfy both.&lt;/p&gt;

&lt;p&gt;So there's an explicit priority order written into the spec file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;state memo  &amp;gt;  strategy doc  &amp;gt;  project spec  &amp;gt;  decision log  &amp;gt;  everything else
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The state memo — the most recently updated, most operational file — always wins. The decision log is deliberately near the bottom: it's &lt;em&gt;history&lt;/em&gt;, and history explains the present but doesn't override it. When the agent changes course, it overwrites the current-state file and appends a new decision entry, so the old reasoning survives without ever being mistaken for the current plan.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it looks like in practice
&lt;/h3&gt;

&lt;p&gt;A real (lightly genericized) state memo from my system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# State memo — updated 2026-07-04 23:10&lt;/span&gt;

&lt;span class="gu"&gt;## Current phase&lt;/span&gt;
Trial run of the nightly build-repair loop (day 12 of 14).

&lt;span class="gu"&gt;## Last completed&lt;/span&gt;
Migrated scheduled jobs to the new machine; all paths rewritten.
Verified one full unattended cycle end-to-end.

&lt;span class="gu"&gt;## Next action&lt;/span&gt;
Watch tonight's run. If the headless timeout recurs, apply D-029
(switch to the local integration instead of the cloud connector).

&lt;span class="gu"&gt;## Active issues&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Cloud connector times out under headless runs (~30% of nights).
  Hypothesis and fix candidates recorded in D-028 / D-029.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A fresh session reading this needs zero archaeology. It knows the phase, the last verified state, the exact next action, and the known failure mode &lt;em&gt;with a pointer to the reasoning&lt;/em&gt;. That last line is the goldfish-to-coworker moment: the agent doesn't re-diagnose the timeout from scratch — it recognizes it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Session memory is the bottleneck, not model intelligence.&lt;/strong&gt; I spent weeks tuning prompts to make sessions "smarter" when the actual failure was that each session was brilliant &lt;em&gt;and blind&lt;/em&gt;. Claude Code (I'm on the 2026 releases) is easily capable of multi-week projects — if and only if context survives between sessions. Fix memory before you touch the prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Write-on-decision beats write-on-exit, every time.&lt;/strong&gt; Sessions end abruptly more often than you think: context limits, crashes, machine restarts. Across six months of 24/7 operation, mid-session persistence has saved me from replaying lost work dozens of times. If your agent buffers its learnings for a final summary step, you have a memory system that fails exactly when you need it most.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Separate "how", "where", and "why" into different files.&lt;/strong&gt; My first attempt was one giant &lt;code&gt;NOTES.md&lt;/code&gt;. It became a swamp — 700 lines where stale plans sat next to current status, and the agent couldn't tell which was which. Splitting by &lt;em&gt;rate of change&lt;/em&gt; (rules change rarely, status changes hourly, decisions only append) fixed it. Each file gets one job and one update pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. An explicit priority order is not optional.&lt;/strong&gt; Contradictions between files aren't an edge case; they're the steady state of any project that pivots. The single line "state memo always wins" has resolved more agent confusion than anything else in my setup. Without it, your agent averages your old plan and your new plan into something that is neither. ⚠️&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Persist decisions and dead ends — not transcripts.&lt;/strong&gt; My instinct was to keep everything. Wrong. Full session logs are write-only garbage: no future session will ever read 40k tokens of history. What's worth its weight in gold is tiny: &lt;em&gt;what we decided, why, and what we tried that didn't work.&lt;/em&gt; "The obvious refactor breaks the release script — see D-014" is one line and saves an entire wasted night. Compression is the feature, not a compromise.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The current design is deliberately low-tech, and I'm keeping it that way — but two upgrades are on my list:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory review passes&lt;/strong&gt;: a periodic session whose only job is pruning the state files — merging duplicates, deleting stale entries, flagging contradictions before they bite.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-project memory&lt;/strong&gt;: my agents work across several codebases, and lessons learned in one ("this library's v3 API silently changed behavior") currently don't transfer. A shared, read-mostly lessons file is the next experiment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'll write both up once they've survived a few weeks of unattended runs — I only publish patterns after they've been through real nights, not demos.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap-up
&lt;/h2&gt;

&lt;p&gt;If your Claude Code sessions feel like hiring a genius with amnesia every morning, the fix probably isn't a better prompt. It's three Markdown files and two write rules.&lt;/p&gt;

&lt;p&gt;If this was useful: &lt;strong&gt;follow me here on Dev.to&lt;/strong&gt; — I'm documenting the whole journey of running a fully autonomous implementation system, war stories included. And if you've built agent memory a different way (vector stores? SQLite? something weirder?), I genuinely want to hear about it in the comments. 🚀&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>productivity</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>How I Built My First MCP Server for Claude Code (4 Lessons)</title>
      <dc:creator>yureki_lab</dc:creator>
      <pubDate>Wed, 17 Jun 2026 14:33:31 +0000</pubDate>
      <link>https://dev.to/yureki_lab/how-i-built-my-first-mcp-server-for-claude-code-4-lessons-293i</link>
      <guid>https://dev.to/yureki_lab/how-i-built-my-first-mcp-server-for-claude-code-4-lessons-293i</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I built my first &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; server to give Claude Code read access to a local project knowledge base — and the first version was bad in ways I didn't expect. Here's the minimal TypeScript skeleton that actually works, plus 4 lessons about tool design that I wish someone had told me on day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I use Claude Code (v2.x) as my daily driver, and most of the time the built-in file and shell tools are enough. But I had one recurring annoyance: a pile of internal docs — design notes, decision logs, runbooks — scattered across a directory that the agent kept &lt;em&gt;re-reading from scratch&lt;/em&gt; every session. It would &lt;code&gt;grep&lt;/code&gt;, open three files, lose the thread, and &lt;code&gt;grep&lt;/code&gt; again.&lt;/p&gt;

&lt;p&gt;I wanted the agent to ask one focused question — "what did we decide about retries?" — and get back the relevant note, not a directory listing.&lt;/p&gt;

&lt;p&gt;That's exactly what MCP is for. MCP is an open protocol that lets you expose &lt;strong&gt;tools&lt;/strong&gt; (callable functions), &lt;strong&gt;resources&lt;/strong&gt;, and &lt;strong&gt;prompts&lt;/strong&gt; to an LLM client over a standard interface. Claude Code speaks it natively. So instead of teaching the model to navigate my files, I could hand it a &lt;code&gt;search_notes&lt;/code&gt; tool and a &lt;code&gt;get_note&lt;/code&gt; tool and let it do the obvious thing.&lt;/p&gt;

&lt;p&gt;The interesting constraint: the model only uses a tool well if the &lt;em&gt;schema&lt;/em&gt; tells it exactly when and how. A tool the model calls with garbage arguments is worse than no tool at all. That's where the real work was.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Solved It
&lt;/h2&gt;

&lt;p&gt;The whole thing is a small Node.js server (Node 22.x) using the official &lt;code&gt;@modelcontextprotocol/sdk&lt;/code&gt;. Here's the architecture — it's deliberately boring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    A[Claude Code] -- stdio / JSON-RPC --&amp;gt; B[MCP Server]
    B --&amp;gt; C[search_notes]
    B --&amp;gt; D[get_note]
    C --&amp;gt; E[(Local notes dir)]
    D --&amp;gt; E
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server talks to Claude Code over &lt;strong&gt;stdio&lt;/strong&gt; (standard in/out), which is the simplest transport — no ports, no auth, no network. The client launches your server as a subprocess and pipes JSON-RPC over the pipe. For a local, single-user tool this is exactly what you want.&lt;/p&gt;

&lt;h3&gt;
  
  
  The minimal skeleton
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Server&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@modelcontextprotocol/sdk/server/index.js&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;StdioServerTransport&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@modelcontextprotocol/sdk/server/stdio.js&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;CallToolRequestSchema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;ListToolsRequestSchema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@modelcontextprotocol/sdk/types.js&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Server&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;notes-server&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;0.3.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;capabilities&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setRequestHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ListToolsRequestSchema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;search_notes&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Search the project knowledge base by keyword. Returns up to 5 &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;matching note titles with a one-line snippet. Use this FIRST to &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;find a note before fetching its full text.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;inputSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Keyword or phrase to search for&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;query&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;get_note&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Fetch the full Markdown of one note by its exact id, as returned &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;by search_notes. Do not guess ids.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;inputSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;The note id from search_notes&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the descriptions read like instructions to a teammate, not API docs. That's the single highest-leverage thing in this file. More on that below.&lt;/p&gt;

&lt;h3&gt;
  
  
  The call handler
&lt;/h3&gt;

&lt;p&gt;The handler is where I learned to be careful about failure. The first version threw exceptions on a missing file. Bad idea — a thrown error reads to the agent like the &lt;em&gt;tool&lt;/em&gt; is broken, so it gives up instead of correcting course.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setRequestHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;CallToolRequestSchema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;args&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;search_notes&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;searchNotes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;No notes matched. Try a broader single keyword.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="p"&gt;};&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;formatHits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;get_note&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;note&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;loadNote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;note&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`No note with id "&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;". Call search_notes to get a valid id.`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="p"&gt;}],&lt;/span&gt;
          &lt;span class="na"&gt;isError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;};&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;note&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Unknown tool: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
      &lt;span class="na"&gt;isError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Tool failed: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt; &lt;span class="k"&gt;instanceof&lt;/span&gt; &lt;span class="nb"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;}],&lt;/span&gt;
      &lt;span class="na"&gt;isError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transport&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;StdioServerTransport&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key move: errors come back as &lt;strong&gt;content with &lt;code&gt;isError: true&lt;/code&gt;&lt;/strong&gt;, with a sentence telling the model what to do next ("call search_notes to get a valid id"). The agent reads that and self-corrects, usually on the very next turn.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wiring it into Claude Code
&lt;/h3&gt;

&lt;p&gt;One line in the MCP config registers it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"notes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"node"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"/abs/path/to/dist/index.js"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart Claude Code, and &lt;code&gt;search_notes&lt;/code&gt; / &lt;code&gt;get_note&lt;/code&gt; show up as callable tools. That's the whole loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The tool description IS the prompt — write it for the model, not for humans
&lt;/h3&gt;

&lt;p&gt;My first &lt;code&gt;search_notes&lt;/code&gt; description was literally &lt;code&gt;"Searches notes."&lt;/code&gt; The model called it with full sentences, paragraph queries, sometimes the entire user message as the query. Garbage in, garbage out.&lt;/p&gt;

&lt;p&gt;When I rewrote the description to say &lt;em&gt;what it returns&lt;/em&gt;, &lt;em&gt;when to use it&lt;/em&gt;, and &lt;em&gt;what to do next&lt;/em&gt; ("Use this FIRST… then get_note"), the call quality jumped immediately. Treat every &lt;code&gt;description&lt;/code&gt; field as a mini system prompt. The model has no other signal about your intent.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you find yourself debugging "why won't the agent use my tool right?", the answer is almost always in the description, not the code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  2. Return structured errors, never stack traces
&lt;/h3&gt;

&lt;p&gt;A raw exception bubbling up makes the agent think the tool is dead. A short, actionable error message makes it a recoverable hiccup. &lt;code&gt;isError: true&lt;/code&gt; + "here's the valid next step" turned my flakiest tool into the most reliable one. Think of error messages as &lt;strong&gt;instructions for recovery&lt;/strong&gt;, because to the model, that's exactly what they are.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Scope tools narrowly — two small tools beat one clever one
&lt;/h3&gt;

&lt;p&gt;I was tempted to build a single &lt;code&gt;notes(action, ...)&lt;/code&gt; mega-tool with an &lt;code&gt;action&lt;/code&gt; enum. Don't. The model reasons about &lt;em&gt;distinct&lt;/em&gt; tools far better than about a polymorphic one with conditional arguments. Two tools with clear names and required fields gave me dramatically more predictable behavior than one tool with five optional params. Narrow tools are also easier to grant or withhold — a real win when you care about what the agent can touch.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Version-stamp the server and the schema
&lt;/h3&gt;

&lt;p&gt;I bumped &lt;code&gt;version: "0.3.0"&lt;/code&gt; and added a stable &lt;code&gt;id&lt;/code&gt; contract to &lt;code&gt;get_note&lt;/code&gt; ("ids come from search_notes, don't guess"). When I later changed the search output format, the explicit contract meant I knew exactly what the model depended on. Schemas rot the same way APIs do — the model has memorized your old shape from earlier in the session. A version field and an explicit contract make breaking changes visible instead of mysterious.&lt;/p&gt;

&lt;p&gt;A few smaller things that bit me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;stdio means stdout is sacred.&lt;/strong&gt; Anything you &lt;code&gt;console.log&lt;/code&gt; to stdout corrupts the JSON-RPC stream. Log to &lt;strong&gt;stderr&lt;/strong&gt; (&lt;code&gt;console.error&lt;/code&gt;) or a file. This cost me an hour of "why is the connection dropping?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep tool count low.&lt;/strong&gt; Every tool you expose is tokens in the model's context on every turn. I capped myself at the two tools I actually needed, and the agent's reasoning stayed tight because of it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make &lt;code&gt;required&lt;/code&gt; fields actually required.&lt;/strong&gt; Marking &lt;code&gt;query&lt;/code&gt; and &lt;code&gt;id&lt;/code&gt; as &lt;code&gt;required&lt;/code&gt; in the JSON Schema stopped the model from calling tools with empty arguments "just to see what happens." The schema is a contract the client enforces before your code ever runs — lean on it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test the server by hand before wiring it in.&lt;/strong&gt; You can pipe a JSON-RPC &lt;code&gt;tools/list&lt;/code&gt; request straight into the process over stdin and eyeball the response. Doing that once saved me from chasing a config problem that was actually a serialization bug in my own &lt;code&gt;formatHits&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Three directions I'm exploring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;code&gt;list_recent_notes&lt;/code&gt; resource so the agent can browse without searching first.&lt;/li&gt;
&lt;li&gt;Swapping the keyword search for a small embedding index — same two-tool interface, smarter matching underneath.&lt;/li&gt;
&lt;li&gt;An HTTP transport version so the same server can back a shared team setup instead of a local subprocess.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The two-tool interface stays stable through all of that, which is the whole point of designing the schema carefully up front.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap-up / CTA
&lt;/h2&gt;

&lt;p&gt;Building an MCP server turned out to be 20% protocol and 80% &lt;strong&gt;interface design for a reader who happens to be a language model&lt;/strong&gt;. The SDK gets you a working server in 40 lines; the lessons are all about making tools the model uses &lt;em&gt;correctly&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If you're already using Claude Code, try writing one tiny MCP server this week — expose one thing you keep re-explaining to the agent and watch how much friction disappears.&lt;/p&gt;

&lt;p&gt;💡 If this was useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Follow me here on Dev.to&lt;/strong&gt; for more build logs on agent tooling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try Claude Code&lt;/strong&gt; if you haven't — pairing it with a custom MCP server is where it gets genuinely fun.&lt;/li&gt;
&lt;li&gt;Drop a comment with the first tool &lt;em&gt;you'd&lt;/em&gt; expose. I'm collecting ideas. 🚀&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>typescript</category>
      <category>programming</category>
    </item>
    <item>
      <title>How I Stopped Babysitting Claude Code: 5 Patterns for 24/7 AI Workers</title>
      <dc:creator>yureki_lab</dc:creator>
      <pubDate>Mon, 15 Jun 2026 14:33:34 +0000</pubDate>
      <link>https://dev.to/yureki_lab/how-i-stopped-babysitting-claude-code-5-patterns-for-247-ai-workers-270n</link>
      <guid>https://dev.to/yureki_lab/how-i-stopped-babysitting-claude-code-5-patterns-for-247-ai-workers-270n</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I spent a month babysitting Claude Code runs — watching every prompt, every tool call, every "are you sure?" If I stepped away for an hour, things either silently stalled or did something I didn't want. Here are 5 patterns that finally got me to a place where my agent runs unattended for a full day and I'm not nervous about it. The big one is treating the agent like a flaky background job, not a chat partner.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;When I first started using Claude Code (this was on &lt;code&gt;claude-sonnet-4-6&lt;/code&gt;, then &lt;code&gt;claude-opus-4-7&lt;/code&gt;), I treated it like an ultra-fast pair programmer. I'd kick off a task, watch it tear through files, and step in whenever it got weird. That's fine for 20-minute sessions.&lt;/p&gt;

&lt;p&gt;It falls apart fast when you want the agent to run for hours.&lt;/p&gt;

&lt;p&gt;The specific failure mode I kept hitting: I'd start a long task, walk away for 30 minutes, and come back to one of three states:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Silent stall.&lt;/strong&gt; A tool call was waiting on a permission prompt I never approved. Nothing was happening. The clock was burning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wrong direction.&lt;/strong&gt; The agent had decided — totally reasonably from its point of view — to refactor a directory I didn't want touched. The diff was huge. Reverting was annoying.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lost state.&lt;/strong&gt; The session had been compressed or context-trimmed, and the agent had forgotten which step we were on. It cheerfully started over from a stale assumption.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of these are bugs in Claude Code. They're consequences of treating an autonomous agent like an interactive REPL. I was the bottleneck because I was the supervisor. So I tried to stop being the supervisor.&lt;/p&gt;

&lt;p&gt;This is the build log of how I did that. The goal: a Claude Code worker I can leave alone for a full workday, that either does the right thing or fails loudly enough that I don't waste a day of compute on garbage.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Solved It
&lt;/h2&gt;

&lt;p&gt;Five patterns. Each one came out of a specific thing breaking. They compose.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Idempotent restart from a state file
&lt;/h3&gt;

&lt;p&gt;The single biggest unlock. The agent should be able to re-read its own state file and pick up where it left off, without remembering anything from the previous session.&lt;/p&gt;

&lt;p&gt;I keep a &lt;code&gt;STATE.md&lt;/code&gt; in the repo root. It's plain markdown, three sections:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Current goal&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;one&lt;/span&gt; &lt;span class="na"&gt;sentence&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;

&lt;span class="gu"&gt;## Next concrete action&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;one&lt;/span&gt; &lt;span class="na"&gt;specific&lt;/span&gt; &lt;span class="na"&gt;verb-led&lt;/span&gt; &lt;span class="na"&gt;step&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;

&lt;span class="gu"&gt;## Notes&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;anything&lt;/span&gt; &lt;span class="na"&gt;load-bearing&lt;/span&gt; &lt;span class="na"&gt;that&lt;/span&gt; &lt;span class="na"&gt;isn&lt;/span&gt;&lt;span class="err"&gt;'&lt;/span&gt;&lt;span class="na"&gt;t&lt;/span&gt; &lt;span class="na"&gt;in&lt;/span&gt; &lt;span class="na"&gt;the&lt;/span&gt; &lt;span class="na"&gt;code&lt;/span&gt; &lt;span class="na"&gt;yet&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent reads this on every startup. Every time it finishes a meaningful step, it rewrites it. If the run dies mid-task, the next invocation reads the same file and resumes.&lt;/p&gt;

&lt;p&gt;The key word is &lt;strong&gt;idempotent&lt;/strong&gt;. The agent shouldn't care whether this is the first run, the fifth, or the recovery after a crash. The state file is the source of truth; the chat history is just working memory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# pseudo-runner&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; task_done&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;claude &lt;span class="nt"&gt;--read&lt;/span&gt; STATE.md &lt;span class="nt"&gt;--do-next-action&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What surprised me: once the state file existed, I stopped needing to scroll the chat to remember what we were doing. I just opened &lt;code&gt;STATE.md&lt;/code&gt;. The discipline forced clarity.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Journal every external action
&lt;/h3&gt;

&lt;p&gt;The second pattern: write an append-only log of anything the agent does that touches the outside world.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2026-06-10T09:14Z run-id=a3f committed=feat: add retry to fetcher
2026-06-10T09:18Z run-id=a3f pushed=origin/main
2026-06-10T09:22Z run-id=a3f opened-pr=#412
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pure text, one event per line, timestamp + run-id + what happened. No JSON, no nested structure, no schema migrations later. I keep this in a &lt;code&gt;journal.log&lt;/code&gt; at the repo root and grep it when I'm trying to figure out what the agent did overnight.&lt;/p&gt;

&lt;p&gt;The boring format is the feature. When something goes wrong at 3am and I want to know which commit was the agent's last sane state, I'm not parsing JSON in my head — I'm just reading lines.&lt;/p&gt;

&lt;p&gt;This is also how I caught the "wrong direction" failures earlier. Reading two days of journal entries is way faster than diffing the repo.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Narrow tool grants by default, escalate by exception
&lt;/h3&gt;

&lt;p&gt;I used to give my agent every tool. Read, Write, Bash, Web — all on, all the time. I figured "smart agent, smart enough not to do anything dumb."&lt;/p&gt;

&lt;p&gt;It was dumb. Not because of the agent — because the &lt;em&gt;blast radius&lt;/em&gt; of any single tool call was the entire filesystem and the entire internet.&lt;/p&gt;

&lt;p&gt;Now I scope tools per task. A "review the diff" task gets Read + Grep. No Write. No Bash. A "run the test suite" task gets Bash but only with a whitelist of commands. A "post a comment to a PR" task gets the GitHub MCP server and that's it.&lt;/p&gt;

&lt;p&gt;When the agent needs something not in scope, it stops and asks. I'd rather have it pause and request a tool than improvise around the missing one. The pause is recoverable. Improvising on a file you didn't want it to touch is not.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Bad: agent has Bash with no constraints&lt;/span&gt;
&lt;span class="gh"&gt;# Good: agent has Bash restricted to: pytest, ruff, mypy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is one of those things that sounds fussy until the first time it saves you from a &lt;code&gt;rm -rf node_modules&lt;/code&gt; you didn't want.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. An explicit error budget per run
&lt;/h3&gt;

&lt;p&gt;Long-running agents fail. Sometimes a tool call times out. Sometimes a flaky test fails. Sometimes a fetch returns 502. The question isn't whether — it's how the agent reacts.&lt;/p&gt;

&lt;p&gt;The old default: retry forever, gradually descend into nonsense. By turn 30, the agent is tilting at the windmill, convinced the problem is somehow with the test runner.&lt;/p&gt;

&lt;p&gt;I now give every run a hard budget:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3 retries per tool call&lt;/strong&gt;, then escalate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2 unrelated errors in a session&lt;/strong&gt;, then halt and write a state-dump&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No silent backoff loops&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the budget is blown, the agent writes the partial state to &lt;code&gt;STATE.md&lt;/code&gt;, leaves a note in &lt;code&gt;journal.log&lt;/code&gt;, and exits non-zero. The next scheduled run reads the state and either retries the same step (if I haven't intervened) or skips it (if I marked it dead).&lt;/p&gt;

&lt;p&gt;This pattern is borrowed wholesale from how SREs run flaky services. AI agents are flaky services. Treat them like flaky services.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Dry-run wrappers for anything destructive
&lt;/h3&gt;

&lt;p&gt;The last pattern is the cheapest and the one I should have started with. Anything that mutates external state — git push, posting to an API, sending an email — goes through a wrapper that supports a &lt;code&gt;--dry-run&lt;/code&gt; flag.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;publish_post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dry_run&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;dry_run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[DRY] would POST &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chars to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dry-run&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/dry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;real_post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent runs with &lt;code&gt;DRY_RUN=1&lt;/code&gt; by default during development. I flip it off when I trust the run. This means the entire pipeline is exercised end-to-end on every change, but nothing actually ships until I unlock it.&lt;/p&gt;

&lt;p&gt;You'd think this is obvious. It was not obvious to me until I had the agent send three identical messages to a Slack channel because of a retry loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;Five takeaways, written so I'd quote them at past-me.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The agent isn't the worker — the state file is.&lt;/strong&gt; The chat session is ephemeral. Anything load-bearing has to live in a file the next invocation can read. Treat the agent as a stateless function over &lt;code&gt;(STATE.md, repo)&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Loud failures beat quiet ones, every single time.&lt;/strong&gt; I used to tune my agent toward "don't crash." Wrong goal. The right goal is "crash visibly the moment the situation gets unfamiliar." A crash leaves a tombstone. A drift leaves a mess.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tool surface = risk surface.&lt;/strong&gt; Every tool you give the agent is a thing it can do without asking. Default to narrow. Expand by exception, with a reason. This is the same logic as Unix permissions and the same logic as IAM. It applies to agents too.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Budget the failure modes, not the work.&lt;/strong&gt; Don't tell the agent "do this task." Tell it "do this task, but stop if you've retried 3 times or hit 2 unrelated errors." The work limit doesn't matter — the failure limit does.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dry-run mode pays for itself in one incident.&lt;/strong&gt; If you're shipping anything destructive, build the dry-run path first. Then make &lt;code&gt;DRY_RUN=1&lt;/code&gt; the default in your shell profile and forget about it.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The meta-lesson behind all five: &lt;strong&gt;stop thinking of the agent as a smarter you, and start thinking of it as an unsupervised intern with root.&lt;/strong&gt; The interventions that work are the same ones you'd use on a junior engineer running their first on-call shift: a clear handoff doc, an append-only log of what they did, scoped permissions, a stop rule, and a sandbox for the scary stuff.&lt;/p&gt;

&lt;p&gt;The agent is good at the work. It is not good at noticing it's lost. Your job is the noticing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Two things I'm playing with now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cross-run learning.&lt;/strong&gt; Right now each run reads &lt;code&gt;STATE.md&lt;/code&gt; and starts fresh. I want a longer-lived &lt;code&gt;LESSONS.md&lt;/code&gt; the agent writes to when it discovers something durable about the codebase (a flaky test, an undocumented constraint). The next run reads it as context. Early experiments suggest this is a big lever — if the agent doesn't have to rediscover every quirk every session, runs get a lot cheaper.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A "second opinion" reviewer agent.&lt;/strong&gt; Before any destructive step, a separate Claude instance with read-only access reviews the planned action and either approves or vetoes. This is overkill for small changes, but for the dangerous 5% of operations it's worth the extra tokens. I'm specifically interested in whether two independent reviewers catch more than one — the multi-perspective verify pattern in disguise.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If either of those goes somewhere interesting, I'll write it up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap-up / CTA
&lt;/h2&gt;

&lt;p&gt;If you're using Claude Code and the run-time is starting to creep past "one focused session" — start with the state file pattern. It's the cheapest one on this list and it unblocks all the others. Everything else is a refinement.&lt;/p&gt;

&lt;p&gt;If you found this useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Follow me on Dev.to&lt;/strong&gt; — I'm writing up more build-in-public posts on running Claude Code in production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try Claude Code&lt;/strong&gt; if you haven't — it's free to get started and the agent-loop UX is genuinely different from a chat window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drop a comment&lt;/strong&gt; with the pattern you're using to keep your own agent runs honest. I'd love to steal it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Versions used while writing this: Claude Code v0.x (current as of mid-2026), &lt;code&gt;claude-sonnet-4-6&lt;/code&gt; and &lt;code&gt;claude-opus-4-7&lt;/code&gt;. The patterns aren't model-specific — they're about how you wrap the agent, not which model you point it at.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>5 Mistakes I Made Designing My First Claude Code Sub-Agent Pipeline</title>
      <dc:creator>yureki_lab</dc:creator>
      <pubDate>Sun, 14 Jun 2026 14:33:46 +0000</pubDate>
      <link>https://dev.to/yureki_lab/5-mistakes-i-made-designing-my-first-claude-code-sub-agent-pipeline-1f3h</link>
      <guid>https://dev.to/yureki_lab/5-mistakes-i-made-designing-my-first-claude-code-sub-agent-pipeline-1f3h</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I spent a weekend wiring up my first multi-agent pipeline with Claude Code, and almost every design choice I made was wrong. Here are the 5 mistakes — one monolithic prompt, free-text returns, eager barriers, ignored concurrency caps, and no dedup — and how I fixed each one. If you're about to fan out sub-agents for the first time, read this before you ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I wanted to run a "find bugs in this repo" sweep across a medium-sized codebase. The naive version was easy: one Claude Code session, one big prompt, walk the tree. But it was slow, it ran out of context partway through, and the output was a wall of unstructured prose I had to re-parse by hand.&lt;/p&gt;

&lt;p&gt;So I rewrote it as a fan-out: spawn N sub-agents, each focused on a slice, collect their findings, verify, dedupe, report. Classic map-reduce on top of Claude Code sub-agents.&lt;/p&gt;

&lt;p&gt;The first version "worked" — it produced output. But the wall-clock was almost as bad as the monolith, half the findings were duplicates, and roughly 1 in 5 sub-agent results couldn't be parsed at all. I rebuilt it three times in a week. These are the mistakes I keep seeing in my own code and in other people's.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Versions: Claude Code v2.x, Node.js 22.x. The patterns are general but the API surface I reference is from late 2025/early 2026.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  How I Solved It
&lt;/h2&gt;

&lt;p&gt;I'll walk through the 5 mistakes one by one. Each one has the "before" sketch I actually wrote, then the version I landed on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 1 — One monolithic prompt for every sub-agent
&lt;/h3&gt;

&lt;p&gt;My first fan-out gave every sub-agent the same prompt and just varied the input slice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;findings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;slices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;slice&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
    &lt;span class="nf"&gt;runAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Find bugs in this code. Look for correctness issues,
              security holes, performance problems, dead code,
              and missing error handling. Be thorough.\n\n&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks tidy. It's also why my outputs were noisy. Every agent tried to be a generalist, every agent re-discovered the same shallow issues (unused variables, missing null checks), and nobody went deep on anything.&lt;/p&gt;

&lt;p&gt;The fix was to give each agent &lt;strong&gt;a single lens&lt;/strong&gt;. Same input, different prompts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;LENSES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;correctness&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Find logic bugs. Ignore style.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;security&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Find injection, auth, secret-handling bugs.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;concurrency&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Find race conditions and ordering bugs.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;findings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;LENSES&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flatMap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;lens&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
    &lt;span class="nx"&gt;slices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;slice&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;runAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;lens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;\n\n&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;lens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;}))&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same agent count, vastly better signal. Diversity beats redundancy when you're searching for something you can't fully specify upfront.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 2 — Free-text returns
&lt;/h3&gt;

&lt;p&gt;I let agents return prose, then tried to regex out the findings. Roughly 20% of returns had a header I didn't anticipate, or a numbering scheme that broke my parser, or a "By the way…" tail that polluted the next stage.&lt;/p&gt;

&lt;p&gt;The fix: enforce a schema at the tool layer. Most agent frameworks now support forcing the agent to call a structured-output tool. In Claude Code's workflow primitives this looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;FINDING_SCHEMA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;findings&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;array&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;file&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;line&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;severity&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;description&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;integer&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;low&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;medium&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;high&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;runAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;FINDING_SCHEMA&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="c1"&gt;// result is already a typed object — no parsing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent retries internally on schema mismatch, so by the time I get the object back, it's valid. This single change cut my downstream code in half and eliminated the parse-error tax.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 3 — Eager barriers between stages
&lt;/h3&gt;

&lt;p&gt;My original pipeline looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;reviews&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;reviewAgent&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;// BARRIER&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;verified&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;reviews&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;verifyAgent&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;// BARRIER&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looks clean. It also means the verify stage cannot start until &lt;strong&gt;every&lt;/strong&gt; review finishes. If one slow reviewer takes 3x the median, the verifier sits idle for that whole stretch.&lt;/p&gt;

&lt;p&gt;The fix is to pipeline: each item flows through all stages independently. Item A can be in verify while item B is still in review.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;stages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;cur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;
    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;stage&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;stages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;cur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;cur&lt;/span&gt;
  &lt;span class="p"&gt;}))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;reviewAgent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;verifyAgent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wall-clock dropped from "sum of slowest per stage" to "slowest single-item chain." On a 12-item run that's the difference between ~90s and ~35s for me.&lt;/p&gt;

&lt;p&gt;A barrier is only correct when stage N actually needs all of stage N-1 (dedup across the full set, early-exit if zero findings, cross-item comparison). Otherwise: pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 4 — Ignoring the concurrency cap
&lt;/h3&gt;

&lt;p&gt;I gleefully shoved 80 items into &lt;code&gt;Promise.all&lt;/code&gt;. The runner happily accepted them, then quietly queued 70 of them while running 10 at a time. My logs showed "80 agents started" — but only 10 were actually doing work, and I had no idea why my wall-clock was so bad.&lt;/p&gt;

&lt;p&gt;Two fixes, depending on the situation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Know your cap.&lt;/strong&gt; Most agent runners have a concurrency cap (often &lt;code&gt;min(16, CPU - 2)&lt;/code&gt;). Anything above that queues. If you want to reason about wall-clock, treat the cap as your effective batch size.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Right-size the fan-out.&lt;/strong&gt; I now scale fan-out to the work budget, not "as wide as possible":
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;BATCH&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;MAX_CONCURRENCY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Running &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;BATCH&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; concurrent; &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;BATCH&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; queued.`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Logging the queued count was the single most useful debug change I made all month. It turned an invisible bottleneck into a number.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 5 — No dedup before verification
&lt;/h3&gt;

&lt;p&gt;Verification is the expensive stage. Each verifier read files, ran tools, asked Claude to refute the claim. So when my finders surfaced "this function lacks input validation" from three different lenses, I was paying 3x for the same finding.&lt;/p&gt;

&lt;p&gt;The fix is dumb-simple — dedup in plain code between fan-out stages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;seen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fresh&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;allFindings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;line&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;description&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
  &lt;span class="nx"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;verified&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fresh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;verifyAgent&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The temptation is to make the dedup itself an agent ("ask Claude to merge similar findings"). Don't. A &lt;code&gt;Set&lt;/code&gt; and a stable key are faster, deterministic, and free. Reach for an agent only when the comparison genuinely needs judgment.&lt;/p&gt;

&lt;h3&gt;
  
  
  The shape I landed on
&lt;/h3&gt;

&lt;p&gt;After all five fixes, the pipeline looks roughly like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    A[Slices] --&amp;gt; B[Lens 1: correctness]
    A --&amp;gt; C[Lens 2: security]
    A --&amp;gt; D[Lens 3: concurrency]
    B --&amp;gt; E[Dedup]
    C --&amp;gt; E
    D --&amp;gt; E
    E --&amp;gt; F[Verify - pipelined]
    F --&amp;gt; G[Report]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Multiple lenses (mistake 1), schema-enforced returns (2), pipelined verify with no barrier (3), batch-aware concurrency (4), dedup before the expensive stage (5).&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Diversity beats redundancy.&lt;/strong&gt; If you're spawning N agents on the same problem, give them N different angles. N copies of the same prompt is wasted spend.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schemas are not bureaucracy — they're a parser.&lt;/strong&gt; Forcing structured output at the tool layer is the single highest-leverage change you can make to a multi-agent system. Stop regexing prose.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline by default, barrier only when you must.&lt;/strong&gt; Most "I'll await everything then start the next stage" code is a wall-clock tax for no reason. The barrier is correct only when stage N needs cross-item context from all of stage N-1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The concurrency cap is real and silent.&lt;/strong&gt; If your runner queues, you need to know. Log the queued count. Right-size fan-out to the cap, not your ambition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dedup with code, not agents.&lt;/strong&gt; Cheap deterministic operations (filter, group, dedup, sort) belong in your script, not in an LLM call. Reserve the agents for the judgment calls.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The version I have running now still has a verification stage that's overly trusting — if a finder is confidently wrong, a single verifier can rubber-stamp it. I'm experimenting with an adversarial panel: three skeptics per finding, each prompted to refute, kill if a majority refute. Early results look promising but the cost goes up linearly, so I want to measure precision/recall properly before I write that one up.&lt;/p&gt;

&lt;p&gt;I'm also tracking how much of the pipeline's wall-clock is the slowest single agent in each stage. If it's consistently one outlier, the right move is probably a timeout-and-retry rather than waiting it out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap-up / CTA
&lt;/h2&gt;

&lt;p&gt;If you're building anything with Claude Code sub-agents, try the schema fix first — it'll pay for itself within a day. The pipelining change is bigger but invasive; do it once your output is reliable enough to trust.&lt;/p&gt;

&lt;p&gt;If this was useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Follow me on Dev.to — I'm writing up more agent-design war stories as I hit them.&lt;/li&gt;
&lt;li&gt;If you haven't tried &lt;a href="https://www.anthropic.com/claude-code" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; yet, the sub-agent + workflow primitives are what made all of this even possible.&lt;/li&gt;
&lt;li&gt;Hit me up in the comments with your own multi-agent mistakes — I want to collect a "things that bit us all" list.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Build in public, break in private. 🛠️&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>ai</category>
      <category>programming</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>How I Built a Self-Improving Coding Agent with Claude Code: 5 Lessons After 6 Months</title>
      <dc:creator>yureki_lab</dc:creator>
      <pubDate>Fri, 12 Jun 2026 14:47:42 +0000</pubDate>
      <link>https://dev.to/yureki_lab/how-i-built-a-self-improving-coding-agent-with-claude-code-5-lessons-after-6-months-2a8p</link>
      <guid>https://dev.to/yureki_lab/how-i-built-a-self-improving-coding-agent-with-claude-code-5-lessons-after-6-months-2a8p</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I spent 6 months building a self-improving coding agent on top of Claude Code — an orchestrator that hands work to sub-agents, persists its own state, and rewrites its own prompts when it gets things wrong. Here are 5 lessons I wish someone had told me on day one, with the war stories that taught me each one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I kept hitting the same wall with single-shot LLM coding sessions.&lt;/p&gt;

&lt;p&gt;The model would do great work for 30 minutes, then forget a decision it made 10 minutes earlier. It would happily "fix" the same bug I'd asked it to leave alone. It would invent a function name, call it three times, and only notice it didn't exist when the test runner blew up. Re-prompting helped — for one turn. Then we drifted again.&lt;/p&gt;

&lt;p&gt;I wanted an agent that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Held its own state&lt;/strong&gt; across long-running work (days, not minutes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delegated&lt;/strong&gt; specialized tasks to sub-agents instead of stuffing one context window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caught its own mistakes&lt;/strong&gt; before I had to point them out&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Got better over time&lt;/strong&gt; — not the model, but the system around it&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I was already using &lt;code&gt;Claude Code&lt;/code&gt; (the CLI, v0.x at the time) and it had the right primitives: sub-agents, tool calls, hooks. So I started building on top of it instead of reinventing the runner.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Solved It
&lt;/h2&gt;

&lt;p&gt;The architecture ended up looking like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart TD
    User[User prompt] --&amp;gt; Orchestrator
    Orchestrator --&amp;gt;|reads| State[(State files&amp;lt;br/&amp;gt;STATE.md, DECISIONS.md)]
    Orchestrator --&amp;gt;|delegates| SubA[Researcher sub-agent]
    Orchestrator --&amp;gt;|delegates| SubB[Implementer sub-agent]
    Orchestrator --&amp;gt;|delegates| SubC[Reviewer sub-agent]
    SubA --&amp;gt; Orchestrator
    SubB --&amp;gt; Orchestrator
    SubC --&amp;gt; Orchestrator
    Orchestrator --&amp;gt;|writes| State
    Orchestrator --&amp;gt;|updates| Prompts[Prompt files]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three pieces did most of the work.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. State as plain Markdown, not a database
&lt;/h3&gt;

&lt;p&gt;I tried a SQLite-backed memory store first. It was technically nicer, but the agent kept misquerying it and producing weird half-correct context. So I ripped it out and replaced it with three plain files the agent reads at the top of every session:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;STATE.md&lt;/code&gt; — "where am I right now, what's the next concrete step"&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DECISIONS.md&lt;/code&gt; — append-only log of decisions with IDs (D-001, D-002, …)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MEMORY.md&lt;/code&gt; — index of long-lived facts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent reads these first, writes to them as it works, and treats &lt;code&gt;STATE.md&lt;/code&gt; as the single source of truth when files disagree.&lt;/p&gt;

&lt;p&gt;Boring? Yes. But the agent &lt;em&gt;understands&lt;/em&gt; Markdown the way it understands prose — and &lt;code&gt;git diff&lt;/code&gt; makes every state change human-reviewable.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Sub-agents with hard scope contracts
&lt;/h3&gt;

&lt;p&gt;Naive delegation didn't work. If I said "spawn a sub-agent to investigate X," the sub-agent would investigate X &lt;em&gt;and also&lt;/em&gt; try to fix it, refactor a neighboring file, and write a new test. The blast radius was unpredictable.&lt;/p&gt;

&lt;p&gt;What worked: every sub-agent gets a contract at spawn time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// pseudo-code, not the real API&lt;/span&gt;
&lt;span class="nf"&gt;spawnSubAgent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;researcher&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;READ ONLY. Locate the file that defines X. Return a path + 5-line excerpt.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;forbidden&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Edit&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Write&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Bash&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;returnSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;excerpt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two rules I learned to enforce hard:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool allowlists per role.&lt;/strong&gt; A researcher can't write. A reviewer can't edit. A planner can't execute. This is the single biggest leverage point.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured return values.&lt;/strong&gt; A sub-agent's final message &lt;em&gt;is&lt;/em&gt; its return value. If I let it return prose, the orchestrator had to re-parse it and often got it wrong. Forcing a schema cut spurious downstream errors dramatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Self-correction via a "reviewer pass"
&lt;/h3&gt;

&lt;p&gt;After every non-trivial change, the orchestrator spawns a reviewer sub-agent with one job:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Read the diff. Find anything that looks wrong, incomplete, or contradicts a decision in &lt;code&gt;DECISIONS.md&lt;/code&gt;. Return a list of concerns or an empty list.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the list is non-empty, the orchestrator either fixes it directly or escalates back to me. This caught maybe 30% of the bugs I'd otherwise have shipped — including a memorable case where the implementer happily added a &lt;code&gt;// TODO: handle errors&lt;/code&gt; and the reviewer flagged it three seconds later.&lt;/p&gt;

&lt;p&gt;The trick was making the reviewer &lt;strong&gt;adversarial by prompt&lt;/strong&gt;, not by hope:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Default to suspicious. If you can't tell whether something is correct, return it as a concern. False positives are cheap; false negatives ship bugs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. State files beat state stores
&lt;/h3&gt;

&lt;p&gt;For an agent that reads its own context every turn, plain Markdown that you can &lt;code&gt;git diff&lt;/code&gt; is more debuggable than any DB. I now reach for SQLite/KV only when the agent needs to query state across thousands of records.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Tool allowlists are the load-bearing safety mechanism
&lt;/h3&gt;

&lt;p&gt;Prompts that say "don't edit anything" work about 80% of the time. A &lt;code&gt;forbidden: ["Edit"]&lt;/code&gt; list works 100% of the time. Don't rely on the model's discipline when you can rely on the runner's.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The sub-agent's return value is data, not a message
&lt;/h3&gt;

&lt;p&gt;Treat sub-agents like RPCs. Define the return schema up front. The orchestrator that parses prose is the orchestrator that breaks on Tuesday.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Fail loud, never silent
&lt;/h3&gt;

&lt;p&gt;Early versions would catch errors and "try the next thing." Bugs accumulated invisibly until something visible broke. I rewrote every error path to either succeed cleanly, surface the failure, or stop. &lt;strong&gt;A loud failure is worth ten silent retries.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Version-stamp your prompts
&lt;/h3&gt;

&lt;p&gt;Prompts that worked on &lt;code&gt;Claude Sonnet 4.5&lt;/code&gt; did not always work the same way on &lt;code&gt;Claude Sonnet 4.6&lt;/code&gt;. I now stamp every prompt file with the model + date I last validated it on. When a model upgrade lands, I have a re-validation checklist instead of a mystery.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;A few things are on my list:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Better reviewer diversity&lt;/strong&gt; — running 3 reviewers with different lenses (correctness, perf, contract-fit) in parallel and merging their concerns. Early experiments suggest this catches a different class of bug than a single reviewer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Branch-and-merge planning&lt;/strong&gt; — letting the orchestrator try N approaches in isolated git worktrees and pick the winner, instead of committing to one path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Eval harness for the agent itself&lt;/strong&gt; — a fixed set of tasks I can run after every prompt change to measure regression.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'll write up each of these as I learn what actually works (and what just looks clever on a diagram).&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap-up / CTA
&lt;/h2&gt;

&lt;p&gt;If you're building agents on top of &lt;code&gt;Claude Code&lt;/code&gt; or any tool-use LLM, the meta-lesson is: &lt;strong&gt;the model is the easy part, the system around it is the hard part.&lt;/strong&gt; Treat it like the distributed system it actually is — explicit contracts, structured returns, loud failures, persistent state.&lt;/p&gt;

&lt;p&gt;If this resonated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;💡 &lt;strong&gt;Follow me on Dev.to&lt;/strong&gt; for more build logs as I push this further&lt;/li&gt;
&lt;li&gt;⚙️ Try &lt;code&gt;Claude Code&lt;/code&gt; if you haven't — most of these patterns translate to whatever runner you use&lt;/li&gt;
&lt;li&gt;💬 Drop a comment with what's broken in your agent setup — I read every reply and I'm collecting failure modes for a follow-up post&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What's the most surprising failure mode &lt;em&gt;you've&lt;/em&gt; hit running a coding agent?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>softwareengineering</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
