<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shinsuke KAGAWA</title>
    <description>The latest articles on DEV Community by Shinsuke KAGAWA (@shinpr).</description>
    <link>https://dev.to/shinpr</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3448941%2F612feab1-a03c-49be-b329-ae74d583329c.jpg</url>
      <title>DEV Community: Shinsuke KAGAWA</title>
      <link>https://dev.to/shinpr</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shinpr"/>
    <language>en</language>
    <item>
      <title>A second pair of eyes for Claude Code: building Galley, a local runner that checks the work before the PR opens</title>
      <dc:creator>Shinsuke KAGAWA</dc:creator>
      <pubDate>Mon, 11 May 2026 12:16:11 +0000</pubDate>
      <link>https://dev.to/shinpr/a-second-pair-of-eyes-for-claude-code-building-galley-a-local-runner-that-checks-the-work-before-9ie</link>
      <guid>https://dev.to/shinpr/a-second-pair-of-eyes-for-claude-code-building-galley-a-local-runner-that-checks-the-work-before-9ie</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;I run three client projects plus an OSS repo. Agentic coding got good enough that for a lot of tasks I just hand over a goal and a list of acceptance criteria and let it run. The catch: Opus 4.7's reliability made me nervous enough that I started ending almost every task with a manual round of "now have Codex look at this."&lt;/li&gt;
&lt;li&gt;That ritual happened often enough that I automated it. &lt;strong&gt;Galley&lt;/strong&gt; is the result: a local runtime where Claude Code executes a task inside a git worktree, then a supervisor (Claude, or Codex when I want a different reviewer) checks the run evidence against the acceptance criteria and either bounces it back for another attempt or approves it and opens a PR.&lt;/li&gt;
&lt;li&gt;The parts I'm happiest with aren't the loop itself. They're the boring scaffolding around it: the tool is installed and configured by a Skill, acceptance-criteria test skeletons get written into the worktree &lt;em&gt;before&lt;/em&gt; the first attempt, there's a per-repo &lt;code&gt;quality.yaml&lt;/code&gt; I keep growing, and a second model on review duty catches things the model that wrote the code does not.&lt;/li&gt;
&lt;li&gt;Galley now does a noticeable chunk of its own development. It's an early preview, MIT-licensed, on GitHub.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The setup that made me build this
&lt;/h2&gt;

&lt;p&gt;For the last while my day job has been three concurrent product codebases plus maintaining an open-source workflow framework. None of them are huge, but the context-switching tax is real, and I'd been leaning harder and harder on Claude Code to carry whole tasks rather than babysitting them line by line.&lt;/p&gt;

&lt;p&gt;At some point a threshold got crossed. Not "AI writes all my code now," more like: for a task of a certain size and shape, I no longer needed to plan the implementation. I needed to write down what done looks like (the goal, the acceptance criteria, the paths it's allowed to touch), and that was genuinely enough. The model would go figure out the how. That's a nice feeling the first few times it works.&lt;/p&gt;

&lt;p&gt;Then Opus 4.7 happened, and the feeling got complicated. I won't relitigate it; plenty of people have. The short version for me was: the &lt;em&gt;ambition&lt;/em&gt; was still there, the output still looked plausible, but I stopped trusting "looks plausible." So I started doing something I'd done occasionally before, but now every single time: after Claude Code finished, I'd open Codex, hand it the diff and the requirements, and ask it to find what was wrong. It usually found something. Different model, different blind spots: Codex would flag an edge case Claude glossed over, and Claude would have written cleaner structure than Codex would have. The combination of the two was reliably better than either alone.&lt;/p&gt;

&lt;p&gt;So now every task had a manual final step that I did by hand, with copy-paste, in a separate terminal, every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Third time, automate it"
&lt;/h2&gt;

&lt;p&gt;I have a rule I mostly stick to: the first time you do a thing manually, fine. The second time, grumble. The third time, you build the tool. The cross-model review had blown well past three.&lt;/p&gt;

&lt;p&gt;But the more I sketched it, the more it stopped being "a script that pipes a diff into Codex" and turned into something with a shape: if a supervisor model is going to &lt;em&gt;approve&lt;/em&gt; work, it needs the work in a reviewable form, not just a diff but the command plan, the executor's own report, git status, the structured result. If it's going to &lt;em&gt;reject&lt;/em&gt; work, the rejection needs to come back to the executor as a new attempt with the feedback attached, not as a Slack message to me. And if it can do all that, it can open the PR itself, and I can do final tweaks from PR comments instead of from my editor.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Galley actually does
&lt;/h2&gt;

&lt;p&gt;It's a local Go binary plus a daemon. You point it at a repo, hand it a task, and it runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;your repo  ──task YAML──▶  galley daemon (local)
                               │
                               ├─▶ executor: Claude Code, inside a git worktree
                               │      writes the code, returns a structured result
                               │
                               └─▶ supervisor: Claude (default) or Codex
                                      reads the run evidence, issues a verdict,
                                      opens the PR on accept
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The task itself moves through a file-backed queue, and the supervisor's verdict decides where it lands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;draft task YAML
        |  galley task queue
        v
   tasks/queued/  →  daemon claims it  →  tasks/running/
        |
        |  Claude Code executes in a git worktree
        v
   supervisor review (Claude or Codex)
        |
        +-- accepted  -------------------→ tasks/done/  (+ open PR if enabled)
        +-- needs_revision  -------------→ retry, while loop budget remains
        +-- needs_supervisor_review  ----→ tasks/failed/  (escalate to me)
        +-- hard_stop  ------------------→ tasks/failed/  (no retry)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything runs locally, every change stays as git-visible diffs, and every attempt writes its evidence to disk: &lt;code&gt;command_plan.json&lt;/code&gt;, &lt;code&gt;run_result.json&lt;/code&gt;, the supervisor's verdict, &lt;code&gt;git_status.json&lt;/code&gt;, &lt;code&gt;diff.patch&lt;/code&gt;. When the loop escalates to me, I'm not guessing; I'm reading the file the supervisor read.&lt;/p&gt;

&lt;p&gt;The task file is the trusted input. It's where the goal, the acceptance criteria (each with an ID the executor has to report back against), the allowed and forbidden paths, the loop budget, and the PR behavior live. The model never gets to redefine its own success criteria mid-run. It gets to satisfy them or fail them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The decisions I'd actually defend
&lt;/h2&gt;

&lt;p&gt;The loop is the obvious part. Here's the stuff that took longer to get right and that I think matters more.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Skill installs the tool
&lt;/h3&gt;

&lt;p&gt;This is the bit that still feels a little science-fictional to me. Galley ships an Agent Skill: a Claude Code plugin, and a Codex marketplace entry. You install the &lt;em&gt;skill&lt;/em&gt; first. Then you ask it, in plain language, to set up the repo. It installs the &lt;code&gt;galley&lt;/code&gt; binary, inspects the repository, drafts a &lt;code&gt;quality.yaml&lt;/code&gt; and an &lt;code&gt;environment.yaml&lt;/code&gt;, explains the execution settings to you, writes a valid task YAML, validates it, and queues it. It only queues after you say yes.&lt;/p&gt;

&lt;p&gt;I've shipped CLIs before. The onboarding was always a README and a prayer. Here the onboarding &lt;em&gt;is an agent&lt;/em&gt;, and "explain what this config field means and pick a sensible default for my repo" is just a thing it does. That splits your docs in two: the README is for the human who wants to understand the system, and the skill's reference files are for the agent that has to operate it correctly without you in the loop. They overlap less than you'd think, and I'm still figuring out where each line belongs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Acceptance-criteria test skeletons go in &lt;em&gt;first&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;This one's a direct response to the 4.7 trust problem. There's an optional preflight step. Before the first executor attempt, Galley runs a built-in test-creator pass that writes test skeletons into the worktree, one per acceptance criterion, and records the mapping back onto the running task: for each AC, the skeleton's path, the behavior it's meant to pin down, and where it plugs into the codebase. In a Go repo a skeleton is about what you'd expect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestAC1_RunAgentOverridesTimeoutPerCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Skip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"AC1: run_agent callers can override execution timeout per call"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c"&gt;// executor fills this in&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Those skeleton paths are validated against the task's allowed paths, so the test-creator can't scatter files wherever it likes. And the executor can't get an "accepted" verdict while those tests are still skipped and the required checks haven't run green; the supervisor downgrades that to &lt;code&gt;needs_supervisor_review&lt;/code&gt;. The effect is small but real: the implementation has to converge on something the AC-shaped tests accept. A model that's drifting toward a clever-but-wrong solution runs into the skeleton and has to reckon with it. It's harder to wander off when there's already a fence where the spec said the fence should be. (It's off by default, since some tasks genuinely shouldn't have it, but for "implement feature X with these three behaviors," I turn it on.)&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;quality.yaml&lt;/code&gt; is a thing I grow
&lt;/h3&gt;

&lt;p&gt;Each repo gets a quality profile: which checks are required, which review dimensions must pass, what evidence the supervisor should expect, what severity of finding blocks acceptance. It starts small. Then every time a run produces something technically-passing-but-wrong-for-this-codebase, I add a line. Over time the profile becomes the codebase's accumulated opinion about what "good" means here, and both the executor and the supervisor get handed that opinion at the start of every task. Implementations stop drifting because the definition of "done well" stopped being implicit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Claude writes, a second model signs off
&lt;/h3&gt;

&lt;p&gt;Supervisor review defaults to Claude. But I can flip it to Codex per task, and for anything I'd have manually double-checked before, I do. Same-model review (Claude checking Claude) is fine and catches plenty. A different model catches a different category of mistake, and because a rejection comes back as another attempt with the feedback attached, the executor gets to fix what the reviewer flagged instead of just failing the task. So a long unattended run doesn't drift the way an unreviewed one does: every accepted step has had a second model poke at it, and the diff you end up with has those corrections baked in.&lt;/p&gt;

&lt;h3&gt;
  
  
  The deterministic / non-deterministic seam
&lt;/h3&gt;

&lt;p&gt;Building this kind of tool, the genuinely fiddly part isn't the AI calls — it's the boundary between the parts that must be exact and the parts that get to improvise. Two places I had to draw a hard line:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The executor has to return a structured JSON result, every time, or the retry-and-review loop has nothing to stand on; the supervisor can't evaluate a free-form essay. So Galley installs a small guard plugin into the executor's Claude Code that enforces the output format. The creative work is non-deterministic; the envelope it arrives in is not.&lt;/li&gt;
&lt;li&gt;Opus 4.7's defaults made me uneasy enough that I replaced the executor's system prompt outright with one derived from Codex-style prompting and from my own &lt;a href="https://github.com/shinpr/claude-code-workflows" rel="noopener noreferrer"&gt;&lt;code&gt;claude-code-workflows&lt;/code&gt;&lt;/a&gt; OSS, and did the same kind of swap for the supervisor prompts. The task YAML literally has &lt;code&gt;prompt_mode: replace&lt;/code&gt; for this. I'd rather pin the behavior than hope for it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Neither of these is glamorous. Both are the difference between a demo and something I leave running while I'm in a meeting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Galley building Galley
&lt;/h2&gt;

&lt;p&gt;The thing I didn't plan for: once it worked, the obvious next move was to have Galley develop Galley.&lt;/p&gt;

&lt;p&gt;A few recent fixes, all queued through the skill and executed by Galley itself. The PR body was rendering every acceptance criterion as &lt;code&gt;not_satisfied&lt;/code&gt; even when the supervisor had accepted them with evidence, which is confusing for anyone reading the PR. &lt;code&gt;galley task show&lt;/code&gt;, the command I run constantly, was loudly reporting &lt;code&gt;latest_claude_status: failed&lt;/code&gt; on tasks that had actually been accepted with a PR open; true as raw history, wrong as the headline. The PR-comment trigger only recognized &lt;code&gt;/galley rerun ...&lt;/code&gt; and &lt;code&gt;/galley requeue ...&lt;/code&gt;, when what I actually wanted was to type &lt;code&gt;/galley fix the failing test&lt;/code&gt; and have it pick that up as the request. And new worktrees were being branched off whatever the source repo's HEAD happened to be instead of the configured base branch, so one PR's commits could leak into the next.&lt;/p&gt;

&lt;p&gt;Some of those I caught by using the thing. Some — like the test-skeleton preflight — came from sitting down with a pile of &lt;code&gt;runs/&lt;/code&gt; evidence and asking what would have stopped the bad run earlier. Either way, the loop closes: I notice a gap, I write it up as a task with acceptance criteria, Galley implements it, a supervisor checks it, a PR shows up, I tweak it from a comment.&lt;/p&gt;

&lt;p&gt;It's not fully autonomous and I'm not pretending it is. I review the task drafts. I read the escalations. I still tweak PRs. But the ratio of "me describing what I want" to "me typing the code" has tilted further than I expected, and the safety rails (evidence on disk, AC-shaped tests, a growing quality profile, a second model's sign-off) are what let me actually trust the tilt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it is
&lt;/h2&gt;

&lt;p&gt;Galley is an early preview. It's MIT-licensed and on GitHub at &lt;a href="https://github.com/shinpr/galley" rel="noopener noreferrer"&gt;&lt;code&gt;shinpr/galley&lt;/code&gt;&lt;/a&gt;. It's Claude-first today: the executor path targets Claude Code, supervisor review defaults to Claude, and Codex slots in as the alternate supervisor. It's built for trusted local repositories: task YAML is trusted input, quality checks run locally, PR comments can request a requeue but can't rewrite your gates, and only the PR author (who is also a repo owner or collaborator) can drive it from comments. It also leans on &lt;code&gt;git&lt;/code&gt;, a git worktree per task, and &lt;code&gt;gh&lt;/code&gt; for the PR path.&lt;/p&gt;

&lt;p&gt;To try it, add the plugin and let the Galley skill do the setup: it installs the &lt;code&gt;galley&lt;/code&gt; CLI if it isn't already on your &lt;code&gt;PATH&lt;/code&gt;, inspects the repo, drafts the &lt;code&gt;quality.yaml&lt;/code&gt; and &lt;code&gt;environment.yaml&lt;/code&gt; profiles and the task YAML, and queues only after you approve. That "install a skill, have it install the tool" loop is the part that still feels new to me.&lt;/p&gt;

&lt;h3&gt;
  
  
  Claude Code
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/plugin marketplace add shinpr/galley
/plugin install galley@galley-tools
/reload-plugins
/galley:galley Set up Galley for this repository.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Codex
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;codex plugin marketplace add shinpr/galley
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then invoke the skill with &lt;code&gt;$galley&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$galley Set up Galley for this repository.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you'd rather install the CLI yourself first, it's a one-liner: &lt;code&gt;curl -fsSL https://raw.githubusercontent.com/shinpr/galley/main/scripts/install.sh | sh&lt;/code&gt;. Either way you end up describing tasks to the skill in plain language from there.&lt;/p&gt;

&lt;p&gt;I'm curious how other people have dealt with the same trust gap. Have you put a second model in the review loop? Did a cross-model pairing actually buy you something, or was Claude-reviewing-Claude enough? And if you've found a better answer than "evidence on disk plus tests that go in before the code," I'd genuinely like to hear it. If you build something around this, or break it in an interesting way, even better.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Built a Skill Reviewer. Then I Ran It on Itself.</title>
      <dc:creator>Shinsuke KAGAWA</dc:creator>
      <pubDate>Thu, 02 Apr 2026 11:44:56 +0000</pubDate>
      <link>https://dev.to/shinpr/i-built-a-skill-reviewer-then-i-ran-it-on-itself-4m4j</link>
      <guid>https://dev.to/shinpr/i-built-a-skill-reviewer-then-i-ran-it-on-itself-4m4j</guid>
      <description>&lt;p&gt;I built a tool that reviews Claude Code skills for quality issues.&lt;/p&gt;

&lt;p&gt;Then I pointed it at its own source files. It found real problems.&lt;/p&gt;

&lt;p&gt;The irony wasn't lost on me. But the more interesting question is: why did this happen, and what does it tell us about how LLM-based quality tools actually work?&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I maintain &lt;a href="https://github.com/shinpr/rashomon" rel="noopener noreferrer"&gt;rashomon&lt;/a&gt;, a Claude Code plugin for prompt and skill optimization. It includes a skill reviewer agent that evaluates skill files against 8 research-backed patterns (BP-001 through BP-008) and 9 editing principles.&lt;/p&gt;

&lt;p&gt;One of those patterns—BP-001—says: &lt;strong&gt;don't write instructions in negative form.&lt;/strong&gt; Research shows LLMs often fail to follow "don't do X" instructions—negated prompts actually cause &lt;a href="https://arxiv.org/abs/2209.12711" rel="noopener noreferrer"&gt;inverse scaling&lt;/a&gt;, where larger models perform &lt;em&gt;worse&lt;/em&gt;. The fix is to rewrite them positively: instead of "don't skip P1 issues," write "evaluate all P1 issues in every review mode."&lt;/p&gt;

&lt;p&gt;Simple enough.&lt;/p&gt;

&lt;p&gt;Except both my agent definition files had a section called &lt;code&gt;## Prohibited Actions&lt;/code&gt; full of "don't" instructions.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Discovery
&lt;/h2&gt;

&lt;p&gt;I noticed this by reading my own code. But I wanted to see what happens when the tools catch it—or don't.&lt;/p&gt;

&lt;p&gt;First, I ran the &lt;strong&gt;prompt-analyzer&lt;/strong&gt; agent against both files. It analyzed them, found some issues, but gave the &lt;code&gt;Prohibited Actions&lt;/code&gt; sections a pass. Its reasoning: these qualify as "safety-critical" exceptions to BP-001, since they constrain "destructive" behaviors.&lt;/p&gt;

&lt;p&gt;That felt off. "Don't invent issues not supported by BP patterns" isn't a safety-critical instruction. It's a quality policy. The caller can override or discard the output.&lt;/p&gt;

&lt;p&gt;So I ran the &lt;strong&gt;skill-reviewer&lt;/strong&gt; agent against the same two files. The results were more interesting.&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;skill-reviewer.md&lt;/code&gt; (reviewing itself), it flagged all four items in Prohibited Actions as BP-001 violations—P2 severity. Correct call.&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;skill-creator.md&lt;/code&gt; (reviewing the other agent), it gave Prohibited Actions a pass. Same structure, same pattern, opposite judgment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The same reviewer, applying the same criteria, reached opposite conclusions on the same construct.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Digging Into Logs
&lt;/h2&gt;

&lt;p&gt;I could have speculated about why. Instead, I checked the subagent conversation logs.&lt;/p&gt;

&lt;p&gt;The skill-creator review log showed this in the Step 1 pattern scan:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;BP-001 (Negative Instructions)&lt;/strong&gt;: Lines 197-202 "Prohibited Actions" section uses negative form. However, per the BP-001 exception in skills.md, these are procedural/irreversible consequences (inventing knowledge, removing examples, overwriting files). &lt;strong&gt;The exception applies.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It did scan for BP-001. It found the section. But it classified the items as "irreversible consequences" and applied the exception.&lt;/p&gt;

&lt;p&gt;The problem was clear: the exception rule said negative form is okay for "safety-critical operations, destructive actions, or order-dependent procedures." That's vague enough to stretch. "Inventing domain knowledge" sounds serious. "Removing user-provided examples" sounds destructive. If you squint, anything can be "destructive."&lt;/p&gt;

&lt;p&gt;Nothing was wrong with the reviewer. It was doing exactly what I told it to do. That was the problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Fixing the Criteria, Not the Reviewer
&lt;/h2&gt;

&lt;p&gt;The instinct is to blame the LLM: "it self-justified," "it was biased toward leniency." But the actual cause was simpler: &lt;strong&gt;the exception rule was written in a way that allowed two reasonable readings.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The fix wasn't to make the reviewer "smarter." It was to make the criteria harder to misread.&lt;/p&gt;

&lt;p&gt;I replaced the broad exception language:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Exception: safety-critical operations, exact command sequences,
destructive actions, or order-dependent procedures
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With a 4-condition checklist:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Exception: Negative form is permitted only when ALL are true:
(1) Violation destroys state in a single step
(2) Caller or subsequent steps cannot normally recover
(3) The constraint is operational/procedural, not a quality policy
(4) Positive rewording would expand or blur the target scope
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And added concrete boundary examples—what qualifies, what doesn't:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Permitted (exception applies)&lt;/th&gt;
&lt;th&gt;Not permitted (rewrite positively)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"Do not modify the command"&lt;/td&gt;
&lt;td&gt;"Do not invent issues" -&amp;gt; "Base every issue on BP patterns"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Do not add flags"&lt;/td&gt;
&lt;td&gt;"Do not skip P1 issues" -&amp;gt; "Evaluate all P1 in every mode"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Do not execute destructive operations"&lt;/td&gt;
&lt;td&gt;"Do not create overlapping skills" -&amp;gt; "Verify no overlap before generating"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key addition: &lt;strong&gt;"Outputs that the caller validates, overwrites, or discards are never irreversible."&lt;/strong&gt; This one sentence eliminates most of the ambiguity. A subagent's output goes to a caller. The caller decides what to do with it. That's not irreversible.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Retest
&lt;/h2&gt;

&lt;p&gt;After updating the criteria, I ran the skill-reviewer again on both files.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;skill-reviewer.md&lt;/code&gt;: Prohibited Actions flagged as BP-001 P2. All four items caught.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;skill-creator.md&lt;/code&gt;: Two items flagged as quality policies that should be positive form. The remaining items—which are genuinely about operational constraints—were accepted.&lt;/p&gt;

&lt;p&gt;Consistent. Explainable. And the reviewer could now articulate &lt;em&gt;why&lt;/em&gt; each item was or wasn't an exception, because the criteria forced it to check specific conditions rather than make a gestalt judgment.&lt;/p&gt;

&lt;p&gt;But I wasn't fully satisfied. In a further round of testing, the reviewer still occasionally applied exceptions loosely—recording "irreversible" in the justification field without explaining &lt;em&gt;how&lt;/em&gt; it's irreversible.&lt;/p&gt;

&lt;p&gt;So I added structured evidence to the output schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"patternExceptions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pattern"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BP-001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"section heading"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"original"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"quoted text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"conditions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"singleStepDestruction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true|false + evidence"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"callerCannotRecover"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true|false + evidence"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"operationalNotPolicy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true|false + evidence"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"positiveFormBlursScope"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true|false + evidence"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can't just write "irreversible" anymore. You have to answer four yes/no questions with evidence. If any answer is no, it's not an exception.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Comes Down To
&lt;/h2&gt;

&lt;p&gt;The criteria had a loophole wide enough to drive a truck through. Better criteria produced better reviews without changing the reviewer at all. The LLM wasn't "inconsistent"—the instructions were ambiguous. Two reasonable people could have read the old exception rule and reached different conclusions too.&lt;/p&gt;

&lt;p&gt;Structured output helped more than I expected. The 4-condition checklist wasn't just about auditability—it changed how the reviewer thinks. When you have to fill in four fields with evidence, you can't hand-wave. The output structure becomes a thinking scaffold.&lt;/p&gt;

&lt;p&gt;And running the tool on its own source files was uncomfortable in a useful way. The temptation is to say "well, I know what I meant." But the tool doesn't know what I meant. It reads what I wrote.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Broader Problem: Skill Quality Is Hard
&lt;/h2&gt;

&lt;p&gt;If you're building Claude Code skills, custom agents, or any kind of structured LLM instruction set—you've probably experienced this: the instructions work fine in your head, but the LLM does something unexpected. You add more instructions. It gets worse. You simplify. Something else breaks.&lt;/p&gt;

&lt;p&gt;The issue is that &lt;strong&gt;you can't see your own blind spots.&lt;/strong&gt; You know what you meant. The LLM reads what you wrote. The gap between intent and text is where bugs live.&lt;/p&gt;

&lt;p&gt;This is why I built &lt;a href="https://github.com/shinpr/rashomon" rel="noopener noreferrer"&gt;rashomon&lt;/a&gt;. It includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Skill review&lt;/strong&gt;: Evaluate skill files against BP-001~008 patterns and 9 editing principles, with structured quality grades&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Golden scenario evaluation&lt;/strong&gt;: Test whether a skill actually &lt;em&gt;works&lt;/em&gt; by comparing execution results with and without the skill, or before and after changes—not just whether it was loaded, but whether it made a measurable difference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The golden scenario part matters. "The skill was loaded" doesn't mean "the skill helped." You need to see the actual output difference to know if your skill is doing anything useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/shinpr/rashomon" rel="noopener noreferrer"&gt;Rashomon&lt;/a&gt; is a Claude Code plugin. Install it and point the skill reviewer at your own skills.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# In Claude Code&lt;/span&gt;
/plugin marketplace add shinpr/rashomon
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;rashomon@rashomon
&lt;span class="c"&gt;# Restart session to activate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will find problems. I know because it found problems in itself—and it's better for it now.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your experience with skill quality? Have you found ways to validate that your instructions actually do what you think they do?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>promptengineering</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Same Framework, Different Engine: Porting AI Coding Workflows from Claude Code to Codex CLI</title>
      <dc:creator>Shinsuke KAGAWA</dc:creator>
      <pubDate>Wed, 18 Mar 2026 11:50:48 +0000</pubDate>
      <link>https://dev.to/shinpr/same-framework-different-engine-porting-ai-coding-workflows-from-claude-code-to-codex-cli-n3p</link>
      <guid>https://dev.to/shinpr/same-framework-different-engine-porting-ai-coding-workflows-from-claude-code-to-codex-cli-n3p</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;I built a &lt;a href="https://dev.to/shinpr/zero-context-exhaustion-building-production-ready-ai-coding-teams-with-claude-code-sub-agents-31b"&gt;sub-agent workflow framework for Claude Code&lt;/a&gt; that solved context exhaustion through specialized agents and structured workflows&lt;/li&gt;
&lt;li&gt;For 8 months, Codex CLI had no sub-agents — the framework was Claude Code-only&lt;/li&gt;
&lt;li&gt;Codex finally shipped sub-agent support — I expected days of migration, it took an afternoon&lt;/li&gt;
&lt;li&gt;What surprised me most: &lt;strong&gt;if you design workflows around agent roles and context separation rather than tool-specific features, your investment survives platform shifts&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The 8-Month Wait
&lt;/h2&gt;

&lt;p&gt;Back in July 2025, I released the &lt;a href="https://github.com/shinpr/ai-coding-project-boilerplate/commit/1a9191dd37e90c7d463f9a26b3a6edf01236d4f2" rel="noopener noreferrer"&gt;first version of this workflow&lt;/a&gt; as a Claude Code boilerplate. By October 2025, it had evolved into a &lt;a href="https://github.com/shinpr/claude-code-workflows/commit/8869e32eeff9d45568a7ca3017688fffdac7e254" rel="noopener noreferrer"&gt;full sub-agent framework&lt;/a&gt; — specialized agents for every phase of development, from requirements analysis through TDD implementation through quality gates. The idea was pretty simple: break complex coding tasks into specialized roles (requirement analyzer, technical designer, task executor, quality fixer...), give each agent a fresh context, and orchestrate them through structured handoffs. No single agent ever hits the context ceiling because no single agent tries to do everything.&lt;/p&gt;

&lt;p&gt;The problem? &lt;strong&gt;Codex CLI had no sub-agent capability.&lt;/strong&gt; Codex had been around since &lt;a href="https://openai.com/index/introducing-codex/" rel="noopener noreferrer"&gt;mid-2025&lt;/a&gt;, and I wanted the same workflow there too. So I kept trying to bridge the gap.&lt;/p&gt;

&lt;p&gt;First, I built an &lt;a href="https://github.com/shinpr/sub-agents-mcp" rel="noopener noreferrer"&gt;MCP server&lt;/a&gt; in August 2025 that let any MCP-compatible tool — Codex, Cursor, whatever — define and spawn sub-agents through a standard protocol. It worked, but MCP added a layer of indirection that wasn't there in Claude Code's native sub-agents.&lt;/p&gt;

&lt;p&gt;Then in December 2025, Codex &lt;a href="https://community.openai.com/t/skills-for-codex-experimental-support-starting-today/1369367" rel="noopener noreferrer"&gt;shipped experimental Agent Skills support&lt;/a&gt;. I saw an opening and built &lt;a href="https://github.com/shinpr/sub-agents-skills" rel="noopener noreferrer"&gt;sub-agents-skills&lt;/a&gt; — cross-LLM sub-agent orchestration packaged as Agent Skills, routing tasks to Codex, Claude Code, Cursor, or Gemini. Closer, but still not native sub-agents.&lt;/p&gt;

&lt;p&gt;Through all of this, my main development stayed on Claude Code. The context separation and the small context windows of the time made it the clear choice for serious work. Codex filled a supporting role — I used it for skills refinement and as an objective reviewer on complex implementations, a fresh set of eyes from a different LLM.&lt;/p&gt;

&lt;p&gt;I don't use hooks extensively — I prefer keeping tasks small and baking quality gates into the completion criteria themselves. So what I was really waiting for was native sub-agent support in Codex, which would let the full orchestration workflow run without workarounds.&lt;/p&gt;

&lt;p&gt;On March 16, 2026, Codex CLI &lt;a href="https://developers.openai.com/codex/subagents" rel="noopener noreferrer"&gt;shipped sub-agent support&lt;/a&gt;. During pre-release validation, I noticed something encouraging: Codex followed the workflow stopping points more strictly than expected. If the behavior stabilizes, it could be a viable primary development tool, not just a supporting one.&lt;/p&gt;

&lt;p&gt;The port took almost no effort.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "Near-Zero Migration" Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;When I say "the same framework," I mean it. The core architecture didn't change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Request
    ↓
requirement-analyzer → scale determination [STOP for confirmation]
    ↓
technical-designer → Design Doc
    ↓
document-reviewer [STOP for approval]
    ↓
work-planner → phased task breakdown [STOP]
    ↓
task-decomposer → atomic task files
    ↓
Per-task 4-step cycle:
  task-executor → escalation check → quality-fixer → git commit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;22 sub-agents. 26 skills. The same stopping points, the same quality gates, the same TDD enforcement.&lt;/p&gt;

&lt;p&gt;What changed was the &lt;strong&gt;container format&lt;/strong&gt;, not the content:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;th&gt;Codex CLI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Agent definitions&lt;/td&gt;
&lt;td&gt;Markdown with YAML frontmatter (&lt;code&gt;agents/*.md&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;TOML files (&lt;code&gt;.codex/agents/*.toml&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skills location&lt;/td&gt;
&lt;td&gt;&lt;code&gt;skills/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.agents/skills/&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool declarations&lt;/td&gt;
&lt;td&gt;Explicit in frontmatter (&lt;code&gt;tools: Read, Grep, Glob...&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Not needed (inferred from sandbox mode)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skill references&lt;/td&gt;
&lt;td&gt;Comma-separated names&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[[skills.config]]&lt;/code&gt; arrays&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config directory&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.claude/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.codex/&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's it. The agent instructions — the actual substance of what each agent knows and does — are the same. The workflow logic is the same. The quality criteria are the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Worked: Design Decisions That Paid Off
&lt;/h2&gt;

&lt;p&gt;It worked for a surprisingly simple reason — three choices I made early on:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Natural Language as the Interface Layer
&lt;/h3&gt;

&lt;p&gt;Every sub-agent's behavior is defined in natural language instructions, not in platform-specific tool calls. The requirement-analyzer isn't wired to Claude Code's &lt;code&gt;Agent&lt;/code&gt; tool or Codex's &lt;code&gt;spawn_agent&lt;/code&gt; — it follows a written protocol: "Extract task type, determine scale (1-2 files = Small, 3-5 = Medium, 6+ = Large), identify ADR necessity, output structured JSON."&lt;/p&gt;

&lt;p&gt;This means the instructions work on any LLM-powered agent system that can read text and follow procedures. In practice, that turned out to be enough. The framework is fundamentally &lt;strong&gt;a set of well-written job descriptions&lt;/strong&gt;, not a set of API integrations.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Context Separation as Architecture
&lt;/h3&gt;

&lt;p&gt;The core insight from the &lt;a href="https://dev.to/shinpr/zero-context-exhaustion-building-production-ready-ai-coding-teams-with-claude-code-sub-agents-31b"&gt;original article&lt;/a&gt; still applies: each agent runs in a fresh context without inheriting bias from previous steps. The document-reviewer doesn't know what the technical-designer was "thinking" — it just reviews the output. The investigator explores without confirmation bias from whoever reported the bug.&lt;/p&gt;

&lt;p&gt;This isn't a Claude Code feature or a Codex feature. It's an &lt;strong&gt;architectural pattern&lt;/strong&gt; that happens to be implementable on both platforms once they support sub-agents.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Structured Handoffs Over Shared State
&lt;/h3&gt;

&lt;p&gt;Agents communicate through artifacts (documents, JSON outputs, task files), not through shared memory or conversation threading. The technical-designer writes a Design Doc. The work-planner reads that Design Doc. Neither needs to know which platform spawned the other.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;docs/
├── prd/          # PRD artifacts
├── adr/          # Architecture decision records
├── design/       # Design documents
├── plans/        # Work plans
│   └── tasks/    # Atomic task files (1 commit each)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This file-based protocol turned out to be surprisingly platform-agnostic.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Framework in Action
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/shinpr/codex-workflows" rel="noopener noreferrer"&gt;codex-workflows&lt;/a&gt; ships as an npm package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx codex-workflows &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This drops the full agent and skill infrastructure into your project. From there, you invoke recipe skills that orchestrate the sub-agents:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Recipe&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;$recipe-implement&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Full lifecycle: requirements → design → implementation → quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;$recipe-design&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Requirements → design documents only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;$recipe-build&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Execute pre-planned tasks autonomously&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;$recipe-diagnose&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Problem investigation → root cause → solutions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;$recipe-front-build&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;React/TypeScript implementation with frontend-specific agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;$recipe-fullstack-implement&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cross-layer features with layer-aware task routing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each recipe automatically loads the right foundational skills (coding standards, testing principles, documentation criteria) and spawns the right sequence of agents with the right stopping points.&lt;/p&gt;

&lt;p&gt;Here's what a real session looks like. I ran &lt;code&gt;$recipe-implement Add a login screen with email/password and a simple auth API&lt;/code&gt; in Codex CLI against a fresh project with placeholder files:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Requirement analysis — the first mandatory stop:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;• Spawned Plato [requirement-analyzer] (gpt-5.4-mini medium)

• Updated requirement analysis is now confirmed:
  - Scope: email/password login form, local mock auth abstraction,
    hardcoded success credentials demo@example.com / password123,
    inline success message only
  - Route: fullstack
  - Scale: medium
  - ADR: not required

  Please confirm this analysis and answer these three questions:
  1. Should the auth API be a real HTTP endpoint, or is a local/mock
     API abstraction enough?
  2. What credentials should succeed initially?
  3. On successful login, should it only show success, navigate
     somewhere, or persist auth state too?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The workflow stops and waits. No implementation starts until I confirm scope.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Parallel design — backend and frontend at the same time:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;• Spawned Volta [technical-designer] (gpt-5.4-mini medium)
  └ Create a backend Design Doc...

• Spawned Pasteur [technical-designer-frontend] (gpt-5.4-mini medium)
  └ Create a frontend Design Doc...

• Spawned Epicurus [document-reviewer] (gpt-5.4-mini medium)
  └ Review login-backend-design.md...

• Spawned Bohr [document-reviewer] (gpt-5.4-mini medium)
  └ Review login-frontend-design.md...

• Spawned Euclid [design-sync] (gpt-5.4-mini medium)
  └ Verify consistency between Design Docs...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five agents in parallel: two designers, two reviewers, one cross-layer sync checker. Each running in its own fresh context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Design gate — another mandatory stop:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;• The design gate is now clean enough to stop for approval.

  - Backend design: approved
  - Frontend design: approved
  - Cross-layer sync: NO_CONFLICTS

  Aligned contract:
  - Success: { ok: true }
  - Failure: { ok: false, error }
  - Inline success copy is UI-owned, not backend-owned

  Please approve the design docs so I can move to
  acceptance-test generation and the work plan.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Autonomous execution after batch approval:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;› batch approval

• Spawned Ohm [task-decomposer] (gpt-5.4-mini medium)

• Verification passed:
  - npm test
  - npm run build
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After batch approval, the framework decomposed the work plan into tasks and executed them autonomously — no more stopping points until the quality gates pass.&lt;/p&gt;

&lt;p&gt;The whole flow from &lt;code&gt;$recipe-implement&lt;/code&gt; to green tests took one session. The same flow, the same stopping points, the same agent roles that I've been running on Claude Code for months.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;The framework is open source:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Codex CLI version&lt;/strong&gt;: &lt;a href="https://github.com/shinpr/codex-workflows" rel="noopener noreferrer"&gt;codex-workflows&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code version&lt;/strong&gt;: &lt;a href="https://github.com/shinpr/claude-code-workflows" rel="noopener noreferrer"&gt;claude-code-workflows&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're already using the Claude Code version, the Codex version follows the same patterns. If you're new to both, pick whichever CLI you're already using — the workflow knowledge transfers either way.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx codex-workflows &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;The whole port changed config file formats and directory conventions. The agent instructions — the part that actually matters — didn't need a single edit. That's the thing I'd want to know if I were deciding whether to invest time in workflow design for AI coding tools.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you've been running sub-agent workflows with either Claude Code or Codex CLI, I'd be curious how your setup compares. What worked? What broke?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Letting LLMs Jump — and Then Verifying Ruthlessly</title>
      <dc:creator>Shinsuke KAGAWA</dc:creator>
      <pubDate>Thu, 12 Feb 2026 13:35:03 +0000</pubDate>
      <link>https://dev.to/shinpr/letting-llms-jump-and-then-verifying-ruthlessly-1mj0</link>
      <guid>https://dev.to/shinpr/letting-llms-jump-and-then-verifying-ruthlessly-1mj0</guid>
      <description>&lt;h2&gt;
  
  
  The "First Plausible Answer" Problem
&lt;/h2&gt;

&lt;p&gt;You've probably seen this: you ask an LLM to investigate a bug, and it latches onto the first plausible explanation. It confidently proposes a fix before thoroughly exploring alternatives. Sometimes it works. Often it doesn't—and you're left debugging the debugger.&lt;/p&gt;

&lt;p&gt;I ran into this repeatedly in my personal projects. The LLM would find &lt;em&gt;something&lt;/em&gt; that looked like the cause, stop investigating, and immediately suggest a solution. When the codebase was small, this worked fine. As it grew, I started getting fixes that didn't actually fix anything.&lt;/p&gt;

&lt;p&gt;This is not for small scripts or simple bugs.&lt;/p&gt;

&lt;p&gt;I only started needing this once my codebase grew large enough&lt;br&gt;
that "just try a fix" stopped working.&lt;/p&gt;

&lt;p&gt;The root issue? &lt;strong&gt;How I was defining the task's purpose.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Planning works well when the problem is understood.&lt;/p&gt;

&lt;p&gt;But when the problem itself is unclear,&lt;br&gt;
planning alone is not enough.&lt;/p&gt;

&lt;p&gt;This article focuses on those cases.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Factor That Made the Difference: Purpose
&lt;/h2&gt;

&lt;p&gt;When delegating tasks to LLMs, two factors affect execution accuracy: &lt;strong&gt;Context&lt;/strong&gt; (staying within ~70% of the context window) and &lt;strong&gt;Purpose&lt;/strong&gt; (how you define the task's goal).&lt;/p&gt;

&lt;p&gt;Context management matters, but this article focuses on the second factor—because that's where I was getting it wrong.&lt;/p&gt;

&lt;p&gt;Where you set the task's goal matters more than you might think. The purpose you define determines the task granularity, and the right granularity depends on your codebase complexity.&lt;/p&gt;
&lt;h2&gt;
  
  
  A Real Example: Bug Investigation
&lt;/h2&gt;
&lt;h3&gt;
  
  
  The Old Approach
&lt;/h3&gt;

&lt;p&gt;A single session handling "Investigation → Solution Proposal → Verification," followed by a separate review session.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzagme1n6o1ik5lu0ug2o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzagme1n6o1ik5lu0ug2o.png" alt=" " width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  What I Changed
&lt;/h3&gt;

&lt;p&gt;My original goal was simple: "propose a solution" and "review it objectively."&lt;/p&gt;

&lt;p&gt;Originally, I'd just have the LLM investigate, propose a fix, and implement it directly. But as the codebase grew, I started getting solutions that didn't actually work. So I added a review step—opening a fresh session to check the proposal with clean context.&lt;/p&gt;

&lt;p&gt;This worked for about 60-70% of problems, but occasionally even this approach couldn't reach the root cause no matter how many iterations.&lt;/p&gt;

&lt;p&gt;Here's what I changed:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Problem Structuring&lt;/strong&gt;: Structure my instructions upfront to make them easier for LLMs to parse in later steps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigation&lt;/strong&gt;: Conduct comprehensive investigation and report results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification&lt;/strong&gt;: If there's uncertainty in the report, perform additional verification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Solution Derivation&lt;/strong&gt;: Receive investigation and verification results, then derive solutions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fln0ycoa4s415257xzuvl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fln0ycoa4s415257xzuvl.png" alt=" " width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By setting &lt;strong&gt;"investigation" as the purpose&lt;/strong&gt;, the model stopped jumping to the first candidate and instead collected information from multiple angles.&lt;/p&gt;
&lt;h2&gt;
  
  
  Implementation Example
&lt;/h2&gt;

&lt;p&gt;This setup is probably overkill for small scripts. I only started doing this after my codebase crossed a certain complexity threshold.&lt;/p&gt;

&lt;p&gt;Here's how I structured the diagnosis workflow using Claude Code's slash commands and sub-agents. Full implementation is available at &lt;a href="https://github.com/shinpr/claude-code-workflows" rel="noopener noreferrer"&gt;github.com/shinpr/claude-code-workflows&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Main Command (diagnose.md)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Investigate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;problem,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;verify&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;findings,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;derive&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;solutions"&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gs"&gt;**Command Context**&lt;/span&gt;: Diagnosis flow to identify root cause and present solutions

Target problem: $ARGUMENTS

&lt;span class="gu"&gt;## Step 0: Problem Structuring (Before investigator invocation)&lt;/span&gt;

&lt;span class="gu"&gt;### 0.1 Problem Type Determination&lt;/span&gt;

| Type | Criteria |
|------|----------|
| Change Failure | Indicates some change occurred before the problem appeared |
| New Discovery | No relation to changes is indicated |

&lt;span class="gu"&gt;### 0.2 Information Supplementation for Change Failures&lt;/span&gt;

If the following are unclear, &lt;span class="gs"&gt;**ask with AskUserQuestion**&lt;/span&gt; before proceeding:
&lt;span class="p"&gt;-&lt;/span&gt; What was changed (cause change)
&lt;span class="p"&gt;-&lt;/span&gt; What broke (affected area)
&lt;span class="p"&gt;-&lt;/span&gt; Relationship between both (shared components, etc.)

&lt;span class="gu"&gt;## Diagnosis Flow Overview&lt;/span&gt;

The goal of investigation is not to propose solutions.
It is to eliminate wrong explanations.

&lt;span class="gs"&gt;**Context Separation**&lt;/span&gt;: Pass only structured JSON output to each step.
Each step starts fresh with the JSON data only.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Sub-agent: Investigator
&lt;/h3&gt;

&lt;p&gt;Think of the Investigator as a junior engineer whose only job is to gather facts—not to be clever. Its purpose is explicitly limited to &lt;strong&gt;evidence collection only&lt;/strong&gt;—no solutions:&lt;/p&gt;

&lt;p&gt;This is one concrete implementation.&lt;br&gt;
The important part is the separation of purpose—not the specific tooling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Output Scope&lt;/span&gt;

This agent outputs &lt;span class="gs"&gt;**evidence matrix and factual observations only**&lt;/span&gt;.
Solution derivation is out of scope for this agent.

&lt;span class="gu"&gt;## Core Responsibilities&lt;/span&gt;
&lt;span class="p"&gt;
1.&lt;/span&gt; &lt;span class="gs"&gt;**Cross-check multiple sources**&lt;/span&gt; - Don't rely on a single source
&lt;span class="p"&gt;2.&lt;/span&gt; &lt;span class="gs"&gt;**Search external info (WebSearch)**&lt;/span&gt; - Official docs, Stack Overflow, GitHub Issues
&lt;span class="p"&gt;3.&lt;/span&gt; &lt;span class="gs"&gt;**List hypotheses and trace causes**&lt;/span&gt; - Multiple candidates, not just the first one
&lt;span class="p"&gt;4.&lt;/span&gt; &lt;span class="gs"&gt;**Identify impact scope**&lt;/span&gt; - Where else might this pattern exist?
&lt;span class="p"&gt;5.&lt;/span&gt; &lt;span class="gs"&gt;**Disclose blind spots**&lt;/span&gt; - Honestly report areas that could not be investigated
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key output structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hypotheses"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"H1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Hypothesis description"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"causeCategory"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"typo|logic_error|missing_constraint|design_gap|external_factor"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"causalChain"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Phenomenon"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"→ Direct cause"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"→ Root cause"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"supportingEvidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"contradictingEvidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"unexploredAspects"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Unverified aspects"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"comparisonAnalysis"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"normalImplementation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Path to working implementation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"failingImplementation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Path to problematic implementation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"keyDifferences"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Differences"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sub-agent: Verifier
&lt;/h3&gt;

&lt;p&gt;The Verifier plays the annoying senior reviewer who assumes everything is wrong. It actively seeks refutation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Core Responsibilities&lt;/span&gt;
&lt;span class="p"&gt;
1.&lt;/span&gt; &lt;span class="gs"&gt;**Cross-check multiple sources**&lt;/span&gt; - Explore information sources not covered
&lt;span class="p"&gt;2.&lt;/span&gt; &lt;span class="gs"&gt;**Generate alternative hypotheses**&lt;/span&gt; - What else could explain this?
&lt;span class="p"&gt;3.&lt;/span&gt; &lt;span class="gs"&gt;**Play devil's advocate**&lt;/span&gt; - Assume "the investigation results are wrong"
&lt;span class="p"&gt;4.&lt;/span&gt; &lt;span class="gs"&gt;**Pick the hypothesis with fewest holes**&lt;/span&gt; - Not "most evidence," but "least refuted"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sub-agent: Solver
&lt;/h3&gt;

&lt;p&gt;The Solver is the engineer who actually has to ship something. Only after verification does it derive solutions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Output Scope&lt;/span&gt;

This agent outputs &lt;span class="gs"&gt;**solution derivation and recommendation presentation**&lt;/span&gt;.
Trust the given conclusion and proceed directly to solution derivation.

&lt;span class="gu"&gt;## Core Responsibilities&lt;/span&gt;
&lt;span class="p"&gt;
1.&lt;/span&gt; &lt;span class="gs"&gt;**Multiple solution generation**&lt;/span&gt; - At least 3 different approaches
&lt;span class="p"&gt;2.&lt;/span&gt; &lt;span class="gs"&gt;**Tradeoff analysis**&lt;/span&gt; - Cost, risk, impact scope, maintainability
&lt;span class="p"&gt;3.&lt;/span&gt; &lt;span class="gs"&gt;**Recommendation selection**&lt;/span&gt; - Optimal solution with selection rationale
&lt;span class="p"&gt;4.&lt;/span&gt; &lt;span class="gs"&gt;**Implementation steps presentation**&lt;/span&gt; - Concrete, actionable steps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Practical Guidelines
&lt;/h2&gt;

&lt;p&gt;When designing LLM tasks, I now check two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Purpose Clarity&lt;/strong&gt; - "Don't create tasks with unclear purposes"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Efficiency&lt;/strong&gt; - Can it be completed in one session with sufficient information? (Ideally using 60-70% of working space)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I don't blindly split tasks into smaller pieces. Instead, I consider ROI and break down from larger tasks only when necessary.&lt;/p&gt;

&lt;p&gt;By explicitly separating "investigation" from "solution," you prevent the model from rushing to conclusions before it has gathered sufficient evidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Lesson I Learned the Hard Way
&lt;/h2&gt;

&lt;p&gt;Early on, I made the Verifier run every single time. The problem? Even when the investigation was clearly off track, the Verifier would dutifully try to verify nonsense.&lt;/p&gt;

&lt;p&gt;That's when I realized: &lt;strong&gt;you need a quality gate between steps&lt;/strong&gt;, not just separation.&lt;/p&gt;

&lt;p&gt;Now I have a checkpoint between Investigation and Verification. If the investigation output doesn't meet basic quality criteria (missing comparison analysis, shallow causal chains, etc.), it loops back instead of wasting cycles on verification.&lt;/p&gt;

&lt;p&gt;I also added Step 0 (Problem Structuring) to help the LLM understand my intent better before diving in. These two changes—quality gates and upfront structuring—made the whole pipeline actually usable.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>Design Integration Checkpoints Before Letting LLMs Code</title>
      <dc:creator>Shinsuke KAGAWA</dc:creator>
      <pubDate>Wed, 04 Feb 2026 13:23:55 +0000</pubDate>
      <link>https://dev.to/shinpr/design-integration-checkpoints-before-letting-llms-code-edo</link>
      <guid>https://dev.to/shinpr/design-integration-checkpoints-before-letting-llms-code-edo</guid>
      <description>&lt;p&gt;Once you stop trying to control AI generation and start designing verification, you immediately hit the next problem: integration.&lt;br&gt;
And this is where most AI-generated systems actually break.&lt;/p&gt;

&lt;p&gt;Everything works.&lt;br&gt;
Until it doesn't.&lt;/p&gt;

&lt;p&gt;Each layer looks correct in isolation.&lt;br&gt;
Tests pass.&lt;br&gt;
Types line up.&lt;/p&gt;

&lt;p&gt;And then the system breaks where those layers meet.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why "Everything Works" Until It Doesn't
&lt;/h2&gt;

&lt;p&gt;This is a verification problem, not an implementation problem.&lt;br&gt;
When you build systems layer by layer, integration happens very late.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer-by-layer development
Phase 1: Data layer ────────────────✓
Phase 2: Service layer ─────────────✓
Phase 3: API layer ────────────────✓
Phase 4: UI layer ─────────────────✓
Phase 5: Integration ── 💥 breaks here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer is implemented in isolation.&lt;br&gt;
So you don't actually know if everything connects correctly until the end.&lt;/p&gt;

&lt;p&gt;This problem becomes much worse with AI-generated code.&lt;/p&gt;

&lt;p&gt;LLMs don't hold the entire system in mind at once.&lt;br&gt;
They optimize locally, based on the current context — and they often miss hidden contracts between layers.&lt;/p&gt;
&lt;h2&gt;
  
  
  A Painful Integration Bug That "Worked"
&lt;/h2&gt;

&lt;p&gt;One of the most painful bugs I faced didn't involve crashes or errors.&lt;/p&gt;

&lt;p&gt;The AI chatbot worked.&lt;/p&gt;

&lt;p&gt;It returned responses.&lt;br&gt;
Logs looked normal.&lt;br&gt;
Nothing failed.&lt;/p&gt;

&lt;p&gt;But when we tested it in the real environment, the answers were subtly — but consistently — wrong.&lt;/p&gt;
&lt;h3&gt;
  
  
  What actually went wrong
&lt;/h3&gt;

&lt;p&gt;The root cause wasn't a single mistake, but a combination of issues across layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mock implementations silently left in place&lt;/li&gt;
&lt;li&gt;LLM fallbacks that prioritized "returning something" instead of failing fast&lt;/li&gt;
&lt;li&gt;Duplicate logic across layers, created while implementing each layer separately&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread? I wasn't tracking what else might break.&lt;/p&gt;

&lt;p&gt;Each layer looked correct in isolation.&lt;br&gt;
Tests passed.&lt;br&gt;
No alerts fired.&lt;/p&gt;

&lt;p&gt;Because the system always returned some response, it created a false sense of confidence.&lt;br&gt;
We didn't notice the problem immediately — and by the time we did, identifying the real cause across layers was extremely difficult.&lt;/p&gt;

&lt;p&gt;Bugs that silently "work" are far more dangerous than bugs that crash.&lt;/p&gt;
&lt;h2&gt;
  
  
  Make Integration Explicit
&lt;/h2&gt;

&lt;p&gt;I now spend about five minutes defining integration checkpoints.&lt;br&gt;
Not documentation. Just verification.&lt;/p&gt;

&lt;p&gt;The goal is simple: define where things must connect, and how I'll know they actually do.&lt;/p&gt;

&lt;p&gt;Now, before implementation, I write a very small design note.&lt;/p&gt;

&lt;p&gt;Not a formal design document.&lt;br&gt;
Nothing formal.&lt;/p&gt;

&lt;p&gt;Just a checklist that answers two questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What parts of the system are affected?&lt;/li&gt;
&lt;li&gt;Where do things need to integrate — and how do I verify it?&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Step 1: List What's Affected
&lt;/h3&gt;

&lt;p&gt;First, I write down what is directly or indirectly impacted.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Change&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Add image generation feature&lt;/span&gt;

&lt;span class="na"&gt;Direct impact&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;infrastructure/image/functions.ts&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;application/services/queryClassificationService.ts&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;application/services/imageGenerationService.ts&lt;/span&gt;

&lt;span class="na"&gt;Indirect impact&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;conversationService.ts (function calling flow)&lt;/span&gt;

&lt;span class="na"&gt;No impact&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;existing text generation services&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;other function handlers&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This immediately clarifies the blast radius.&lt;/p&gt;

&lt;p&gt;I don't aim for perfection —&lt;br&gt;
I just want to avoid being surprised later.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 2: Define Integration Checkpoints
&lt;/h3&gt;

&lt;p&gt;Next, I decide where integration must be verified and how.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Integration point 1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Function selection&lt;/span&gt;
&lt;span class="na"&gt;Location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConversationService.generateContentWithFunctionCalling&lt;/span&gt;

&lt;span class="na"&gt;How to verify&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="s"&gt;1. Send a request asking for an image&lt;/span&gt;
  &lt;span class="s"&gt;2. Confirm query classification returns `image_generation`&lt;/span&gt;
  &lt;span class="s"&gt;3. Confirm the correct function is selected in logs&lt;/span&gt;

&lt;span class="na"&gt;Expected result&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Log shows: Executing function&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;generateImage&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And another one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Integration point 2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Image generation and posting&lt;/span&gt;
&lt;span class="na"&gt;Location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ImageGenerationService → MessagingClient.uploadFile&lt;/span&gt;

&lt;span class="na"&gt;How to verify&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="s"&gt;1. Image data is returned from the image client&lt;/span&gt;
  &lt;span class="s"&gt;2. The file is posted to the chat thread&lt;/span&gt;

&lt;span class="na"&gt;Expected result&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Image appears in the chat&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now I know exactly what "working" means.&lt;/p&gt;

&lt;p&gt;That's it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Works (Especially with AI)
&lt;/h2&gt;

&lt;p&gt;When I give this to an LLM, it changes how implementation happens.&lt;/p&gt;

&lt;p&gt;Instead of "build this feature," it's more like:&lt;br&gt;
"Connect A to B. Here's how we'll know it works."&lt;/p&gt;

&lt;p&gt;This also pairs well with building features end-to-end:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Feature-based development
Feature A: Data → Service → API → UI → Verify
Feature B: Data → Service → API → UI → Verify
Feature C: Data → Service → API → UI → Verify
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each feature is fully integrated before moving on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;Before this habit, integration bugs often cost me hours.&lt;/p&gt;

&lt;p&gt;After introducing these small design notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI-generated code still has small issues&lt;/li&gt;
&lt;li&gt;But features no longer completely break at integration&lt;/li&gt;
&lt;li&gt;Unexpected behavior is caught much earlier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Five minutes of thinking up front easily saves hours of debugging later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;p&gt;This approach works well if you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use AI coding tools&lt;/li&gt;
&lt;li&gt;Build layered architectures&lt;/li&gt;
&lt;li&gt;Want fast feedback instead of perfect design docs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not about writing more documentation.&lt;br&gt;
It's just about making integration explicit before code is written.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;AI tools are incredibly powerful — but they optimize locally.&lt;/p&gt;

&lt;p&gt;If we don't define integration points explicitly, we end up debugging systems that look correct but behave incorrectly.&lt;/p&gt;

&lt;p&gt;A small design checklist has made a huge difference for me.&lt;/p&gt;

&lt;p&gt;Hope this saves you some painful debugging.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Planning Is the Real Superpower of Agentic Coding</title>
      <dc:creator>Shinsuke KAGAWA</dc:creator>
      <pubDate>Mon, 26 Jan 2026 12:24:31 +0000</pubDate>
      <link>https://dev.to/shinpr/planning-is-the-real-superpower-of-agentic-coding-1imm</link>
      <guid>https://dev.to/shinpr/planning-is-the-real-superpower-of-agentic-coding-1imm</guid>
      <description>&lt;p&gt;I see this pattern constantly: someone gives an LLM a task, it starts executing immediately, and halfway through you realize it's building the wrong thing. Or it gets stuck in a loop. Or it produces something that technically works but doesn't fit the existing codebase at all.&lt;/p&gt;

&lt;p&gt;The instinct is to write better prompts. More detail. More constraints. More examples.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The actual fix is simpler: make it plan before it executes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Research shows that separating planning from execution dramatically improves task success rates—by as much as 33% in complex scenarios.&lt;/p&gt;

&lt;p&gt;In earlier articles, I wrote about why &lt;a href="https://dev.to/shinpr/why-llms-are-bad-at-first-try-and-great-at-verification-4kcf"&gt;LLMs struggle with first attempts&lt;/a&gt; and why &lt;a href="https://dev.to/shinpr/stop-putting-everything-in-agentsmd-22bl"&gt;overloading AGENTS.md&lt;/a&gt; is often a symptom of that misunderstanding. This article focuses on what actually fixes that.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why "Just Execute" Fails
&lt;/h2&gt;

&lt;p&gt;This took me longer to figure out than I'd like to admit. When you ask an LLM to directly implement something, you're asking it to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Understand the requirements&lt;/li&gt;
&lt;li&gt;Analyze the existing codebase&lt;/li&gt;
&lt;li&gt;Design an approach&lt;/li&gt;
&lt;li&gt;Evaluate trade-offs&lt;/li&gt;
&lt;li&gt;Decompose into steps&lt;/li&gt;
&lt;li&gt;Execute each step&lt;/li&gt;
&lt;li&gt;Verify results&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All in one shot. With one context. Using the same cognitive load throughout.&lt;/p&gt;

&lt;p&gt;Even powerful LLMs struggle with this. Not because they lack capability, but because &lt;strong&gt;long-horizon planning is fundamentally hard in a step-by-step mode.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Plan-Execute Architecture
&lt;/h2&gt;

&lt;p&gt;Research on LLM agents has consistently shown that separating planning and execution yields better results.&lt;/p&gt;

&lt;p&gt;The reasons:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benefit&lt;/th&gt;
&lt;th&gt;Explanation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Explicit long-term planning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Even strong LLMs struggle with multi-step reasoning when taking actions one at a time. Explicit planning forces consideration of the full path.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Model flexibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You can use a powerful model for planning and a lighter model for execution—or even different specialized models per phase.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Efficiency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Each execution step doesn't need to reason through the entire conversation history. It just needs to execute against the plan.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What matters here: &lt;strong&gt;the plan becomes an artifact&lt;/strong&gt;, and the execution becomes &lt;em&gt;verification against that artifact&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If you've read about why LLMs are better at verification than first-shot generation, this should sound familiar. Creating a plan first converts the execution task from "generate good code" to "implement according to this plan"—a much clearer, more verifiable objective.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Workflow
&lt;/h2&gt;

&lt;p&gt;The complete picture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Step 1: Preparation
    │
    ▼
Step 2: Design (Agree on Direction)
    │
    ▼
Step 3: Work Planning  ← The Most Important Step
    │
    ▼
Step 4: Execution
    │
    ▼
Step 5: Verification &amp;amp; Feedback
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I'll walk through each step, but Step 3 is where the magic happens.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Preparation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Clarify &lt;em&gt;what&lt;/em&gt; you want to achieve, not &lt;em&gt;how&lt;/em&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create a ticket, issue, or todo document stating the goal in plain language&lt;/li&gt;
&lt;li&gt;Point the LLM to AGENTS.md (or CLAUDE.md, depending on your tool) and relevant context files&lt;/li&gt;
&lt;li&gt;Don't jump into implementation details yet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is about setting the stage, not solving the problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Design (Agree on Direction)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Align on the approach before any code gets written.&lt;/p&gt;

&lt;h3&gt;
  
  
  Don't Let It Start Coding Immediately
&lt;/h3&gt;

&lt;p&gt;Instead of "implement this feature," say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Before implementing, present a step-by-step plan for how you would approach this."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Review the Plan
&lt;/h3&gt;

&lt;p&gt;Look for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Contradictions with existing architecture&lt;/li&gt;
&lt;li&gt;Simpler alternatives the LLM missed&lt;/li&gt;
&lt;li&gt;Misunderstandings of the requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this stage, you're agreeing on &lt;strong&gt;what to build&lt;/strong&gt; and &lt;strong&gt;why this approach&lt;/strong&gt;. The &lt;strong&gt;how&lt;/strong&gt; and &lt;strong&gt;in what order&lt;/strong&gt; come in Step 3.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Work Planning (The Most Important Step)
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;This section is dense. But the payoff is proportional—the more carefully you plan, the smoother execution becomes.&lt;/p&gt;

&lt;p&gt;For small tasks, you don't need all of this. See "Scaling to Task Size" at the end.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Convert the design into executable work units with clear completion criteria.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Step Matters Most
&lt;/h3&gt;

&lt;p&gt;Research shows that decomposing complex tasks into subtasks significantly improves LLM success rates. Step-by-step decomposition produces more accurate results than direct generation.&lt;/p&gt;

&lt;p&gt;But there's another reason: &lt;strong&gt;the work plan is an artifact&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When the plan exists, the execution task transforms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Before: "Build this feature" (generation)&lt;/li&gt;
&lt;li&gt;After: "Implement according to this plan" (verification)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the same principle from Article 1. Creating a plan first means execution becomes verification—and LLMs are better at verification.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Work Planning Includes
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Task decomposition&lt;/strong&gt;: Break the design into executable units&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependency mapping&lt;/strong&gt;: Define order and dependencies between tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Completion criteria&lt;/strong&gt;: What does "done" mean for each task?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Checkpoint design&lt;/strong&gt;: When do we get external feedback?&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Perspectives to Consider
&lt;/h3&gt;

&lt;p&gt;I'll be honest: I learned most of these the hard way. Plans would fall apart mid-implementation, and only later did I realize I'd skipped something obvious in hindsight.&lt;/p&gt;

&lt;p&gt;These aren't meant to be followed rigidly for every task. Think of them as a mental checklist. You don't need to get all of these right—if even one of these perspectives changes your plan, it's doing its job.&lt;/p&gt;




&lt;h4&gt;
  
  
  Perspective 1: Current State Analysis
&lt;/h4&gt;

&lt;p&gt;Understand what exists before planning changes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is this code's actual responsibility?&lt;/li&gt;
&lt;li&gt;Which parts are essential business logic vs. technical constraints?&lt;/li&gt;
&lt;li&gt;What benefits and limitations does the current design provide?&lt;/li&gt;
&lt;li&gt;What implicit dependencies or assumptions aren't obvious from the code?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Skipping this leads to plans that don't fit the existing codebase.&lt;/p&gt;




&lt;h4&gt;
  
  
  Perspective 2: Strategy Selection
&lt;/h4&gt;

&lt;p&gt;Consider how to approach the transition from current to desired state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Research options:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Look for similar patterns in your tech stack&lt;/li&gt;
&lt;li&gt;Check how comparable projects solved this&lt;/li&gt;
&lt;li&gt;Review OSS implementations, articles, documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Common strategy patterns:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strangler Pattern&lt;/strong&gt;: Gradual replacement, incremental migration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Facade Pattern&lt;/strong&gt;: Hide complexity behind unified interface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feature-Driven&lt;/strong&gt;: Vertical slices, user-value first&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Foundation-Driven&lt;/strong&gt;: Build stable base first, then features on top&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key isn't applying patterns dogmatically—it's consciously choosing an approach instead of stumbling into one.&lt;/p&gt;




&lt;h4&gt;
  
  
  Perspective 3: Risk Assessment
&lt;/h4&gt;

&lt;p&gt;Evaluate what could go wrong with your chosen strategy.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Risk Type&lt;/th&gt;
&lt;th&gt;Considerations&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Technical&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Impact on existing systems, data integrity, performance degradation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operational&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Service availability, deployment downtime, rollback procedures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Project&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Schedule delays, learning curve, team coordination&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Skipping risk assessment leads to expensive surprises mid-implementation.&lt;/p&gt;




&lt;h4&gt;
  
  
  Perspective 4: Constraints
&lt;/h4&gt;

&lt;p&gt;Identify hard limits before committing to a strategy.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Technical&lt;/strong&gt;: Library compatibility, resource capacity, performance requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timeline&lt;/strong&gt;: Deadlines, milestones, external dependencies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resources&lt;/strong&gt;: Team availability, skill gaps, budget&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business&lt;/strong&gt;: Time-to-market, customer impact, regulations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A strategy that ignores constraints isn't executable.&lt;/p&gt;




&lt;h4&gt;
  
  
  Perspective 5: Completion Levels
&lt;/h4&gt;

&lt;p&gt;Define what "done" means for each task—this is critical.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Definition&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L1: Functional verification&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Works as user-facing feature&lt;/td&gt;
&lt;td&gt;Search actually returns results&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L2: Test verification&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;New tests added and passing&lt;/td&gt;
&lt;td&gt;Type definition tests pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L3: Build verification&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No compilation errors&lt;/td&gt;
&lt;td&gt;Interface definition complete&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Priority: L1 &amp;gt; L2 &amp;gt; L3&lt;/strong&gt;. Whenever possible, verify at L1 (actually works in practice).&lt;/p&gt;

&lt;p&gt;This directly maps to "external feedback" from the previous articles. Defining completion levels upfront ensures you get external verification at each checkpoint.&lt;/p&gt;




&lt;h4&gt;
  
  
  Perspective 6: Integration Points
&lt;/h4&gt;

&lt;p&gt;Define when to verify things work together.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Integration Point&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Feature-driven&lt;/td&gt;
&lt;td&gt;When users can actually use the feature&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Foundation-driven&lt;/td&gt;
&lt;td&gt;When all layers are complete and E2E tests pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Strangler pattern&lt;/td&gt;
&lt;td&gt;At each old-to-new system cutover&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Without defined integration points, you end up with "it all works individually but doesn't work together."&lt;/p&gt;




&lt;h3&gt;
  
  
  Task Decomposition Principles
&lt;/h3&gt;

&lt;p&gt;After considering the perspectives, break down into concrete tasks:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Executable granularity:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each task = one meaningful commit&lt;/li&gt;
&lt;li&gt;Clear completion criteria&lt;/li&gt;
&lt;li&gt;Explicit dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Minimize dependencies:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maximum 2 levels deep (A→B→C is okay, A→B→C→D needs redesign)&lt;/li&gt;
&lt;li&gt;Tasks with 3+ chained dependencies should be split&lt;/li&gt;
&lt;li&gt;Each task should ideally provide independent value&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Build quality in:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don't make "write tests" a separate task—include testing in the implementation task&lt;/li&gt;
&lt;li&gt;Tag each task with its completion level (L1/L2/L3, though in practice L1 is almost always what you want)&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Work Planning Anti-Patterns
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Anti-Pattern&lt;/th&gt;
&lt;th&gt;Consequence&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Skip current-state analysis&lt;/td&gt;
&lt;td&gt;Plan doesn't fit codebase&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ignore risks&lt;/td&gt;
&lt;td&gt;Expensive surprises mid-implementation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ignore constraints&lt;/td&gt;
&lt;td&gt;Plan isn't executable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Over-detail&lt;/td&gt;
&lt;td&gt;Lose flexibility, waste planning time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Undefined completion criteria&lt;/td&gt;
&lt;td&gt;"Done" is ambiguous, verification impossible&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  Scaling to Task Size
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Not every task needs full work planning.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;th&gt;Planning Depth&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Small (1-2 hours)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Verbal/mental notes or simple TODO list&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Medium (1 day to 1 week)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Written work plan, but abbreviated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Large (1+ weeks)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full work plan covering all perspectives&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a typo fix, you don't need a work plan. For a multi-week refactor, you absolutely do.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Execution
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Implement according to the work plan.&lt;/p&gt;

&lt;h3&gt;
  
  
  Work in Small Steps
&lt;/h3&gt;

&lt;p&gt;Follow the plan. One task at a time. One file, one function at a time where appropriate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Types-First
&lt;/h3&gt;

&lt;p&gt;When adding new functionality, define interfaces and types before implementing logic. Type definitions become guardrails that help both you and the LLM stay on track.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Changes Everything
&lt;/h3&gt;

&lt;p&gt;With a work plan in place, execution becomes &lt;em&gt;verification&lt;/em&gt;. The LLM isn't guessing what to build—it's checking whether the implementation matches the plan.&lt;/p&gt;

&lt;p&gt;If you need to deviate from the plan, &lt;strong&gt;update the plan first&lt;/strong&gt;, then continue implementation. Don't let plan and implementation drift apart.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Verification &amp;amp; Feedback
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Verify results and externalize learnings.&lt;/p&gt;

&lt;h3&gt;
  
  
  Feedback Format
&lt;/h3&gt;

&lt;p&gt;When something goes wrong, don't just paste an error. Include the intent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;❌ Just the error
[error log]

✅ Intent + error
Goal: Redirect to dashboard after authentication
Issue: Following error occurs
[error log]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without intent, the LLM optimizes for "remove the error." With intent, it optimizes for "achieve the goal."&lt;/p&gt;

&lt;h3&gt;
  
  
  Externalize Learnings
&lt;/h3&gt;

&lt;p&gt;If you find yourself explaining the same thing twice, it's time to write it down.&lt;/p&gt;

&lt;p&gt;I covered this in detail in the previous article—where to put rules, what to write, and how to verify they work. The short version: write root causes, not specific incidents, and put them where they'll actually be read.&lt;/p&gt;




&lt;h2&gt;
  
  
  Referencing Skills and Rules
&lt;/h2&gt;

&lt;p&gt;One common failure mode: you reference a skill or rule file, but the LLM just reads it and moves on without actually applying it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Write "see AGENTS.md"&lt;/td&gt;
&lt;td&gt;It's already loaded—redundant reference adds noise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;@file.md&lt;/code&gt; only&lt;/td&gt;
&lt;td&gt;LLM reads it, then continues. Reading ≠ applying&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Please reference X"&lt;/td&gt;
&lt;td&gt;References it minimally, doesn't apply the content&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The Solution: Blocking References
&lt;/h3&gt;

&lt;p&gt;Make the reference a task with verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Required Rules [MANDATORY - MUST BE ACTIVE]&lt;/span&gt;

&lt;span class="gs"&gt;**LOADING PROTOCOL:**&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; STEP 1: CHECK if &lt;span class="sb"&gt;`.agents/skills/coding-rules/SKILL.md`&lt;/span&gt; is active
&lt;span class="p"&gt;-&lt;/span&gt; STEP 2: If NOT active → Execute BLOCKING READ
&lt;span class="p"&gt;-&lt;/span&gt; STEP 3: CONFIRM skill active before proceeding
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why This Works
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Element&lt;/th&gt;
&lt;th&gt;Effect&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Action verbs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"CHECK", "READ", "CONFIRM"—not just "reference"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;STEP numbers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Forces sequence, can't skip&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Before proceeding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Blocking—must complete before continuing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;If NOT active&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Conditional—skips if already loaded (efficiency)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This maps to the task clarity principle: "check if loaded → load if needed → confirm → proceed" is far clearer than "please reference this file."&lt;/p&gt;




&lt;h2&gt;
  
  
  How This Connects to the Theory
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Connection to LLM Characteristics&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Step 1: Preparation&lt;/td&gt;
&lt;td&gt;Task clarification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Step 2: Design&lt;/td&gt;
&lt;td&gt;Artifact-first (design doc is an artifact)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Step 3: Work Planning&lt;/td&gt;
&lt;td&gt;Artifact-first (plan is an artifact) + external feedback design&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Step 4: Execution&lt;/td&gt;
&lt;td&gt;Transform "generation" into "verification against plan"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Step 5: Verification&lt;/td&gt;
&lt;td&gt;Obtain external feedback + externalize learnings&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The work plan created in Step 3 converts Step 4 from "generate from scratch" to "verify against specification." This is the key mechanism for improving accuracy.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Research
&lt;/h2&gt;

&lt;p&gt;The practices in this article aren't just workflow opinions—they're backed by research on how LLM agents perform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ADaPT (Prasad et al., NAACL 2024)&lt;/strong&gt;: Separating planning and execution, with dynamic subtask decomposition when needed, achieved up to 33% higher success rates than baselines (28.3% on ALFWorld, 27% on WebShop, 33% on TextCraft).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Plan-and-Execute (LangChain)&lt;/strong&gt;: Explicit long-term planning enables handling complex tasks that even powerful LLMs struggle with in step-by-step mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-Layer Task Decomposition (PMC, 2024)&lt;/strong&gt;: Step-by-step models generate more accurate results than direct generation—task decomposition directly improves output quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task Decomposition (Amazon Science, 2025)&lt;/strong&gt;: With proper task decomposition, smaller specialized models can match the performance of larger general models.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Don't let it execute immediately.&lt;/strong&gt; Ask for a plan first. Even just "present your approach step-by-step before implementing" makes a significant difference.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Work Planning is the superpower.&lt;/strong&gt; A plan is an artifact. Having it converts execution from generation to verification—and LLMs are better at verification.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Define completion criteria.&lt;/strong&gt; L1 (works as feature) &amp;gt; L2 (tests pass) &amp;gt; L3 (builds). Know what "done" means before starting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale to task size.&lt;/strong&gt; Small task = mental note. Large task = full work plan. Don't over-plan trivial work, don't under-plan complex work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update plan before deviating.&lt;/strong&gt; If implementation needs to differ from the plan, update the plan first. Drift kills the verification benefit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Include intent with errors.&lt;/strong&gt; "Goal + error" beats "just error." The LLM should know what you're trying to achieve, not just what went wrong.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Prasad, A., et al. (2024). "ADaPT: As-Needed Decomposition and Planning with Language Models." NAACL 2024 Findings. arXiv:2311.05772&lt;/li&gt;
&lt;li&gt;Wang, L., et al. (2023). "Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models." ACL 2023.&lt;/li&gt;
&lt;li&gt;LangChain. "Plan-and-Execute Agents." &lt;a href="https://blog.langchain.com/planning-agents/" rel="noopener noreferrer"&gt;https://blog.langchain.com/planning-agents/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>softwareengineering</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Stop Guessing If Your Prompt Is Better</title>
      <dc:creator>Shinsuke KAGAWA</dc:creator>
      <pubDate>Thu, 22 Jan 2026 13:52:50 +0000</pubDate>
      <link>https://dev.to/shinpr/stop-guessing-if-your-prompt-is-better-5amb</link>
      <guid>https://dev.to/shinpr/stop-guessing-if-your-prompt-is-better-5amb</guid>
      <description>&lt;p&gt;You rewrote your prompt. The output looks different. But is it actually &lt;em&gt;better&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;Most of us have been there — reading prompt engineering best practices, tweaking instructions, and hoping the changes help. But without comparison, you're just guessing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;When you improve a prompt, you typically:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run the new version&lt;/li&gt;
&lt;li&gt;Look at the output&lt;/li&gt;
&lt;li&gt;Think "yeah, this seems better"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But you're comparing against your memory of the old output. Different runs produce different results anyway. How do you know the improvement came from your changes and not just LLM variance?&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/shinpr/rashomon" rel="noopener noreferrer"&gt;rashomon&lt;/a&gt; is a Claude Code plugin that focuses on one practical question: &lt;strong&gt;"Did my instruction change actually affect the result?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It analyzes your prompt, generates an optimized version, runs both in isolated environments, and compares the actual results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Example
&lt;/h2&gt;

&lt;p&gt;I ran this prompt through rashomon:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Add logging to track function calling usage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A reasonable instruction. But vague.&lt;/p&gt;
&lt;h3&gt;
  
  
  What rashomon detected
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;th&gt;Detail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vague instructions&lt;/td&gt;
&lt;td&gt;What, where, and why to log are unclear&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No output format&lt;/td&gt;
&lt;td&gt;Log structure not specified&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Missing context&lt;/td&gt;
&lt;td&gt;No project architecture information&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  The optimized prompt
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Context

This is a Slack bot using Google Gemini API with function calling. 
The project uses a shared `logger` utility with structured logging.
Function calling flows through:
1. `GeminiService.executeWithRetry()` - detects function calls
2. `FunctionHandler.handleFunctionCall()` - executes them

## Task

Add logging to track function calling usage for analytics and debugging.

## Requirements

At Function Call Detection (GeminiService):
- Function name(s) detected
- Number of function calls in response

At Function Execution (FunctionHandler):
- Parameters passed (sanitized - exclude sensitive data)
- Execution duration
- Result status (success/failure)

## Output Format

logger.info('Function call detected', {
  functionName: 'executeWithRetry',
  detectedFunctions: ['searchNotionPages'],
  functionCallCount: 1
})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  What changed
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Original&lt;/th&gt;
&lt;th&gt;Optimized&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Logging Scope&lt;/td&gt;
&lt;td&gt;1 stage (execution only)&lt;/td&gt;
&lt;td&gt;2 stages (detection + execution)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parameter Sanitization&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Passwords, tokens, secrets redacted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Files Modified&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The original prompt &lt;em&gt;looked&lt;/em&gt; reasonable, but led the agent to log at only one point. The optimized version covered both detection and execution — with security considerations the original didn't address.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classification: Structural Improvement&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  About Variance
&lt;/h2&gt;

&lt;p&gt;Not every difference is an improvement. rashomon distinguishes between structural gains and mere variance.&lt;/p&gt;

&lt;p&gt;I tried to create a Variance example — a prompt so clear that optimization wouldn't matter. I couldn't. In practice, the same vague prompt sometimes works beautifully, sometimes completely misses the point.&lt;/p&gt;

&lt;p&gt;rashomon just makes that inconsistency visible.&lt;/p&gt;
&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Requires &lt;a href="https://claude.ai/code" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude
/plugin marketplace add shinpr/rashomon
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;rashomon@rashomon
&lt;span class="c"&gt;# Restart session&lt;/span&gt;
/rashomon Your prompt here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/shinpr" rel="noopener noreferrer"&gt;
        shinpr
      &lt;/a&gt; / &lt;a href="https://github.com/shinpr/rashomon" rel="noopener noreferrer"&gt;
        rashomon
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Compare, improve, and verify prompt changes with evidence — not vibes.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;p&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/shinpr/rashomon/assets/rashomon-banner.jpg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fshinpr%2Frashomon%2Fassets%2Frashomon-banner.jpg" width="600" alt="Rashomon"&gt;&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
  &lt;a href="https://claude.ai/code" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/77c3fac949481ce7960e41b57da074d377eb159a42c6cf4694cf225ddcada391/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436c61756465253230436f64652d506c7567696e2d707572706c65" alt="Claude Code"&gt;&lt;/a&gt;
  &lt;a href="https://github.com/shinpr/rashomon/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/d6bc2b26794002c24d023acaab01b6dbb953c57ab9cb80ba5b8aa2f2bd5de99a/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d626c7565" alt="License"&gt;&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;See what actually changes when you improve your prompts — not just different wording.&lt;/strong&gt;&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Why rashomon?&lt;/h2&gt;
&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Inspired by the &lt;em&gt;Rashomon effect&lt;/em&gt; — the idea that the same event can produce different outcomes depending on perspective
rashomon makes those differences explicit and comparable.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;Spending too much time on trial-and-error with prompts?&lt;/li&gt;
&lt;li&gt;Read best practices but not sure how they apply to your case?&lt;/li&gt;
&lt;li&gt;Want proof that your changes actually made things better?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;rashomon&lt;/strong&gt; analyzes, improves, and compares prompts—so you can see what &lt;em&gt;actually&lt;/em&gt; changed, and whether it matters.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;Who Is This For?&lt;/h3&gt;
&lt;/div&gt;

&lt;p&gt;rashomon is designed for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Developers using Claude Code daily&lt;/li&gt;
&lt;li&gt;Teams iterating on complex prompts (coding, analysis, writing)&lt;/li&gt;
&lt;li&gt;Anyone who wants &lt;strong&gt;evidence&lt;/strong&gt;, not vibes, when improving prompts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not ideal if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You don't use git&lt;/li&gt;
&lt;li&gt;You want one-shot prompt rewriting without comparison&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Quick Example&lt;/h2&gt;

&lt;/div&gt;

&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;&lt;pre class="notranslate"&gt;&lt;code&gt;/rashomon Write a function to sort an array
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;What You Get&lt;/h3&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;1. Detected Issues&lt;/strong&gt;&lt;/p&gt;

&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;
&lt;pre class="notranslate"&gt;&lt;code&gt;- BP-002&lt;/code&gt;&lt;/pre&gt;…&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/shinpr/rashomon" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;





</description>
      <category>ai</category>
      <category>promptengineering</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Stop Putting Everything in AGENTS.md</title>
      <dc:creator>Shinsuke KAGAWA</dc:creator>
      <pubDate>Mon, 19 Jan 2026 14:13:49 +0000</pubDate>
      <link>https://dev.to/shinpr/stop-putting-everything-in-agentsmd-22bl</link>
      <guid>https://dev.to/shinpr/stop-putting-everything-in-agentsmd-22bl</guid>
      <description>&lt;p&gt;If you're using Agentic Coding and find yourself explaining the same thing to the LLM over and over, you have a learning externalization problem.&lt;/p&gt;

&lt;p&gt;The fix seems obvious: write it down in AGENTS.md (or CLAUDE.md, depending on your tool) and never explain it again.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: This article uses "AGENTS.md" as the generic term for root instruction files. Claude Code uses CLAUDE.md, Codex uses AGENTS.md, and other tools have their own conventions. The principles apply regardless of the specific filename.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;But here's what actually happens—you keep adding rules, AGENTS.md grows to 200+ lines, and somehow the LLM still ignores half of what you wrote.&lt;/p&gt;

&lt;p&gt;This article is about how to actually make your rules stick: &lt;strong&gt;where&lt;/strong&gt; to write them, &lt;strong&gt;what&lt;/strong&gt; to write, and how to verify they work.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Problem
&lt;/h2&gt;

&lt;p&gt;LLMs don't learn across sessions. Every conversation starts fresh. This means:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You explain something once&lt;/li&gt;
&lt;li&gt;It works&lt;/li&gt;
&lt;li&gt;Next session, you explain it again&lt;/li&gt;
&lt;li&gt;And again&lt;/li&gt;
&lt;li&gt;Eventually you get frustrated&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The solution is to externalize your learnings into rules. But most people do this wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Common Mistakes
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mistake&lt;/th&gt;
&lt;th&gt;What Happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Put everything in AGENTS.md&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;It bloats, becomes noise, important rules get buried&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Put everything in code comments&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LLM doesn't load them into context unless you explicitly reference the file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Don't write it down at all&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You repeat yourself forever&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The thing is, &lt;strong&gt;where you write a rule determines whether the LLM actually follows it.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Where to Write Rules
&lt;/h2&gt;

&lt;p&gt;Not all rules belong in the same place. A simple decision tree:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;When is this rule needed?
│
├─ Always, on every task → AGENTS.md
│
├─ When working on a specific feature → Design Doc
│
├─ When using a specific technology → Rule file (skill)
│
└─ When performing a specific task type → Task guidelines
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note: "Skills" are modular rule files used in tools like Codex and Claude Code. They allow you to inject context-specific rules only when relevant. If your tool doesn't have this concept, think of them as separate rule files you reference when needed.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Task guidelines" refers to rules that apply only during specific operations—like code review, migration, or content generation. Some call these "task rules" or "task-specific constraints."&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Full Picture
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Destination&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;When Applied&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AGENTS.md&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;All tasks&lt;/td&gt;
&lt;td&gt;Always&lt;/td&gt;
&lt;td&gt;Approval flows, stop conditions, project principles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rule files (skills)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Specific technology area&lt;/td&gt;
&lt;td&gt;When using that tech&lt;/td&gt;
&lt;td&gt;Type conventions, error handling patterns, function size limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Task guidelines&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Specific task type&lt;/td&gt;
&lt;td&gt;When doing that task&lt;/td&gt;
&lt;td&gt;Subagent usage rules, review procedures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Design docs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Specific feature&lt;/td&gt;
&lt;td&gt;When developing that feature&lt;/td&gt;
&lt;td&gt;Feature requirements, API specs, security constraints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code comments&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Specific code location&lt;/td&gt;
&lt;td&gt;When modifying that code&lt;/td&gt;
&lt;td&gt;Implementation rationale, gotchas&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The Key Question
&lt;/h3&gt;

&lt;p&gt;Ask yourself: &lt;strong&gt;"Is this needed on &lt;em&gt;every&lt;/em&gt; task in this project?"&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Yes&lt;/strong&gt; → AGENTS.md&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No&lt;/strong&gt; → Put it closer to where it's needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps AGENTS.md lean (around 100 lines) and ensures task-specific rules don't create noise for unrelated work.&lt;/p&gt;

&lt;p&gt;You don't need to get this perfect from day one. Start with one thing: keep AGENTS.md small. That alone changes a lot.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Write
&lt;/h2&gt;

&lt;p&gt;This is the hard part. Most people write the wrong thing.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Principle: Write Root Causes, Not Incidents
&lt;/h3&gt;

&lt;p&gt;When something goes wrong, the instinct is to document the specific incident. But this creates bias—the LLM over-fits to that one case.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;❌ Bad (specific incident)
"The getUser() function in UserService was missing null check"

✅ Good (root cause / system fix)
"Always null-check return values from external APIs"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first one only helps if the LLM encounters that exact function again. The second one prevents the entire &lt;em&gt;class&lt;/em&gt; of errors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Specific Incident vs. Root Cause
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Specific Incident&lt;/th&gt;
&lt;th&gt;Root Cause&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Applies to&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;That one location&lt;/td&gt;
&lt;td&gt;All similar cases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prevents recurrence&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Weakly (same bug elsewhere)&lt;/td&gt;
&lt;td&gt;Strongly (operates as principle)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bias risk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (overfitting)&lt;/td&gt;
&lt;td&gt;Low (generalizable)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Finding the Root Cause
&lt;/h3&gt;

&lt;p&gt;When you encounter an issue, ask:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Why did this mistake happen?&lt;/strong&gt; (direct cause)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why wasn't it prevented?&lt;/strong&gt; (system gap)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where else could this same mistake occur?&lt;/strong&gt; (scope)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Direct cause: &lt;code&gt;getUser()&lt;/code&gt; was missing null check&lt;/li&gt;
&lt;li&gt;System gap: We trusted external API return values without validation&lt;/li&gt;
&lt;li&gt;Scope: All external API calls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ &lt;strong&gt;Rule to write&lt;/strong&gt;: "Always null-check return values from external APIs"&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Verify Rules Work
&lt;/h2&gt;

&lt;p&gt;This is the step most people skip—and it's critical.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Principle: Fix the System, Then Discard and Retry
&lt;/h3&gt;

&lt;p&gt;When you add or modify a rule in AGENTS.md or a skill file, you need to verify it actually works. The only way to do this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Add/modify the rule&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discard&lt;/strong&gt; the current artifact (or stash it in a branch)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Start a new session&lt;/strong&gt; with the updated rules&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Re-run the same task&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify&lt;/strong&gt; the issue doesn't recur
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Continue with existing artifact after rule change → ❌
Discard and restart with new rules → ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why This Matters
&lt;/h3&gt;

&lt;p&gt;If you keep the existing artifact and just continue, you're still operating in a context polluted by the old system. The new rule might not get properly applied because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The existing artifact carries biases from before the rule existed&lt;/li&gt;
&lt;li&gt;The LLM might try to "reconcile" the new rule with existing work rather than applying it cleanly&lt;/li&gt;
&lt;li&gt;You can't tell if the rule actually works or if you just manually fixed the symptom&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Verification Checklist
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Modified the rule (AGENTS.md / skill file / task guideline)&lt;/li&gt;
&lt;li&gt;[ ] Discarded current artifact (or moved to a branch)&lt;/li&gt;
&lt;li&gt;[ ] Started new session with updated rules&lt;/li&gt;
&lt;li&gt;[ ] Re-ran the same task&lt;/li&gt;
&lt;li&gt;[ ] Confirmed the issue doesn't recur&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For small changes, you can stash instead of discard. The key is: &lt;strong&gt;test the system in isolation&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Write Rules
&lt;/h2&gt;

&lt;p&gt;Not every issue deserves a rule. Some guidance:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Situation&lt;/th&gt;
&lt;th&gt;Write a Rule?&lt;/th&gt;
&lt;th&gt;Rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;You explained the same thing twice&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prevent the third time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Encountered unexpected behavior&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Maybe&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Find root cause first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task completed successfully&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Maybe&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrospective—any generalizable insights?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Found a serious bug&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prevent recurrence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Warning Signs You're Over-Documenting
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;AGENTS.md exceeds &lt;strong&gt;100 lines&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;A single rule file exceeds &lt;strong&gt;300 lines&lt;/strong&gt; (~1,500 tokens)&lt;/li&gt;
&lt;li&gt;Rules take more than 1 minute to read through&lt;/li&gt;
&lt;li&gt;You find yourself thinking "is this really needed every time?"&lt;/li&gt;
&lt;li&gt;Rules contradict each other&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you see these signs, it's time to prune. &lt;strong&gt;Rule maintenance includes deletion.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Write Rules (Cheat Sheet)
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;This section is a reference.&lt;/strong&gt; You don't need to read it all now—come back when you're actually writing a rule. The rest of the article stands on its own.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  1. Minimum Viable Length
&lt;/h3&gt;

&lt;p&gt;Context is precious. Same meaning, shorter expression. But don't sacrifice clarity for brevity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;❌ Verbose (38 chars)
If an error occurs, you must always log it

✅ Concise (20 chars)
All errors must be logged

❌ Too short (unclear)
Log errors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. No Duplication
&lt;/h3&gt;

&lt;p&gt;Same content in multiple places wastes context and creates update drift.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;❌ Duplicated
&lt;span class="gh"&gt;# base.md&lt;/span&gt;
Standard error format: { success: false, error: string }

&lt;span class="gh"&gt;# api.md&lt;/span&gt;
Errors use { success: false, error: string } format

✅ Single source
&lt;span class="gh"&gt;# base.md&lt;/span&gt;
Standard error format: { success: false, error: string }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Measurable Criteria
&lt;/h3&gt;

&lt;p&gt;Vague instructions create interpretation variance. Use numbers and specific conditions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;✅ Measurable
&lt;span class="p"&gt;-&lt;/span&gt; Functions: max 30 lines
&lt;span class="p"&gt;-&lt;/span&gt; Cyclomatic complexity: max 10
&lt;span class="p"&gt;-&lt;/span&gt; Test coverage: min 80%

❌ Vague
&lt;span class="p"&gt;-&lt;/span&gt; Readable code
&lt;span class="p"&gt;-&lt;/span&gt; Sufficient testing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Recommendations Over Prohibitions
&lt;/h3&gt;

&lt;p&gt;Banning things without alternatives leaves the LLM guessing. Show the right way.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;✅ Recommendation + rationale
【State Management】
Recommended: Zustand or Context API
Reason: Global variables make testing difficult, state tracking complex
Avoid: window.globalState = { ... }

❌ Prohibition list
&lt;span class="p"&gt;-&lt;/span&gt; Don't use global variables
&lt;span class="p"&gt;-&lt;/span&gt; Don't store values on window
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Priority Order
&lt;/h3&gt;

&lt;p&gt;LLMs pay more attention to what comes first. Lead with the most important rules.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Critical (Must Follow)&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; All APIs require JWT authentication
&lt;span class="p"&gt;2.&lt;/span&gt; Rate limit: 100 requests/minute

&lt;span class="gu"&gt;## Standard Specs&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Methods: Follow REST principles
&lt;span class="p"&gt;-&lt;/span&gt; Body: JSON format

&lt;span class="gu"&gt;## Edge Cases (Only When Applicable)&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; File uploads may use multipart
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6. Clear Scope Boundaries
&lt;/h3&gt;

&lt;p&gt;State what the rule covers—and what it doesn't.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Scope&lt;/span&gt;

&lt;span class="gu"&gt;### Applies To&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; REST API endpoints
&lt;span class="p"&gt;-&lt;/span&gt; GraphQL endpoints

&lt;span class="gu"&gt;### Does Not Apply To&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Static file serving
&lt;span class="p"&gt;-&lt;/span&gt; Health checks (/health)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Feedback Loop
&lt;/h2&gt;

&lt;p&gt;This is how it all fits together in practice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Working with LLM]
       │
       ├─ Issue occurs
       │      │
       │      ▼
       │  Find root cause (not just symptom)
       │      │
       │      ▼
       │  Decide where to write (AGENTS.md? Skill? Task guideline?)
       │      │
       │      ▼
       │  Write the rule
       │      │
       │      ▼
       │  Discard current work
       │      │
       │      ▼
       │  New session with updated rules
       │      │
       │      ▼
       │  Verify issue doesn't recur
       │
       ▼
[Continue working]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The goal is to reach a state where &lt;strong&gt;you never explain the same thing twice&lt;/strong&gt;. Every explanation either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gets externalized into a rule, or&lt;/li&gt;
&lt;li&gt;Was truly a one-off that doesn't need capturing&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Passing Feedback Correctly
&lt;/h2&gt;

&lt;p&gt;One more thing: when you give feedback to the LLM, don't just paste error logs. Include your &lt;em&gt;intent&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;❌ Just the error
[Stack trace]

✅ Intent + error
Goal: Redirect to dashboard after user authentication
Issue: Following error occurred
[Stack trace]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without the intent, the LLM optimizes for "make the error go away." With the intent, it optimizes for "achieve the goal while resolving this error."&lt;/p&gt;

&lt;p&gt;These are very different things.&lt;/p&gt;




&lt;h2&gt;
  
  
  Anti-Pattern Summary
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Quick reference if you want to check your current practices:&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Anti-Pattern&lt;/th&gt;
&lt;th&gt;Reference&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Put everything in AGENTS.md&lt;/td&gt;
&lt;td&gt;→ "Where to Write Rules"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write specific incidents instead of root causes&lt;/td&gt;
&lt;td&gt;→ "What to Write"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Continue with old artifacts after changing rules&lt;/td&gt;
&lt;td&gt;→ "How to Verify Rules Work"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;List only prohibitions without recommendations&lt;/td&gt;
&lt;td&gt;→ "How to Write Rules" #4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Keep explaining instead of writing it down&lt;/td&gt;
&lt;td&gt;→ "When to Write Rules"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AGENTS.md is not a dumping ground.&lt;/strong&gt; Only rules needed on &lt;em&gt;every&lt;/em&gt; task belong there. Everything else goes closer to where it's used.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Write root causes, not incidents.&lt;/strong&gt; "Null-check external API returns" beats "UserService.getUser() was missing null check."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Test your rules.&lt;/strong&gt; After adding a rule, discard current work and re-run. If the issue recurs, the rule isn't working.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Maintenance includes deletion.&lt;/strong&gt; If AGENTS.md is over 100 lines, you've probably over-documented. Prune ruthlessly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Explain twice, document once.&lt;/strong&gt; If you're explaining the same thing for a second time, stop and externalize it.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once you stop expecting rules alone to do the work, the real question becomes how to design the workflow around them. In practice, that starts with &lt;a href="https://dev.to/shinpr/planning-is-the-real-superpower-of-agentic-coding-1imm"&gt;planning—before execution ever begins&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Research
&lt;/h2&gt;

&lt;p&gt;The practices in this article are grounded in LLM research:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SALAM (Wang et al., 2023)&lt;/strong&gt;: LLM self-feedback is often inaccurate. Structured feedback from external agents (or externalized rules) is more effective.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LEMA (An et al., 2023)&lt;/strong&gt;: Learning from mistakes (error → explanation → correction) improves LLM reasoning ability—but this requires explicit externalization of what was learned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feedback Loop for IaC (Palavalli et al., 2024)&lt;/strong&gt;: Feedback loop effectiveness decreases exponentially with each iteration and plateaus. This supports the "discard and restart" approach over endless iteration in the same context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reflexion (Shinn et al., 2023)&lt;/strong&gt;: Combining short-term memory (recent trajectory) with long-term memory (past experience) enables effective self-improvement. Externalized rules function as that long-term memory.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Wang, D., et al. (2023). "Learning from Mistakes via Cooperative Study Assistant for Large Language Models." arXiv:2305.13829&lt;/li&gt;
&lt;li&gt;An, S., et al. (2023). "Learning From Mistakes Makes LLM Better Reasoner." arXiv:2310.20689&lt;/li&gt;
&lt;li&gt;Palavalli, M. A., et al. (2024). "Using a Feedback Loop for LLM-based Infrastructure as Code Generation." arXiv:2411.19043&lt;/li&gt;
&lt;li&gt;Shinn, N., et al. (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>softwareengineering</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Why LLMs Are Bad at "First Try" and Great at Verification</title>
      <dc:creator>Shinsuke KAGAWA</dc:creator>
      <pubDate>Mon, 12 Jan 2026 12:46:19 +0000</pubDate>
      <link>https://dev.to/shinpr/why-llms-are-bad-at-first-try-and-great-at-verification-4kcf</link>
      <guid>https://dev.to/shinpr/why-llms-are-bad-at-first-try-and-great-at-verification-4kcf</guid>
      <description>&lt;p&gt;I used to spend hours crafting the perfect prompt.&lt;br&gt;
Detailed instructions, examples, constraints—the works.&lt;/p&gt;

&lt;p&gt;And the AI would still add random features I never asked for.&lt;br&gt;
Or refactor code that was perfectly fine.&lt;br&gt;
Or skip steps it decided were "unnecessary."&lt;/p&gt;

&lt;p&gt;Eventually it clicked: I was fighting a losing battle.&lt;br&gt;
So I stopped trying to control generation and started focusing on verification.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Failure Patterns You've Probably Seen
&lt;/h2&gt;

&lt;p&gt;Before diving into why, these are the common anti-patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Giant Prompt Syndrome&lt;/strong&gt;: Cramming requirements, design, implementation, and improvement into a single prompt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overconfidence in Abstract Instructions&lt;/strong&gt;: Expecting "think carefully" or "be thorough" to actually improve quality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Invisible Loop&lt;/strong&gt;: Thinking you're iterating when you're actually spinning in circles within the same biased context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Bloat&lt;/strong&gt;: Adding "just in case" information until the actually important instructions get buried&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of these sound familiar, you're in the right place.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Insight
&lt;/h2&gt;

&lt;p&gt;The claim is simple:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLMs perform better at "verify and improve existing artifacts" than at "controlled first-time generation."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of trying to get the perfect output on the first attempt, you get better results by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Having the LLM produce &lt;em&gt;something&lt;/em&gt; first&lt;/li&gt;
&lt;li&gt;Then having it verify and improve that output&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is grounded in how LLMs actually process information.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Verification Works Better
&lt;/h2&gt;

&lt;p&gt;At first, I assumed better prompts would lead to better first-shot output. But after enough failures, the pattern became clear: there are three interconnected reasons why LLMs become "smarter" when they have something to work with.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. External Feedback Changes the Task
&lt;/h3&gt;

&lt;p&gt;When an artifact exists, the task fundamentally transforms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Without artifact&lt;/strong&gt;: "Generate something good" (vague, open-ended)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;With artifact&lt;/strong&gt;: "Identify what's wrong with this and fix it" (specific, bounded)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The second task has clearer success criteria. The LLM isn't guessing what "good" means—it can evaluate concrete issues against concrete output.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Position Bias (Lost in the Middle)
&lt;/h3&gt;

&lt;p&gt;Research has shown that LLMs exhibit a U-shaped attention pattern: they prioritize information at the &lt;strong&gt;beginning&lt;/strong&gt; and &lt;strong&gt;end&lt;/strong&gt; of their context window, while information in the &lt;strong&gt;middle&lt;/strong&gt; tends to get overlooked.&lt;/p&gt;

&lt;p&gt;When you feed an artifact as input to a new session, it naturally occupies a prominent position in the context. The LLM is literally forced to pay attention to it.&lt;/p&gt;

&lt;p&gt;This also explains why that really important instruction you buried in paragraph 5 of your mega-prompt keeps getting ignored.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Task Clarity Drives Performance
&lt;/h3&gt;

&lt;p&gt;"Improve this code" is a more concrete task than "write good code."&lt;/p&gt;

&lt;p&gt;The presence of an artifact provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A specific target for evaluation&lt;/li&gt;
&lt;li&gt;Clear boundaries for the scope of work&lt;/li&gt;
&lt;li&gt;Implicit success criteria (this one matters more than you'd think—"better than before" is much easier to verify than "good enough")&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Externality Spectrum
&lt;/h2&gt;

&lt;p&gt;What made the biggest difference for me was reviewing in a completely separate context.&lt;br&gt;
Once I stopped letting the generator review its own work, the blind spots became obvious.&lt;/p&gt;

&lt;p&gt;Not all feedback loops are created equal. Different approaches rank very differently in effectiveness:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu5rubl6dcai7hiugd9h9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu5rubl6dcai7hiugd9h9.png" alt="Verification methods effectiveness spectrum from external signals to self-introspection" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The thing is, looping within the same session is fundamentally &lt;em&gt;internal&lt;/em&gt; feedback. The LLM is still operating within its original generation biases. Only by separating context do you get true "external" perspective.&lt;/p&gt;

&lt;p&gt;In short: if the context doesn't change, neither does the model's perspective.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Implications
&lt;/h2&gt;

&lt;p&gt;So what do you actually do with this?&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Artifact-First Workflow
&lt;/h3&gt;

&lt;p&gt;Stop trying to get everything right in one shot. Instead:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Generate Phase&lt;/strong&gt;: Get &lt;em&gt;something&lt;/em&gt; out, even if imperfect. Don't over-specify.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;External Feedback&lt;/strong&gt;: Run the code, execute tests, use linters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification Phase&lt;/strong&gt; (new session): Feed the artifact + feedback to a fresh context
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Generation Session]
    │
    ├── Input: Requirements, constraints
    ├── Output: Artifact (code, design, etc.) + brief intent summary (1-3 lines)
    │
    ▼
[External Feedback]
    │
    ├── Code execution
    ├── Test execution
    ├── Linter/static analysis
    │
    ▼
[Verification Session]  ← Fresh context
    │
    ├── Input: Artifact + intent summary + feedback results
    ├── Output: Improved artifact
    │
    ▼
[Repeat or Complete]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Know When to Separate Sessions
&lt;/h3&gt;

&lt;p&gt;Session separation isn't always necessary. Use your judgment:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task Type&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Small, localized fixes (typos, formatting)&lt;/td&gt;
&lt;td&gt;Same session is fine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Clear error fixes (with stack trace)&lt;/td&gt;
&lt;td&gt;Same session works—external feedback (error log) exists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Design changes, architecture revisions&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Separate sessions&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quality improvements, refactoring&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Separate sessions&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Direction changes, requirement pivots&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Separate sessions&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb&lt;/strong&gt;: If you're feeling "something isn't working," that's often a sign to start a fresh session. Your intuition about context pollution is usually right.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. What to Pass Between Sessions
&lt;/h3&gt;

&lt;p&gt;Not everything from the generation phase should go to the verification phase.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Content&lt;/th&gt;
&lt;th&gt;Should Pass?&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Full Chain-of-Thought log&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Verbose, becomes noise. Important info gets lost (position bias)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intent summary (1-3 lines)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Preserves the "why" compactly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Final decision + rationale&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Useful for debugging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rejected alternatives&lt;/td&gt;
&lt;td&gt;Maybe&lt;/td&gt;
&lt;td&gt;Only when specifically relevant&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The principle: &lt;strong&gt;Pass the "why," not the "how I thought about it."&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Designing Your AGENTS.md
&lt;/h2&gt;

&lt;p&gt;You don't need to redesign your AGENTS.md all at once. But understanding position bias changes how you think about what goes in it.&lt;/p&gt;

&lt;p&gt;This insight has direct implications for how you structure AGENTS.md (or whatever root instruction file you use—CLAUDE.md, cursorrules, etc.).&lt;/p&gt;

&lt;h3&gt;
  
  
  The Position Bias Problem
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Context Window Position → Attention Weight

[AGENTS.md]            ← Start: HIGH attention
       ↓
[Middle instructions]  ← Middle: LOW attention (Lost in the Middle)
       ↓
[User prompt]          ← End: HIGH attention
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your AGENTS.md is bloated, the truly important principles get diluted. Adding more "just in case" actually makes everything weaker.&lt;/p&gt;

&lt;h3&gt;
  
  
  Design Principles
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AGENTS.md&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Core principles only. ~100 lines. What must be followed on &lt;em&gt;every&lt;/em&gt; task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Task-specific info&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Inject via skills, command arguments, or reference files &lt;em&gt;when needed&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Why separate?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Context separation lets you compose optimal information for each task&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What Belongs in AGENTS.md
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Project purpose and domain&lt;/li&gt;
&lt;li&gt;Non-negotiable constraints (security, naming conventions)&lt;/li&gt;
&lt;li&gt;Tech stack overview&lt;/li&gt;
&lt;li&gt;Communication style&lt;/li&gt;
&lt;li&gt;Error handling behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What Doesn't Belong
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Individual feature specs&lt;/li&gt;
&lt;li&gt;API details&lt;/li&gt;
&lt;li&gt;Task-specific workflows&lt;/li&gt;
&lt;li&gt;Long code examples&lt;/li&gt;
&lt;li&gt;"Nice to have" information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Test&lt;/strong&gt;: Ask "Is this needed for &lt;em&gt;every&lt;/em&gt; task?" If no, it belongs elsewhere.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Human Role
&lt;/h2&gt;

&lt;p&gt;In Agentic Coding, you're not "using an LLM"—you're &lt;strong&gt;designing a system&lt;/strong&gt; where an LLM operates.&lt;/p&gt;

&lt;h3&gt;
  
  
  Your Responsibilities
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;th&gt;Concrete Actions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Design external feedback&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Decide which tests to run, which linters to use, what "success" means&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Determine session boundaries&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Judge when to cut context, what carries over&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Define quality gates&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Separate automated checks from human review needs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maintain AGENTS.md&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Keep core principles tight, prevent bloat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Articulate intent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Create or validate the "intent summary" that passes between sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Automation vs. Human Review
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Good for automation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code execution, test execution&lt;/li&gt;
&lt;li&gt;Linters, formatters&lt;/li&gt;
&lt;li&gt;Type checking&lt;/li&gt;
&lt;li&gt;Security scans&lt;/li&gt;
&lt;li&gt;Applying formulaic fixes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Requires human review:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Design decision validity&lt;/li&gt;
&lt;li&gt;Requirement alignment&lt;/li&gt;
&lt;li&gt;Session boundary judgment&lt;/li&gt;
&lt;li&gt;Trade-off decisions&lt;/li&gt;
&lt;li&gt;Validating the "why"&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  A Framework: Context Separation at Every Level
&lt;/h2&gt;

&lt;p&gt;It took me a while to realize this wasn't about writing better prompts—it was about where I drew the boundaries.&lt;/p&gt;

&lt;p&gt;You don't need to apply all of this rigidly. But when something feels off, one of these levels is usually the culprit:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzv0ou7eep55r3h8b5dko.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzv0ou7eep55r3h8b5dko.png" alt="Four levels of Context Separation Principle" width="800" height="993"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Research Behind This
&lt;/h2&gt;

&lt;p&gt;These aren't just opinions—they're grounded in LLM research:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-Refine (Madaan et al., 2023)&lt;/strong&gt;&lt;br&gt;
The generate → feedback → refine loop shows approximately 20% improvement over single-shot generation. Key insight: the improvement comes from the structured iteration, not from the model "trying harder."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lost in the Middle (Liu et al., 2023)&lt;/strong&gt;&lt;br&gt;
LLMs show U-shaped attention bias, heavily weighting the beginning and end of context while underweighting the middle. This explains why your carefully crafted instructions in paragraph 5 keep getting ignored.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLMs Cannot Self-Correct Reasoning Yet (Huang et al., 2023)&lt;/strong&gt;&lt;br&gt;
Without external feedback, self-correction doesn't work—and can actually make things worse. "Review your work" as an instruction has minimal effect; external signals (test failures, linter errors) are what drive actual improvement.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't optimize for first-shot perfection&lt;/strong&gt;. Get something out, then improve it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Session separation is real&lt;/strong&gt;. The same context that generated the artifact will struggle to objectively improve it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;External feedback is non-negotiable&lt;/strong&gt;. Tests, linters, execution results—these are what drive quality, not "think harder" prompts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Keep AGENTS.md lean&lt;/strong&gt;. Position bias means bloat actively hurts. If it's not needed for every task, move it out.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pass intent, not process&lt;/strong&gt;. Between sessions, transfer the "why" in 1-3 lines, not the full thought log.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You're a system designer&lt;/strong&gt;. Your job isn't to use the LLM—it's to design the workflow, feedback loops, and context boundaries that let it perform.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This article focused on &lt;em&gt;why&lt;/em&gt; verification-oriented workflows outperform first-shot generation. In future articles, I'll cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;How to structure work plans&lt;/strong&gt; that turn execution into verification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where to put rules&lt;/strong&gt; so they actually get followed (hint: not all in AGENTS.md)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've been struggling with inconsistent LLM output or finding that your detailed prompts underperform simpler ones, try restructuring around verification. The difference is often dramatic.&lt;/p&gt;

&lt;p&gt;What's your experience been? Did switching to a verification-first approach change anything for you?&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Madaan, A., et al. (2023). "Self-Refine: Iterative Refinement with Self-Feedback." arXiv:2303.17651&lt;/li&gt;
&lt;li&gt;Liu, N. F., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172&lt;/li&gt;
&lt;li&gt;Huang, J., et al. (2023). "Large Language Models Cannot Self-Correct Reasoning Yet." ICLR 2024. arXiv:2310.01798&lt;/li&gt;
&lt;li&gt;Hsieh, C.-Y., et al. (2024). "Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization." ACL 2024 Findings. arXiv:2406.16008&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>softwareengineering</category>
      <category>promptengineering</category>
    </item>
    <item>
      <title>Building a Local RAG for Agentic Coding: From Fixed Chunks to Semantic Search with Keyword Boost</title>
      <dc:creator>Shinsuke KAGAWA</dc:creator>
      <pubDate>Tue, 06 Jan 2026 12:12:57 +0000</pubDate>
      <link>https://dev.to/shinpr/building-a-local-rag-for-agentic-coding-from-fixed-chunks-to-semantic-search-with-keyword-boost-15m8</link>
      <guid>https://dev.to/shinpr/building-a-local-rag-for-agentic-coding-from-fixed-chunks-to-semantic-search-with-keyword-boost-15m8</guid>
      <description>&lt;p&gt;Started with a simple RAG for MCP—the kind of thing you build in a weekend. Ended up implementing semantic chunking (Max-Min algorithm) and rethinking hybrid search entirely. This article is written for people who have already built RAG systems and started hitting quality limits. If you've hit walls with fixed-size chunks and top-K retrieval, this might be useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Context: RAG for Agentic Coding&lt;/li&gt;
&lt;li&gt;The Invisible Problem: What Does the LLM Actually Receive?&lt;/li&gt;
&lt;li&gt;Semantic Chunking: Why Fixed Chunks Break Down&lt;/li&gt;
&lt;li&gt;When Semantic Chunks Broke Hybrid Search&lt;/li&gt;
&lt;li&gt;Results: What Actually Changed&lt;/li&gt;
&lt;li&gt;Architecture Summary&lt;/li&gt;
&lt;li&gt;The Other Side: Query Quality&lt;/li&gt;
&lt;li&gt;Tradeoffs and Limitations&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  1. Context: RAG for Agentic Coding
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Problem statement
&lt;/h3&gt;

&lt;p&gt;The request was straightforward: load domain knowledge from PDFs for a specialized agent. Framework best practices, project principles (rules), and specifications (PRDs)—the kind of documents you'd want an AI coding assistant to reference while working.&lt;/p&gt;

&lt;p&gt;The constraints made it interesting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Personal use&lt;/strong&gt; → No external APIs, privacy matters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP ecosystem&lt;/strong&gt; → Integration with Cursor, Claude Code, Codex&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Agentic Coding support"&lt;/strong&gt; as the use case&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Initial implementation
&lt;/h3&gt;

&lt;p&gt;The first version was textbook RAG:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Document → Fixed-size chunks (500 chars) → Embeddings → LanceDB
Query → Vector search → Top-K results → LLM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Standard fixed-size chunking. Vector search with top-K retrieval. Local embedding model via &lt;a href="https://huggingface.co/docs/transformers.js" rel="noopener noreferrer"&gt;Transformers.js&lt;/a&gt;. &lt;a href="https://lancedb.com/" rel="noopener noreferrer"&gt;LanceDB&lt;/a&gt; for vector storage—file-based, no server process required.&lt;/p&gt;

&lt;p&gt;It worked... sort of.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The Invisible Problem: What Does the LLM Actually Receive?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Discovery
&lt;/h3&gt;

&lt;p&gt;Here's the thing about MCP: search results go directly to the LLM. The user never sees them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → LLM → MCP(RAG) → LLM → Response
               ↑
         Results hidden from user
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the RAG returns garbage, you don't see it. You just notice the LLM behaving strangely—making additional searches, reading files directly, or giving incomplete answers.&lt;/p&gt;

&lt;p&gt;To debug this, I forced the LLM to output the raw JSON search results. The prompt was simple: "Show me the exact JSON you received from the RAG search."&lt;/p&gt;

&lt;p&gt;What I found: &lt;strong&gt;lots of irrelevant chunks polluting the context.&lt;/strong&gt; Page markers, decoration lines, fragments cut mid-sentence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why top-K fails
&lt;/h3&gt;

&lt;p&gt;The standard approach is "return the top 10 closest vectors." But closeness in vector space doesn't equal usefulness.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increasing K just adds more noise&lt;/li&gt;
&lt;li&gt;No quality signal—just "top 10 closest vectors"&lt;/li&gt;
&lt;li&gt;A chunk with distance 0.1 and another with distance 0.9 both make the cut if they're in the top K&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  First fix: Quality filtering
&lt;/h3&gt;

&lt;p&gt;Three mechanisms, each addressing a different problem:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Distance-based threshold (&lt;code&gt;RAG_MAX_DISTANCE&lt;/code&gt;)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/vectordb/index.ts&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;maxDistance&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="kc"&gt;undefined&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;distanceRange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;undefined&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;maxDistance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only return results below a certain distance. If nothing is close enough, return nothing—better than returning garbage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Relevance gap grouping (&lt;code&gt;RAG_GROUPING&lt;/code&gt;)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of arbitrary K, detect natural "quality groups" in the results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/vectordb/index.ts&lt;/span&gt;
&lt;span class="c1"&gt;// Calculate statistical threshold: mean + 1.5 * std&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;mean&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;GROUPING_BOUNDARY_STD_MULTIPLIER&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;std&lt;/span&gt;

&lt;span class="c1"&gt;// Find significant gaps (group boundaries)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;boundaries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;gaps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;g&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;gap&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// 'similar' mode: first group only&lt;/span&gt;
&lt;span class="c1"&gt;// 'related' mode: top 2 groups&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results cluster naturally—there's usually a gap between "highly relevant" and "somewhat related." This detects that gap statistically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Garbage chunk removal&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/chunker/semantic-chunker.ts&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;isGarbageChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Decoration line patterns (----, ====, ****, etc.)&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^&lt;/span&gt;&lt;span class="se"&gt;[\-&lt;/span&gt;&lt;span class="sr"&gt;=_.*#|~`@!%^&amp;amp;*()&lt;/span&gt;&lt;span class="se"&gt;\[\]&lt;/span&gt;&lt;span class="sr"&gt;{}&lt;/span&gt;&lt;span class="se"&gt;\\/&lt;/span&gt;&lt;span class="sr"&gt;&amp;lt;&amp;gt;:+&lt;/span&gt;&lt;span class="se"&gt;\s]&lt;/span&gt;&lt;span class="sr"&gt;+$/&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;trimmed&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="c1"&gt;// Excessive repetition of single character (&amp;gt;80%)&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;maxCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nx"&gt;charCounts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;maxCount&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;trimmed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Page markers, separator lines, repeated characters—filter them before they ever reach the index.&lt;/p&gt;

&lt;h3&gt;
  
  
  New problem emerged
&lt;/h3&gt;

&lt;p&gt;Technical terms like &lt;code&gt;useEffect&lt;/code&gt; or &lt;code&gt;ERR_CONNECTION_REFUSED&lt;/code&gt; were getting filtered out. They're semantically distant from natural language queries but keyword-relevant.&lt;/p&gt;

&lt;p&gt;The fix: hybrid search (semantic + keyword blend). But implementing it properly required rethinking the chunking strategy first.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Semantic Chunking: Why Fixed Chunks Break Down
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Trigger
&lt;/h3&gt;

&lt;p&gt;I read about "semantic center of gravity" in chunks—the idea that a chunk should have a coherent meaning, not just a coherent length.&lt;/p&gt;

&lt;p&gt;Then I observed the LLM's behavior: after RAG search, it would often search again with different terms, or just read the file directly. The chunks weren't trustworthy—they lacked sufficient context for the LLM to act on them.&lt;/p&gt;

&lt;h3&gt;
  
  
  The waste
&lt;/h3&gt;

&lt;p&gt;If a chunk doesn't contain enough meaning:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;LLM makes additional tool calls to compensate&lt;/li&gt;
&lt;li&gt;Context gets polluted with redundant searches&lt;/li&gt;
&lt;li&gt;Latency increases&lt;/li&gt;
&lt;li&gt;Tokens get wasted&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The LLM was doing work that good chunking should prevent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution: Max-Min Algorithm
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://link.springer.com/article/10.1007/s10791-025-09638-7" rel="noopener noreferrer"&gt;Max-Min semantic chunking paper&lt;/a&gt; (Kiss et al., Springer 2025) provided the foundation. This implementation is a pragmatic adaptation of the Max–Min idea, not a faithful reproduction of the paper's algorithm.&lt;/p&gt;

&lt;p&gt;The core idea: group consecutive sentences based on semantic similarity, not character count.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/chunker/semantic-chunker.ts&lt;/span&gt;

&lt;span class="c1"&gt;// Should we add this sentence to the current chunk?&lt;/span&gt;
&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="nf"&gt;shouldAddToChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;maxSim&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;maxSim&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;threshold&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Dynamic threshold based on chunk coherence&lt;/span&gt;
&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="nf"&gt;calculateThreshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;minSim&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;chunkSize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// threshold = max(c * minSim * sigmoid(|C|), hardThreshold)&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sigmoid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nx"&gt;chunkSize&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;minSim&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;sigmoid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hardThreshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The algorithm:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Split text into sentences&lt;/li&gt;
&lt;li&gt;Generate embeddings for all sentences&lt;/li&gt;
&lt;li&gt;For each sentence, decide: add to current chunk or start new?&lt;/li&gt;
&lt;li&gt;Decision based on comparing max similarity with new sentence vs. min similarity within chunk&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When the new sentence's similarity drops below the threshold, it signals a topic boundary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation details
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sentence detection: &lt;code&gt;Intl.Segmenter&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/chunker/sentence-splitter.ts&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;segmenter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;Intl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Segmenter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;und&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;granularity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;sentence&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No external dependencies. Multilingual support via Unicode standard (UAX #29). The &lt;code&gt;'und'&lt;/code&gt; (undetermined) locale provides general Unicode support.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code block preservation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/chunker/sentence-splitter.ts&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;CODE_BLOCK_PLACEHOLDER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s1"&gt;u0000CODE_BLOCK&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s1"&gt;u0000&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="c1"&gt;// Extract before sentence splitting&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;codeBlockRegex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sr"&gt;/``&lt;/span&gt;&lt;span class="err"&gt;`
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nx"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="nx"&gt;S&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nx"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="s2"&gt;```/g
// ... replace with placeholders ...

// Restore after chunking
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Markdown code blocks stay intact—never split mid-block. Critical for technical documentation where copy-pastable code is the point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance tuning&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The paper uses O(k²) comparisons within each chunk. For long homogeneous documents, this explodes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/chunker/semantic-chunker.ts&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;WINDOW_SIZE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;      &lt;span class="c1"&gt;// Compare only recent 5 sentences: O(k²) → O(25)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;MAX_SENTENCES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;   &lt;span class="c1"&gt;// Force split at 15 sentences (3x paper's median)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;PDF parsing: pdfjs-dist&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Switched from &lt;code&gt;pdf-parse&lt;/code&gt; to &lt;code&gt;pdfjs-dist&lt;/code&gt; for access to position information (x, y coordinates, font size). This enables semantic header/footer detection—variable content like "Page 7 of 75" that pdf-parse would include as regular text.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. When Semantic Chunks Broke Hybrid Search
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;Semantic chunks are richer—more content per chunk, more coherent meaning. But this broke the original keyword matching.&lt;/p&gt;

&lt;p&gt;The issue: scores became unreliable. A keyword match in a dense, high-quality chunk meant something different than a match in a sparse, fragmented one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attempted: RRF (Reciprocal Rank Fusion)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://opensearch.org/blog/introducing-reciprocal-rank-fusion-hybrid-search/" rel="noopener noreferrer"&gt;RRF&lt;/a&gt; is the standard approach for merging BM25 and vector results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RRF_score = Σ 1/(k + rank_i)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Combine rankings by position, not by score. Elegant, widely used, no tuning required.&lt;/p&gt;

&lt;p&gt;But there's a fundamental problem: &lt;strong&gt;distance information is lost.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Original distances: 0.1, 0.2, 0.9  →  Ranks: 1, 2, 3
Original distances: 0.1, 0.15, 0.18  →  Ranks: 1, 2, 3
# Same ranks, completely different quality gaps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RRF outputs ranks, not distances. Our quality filters—distance threshold, relevance gap grouping—need actual distances to work.&lt;/p&gt;

&lt;p&gt;As noted in &lt;a href="https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking" rel="noopener noreferrer"&gt;Microsoft's hybrid search documentation&lt;/a&gt;: "RRF aggregates rankings rather than scores." This is by design—it avoids the problem of incompatible score scales. But it means downstream quality filtering can't distinguish "barely made top-10" from "clearly the best match."&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution: Semantic-first with keyword boost
&lt;/h3&gt;

&lt;p&gt;Keep vector search as the primary signal. Use keywords to adjust distances, not replace them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/vectordb/index.ts&lt;/span&gt;
&lt;span class="c1"&gt;// Multiplicative boost: distance / (1 + keyword_score * weight)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;boostedDistance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;keywordScore&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The formula:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No keyword match&lt;/strong&gt; (score=0): &lt;code&gt;distance / 1 = distance&lt;/code&gt; (unchanged)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Perfect match&lt;/strong&gt; with weight=0.6: &lt;code&gt;distance / 1.6&lt;/code&gt; (reduced by 37.5%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Perfect match&lt;/strong&gt; with weight=1.0: &lt;code&gt;distance / 2&lt;/code&gt; (halved)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This preserves the distance for quality filtering while boosting exact matches.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/vectordb/index.ts&lt;/span&gt;
&lt;span class="c1"&gt;// 1. Vector search with 2x candidate pool&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;candidateLimit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;limit&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;HYBRID_SEARCH_CANDIDATE_MULTIPLIER&lt;/span&gt;

&lt;span class="c1"&gt;// 2. Apply distance filter&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;maxDistance&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="kc"&gt;undefined&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;distanceRange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;undefined&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;maxDistance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// 3. Apply grouping&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;grouping&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;applyGrouping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;grouping&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// 4. Keyword boost via FTS&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ftsResults&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;table&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;queryText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;fts&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;applyKeywordBoost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ftsResults&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;hybridWeight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Quality filters apply to meaningful vector distances. Keyword matching acts as a boost, not a replacement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multilingual challenge
&lt;/h3&gt;

&lt;p&gt;Japanese keyword matching broke with richer chunks. The default tokenizer couldn't handle CJK characters properly.&lt;/p&gt;

&lt;p&gt;Solution: LanceDB FTS with n-gram indexing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/vectordb/index.ts&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fts&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;baseTokenizer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ngram&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;ngramMinLength&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// Capture Japanese bi-grams (東京, 設計)&lt;/span&gt;
    &lt;span class="na"&gt;ngramMaxLength&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// Balance precision vs index size&lt;/span&gt;
    &lt;span class="na"&gt;prefixOnly&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// All positions for proper CJK support&lt;/span&gt;
    &lt;span class="na"&gt;stem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;// Preserve exact terms&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;N-grams at min=2, max=3 capture both English terms and Japanese compound words without language-specific tokenization.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Results: What Actually Changed
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Observed behavior (real usage)
&lt;/h3&gt;

&lt;p&gt;My setup: framework best practices (official PDFs), project principles (rules), specifications (PRDs) stored in RAG. Before each task, the agent analyzes requirements and searches RAG for relevant context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before (fixed chunks + top-K):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent couldn't find relevant information on first search&lt;/li&gt;
&lt;li&gt;Multiple search attempts with different query formulations&lt;/li&gt;
&lt;li&gt;Eventually gave up and read rule files directly&lt;/li&gt;
&lt;li&gt;PDFs were too large to read, so that context was effectively lost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After (semantic chunks + boost + filtering):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single search usually provides sufficient context&lt;/li&gt;
&lt;li&gt;Additional searches happen for depth, not compensation&lt;/li&gt;
&lt;li&gt;Agent stopped reading files directly—RAG results were trustworthy&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  LLM evaluation (before/after comparison)
&lt;/h3&gt;

&lt;p&gt;I had an LLM evaluate search results with project context—not a formal LLM-as-Judge setup, but structured comparison.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Old version:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Garbage chunks (outliers) and fragmented information in ~2/10 results for some queries&lt;/li&gt;
&lt;li&gt;Results required additional verification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Updated version:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No garbage chunks&lt;/li&gt;
&lt;li&gt;8/10 results directly relevant to the query&lt;/li&gt;
&lt;li&gt;2/10 results tangentially related (still useful context)&lt;/li&gt;
&lt;li&gt;Evaluator noted: "Search results alone provide necessary and sufficient information"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examining the raw JSON confirmed the qualitative assessment—chunks contained coherent, dense information rather than fragments.&lt;/p&gt;

&lt;h3&gt;
  
  
  No benchmarks
&lt;/h3&gt;

&lt;p&gt;This is qualitative observation from real usage, not controlled experiments. But the behavioral change is clear: &lt;strong&gt;the LLM stopped compensating for bad RAG results.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Architecture Summary
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Document → Semantic Chunking (Max-Min) → Embeddings → LanceDB

Query → Vector Search → Distance Filter → Grouping → Keyword Boost → Results
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key decisions
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Choice&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Semantic chunking over fixed&lt;/td&gt;
&lt;td&gt;Meaning-preserving units reduce LLM compensation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Keyword boost over RRF&lt;/td&gt;
&lt;td&gt;Preserves distance for quality filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Distance-based grouping&lt;/td&gt;
&lt;td&gt;Quality signal, not arbitrary K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;N-gram FTS&lt;/td&gt;
&lt;td&gt;Multilingual support without tokenizer complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local-only&lt;/td&gt;
&lt;td&gt;Privacy, cost, offline capability&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Environment variables&lt;/span&gt;
&lt;span class="nv"&gt;RAG_HYBRID_WEIGHT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.6    &lt;span class="c"&gt;# Keyword boost factor (0=semantic, 1=BM25-dominant)&lt;/span&gt;
&lt;span class="nv"&gt;RAG_GROUPING&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;related     &lt;span class="c"&gt;# 'similar' (top group) or 'related' (top 2 groups)&lt;/span&gt;
&lt;span class="nv"&gt;RAG_MAX_DISTANCE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.5     &lt;span class="c"&gt;# Filter low-relevance results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  7. The Other Side: Query Quality
&lt;/h2&gt;

&lt;p&gt;RAG accuracy depends on two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Search quality (what we've discussed)&lt;/li&gt;
&lt;li&gt;Query quality (what the LLM sends)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  MCP's dual invisibility
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → LLM → MCP(RAG) → LLM → Response
         ↑         ↑
     Query hidden  Results hidden
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even perfect RAG fails with bad queries. And users can't see either side.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution: Agent Skills
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://agentskills.io/" rel="noopener noreferrer"&gt;Agent Skills&lt;/a&gt; is an open format for extending AI agent capabilities with specialized knowledge. Skills are portable, version-controlled packages of procedural knowledge that agents load on-demand.&lt;/p&gt;

&lt;p&gt;For this RAG, skills teach the LLM:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query formulation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Query patterns by intent&lt;/span&gt;
| Intent | Pattern |
|--------|---------|
| Definition/Concept | "[term] definition concept" |
| How-To/Procedure | "[action] steps example usage" |
| API/Function | "[function] API arguments return" |
| Troubleshooting | "[error] fix solution cause" |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Score interpretation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Score thresholds&lt;/span&gt;
&amp;lt; 0.3  : Use directly (high confidence)
&lt;span class="p"&gt;0.&lt;/span&gt;3-0.5: Include if mentions same concept/entity
&lt;span class="gt"&gt;&amp;gt; 0.5  : Skip unless no better results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Skills can be installed via the &lt;a href="https://github.com/shinpr/mcp-local-rag#agent-skills" rel="noopener noreferrer"&gt;mcp-local-rag-skills CLI&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This completes the optimization loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RAG side:&lt;/strong&gt; semantic chunks + distance filters + keyword boost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM side:&lt;/strong&gt; query formulation + result interpretation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both sides matter. Optimizing only one leaves performance on the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Tradeoffs and Limitations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What this approach gives up
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BM25-only hits don't surface&lt;/strong&gt;: Must appear in semantic results first to get boosted&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No reranker&lt;/strong&gt;: Would improve accuracy but adds complexity/latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No formal benchmarks&lt;/strong&gt;: Qualitative evaluation only&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Where heavier approaches win
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RRF + Reranker&lt;/strong&gt;: Broader candidate pool, reranker compensates for RRF's rank-only output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM-as-reranker&lt;/strong&gt;: Best accuracy, but slow and expensive&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Position on the spectrum
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Light &amp;amp; Fast ←————————————————————→ Heavy &amp;amp; Accurate
    semantic-only
        └─ semantic + boost (here)
               └─ RRF + Cross-Encoder
                      └─ RRF + LLM Rerank
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The goal was: &lt;strong&gt;maximum quality within zero-setup, local-only constraints.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Conclusion
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Standard RAG (fixed chunks + top-K) breaks down for agentic coding use cases&lt;/li&gt;
&lt;li&gt;Semantic chunking + quality filtering + keyword boost is a viable middle ground&lt;/li&gt;
&lt;li&gt;RRF looks elegant but loses distance information critical for filtering&lt;/li&gt;
&lt;li&gt;Query quality matters as much as search quality—Agent Skills address this&lt;/li&gt;
&lt;li&gt;The real test: does the LLM stop making compensatory tool calls?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/shinpr/mcp-local-rag" rel="noopener noreferrer"&gt;github.com/shinpr/mcp-local-rag&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Kiss, C., Nagy, M. &amp;amp; Szilágyi, P. (2025). Max–Min semantic chunking of documents for RAG application. &lt;em&gt;Discover Computing&lt;/em&gt; 28, 117. &lt;a href="https://doi.org/10.1007/s10791-025-09638-7" rel="noopener noreferrer"&gt;https://doi.org/10.1007/s10791-025-09638-7&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LanceDB Full-Text Search: &lt;a href="https://lancedb.github.io/lancedb/fts/" rel="noopener noreferrer"&gt;https://lancedb.github.io/lancedb/fts/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;MCP Specification: &lt;a href="https://modelcontextprotocol.io" rel="noopener noreferrer"&gt;https://modelcontextprotocol.io&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Agent Skills: &lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;https://agentskills.io&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Reciprocal Rank Fusion (OpenSearch): &lt;a href="https://opensearch.org/blog/introducing-reciprocal-rank-fusion-hybrid-search/" rel="noopener noreferrer"&gt;https://opensearch.org/blog/introducing-reciprocal-rank-fusion-hybrid-search/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Hybrid Search Scoring (Microsoft): &lt;a href="https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking" rel="noopener noreferrer"&gt;https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>architecture</category>
      <category>mcp</category>
    </item>
    <item>
      <title>How I Made Legacy Code AI-Friendly with Auto-Generated Docs</title>
      <dc:creator>Shinsuke KAGAWA</dc:creator>
      <pubDate>Fri, 26 Dec 2025 12:30:09 +0000</pubDate>
      <link>https://dev.to/shinpr/how-i-made-legacy-code-ai-friendly-with-auto-generated-docs-4353</link>
      <guid>https://dev.to/shinpr/how-i-made-legacy-code-ai-friendly-with-auto-generated-docs-4353</guid>
      <description>&lt;p&gt;AI coding assistants are amazing—until you point them at a legacy codebase.&lt;/p&gt;

&lt;p&gt;"What does this module do?"&lt;br&gt;
"I don't have enough context."&lt;/p&gt;

&lt;p&gt;Sound familiar?&lt;/p&gt;
&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Claude Code (and similar tools) hit context limits fast on existing projects. No documentation means no context, which means the AI can't help effectively.&lt;/p&gt;

&lt;p&gt;You could spend weeks writing docs manually. Or you could automate it.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Fix: Generate Docs First
&lt;/h2&gt;

&lt;p&gt;Instead of fighting the AI, I ended up building a workflow that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Scans your codebase for features&lt;/li&gt;
&lt;li&gt;Generates PRD + Design Docs automatically&lt;/li&gt;
&lt;li&gt;Verifies docs against actual code&lt;/li&gt;
&lt;li&gt;Now AI has context to work with&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start Claude Code&lt;/span&gt;
claude

&lt;span class="c"&gt;# Add the marketplace&lt;/span&gt;
/plugin marketplace add shinpr/claude-code-workflows

&lt;span class="c"&gt;# Install the plugin&lt;/span&gt;
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;dev-workflows@claude-code-workflows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Then point it at your legacy code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/reverse-engineer &lt;span class="s2"&gt;"src/auth"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That's it.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Happens
&lt;/h2&gt;

&lt;p&gt;The workflow runs through multiple specialized agents:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;scope-discoverer&lt;/strong&gt; finds what features exist in your code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;prd-creator&lt;/strong&gt; generates product docs for each feature&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;code-verifier&lt;/strong&gt; checks if the docs match reality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;document-reviewer&lt;/strong&gt; catches inconsistencies&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each step verifies against the actual code—so you get docs that reflect what the system &lt;em&gt;actually does&lt;/em&gt;, not what someone thought it did years ago.&lt;/p&gt;
&lt;h2&gt;
  
  
  What You Get
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PRD for each feature&lt;/strong&gt; (what it does, why it exists)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design docs&lt;/strong&gt; (how it's built, what depends on what)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now when you ask the AI to modify something, it has context.&lt;/p&gt;
&lt;h2&gt;
  
  
  Before/After
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt;: "Explain the auth module" → Context limit, vague answers&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After&lt;/strong&gt;: AI reads generated docs → Specific, actionable suggestions&lt;/p&gt;
&lt;h2&gt;
  
  
  When to Use This
&lt;/h2&gt;

&lt;p&gt;Works best when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You've inherited a codebase with missing docs&lt;/li&gt;
&lt;li&gt;Institutional knowledge has left with previous developers&lt;/li&gt;
&lt;li&gt;You want to onboard AI assistants to existing projects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's not magic—complex legacy systems still need human review. But it gets you 80% there automatically.&lt;/p&gt;

&lt;p&gt;I built this while trying to make Claude Code usable on projects where no one knows how things work anymore.&lt;/p&gt;



&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/shinpr" rel="noopener noreferrer"&gt;
        shinpr
      &lt;/a&gt; / &lt;a href="https://github.com/shinpr/claude-code-workflows" rel="noopener noreferrer"&gt;
        claude-code-workflows
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Production-ready development workflows for Claude Code, powered by specialized AI agents.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Claude Code Workflows 🚀&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a href="https://claude.ai/code" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/77c3fac949481ce7960e41b57da074d377eb159a42c6cf4694cf225ddcada391/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436c61756465253230436f64652d506c7567696e2d707572706c65" alt="Claude Code"&gt;&lt;/a&gt;
&lt;a href="https://github.com/shinpr/claude-code-workflows" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/2961c6708a56bf2bca4bb7dcc53a5e30d0a22e67b3bca0725a8d74a2360432cb/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f7368696e70722f636c617564652d636f64652d776f726b666c6f77733f7374796c653d736f6369616c" alt="GitHub Stars"&gt;&lt;/a&gt;
&lt;a href="https://opensource.org/licenses/MIT" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/fdf2982b9f5d7489dcf44570e714e3a15fce6253e0cc6b5aa61a075aac2ff71b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667" alt="License: MIT"&gt;&lt;/a&gt;
&lt;a href="https://github.com/shinpr/claude-code-workflows/pulls" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/dd0b24c1e6776719edb2c273548a510d6490d8d25269a043dfabbd38419905da/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5052732d77656c636f6d652d627269676874677265656e2e737667" alt="PRs Welcome"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;End-to-end development workflows for Claude Code&lt;/strong&gt; - Specialized agents handle requirements, design, implementation, and quality checks so you get reviewable code, not just generated code.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;⚡ Quick Start&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;This marketplace includes the following plugins:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Core plugins:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;dev-workflows&lt;/strong&gt; - Backend and general-purpose development&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;dev-workflows-frontend&lt;/strong&gt; - React/TypeScript specialized workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Optional add-ons&lt;/strong&gt; (enhance core plugins):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/shinpr/claude-code-discover" rel="noopener noreferrer"&gt;claude-code-discover&lt;/a&gt;&lt;/strong&gt; - Turns feature ideas into evidence-backed PRDs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/shinpr/metronome" rel="noopener noreferrer"&gt;metronome&lt;/a&gt;&lt;/strong&gt; - Detects shortcut-taking behavior and nudges Claude to proceed step by step&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/francismiles1/dev-workflows-governance" rel="noopener noreferrer"&gt;dev-workflows-governance&lt;/a&gt;&lt;/strong&gt; - Enforces TIDY stage and human signoff checkpoint before deployment&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Skills only&lt;/strong&gt; (for users with existing workflows):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;dev-skills&lt;/strong&gt; - Coding best practices, testing principles, and design guidelines — no workflow recipes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These plugins provide end-to-end workflows for AI-assisted development. Choose what fits your project:&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;Backend or General Development&lt;/h3&gt;
&lt;/div&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; 1. Start Claude Code&lt;/span&gt;
claude
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; 2. Install the marketplace&lt;/span&gt;
/plugin marketplace add shinpr/claude-code-workflows

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; 3. Install backend plugin&lt;/span&gt;
/plugin install dev-workflows@claude-code-workflows

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;…
&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/shinpr/claude-code-workflows" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;




</description>
      <category>ai</category>
      <category>productivity</category>
      <category>automation</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Taming Opus 4.5's Efficiency: Using TodoWrite to Keep Claude Code on Track</title>
      <dc:creator>Shinsuke KAGAWA</dc:creator>
      <pubDate>Thu, 11 Dec 2025 12:54:19 +0000</pubDate>
      <link>https://dev.to/shinpr/taming-opus-45s-efficiency-using-todowrite-to-keep-claude-code-on-track-1ee5</link>
      <guid>https://dev.to/shinpr/taming-opus-45s-efficiency-using-todowrite-to-keep-claude-code-on-track-1ee5</guid>
      <description>&lt;p&gt;I've been using Claude Code with Opus 4.5 for a while now, and there's one thing that kept driving me crazy: it skips steps. Steps I actually needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually happens
&lt;/h2&gt;

&lt;p&gt;According to Anthropic's docs, Opus 4.5 is designed to "skip summaries for efficiency and maintain workflow momentum." Sounds great in theory.&lt;/p&gt;

&lt;p&gt;In practice? You ask for a 5-step process, and it delivers the final result—skipping steps 2, 3, and 4. Efficient? Sure. But not what I needed.&lt;/p&gt;

&lt;p&gt;I ran into this when I was working on a test review task. I wanted Claude to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;List all test items from the spec&lt;/li&gt;
&lt;li&gt;Evaluate each item against criteria&lt;/li&gt;
&lt;li&gt;Filter down to the essential ones&lt;/li&gt;
&lt;li&gt;Generate the final test plan&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Instead, it jumped straight to step 4. "Here's your optimized test plan!" Thanks, but I needed to see steps 2 and 3 to understand &lt;em&gt;why&lt;/em&gt; those tests were selected.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: Make steps explicit with TodoWrite
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📢 Update (March 2026):&lt;/strong&gt; As of Claude Code v2.1.16 (released January 22, 2026), &lt;code&gt;TodoWrite&lt;/code&gt; has been superseded by the new &lt;strong&gt;Tasks API&lt;/strong&gt; — &lt;code&gt;TaskCreate&lt;/code&gt;, &lt;code&gt;TaskUpdate&lt;/code&gt;, &lt;code&gt;TaskList&lt;/code&gt;, and &lt;code&gt;TaskGet&lt;/code&gt;. The concept in this article still applies, but you'll now use &lt;code&gt;TaskCreate&lt;/code&gt; to register steps instead of &lt;code&gt;TodoWrite&lt;/code&gt;. You can revert to the old behavior with the env var &lt;code&gt;CLAUDE_CODE_ENABLE_TASKS=false&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Claude Code has a built-in TODO management feature called &lt;code&gt;TodoWrite&lt;/code&gt;. When you register tasks explicitly, Opus 4.5 treats them as checkpoints it must complete.&lt;/p&gt;

&lt;p&gt;At the start of your task, tell Claude Code to register the steps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before starting, register these steps using TodoWrite:
1. List all test items from the spec
2. Evaluate each against the criteria
3. Filter to essential items with reasoning
4. Generate the final test plan
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or just add this to your prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Use TodoWrite to track each step. Do not skip any steps.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Basically, once you've registered steps as TODOs, Opus treats them as real checkpoints—not optional stops it can skip.&lt;/p&gt;

&lt;h2&gt;
  
  
  A quick limitation I learned the hard way
&lt;/h2&gt;

&lt;p&gt;If you register too many steps (7+), Opus 4.5 may batch them together for "efficiency," defeating the purpose.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't do this:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Read file A
2. Read file B
3. Read file C
4. Analyze A
5. Analyze B
6. Analyze C
7. Compare results
8. Generate report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Do this instead:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Read and analyze all relevant files
2. Compare the implementations
3. Generate the report with findings
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Meaningful steps, not micro-tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  When this saved me
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Multi-step refactoring where I needed to see intermediate states&lt;/li&gt;
&lt;li&gt;Debugging sessions where I wanted the reasoning at each stage&lt;/li&gt;
&lt;li&gt;Any task where Opus 4.5 kept "helpfully" jumping to the end&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Opus 4.5's efficiency is a feature, not a bug—but sometimes you need the journey, not just the destination. TodoWrite gives you that control back.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
