DEV Community: Shinsuke KAGAWA

A second pair of eyes for Claude Code: building Galley, a local runner that checks the work before the PR opens

Shinsuke KAGAWA — Mon, 11 May 2026 12:16:11 +0000

Update: claude -p pricing changed after this was published, which made running the cross-model loop continuously more expensive than I wanted to defend. Galley's executor is now selectable between Claude Code and Codex (it was Claude-only before), so a same-model Codex → Codex setup is a viable cost-optimized option alongside the original Claude → Codex pairing. See "Where it is" for the current pairing recommendations.

TL;DR

I run three client projects plus an OSS repo. Agentic coding got good enough that for a lot of tasks I just hand over a goal and a list of acceptance criteria and let it run. The catch: Opus 4.7's reliability made me nervous enough that I started ending almost every task with a manual round of "now have Codex look at this."
That ritual happened often enough that I automated it. Galley is the result: a local runtime where Claude Code executes a task inside a git worktree, then a supervisor (Claude, or Codex when I want a different reviewer) checks the run evidence against the acceptance criteria and either bounces it back for another attempt or approves it and opens a PR.
The parts I'm happiest with aren't the loop itself. They're the boring scaffolding around it: the tool is installed and configured by a Skill, acceptance-criteria test skeletons get written into the worktree before the first attempt, there's a per-repo quality.yaml I keep growing, and a second model on review duty catches things the model that wrote the code does not.
Galley now does a noticeable chunk of its own development. It's MIT-licensed and available on GitHub.

The setup that made me build this

For the last while my day job has been three concurrent product codebases plus maintaining an open-source workflow framework. None of them are huge, but the context-switching tax is real, and I'd been leaning harder and harder on Claude Code to carry whole tasks rather than babysitting them line by line.

At some point a threshold got crossed. Not "AI writes all my code now," more like: for a task of a certain size and shape, I no longer needed to plan the implementation. I needed to write down what done looks like (the goal, the acceptance criteria, the paths it's allowed to touch), and that was genuinely enough. The model would go figure out the how. That's a nice feeling the first few times it works.

Then Opus 4.7 happened, and the feeling got complicated. I won't relitigate it; plenty of people have. The short version for me was: the ambition was still there, the output still looked plausible, but I stopped trusting "looks plausible." So I started doing something I'd done occasionally before, but now every single time: after Claude Code finished, I'd open Codex, hand it the diff and the requirements, and ask it to find what was wrong. It usually found something. Different model, different blind spots: Codex would flag an edge case Claude glossed over, and Claude would have written cleaner structure than Codex would have. The combination of the two was reliably better than either alone.

So now every task had a manual final step that I did by hand, with copy-paste, in a separate terminal, every time.

"Third time, automate it"

I have a rule I mostly stick to: the first time you do a thing manually, fine. The second time, grumble. The third time, you build the tool. The cross-model review had blown well past three.

But the more I sketched it, the more it stopped being "a script that pipes a diff into Codex" and turned into something with a shape: if a supervisor model is going to approve work, it needs the work in a reviewable form, not just a diff but the command plan, the executor's own report, git status, the structured result. If it's going to reject work, the rejection needs to come back to the executor as a new attempt with the feedback attached, not as a Slack message to me. And if it can do all that, it can open the PR itself, and I can do final tweaks from PR comments instead of from my editor.

What Galley actually does

It's a local Go binary plus a daemon. You point it at a repo, hand it a task, and it runs:

your repo  ──task YAML──▶  galley daemon (local)
                               │
                               ├─▶ executor: Claude Code or Codex, inside a git worktree
                               │      writes the code, returns a structured result
                               │
                               └─▶ supervisor: Claude or Codex
                                      reads the run evidence, issues a verdict,
                                      opens the PR on accept

The task itself moves through a file-backed queue, and the supervisor's verdict decides where it lands:

draft task YAML
        |  galley task queue
        v
   tasks/queued/  →  daemon claims it  →  tasks/running/
        |
        |  Claude Code executes in a git worktree
        v
   supervisor review (Claude or Codex)
        |
        +-- accepted  -------------------→ tasks/done/  (+ open PR if enabled)
        +-- needs_revision  -------------→ retry, while loop budget remains
        +-- needs_supervisor_review  ----→ tasks/failed/  (escalate to me)
        +-- hard_stop  ------------------→ tasks/failed/  (no retry)

Everything runs locally, every change stays as git-visible diffs, and every attempt writes its evidence to disk: command_plan.json, run_result.json, the supervisor's verdict, git_status.json, diff.patch. When the loop escalates to me, I'm not guessing; I'm reading the file the supervisor read.

The task file is the trusted input. It's where the goal, the acceptance criteria (each with an ID the executor has to report back against), the allowed and forbidden paths, the loop budget, and the PR behavior live. The model never gets to redefine its own success criteria mid-run. It gets to satisfy them or fail them.

The decisions I'd actually defend

The loop is the obvious part. Here's the stuff that took longer to get right and that I think matters more.

A Skill installs the tool

This is the bit that still feels a little science-fictional to me. Galley ships an Agent Skill: a Claude Code plugin, and a Codex marketplace entry. You install the skill first. Then you ask it, in plain language, to set up the repo. It installs the galley binary, inspects the repository, drafts a quality.yaml and an environment.yaml, explains the execution settings to you, writes a valid task YAML, validates it, and queues it. It only queues after you say yes.

I've shipped CLIs before. The onboarding was always a README and a prayer. Here the onboarding is an agent, and "explain what this config field means and pick a sensible default for my repo" is just a thing it does. That splits your docs in two: the README is for the human who wants to understand the system, and the skill's reference files are for the agent that has to operate it correctly without you in the loop. They overlap less than you'd think, and I'm still figuring out where each line belongs.

Acceptance-criteria test skeletons go in first

This one's a direct response to the 4.7 trust problem. There's an optional preflight step. Before the first executor attempt, Galley runs a built-in test-creator pass that writes test skeletons into the worktree, one per acceptance criterion, and records the mapping back onto the running task: for each AC, the skeleton's path, the behavior it's meant to pin down, and where it plugs into the codebase. In a Go repo a skeleton is about what you'd expect:

func TestAC1_RunAgentOverridesTimeoutPerCall(t *testing.T) {
    t.Skip("AC1: run_agent callers can override execution timeout per call")
    // executor fills this in
}

Those skeleton paths are validated against the task's allowed paths, so the test-creator can't scatter files wherever it likes. And the executor can't get an "accepted" verdict while those tests are still skipped and the required checks haven't run green; the supervisor downgrades that to needs_supervisor_review. The effect is small but real: the implementation has to converge on something the AC-shaped tests accept. A model that's drifting toward a clever-but-wrong solution runs into the skeleton and has to reckon with it. It's harder to wander off when there's already a fence where the spec said the fence should be. (It's off by default, since some tasks genuinely shouldn't have it, but for "implement feature X with these three behaviors," I turn it on.)

`quality.yaml` is a thing I grow

Each repo gets a quality profile: which checks are required, which review dimensions must pass, what evidence the supervisor should expect, what severity of finding blocks acceptance. It starts small. Then every time a run produces something technically-passing-but-wrong-for-this-codebase, I add a line. Over time the profile becomes the codebase's accumulated opinion about what "good" means here, and both the executor and the supervisor get handed that opinion at the start of every task. Implementations stop drifting because the definition of "done well" stopped being implicit.

Claude writes, a second model signs off

Supervisor review defaults to Claude. But I can flip it to Codex per task, and for anything I'd have manually double-checked before, I do. Same-model review (Claude checking Claude) is fine and catches plenty. A different model catches a different category of mistake, and because a rejection comes back as another attempt with the feedback attached, the executor gets to fix what the reviewer flagged instead of just failing the task. So a long unattended run doesn't drift the way an unreviewed one does: every accepted step has had a second model poke at it, and the diff you end up with has those corrections baked in.

The deterministic / non-deterministic seam

Building this kind of tool, the genuinely fiddly part isn't the AI calls — it's the boundary between the parts that must be exact and the parts that get to improvise. Two places I had to draw a hard line:

The executor has to return a structured JSON result, every time, or the retry-and-review loop has nothing to stand on; the supervisor can't evaluate a free-form essay. So Galley installs a small guard plugin into the executor's Claude Code that enforces the output format. The creative work is non-deterministic; the envelope it arrives in is not.
Opus 4.7's defaults made me uneasy enough that I replaced the executor's system prompt outright with one derived from Codex-style prompting and from my own claude-code-workflows OSS, and did the same kind of swap for the supervisor prompts. The task YAML literally has prompt_mode: replace for this. I'd rather pin the behavior than hope for it.

Neither of these is glamorous. Both are the difference between a demo and something I leave running while I'm in a meeting.

Galley building Galley

The thing I didn't plan for: once it worked, the obvious next move was to have Galley develop Galley.

A few recent fixes, all queued through the skill and executed by Galley itself. The PR body was rendering every acceptance criterion as not_satisfied even when the supervisor had accepted them with evidence, which is confusing for anyone reading the PR. galley task show, the command I run constantly, was loudly reporting latest_claude_status: failed on tasks that had actually been accepted with a PR open; true as raw history, wrong as the headline. The PR-comment trigger only recognized /galley rerun ... and /galley requeue ..., when what I actually wanted was to type /galley fix the failing test and have it pick that up as the request. And new worktrees were being branched off whatever the source repo's HEAD happened to be instead of the configured base branch, so one PR's commits could leak into the next.

Some of those I caught by using the thing. Some — like the test-skeleton preflight — came from sitting down with a pile of runs/ evidence and asking what would have stopped the bad run earlier. Either way, the loop closes: I notice a gap, I write it up as a task with acceptance criteria, Galley implements it, a supervisor checks it, a PR shows up, I tweak it from a comment.

It's not fully autonomous and I'm not pretending it is. I review the task drafts. I read the escalations. I still tweak PRs. But the ratio of "me describing what I want" to "me typing the code" has tilted further than I expected, and the safety rails (evidence on disk, AC-shaped tests, a growing quality profile, a second model's sign-off) are what let me actually trust the tilt.

Where it is

Galley is MIT-licensed and available on GitHub at shinpr/galley. Both the executor and the supervisor are selectable between Claude Code and Codex. If you want a different model checking the work — the original premise of this post — Claude → Codex is still what I'd reach for; a different model's blind spots are the whole point. Codex → Codex is the cost-optimized alternative: same model on both sides, but the supervisor only ever sees ACs and evidence files, not the executor's conversation, which mitigates self-preference bias even without going cross-model. It's built for trusted local repositories: task YAML is trusted input, quality checks run locally, PR comments can request a requeue but can't rewrite your gates, and only the PR author (who is also a repo owner or collaborator) can drive it from comments. It also leans on git, a git worktree per task, and gh for the PR path.

To try it, add the plugin and let the Galley skill do the setup: it installs the galley CLI if it isn't already on your PATH, inspects the repo, drafts the quality.yaml and environment.yaml profiles and the task YAML, and queues only after you approve. That "install a skill, have it install the tool" loop is the part that still feels new to me.

Claude Code

/plugin marketplace add shinpr/galley
/plugin install galley@galley-tools
/reload-plugins
/galley:galley Set up Galley for this repository.

Codex

codex plugin marketplace add shinpr/galley

That only adds the Galley marketplace. Next, start a Codex session, run /plugins, select Galley, and install it from there.

Then invoke the skill with $galley:

$galley Set up Galley for this repository.

If you'd rather install the CLI yourself first, it's a one-liner: curl -fsSL https://raw.githubusercontent.com/shinpr/galley/main/scripts/install.sh | sh. Either way you end up describing tasks to the skill in plain language from there.

I'm curious how other people have dealt with the same trust gap. Have you put a second model in the review loop? Did a cross-model pairing actually buy you something, or was Claude-reviewing-Claude enough? And if you've found a better answer than "evidence on disk plus tests that go in before the code," I'd genuinely like to hear it. If you build something around this, or break it in an interesting way, even better.

I Built a Skill Reviewer. Then I Ran It on Itself.

Shinsuke KAGAWA — Thu, 02 Apr 2026 11:44:56 +0000

I built a tool that reviews Claude Code skills for quality issues.

Then I pointed it at its own source files. It found real problems.

The irony wasn't lost on me. But the more interesting question is: why did this happen, and what does it tell us about how LLM-based quality tools actually work?

The Setup

I maintain rashomon, a Claude Code plugin for prompt and skill optimization. It includes a skill reviewer agent that evaluates skill files against 8 research-backed patterns (BP-001 through BP-008) and 9 editing principles.

One of those patterns—BP-001—says: don't write instructions in negative form. Research shows LLMs often fail to follow "don't do X" instructions—negated prompts actually cause inverse scaling, where larger models perform worse. The fix is to rewrite them positively: instead of "don't skip P1 issues," write "evaluate all P1 issues in every review mode."

Simple enough.

Except both my agent definition files had a section called ## Prohibited Actions full of "don't" instructions.

The Discovery

I noticed this by reading my own code. But I wanted to see what happens when the tools catch it—or don't.

First, I ran the prompt-analyzer agent against both files. It analyzed them, found some issues, but gave the Prohibited Actions sections a pass. Its reasoning: these qualify as "safety-critical" exceptions to BP-001, since they constrain "destructive" behaviors.

That felt off. "Don't invent issues not supported by BP patterns" isn't a safety-critical instruction. It's a quality policy. The caller can override or discard the output.

So I ran the skill-reviewer agent against the same two files. The results were more interesting.

For skill-reviewer.md (reviewing itself), it flagged all four items in Prohibited Actions as BP-001 violations—P2 severity. Correct call.

For skill-creator.md (reviewing the other agent), it gave Prohibited Actions a pass. Same structure, same pattern, opposite judgment.

The same reviewer, applying the same criteria, reached opposite conclusions on the same construct.

Digging Into Logs

I could have speculated about why. Instead, I checked the subagent conversation logs.

The skill-creator review log showed this in the Step 1 pattern scan:

BP-001 (Negative Instructions): Lines 197-202 "Prohibited Actions" section uses negative form. However, per the BP-001 exception in skills.md, these are procedural/irreversible consequences (inventing knowledge, removing examples, overwriting files). The exception applies.

It did scan for BP-001. It found the section. But it classified the items as "irreversible consequences" and applied the exception.

The problem was clear: the exception rule said negative form is okay for "safety-critical operations, destructive actions, or order-dependent procedures." That's vague enough to stretch. "Inventing domain knowledge" sounds serious. "Removing user-provided examples" sounds destructive. If you squint, anything can be "destructive."

Nothing was wrong with the reviewer. It was doing exactly what I told it to do. That was the problem.

Fixing the Criteria, Not the Reviewer

The instinct is to blame the LLM: "it self-justified," "it was biased toward leniency." But the actual cause was simpler: the exception rule was written in a way that allowed two reasonable readings.

The fix wasn't to make the reviewer "smarter." It was to make the criteria harder to misread.

I replaced the broad exception language:

Exception: safety-critical operations, exact command sequences,
destructive actions, or order-dependent procedures

With a 4-condition checklist:

Exception: Negative form is permitted only when ALL are true:
(1) Violation destroys state in a single step
(2) Caller or subsequent steps cannot normally recover
(3) The constraint is operational/procedural, not a quality policy
(4) Positive rewording would expand or blur the target scope

And added concrete boundary examples—what qualifies, what doesn't:

Permitted (exception applies)	Not permitted (rewrite positively)
"Do not modify the command"	"Do not invent issues" -> "Base every issue on BP patterns"
"Do not add flags"	"Do not skip P1 issues" -> "Evaluate all P1 in every mode"
"Do not execute destructive operations"	"Do not create overlapping skills" -> "Verify no overlap before generating"

The key addition: "Outputs that the caller validates, overwrites, or discards are never irreversible." This one sentence eliminates most of the ambiguity. A subagent's output goes to a caller. The caller decides what to do with it. That's not irreversible.

The Retest

After updating the criteria, I ran the skill-reviewer again on both files.

skill-reviewer.md: Prohibited Actions flagged as BP-001 P2. All four items caught.

skill-creator.md: Two items flagged as quality policies that should be positive form. The remaining items—which are genuinely about operational constraints—were accepted.

Consistent. Explainable. And the reviewer could now articulate why each item was or wasn't an exception, because the criteria forced it to check specific conditions rather than make a gestalt judgment.

But I wasn't fully satisfied. In a further round of testing, the reviewer still occasionally applied exceptions loosely—recording "irreversible" in the justification field without explaining how it's irreversible.

So I added structured evidence to the output schema:

"patternExceptions": [{
  "pattern": "BP-001",
  "location": "section heading",
  "original": "quoted text",
  "conditions": {
    "singleStepDestruction": "true|false + evidence",
    "callerCannotRecover": "true|false + evidence",
    "operationalNotPolicy": "true|false + evidence",
    "positiveFormBlursScope": "true|false + evidence"
  }
}]

You can't just write "irreversible" anymore. You have to answer four yes/no questions with evidence. If any answer is no, it's not an exception.

What This Comes Down To

The criteria had a loophole wide enough to drive a truck through. Better criteria produced better reviews without changing the reviewer at all. The LLM wasn't "inconsistent"—the instructions were ambiguous. Two reasonable people could have read the old exception rule and reached different conclusions too.

Structured output helped more than I expected. The 4-condition checklist wasn't just about auditability—it changed how the reviewer thinks. When you have to fill in four fields with evidence, you can't hand-wave. The output structure becomes a thinking scaffold.

And running the tool on its own source files was uncomfortable in a useful way. The temptation is to say "well, I know what I meant." But the tool doesn't know what I meant. It reads what I wrote.

The Broader Problem: Skill Quality Is Hard

If you're building Claude Code skills, custom agents, or any kind of structured LLM instruction set—you've probably experienced this: the instructions work fine in your head, but the LLM does something unexpected. You add more instructions. It gets worse. You simplify. Something else breaks.

The issue is that you can't see your own blind spots. You know what you meant. The LLM reads what you wrote. The gap between intent and text is where bugs live.

This is why I built rashomon. It includes:

Skill review: Evaluate skill files against BP-001~008 patterns and 9 editing principles, with structured quality grades
Golden scenario evaluation: Test whether a skill actually works by comparing execution results with and without the skill, or before and after changes—not just whether it was loaded, but whether it made a measurable difference

The golden scenario part matters. "The skill was loaded" doesn't mean "the skill helped." You need to see the actual output difference to know if your skill is doing anything useful.

Try It

Rashomon is a Claude Code plugin. Install it and point the skill reviewer at your own skills.

# In Claude Code
/plugin marketplace add shinpr/rashomon
/plugin install rashomon@rashomon
# Restart session to activate

It will find problems. I know because it found problems in itself—and it's better for it now.

What's your experience with skill quality? Have you found ways to validate that your instructions actually do what you think they do?

Same Framework, Different Engine: Porting AI Coding Workflows from Claude Code to Codex CLI

Shinsuke KAGAWA — Wed, 18 Mar 2026 11:50:48 +0000

TL;DR

I built a sub-agent workflow framework for Claude Code that solved context exhaustion through specialized agents and structured workflows
For 8 months, Codex CLI had no sub-agents — the framework was Claude Code-only
Codex finally shipped sub-agent support — I expected days of migration, it took an afternoon
What surprised me most: if you design workflows around agent roles and context separation rather than tool-specific features, your investment survives platform shifts

The 8-Month Wait

Back in July 2025, I released the first version of this workflow as a Claude Code boilerplate. By October 2025, it had evolved into a full sub-agent framework — specialized agents for every phase of development, from requirements analysis through TDD implementation through quality gates. The idea was pretty simple: break complex coding tasks into specialized roles (requirement analyzer, technical designer, task executor, quality fixer...), give each agent a fresh context, and orchestrate them through structured handoffs. No single agent ever hits the context ceiling because no single agent tries to do everything.

The problem? Codex CLI had no sub-agent capability. Codex had been around since mid-2025, and I wanted the same workflow there too. So I kept trying to bridge the gap.

First, I built an MCP server in August 2025 that let any MCP-compatible tool — Codex, Cursor, whatever — define and spawn sub-agents through a standard protocol. It worked, but MCP added a layer of indirection that wasn't there in Claude Code's native sub-agents.

Then in December 2025, Codex shipped experimental Agent Skills support. I saw an opening and built sub-agents-skills — cross-LLM sub-agent orchestration packaged as Agent Skills, routing tasks to Codex, Claude Code, Cursor, or Gemini. Closer, but still not native sub-agents.

Through all of this, my main development stayed on Claude Code. The context separation and the small context windows of the time made it the clear choice for serious work. Codex filled a supporting role — I used it for skills refinement and as an objective reviewer on complex implementations, a fresh set of eyes from a different LLM.

I don't use hooks extensively — I prefer keeping tasks small and baking quality gates into the completion criteria themselves. So what I was really waiting for was native sub-agent support in Codex, which would let the full orchestration workflow run without workarounds.

On March 16, 2026, Codex CLI shipped sub-agent support. During pre-release validation, I noticed something encouraging: Codex followed the workflow stopping points more strictly than expected. If the behavior stabilizes, it could be a viable primary development tool, not just a supporting one.

The port took almost no effort.

What "Near-Zero Migration" Actually Looks Like

When I say "the same framework," I mean it. The core architecture didn't change:

User Request
    ↓
requirement-analyzer → scale determination [STOP for confirmation]
    ↓
technical-designer → Design Doc
    ↓
document-reviewer [STOP for approval]
    ↓
work-planner → phased task breakdown [STOP]
    ↓
task-decomposer → atomic task files
    ↓
Per-task 4-step cycle:
  task-executor → escalation check → quality-fixer → git commit

22 sub-agents. 26 skills. The same stopping points, the same quality gates, the same TDD enforcement.

What changed was the container format, not the content:

Aspect	Claude Code	Codex CLI
Agent definitions	Markdown with YAML frontmatter (`agents/*.md`)	TOML files (`.codex/agents/*.toml`)
Skills location	`skills/`	`.agents/skills/`
Tool declarations	Explicit in frontmatter (`tools: Read, Grep, Glob...`)	Not needed (inferred from sandbox mode)
Skill references	Comma-separated names	`[[skills.config]]` arrays
Config directory	`.claude/`	`.codex/`

That's it. The agent instructions — the actual substance of what each agent knows and does — are the same. The workflow logic is the same. The quality criteria are the same.

Why This Worked: Design Decisions That Paid Off

It worked for a surprisingly simple reason — three choices I made early on:

1. Natural Language as the Interface Layer

Every sub-agent's behavior is defined in natural language instructions, not in platform-specific tool calls. The requirement-analyzer isn't wired to Claude Code's Agent tool or Codex's spawn_agent — it follows a written protocol: "Extract task type, determine scale (1-2 files = Small, 3-5 = Medium, 6+ = Large), identify ADR necessity, output structured JSON."

This means the instructions work on any LLM-powered agent system that can read text and follow procedures. In practice, that turned out to be enough. The framework is fundamentally a set of well-written job descriptions, not a set of API integrations.

2. Context Separation as Architecture

The core insight from the original article still applies: each agent runs in a fresh context without inheriting bias from previous steps. The document-reviewer doesn't know what the technical-designer was "thinking" — it just reviews the output. The investigator explores without confirmation bias from whoever reported the bug.

This isn't a Claude Code feature or a Codex feature. It's an architectural pattern that happens to be implementable on both platforms once they support sub-agents.

3. Structured Handoffs Over Shared State

Agents communicate through artifacts (documents, JSON outputs, task files), not through shared memory or conversation threading. The technical-designer writes a Design Doc. The work-planner reads that Design Doc. Neither needs to know which platform spawned the other.

docs/
├── prd/          # PRD artifacts
├── adr/          # Architecture decision records
├── design/       # Design documents
├── plans/        # Work plans
│   └── tasks/    # Atomic task files (1 commit each)

This file-based protocol turned out to be surprisingly platform-agnostic.

The Framework in Action

codex-workflows ships as an npm package:

npx codex-workflows install

This drops the full agent and skill infrastructure into your project. From there, you invoke recipe skills that orchestrate the sub-agents:

Recipe	What It Does
`$recipe-implement`	Full lifecycle: requirements → design → implementation → quality
`$recipe-design`	Requirements → design documents only
`$recipe-build`	Execute pre-planned tasks autonomously
`$recipe-diagnose`	Problem investigation → root cause → solutions
`$recipe-front-build`	React/TypeScript implementation with frontend-specific agents
`$recipe-fullstack-implement`	Cross-layer features with layer-aware task routing

Each recipe automatically loads the right foundational skills (coding standards, testing principles, documentation criteria) and spawns the right sequence of agents with the right stopping points.

Here's what a real session looks like. I ran $recipe-implement Add a login screen with email/password and a simple auth API in Codex CLI against a fresh project with placeholder files:

1. Requirement analysis — the first mandatory stop:

• Spawned Plato [requirement-analyzer] (gpt-5.4-mini medium)

• Updated requirement analysis is now confirmed:
  - Scope: email/password login form, local mock auth abstraction,
    hardcoded success credentials demo@example.com / password123,
    inline success message only
  - Route: fullstack
  - Scale: medium
  - ADR: not required

  Please confirm this analysis and answer these three questions:
  1. Should the auth API be a real HTTP endpoint, or is a local/mock
     API abstraction enough?
  2. What credentials should succeed initially?
  3. On successful login, should it only show success, navigate
     somewhere, or persist auth state too?

The workflow stops and waits. No implementation starts until I confirm scope.

2. Parallel design — backend and frontend at the same time:

• Spawned Volta [technical-designer] (gpt-5.4-mini medium)
  └ Create a backend Design Doc...

• Spawned Pasteur [technical-designer-frontend] (gpt-5.4-mini medium)
  └ Create a frontend Design Doc...

• Spawned Epicurus [document-reviewer] (gpt-5.4-mini medium)
  └ Review login-backend-design.md...

• Spawned Bohr [document-reviewer] (gpt-5.4-mini medium)
  └ Review login-frontend-design.md...

• Spawned Euclid [design-sync] (gpt-5.4-mini medium)
  └ Verify consistency between Design Docs...

Five agents in parallel: two designers, two reviewers, one cross-layer sync checker. Each running in its own fresh context.

3. Design gate — another mandatory stop:

• The design gate is now clean enough to stop for approval.

  - Backend design: approved
  - Frontend design: approved
  - Cross-layer sync: NO_CONFLICTS

  Aligned contract:
  - Success: { ok: true }
  - Failure: { ok: false, error }
  - Inline success copy is UI-owned, not backend-owned

  Please approve the design docs so I can move to
  acceptance-test generation and the work plan.

4. Autonomous execution after batch approval:

› batch approval

• Spawned Ohm [task-decomposer] (gpt-5.4-mini medium)

• Verification passed:
  - npm test
  - npm run build

After batch approval, the framework decomposed the work plan into tasks and executed them autonomously — no more stopping points until the quality gates pass.

The whole flow from $recipe-implement to green tests took one session. The same flow, the same stopping points, the same agent roles that I've been running on Claude Code for months.

Getting Started

The framework is open source:

Codex CLI version: codex-workflows
Claude Code version: claude-code-workflows

If you're already using the Claude Code version, the Codex version follows the same patterns. If you're new to both, pick whichever CLI you're already using — the workflow knowledge transfers either way.

npx codex-workflows install

The whole port changed config file formats and directory conventions. The agent instructions — the part that actually matters — didn't need a single edit. That's the thing I'd want to know if I were deciding whether to invest time in workflow design for AI coding tools.

If you've been running sub-agent workflows with either Claude Code or Codex CLI, I'd be curious how your setup compares. What worked? What broke?

Letting LLMs Jump — and Then Verifying Ruthlessly

Shinsuke KAGAWA — Thu, 12 Feb 2026 13:35:03 +0000

The "First Plausible Answer" Problem

You've probably seen this: you ask an LLM to investigate a bug, and it latches onto the first plausible explanation. It confidently proposes a fix before thoroughly exploring alternatives. Sometimes it works. Often it doesn't—and you're left debugging the debugger.

I ran into this repeatedly in my personal projects. The LLM would find something that looked like the cause, stop investigating, and immediately suggest a solution. When the codebase was small, this worked fine. As it grew, I started getting fixes that didn't actually fix anything.

This is not for small scripts or simple bugs.

I only started needing this once my codebase grew large enough
that "just try a fix" stopped working.

The root issue? How I was defining the task's purpose.

Planning works well when the problem is understood.

But when the problem itself is unclear,
planning alone is not enough.

This article focuses on those cases.

The Factor That Made the Difference: Purpose

When delegating tasks to LLMs, two factors affect execution accuracy: Context (staying within ~70% of the context window) and Purpose (how you define the task's goal).

Context management matters, but this article focuses on the second factor—because that's where I was getting it wrong.

Where you set the task's goal matters more than you might think. The purpose you define determines the task granularity, and the right granularity depends on your codebase complexity.

A Real Example: Bug Investigation

The Old Approach

A single session handling "Investigation → Solution Proposal → Verification," followed by a separate review session.

What I Changed

My original goal was simple: "propose a solution" and "review it objectively."

Originally, I'd just have the LLM investigate, propose a fix, and implement it directly. But as the codebase grew, I started getting solutions that didn't actually work. So I added a review step—opening a fresh session to check the proposal with clean context.

This worked for about 60-70% of problems, but occasionally even this approach couldn't reach the root cause no matter how many iterations.

Here's what I changed:

Problem Structuring: Structure my instructions upfront to make them easier for LLMs to parse in later steps
Investigation: Conduct comprehensive investigation and report results
Verification: If there's uncertainty in the report, perform additional verification
Solution Derivation: Receive investigation and verification results, then derive solutions

By setting "investigation" as the purpose, the model stopped jumping to the first candidate and instead collected information from multiple angles.

Implementation Example

This setup is probably overkill for small scripts. I only started doing this after my codebase crossed a certain complexity threshold.

Here's how I structured the diagnosis workflow using Claude Code's slash commands and sub-agents. Full implementation is available at github.com/shinpr/claude-code-workflows.

Main Command (diagnose.md)

---
description: "Investigate problem, verify findings, and derive solutions"
---

**Command Context**: Diagnosis flow to identify root cause and present solutions

Target problem: $ARGUMENTS

## Step 0: Problem Structuring (Before investigator invocation)

### 0.1 Problem Type Determination

| Type | Criteria |
|------|----------|
| Change Failure | Indicates some change occurred before the problem appeared |
| New Discovery | No relation to changes is indicated |

### 0.2 Information Supplementation for Change Failures

If the following are unclear, **ask with AskUserQuestion** before proceeding:
- What was changed (cause change)
- What broke (affected area)
- Relationship between both (shared components, etc.)

## Diagnosis Flow Overview

The goal of investigation is not to propose solutions.
It is to eliminate wrong explanations.

**Context Separation**: Pass only structured JSON output to each step.
Each step starts fresh with the JSON data only.

Sub-agent: Investigator

Think of the Investigator as a junior engineer whose only job is to gather facts—not to be clever. Its purpose is explicitly limited to evidence collection only—no solutions:

This is one concrete implementation.
The important part is the separation of purpose—not the specific tooling.

## Output Scope

This agent outputs **evidence matrix and factual observations only**.
Solution derivation is out of scope for this agent.

## Core Responsibilities

1. **Cross-check multiple sources** - Don't rely on a single source
2. **Search external info (WebSearch)** - Official docs, Stack Overflow, GitHub Issues
3. **List hypotheses and trace causes** - Multiple candidates, not just the first one
4. **Identify impact scope** - Where else might this pattern exist?
5. **Disclose blind spots** - Honestly report areas that could not be investigated

Key output structure:

{
  "hypotheses": [
    {
      "id": "H1",
      "description": "Hypothesis description",
      "causeCategory": "typo|logic_error|missing_constraint|design_gap|external_factor",
      "causalChain": ["Phenomenon", "→ Direct cause", "→ Root cause"],
      "supportingEvidence": [...],
      "contradictingEvidence": [...],
      "unexploredAspects": ["Unverified aspects"]
    }
  ],
  "comparisonAnalysis": {
    "normalImplementation": "Path to working implementation",
    "failingImplementation": "Path to problematic implementation",
    "keyDifferences": ["Differences"]
  }
}

Sub-agent: Verifier

The Verifier plays the annoying senior reviewer who assumes everything is wrong. It actively seeks refutation:

## Core Responsibilities

1. **Cross-check multiple sources** - Explore information sources not covered
2. **Generate alternative hypotheses** - What else could explain this?
3. **Play devil's advocate** - Assume "the investigation results are wrong"
4. **Pick the hypothesis with fewest holes** - Not "most evidence," but "least refuted"

Sub-agent: Solver

The Solver is the engineer who actually has to ship something. Only after verification does it derive solutions:

## Output Scope

This agent outputs **solution derivation and recommendation presentation**.
Trust the given conclusion and proceed directly to solution derivation.

## Core Responsibilities

1. **Multiple solution generation** - At least 3 different approaches
2. **Tradeoff analysis** - Cost, risk, impact scope, maintainability
3. **Recommendation selection** - Optimal solution with selection rationale
4. **Implementation steps presentation** - Concrete, actionable steps

Practical Guidelines

When designing LLM tasks, I now check two things:

Purpose Clarity - "Don't create tasks with unclear purposes"
Context Efficiency - Can it be completed in one session with sufficient information? (Ideally using 60-70% of working space)

I don't blindly split tasks into smaller pieces. Instead, I consider ROI and break down from larger tasks only when necessary.

By explicitly separating "investigation" from "solution," you prevent the model from rushing to conclusions before it has gathered sufficient evidence.

A Lesson I Learned the Hard Way

Early on, I made the Verifier run every single time. The problem? Even when the investigation was clearly off track, the Verifier would dutifully try to verify nonsense.

That's when I realized: you need a quality gate between steps, not just separation.

Now I have a checkpoint between Investigation and Verification. If the investigation output doesn't meet basic quality criteria (missing comparison analysis, shallow causal chains, etc.), it loops back instead of wasting cycles on verification.

I also added Step 0 (Problem Structuring) to help the LLM understand my intent better before diving in. These two changes—quality gates and upfront structuring—made the whole pipeline actually usable.

Design Integration Checkpoints Before Letting LLMs Code

Shinsuke KAGAWA — Wed, 04 Feb 2026 13:23:55 +0000

Once you stop trying to control AI generation and start designing verification, you immediately hit the next problem: integration.
And this is where most AI-generated systems actually break.

Everything works.
Until it doesn't.

Each layer looks correct in isolation.
Tests pass.
Types line up.

And then the system breaks where those layers meet.

Why "Everything Works" Until It Doesn't

This is a verification problem, not an implementation problem.
When you build systems layer by layer, integration happens very late.

Layer-by-layer development
Phase 1: Data layer ────────────────✓
Phase 2: Service layer ─────────────✓
Phase 3: API layer ────────────────✓
Phase 4: UI layer ─────────────────✓
Phase 5: Integration ── 💥 breaks here

Each layer is implemented in isolation.
So you don't actually know if everything connects correctly until the end.

This problem becomes much worse with AI-generated code.

LLMs don't hold the entire system in mind at once.
They optimize locally, based on the current context — and they often miss hidden contracts between layers.

A Painful Integration Bug That "Worked"

One of the most painful bugs I faced didn't involve crashes or errors.

The AI chatbot worked.

It returned responses.
Logs looked normal.
Nothing failed.

But when we tested it in the real environment, the answers were subtly — but consistently — wrong.

What actually went wrong

The root cause wasn't a single mistake, but a combination of issues across layers:

Mock implementations silently left in place
LLM fallbacks that prioritized "returning something" instead of failing fast
Duplicate logic across layers, created while implementing each layer separately

The common thread? I wasn't tracking what else might break.

Each layer looked correct in isolation.
Tests passed.
No alerts fired.

Because the system always returned some response, it created a false sense of confidence.
We didn't notice the problem immediately — and by the time we did, identifying the real cause across layers was extremely difficult.

Bugs that silently "work" are far more dangerous than bugs that crash.

Make Integration Explicit

I now spend about five minutes defining integration checkpoints.
Not documentation. Just verification.

The goal is simple: define where things must connect, and how I'll know they actually do.

Now, before implementation, I write a very small design note.

Not a formal design document.
Nothing formal.

Just a checklist that answers two questions:

What parts of the system are affected?
Where do things need to integrate — and how do I verify it?

Step 1: List What's Affected

First, I write down what is directly or indirectly impacted.

Change: Add image generation feature

Direct impact:
  - infrastructure/image/functions.ts
  - application/services/queryClassificationService.ts
  - application/services/imageGenerationService.ts

Indirect impact:
  - conversationService.ts (function calling flow)

No impact:
  - existing text generation services
  - other function handlers

This immediately clarifies the blast radius.

I don't aim for perfection —
I just want to avoid being surprised later.

Step 2: Define Integration Checkpoints

Next, I decide where integration must be verified and how.

Integration point 1: Function selection
Location: ConversationService.generateContentWithFunctionCalling

How to verify:
  1. Send a request asking for an image
  2. Confirm query classification returns `image_generation`
  3. Confirm the correct function is selected in logs

Expected result:
  - Log shows: Executing function: generateImage

And another one:

Integration point 2: Image generation and posting
Location: ImageGenerationService → MessagingClient.uploadFile

How to verify:
  1. Image data is returned from the image client
  2. The file is posted to the chat thread

Expected result:
  - Image appears in the chat

Now I know exactly what "working" means.

That's it.

Why This Works (Especially with AI)

When I give this to an LLM, it changes how implementation happens.

Instead of "build this feature," it's more like:
"Connect A to B. Here's how we'll know it works."

This also pairs well with building features end-to-end:

Feature-based development
Feature A: Data → Service → API → UI → Verify
Feature B: Data → Service → API → UI → Verify
Feature C: Data → Service → API → UI → Verify

Each feature is fully integrated before moving on.

The Result

Before this habit, integration bugs often cost me hours.

After introducing these small design notes:

AI-generated code still has small issues
But features no longer completely break at integration
Unexpected behavior is caught much earlier

Five minutes of thinking up front easily saves hours of debugging later.

Who This Is For

This approach works well if you:

Use AI coding tools
Build layered architectures
Want fast feedback instead of perfect design docs

This is not about writing more documentation.
It's just about making integration explicit before code is written.

Final Thoughts

AI tools are incredibly powerful — but they optimize locally.

If we don't define integration points explicitly, we end up debugging systems that look correct but behave incorrectly.

A small design checklist has made a huge difference for me.

Hope this saves you some painful debugging.

Planning Is the Real Superpower of Agentic Coding

Shinsuke KAGAWA — Mon, 26 Jan 2026 12:24:31 +0000

I see this pattern constantly: someone gives an LLM a task, it starts executing immediately, and halfway through you realize it's building the wrong thing. Or it gets stuck in a loop. Or it produces something that technically works but doesn't fit the existing codebase at all.

The instinct is to write better prompts. More detail. More constraints. More examples.

The actual fix is simpler: make it plan before it executes.

Research shows that separating planning from execution dramatically improves task success rates—by as much as 33% in complex scenarios.

In earlier articles, I wrote about why LLMs struggle with first attempts and why overloading AGENTS.md is often a symptom of that misunderstanding. This article focuses on what actually fixes that.

Why "Just Execute" Fails

This took me longer to figure out than I'd like to admit. When you ask an LLM to directly implement something, you're asking it to:

Understand the requirements
Analyze the existing codebase
Design an approach
Evaluate trade-offs
Decompose into steps
Execute each step
Verify results

All in one shot. With one context. Using the same cognitive load throughout.

Even powerful LLMs struggle with this. Not because they lack capability, but because long-horizon planning is fundamentally hard in a step-by-step mode.

The Plan-Execute Architecture

Research on LLM agents has consistently shown that separating planning and execution yields better results.

The reasons:

Benefit	Explanation
Explicit long-term planning	Even strong LLMs struggle with multi-step reasoning when taking actions one at a time. Explicit planning forces consideration of the full path.
Model flexibility	You can use a powerful model for planning and a lighter model for execution—or even different specialized models per phase.
Efficiency	Each execution step doesn't need to reason through the entire conversation history. It just needs to execute against the plan.

What matters here: the plan becomes an artifact, and the execution becomes verification against that artifact.

If you've read about why LLMs are better at verification than first-shot generation, this should sound familiar. Creating a plan first converts the execution task from "generate good code" to "implement according to this plan"—a much clearer, more verifiable objective.

The Full Workflow

The complete picture:

Step 1: Preparation
    │
    ▼
Step 2: Design (Agree on Direction)
    │
    ▼
Step 3: Work Planning  ← The Most Important Step
    │
    ▼
Step 4: Execution
    │
    ▼
Step 5: Verification & Feedback

I'll walk through each step, but Step 3 is where the magic happens.

Step 1: Preparation

Goal: Clarify what you want to achieve, not how.

Create a ticket, issue, or todo document stating the goal in plain language
Point the LLM to AGENTS.md (or CLAUDE.md, depending on your tool) and relevant context files
Don't jump into implementation details yet

This is about setting the stage, not solving the problem.

Step 2: Design (Agree on Direction)

Goal: Align on the approach before any code gets written.

Don't Let It Start Coding Immediately

Instead of "implement this feature," say:

"Before implementing, present a step-by-step plan for how you would approach this."

Review the Plan

Look for:

Contradictions with existing architecture
Simpler alternatives the LLM missed
Misunderstandings of the requirements

At this stage, you're agreeing on what to build and why this approach. The how and in what order come in Step 3.

Step 3: Work Planning (The Most Important Step)

This section is dense. But the payoff is proportional—the more carefully you plan, the smoother execution becomes.

For small tasks, you don't need all of this. See "Scaling to Task Size" at the end.

Goal: Convert the design into executable work units with clear completion criteria.

Why This Step Matters Most

Research shows that decomposing complex tasks into subtasks significantly improves LLM success rates. Step-by-step decomposition produces more accurate results than direct generation.

But there's another reason: the work plan is an artifact.

When the plan exists, the execution task transforms:

Before: "Build this feature" (generation)
After: "Implement according to this plan" (verification)

This is the same principle from Article 1. Creating a plan first means execution becomes verification—and LLMs are better at verification.

What Work Planning Includes

Task decomposition: Break the design into executable units
Dependency mapping: Define order and dependencies between tasks
Completion criteria: What does "done" mean for each task?
Checkpoint design: When do we get external feedback?

Perspectives to Consider

I'll be honest: I learned most of these the hard way. Plans would fall apart mid-implementation, and only later did I realize I'd skipped something obvious in hindsight.

These aren't meant to be followed rigidly for every task. Think of them as a mental checklist. You don't need to get all of these right—if even one of these perspectives changes your plan, it's doing its job.

Perspective 1: Current State Analysis

Understand what exists before planning changes.

What is this code's actual responsibility?
Which parts are essential business logic vs. technical constraints?
What benefits and limitations does the current design provide?
What implicit dependencies or assumptions aren't obvious from the code?

Skipping this leads to plans that don't fit the existing codebase.

Perspective 2: Strategy Selection

Consider how to approach the transition from current to desired state.

Research options:

Look for similar patterns in your tech stack
Check how comparable projects solved this
Review OSS implementations, articles, documentation

Common strategy patterns:

Strangler Pattern: Gradual replacement, incremental migration
Facade Pattern: Hide complexity behind unified interface
Feature-Driven: Vertical slices, user-value first
Foundation-Driven: Build stable base first, then features on top

The key isn't applying patterns dogmatically—it's consciously choosing an approach instead of stumbling into one.

Perspective 3: Risk Assessment

Evaluate what could go wrong with your chosen strategy.

Risk Type	Considerations
Technical	Impact on existing systems, data integrity, performance degradation
Operational	Service availability, deployment downtime, rollback procedures
Project	Schedule delays, learning curve, team coordination

Skipping risk assessment leads to expensive surprises mid-implementation.

Perspective 4: Constraints

Identify hard limits before committing to a strategy.

Technical: Library compatibility, resource capacity, performance requirements
Timeline: Deadlines, milestones, external dependencies
Resources: Team availability, skill gaps, budget
Business: Time-to-market, customer impact, regulations

A strategy that ignores constraints isn't executable.

Perspective 5: Completion Levels

Define what "done" means for each task—this is critical.

Level	Definition	Example
L1: Functional verification	Works as user-facing feature	Search actually returns results
L2: Test verification	New tests added and passing	Type definition tests pass
L3: Build verification	No compilation errors	Interface definition complete

Priority: L1 > L2 > L3. Whenever possible, verify at L1 (actually works in practice).

This directly maps to "external feedback" from the previous articles. Defining completion levels upfront ensures you get external verification at each checkpoint.

Perspective 6: Integration Points

Define when to verify things work together.

Strategy	Integration Point
Feature-driven	When users can actually use the feature
Foundation-driven	When all layers are complete and E2E tests pass
Strangler pattern	At each old-to-new system cutover

Without defined integration points, you end up with "it all works individually but doesn't work together."

Task Decomposition Principles

After considering the perspectives, break down into concrete tasks:

Executable granularity:

Each task = one meaningful commit
Clear completion criteria
Explicit dependencies

Minimize dependencies:

Maximum 2 levels deep (A→B→C is okay, A→B→C→D needs redesign)
Tasks with 3+ chained dependencies should be split
Each task should ideally provide independent value

Build quality in:

Don't make "write tests" a separate task—include testing in the implementation task
Tag each task with its completion level (L1/L2/L3, though in practice L1 is almost always what you want)

Work Planning Anti-Patterns

Anti-Pattern	Consequence
Skip current-state analysis	Plan doesn't fit codebase
Ignore risks	Expensive surprises mid-implementation
Ignore constraints	Plan isn't executable
Over-detail	Lose flexibility, waste planning time
Undefined completion criteria	"Done" is ambiguous, verification impossible

Scaling to Task Size

Not every task needs full work planning.

Scale	Planning Depth
Small (1-2 hours)	Verbal/mental notes or simple TODO list
Medium (1 day to 1 week)	Written work plan, but abbreviated
Large (1+ weeks)	Full work plan covering all perspectives

For a typo fix, you don't need a work plan. For a multi-week refactor, you absolutely do.

Step 4: Execution

Goal: Implement according to the work plan.

Work in Small Steps

Follow the plan. One task at a time. One file, one function at a time where appropriate.

Types-First

When adding new functionality, define interfaces and types before implementing logic. Type definitions become guardrails that help both you and the LLM stay on track.

Why This Changes Everything

With a work plan in place, execution becomes verification. The LLM isn't guessing what to build—it's checking whether the implementation matches the plan.

If you need to deviate from the plan, update the plan first, then continue implementation. Don't let plan and implementation drift apart.

Step 5: Verification & Feedback

Goal: Verify results and externalize learnings.

Feedback Format

When something goes wrong, don't just paste an error. Include the intent:

❌ Just the error
[error log]

✅ Intent + error
Goal: Redirect to dashboard after authentication
Issue: Following error occurs
[error log]

Without intent, the LLM optimizes for "remove the error." With intent, it optimizes for "achieve the goal."

Externalize Learnings

If you find yourself explaining the same thing twice, it's time to write it down.

I covered this in detail in the previous article—where to put rules, what to write, and how to verify they work. The short version: write root causes, not specific incidents, and put them where they'll actually be read.

Referencing Skills and Rules

One common failure mode: you reference a skill or rule file, but the LLM just reads it and moves on without actually applying it.

The Problem

Pattern	Issue
Write "see AGENTS.md"	It's already loaded—redundant reference adds noise
`@file.md` only	LLM reads it, then continues. Reading ≠ applying
"Please reference X"	References it minimally, doesn't apply the content

The Solution: Blocking References

Make the reference a task with verification:

## Required Rules [MANDATORY - MUST BE ACTIVE]

**LOADING PROTOCOL:**
- STEP 1: CHECK if `.agents/skills/coding-rules/SKILL.md` is active
- STEP 2: If NOT active → Execute BLOCKING READ
- STEP 3: CONFIRM skill active before proceeding

Why This Works

Element	Effect
Action verbs	"CHECK", "READ", "CONFIRM"—not just "reference"
STEP numbers	Forces sequence, can't skip
Before proceeding	Blocking—must complete before continuing
If NOT active	Conditional—skips if already loaded (efficiency)

This maps to the task clarity principle: "check if loaded → load if needed → confirm → proceed" is far clearer than "please reference this file."

How This Connects to the Theory

Step	Connection to LLM Characteristics
Step 1: Preparation	Task clarification
Step 2: Design	Artifact-first (design doc is an artifact)
Step 3: Work Planning	Artifact-first (plan is an artifact) + external feedback design
Step 4: Execution	Transform "generation" into "verification against plan"
Step 5: Verification	Obtain external feedback + externalize learnings

The work plan created in Step 3 converts Step 4 from "generate from scratch" to "verify against specification." This is the key mechanism for improving accuracy.

The Research

The practices in this article aren't just workflow opinions—they're backed by research on how LLM agents perform.

ADaPT (Prasad et al., NAACL 2024): Separating planning and execution, with dynamic subtask decomposition when needed, achieved up to 33% higher success rates than baselines (28.3% on ALFWorld, 27% on WebShop, 33% on TextCraft).

Plan-and-Execute (LangChain): Explicit long-term planning enables handling complex tasks that even powerful LLMs struggle with in step-by-step mode.

Multi-Layer Task Decomposition (PMC, 2024): Step-by-step models generate more accurate results than direct generation—task decomposition directly improves output quality.

Task Decomposition (Amazon Science, 2025): With proper task decomposition, smaller specialized models can match the performance of larger general models.

Key Takeaways

Don't let it execute immediately. Ask for a plan first. Even just "present your approach step-by-step before implementing" makes a significant difference.
Work Planning is the superpower. A plan is an artifact. Having it converts execution from generation to verification—and LLMs are better at verification.
Define completion criteria. L1 (works as feature) > L2 (tests pass) > L3 (builds). Know what "done" means before starting.
Scale to task size. Small task = mental note. Large task = full work plan. Don't over-plan trivial work, don't under-plan complex work.
Update plan before deviating. If implementation needs to differ from the plan, update the plan first. Drift kills the verification benefit.
Include intent with errors. "Goal + error" beats "just error." The LLM should know what you're trying to achieve, not just what went wrong.

References

Prasad, A., et al. (2024). "ADaPT: As-Needed Decomposition and Planning with Language Models." NAACL 2024 Findings. arXiv:2311.05772
Wang, L., et al. (2023). "Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models." ACL 2023.
LangChain. "Plan-and-Execute Agents." https://blog.langchain.com/planning-agents/

Stop Guessing If Your Prompt Is Better

Shinsuke KAGAWA — Thu, 22 Jan 2026 13:52:50 +0000

You rewrote your prompt. The output looks different. But is it actually better?

Most of us have been there — reading prompt engineering best practices, tweaking instructions, and hoping the changes help. But without comparison, you're just guessing.

The Problem

When you improve a prompt, you typically:

Run the new version
Look at the output
Think "yeah, this seems better"

But you're comparing against your memory of the old output. Different runs produce different results anyway. How do you know the improvement came from your changes and not just LLM variance?

What I Built

rashomon is a Claude Code plugin that focuses on one practical question: "Did my instruction change actually affect the result?"

It analyzes your prompt, generates an optimized version, runs both in isolated environments, and compares the actual results.

Real Example

I ran this prompt through rashomon:

Add logging to track function calling usage

A reasonable instruction. But vague.

What rashomon detected

Issue	Detail
Vague instructions	What, where, and why to log are unclear
No output format	Log structure not specified
Missing context	No project architecture information

The optimized prompt

## Context

This is a Slack bot using Google Gemini API with function calling. 
The project uses a shared `logger` utility with structured logging.
Function calling flows through:
1. `GeminiService.executeWithRetry()` - detects function calls
2. `FunctionHandler.handleFunctionCall()` - executes them

## Task

Add logging to track function calling usage for analytics and debugging.

## Requirements

At Function Call Detection (GeminiService):
- Function name(s) detected
- Number of function calls in response

At Function Execution (FunctionHandler):
- Parameters passed (sanitized - exclude sensitive data)
- Execution duration
- Result status (success/failure)

## Output Format

logger.info('Function call detected', {
  functionName: 'executeWithRetry',
  detectedFunctions: ['searchNotionPages'],
  functionCallCount: 1
})

What changed

Aspect	Original	Optimized
Logging Scope	1 stage (execution only)	2 stages (detection + execution)
Parameter Sanitization	None	Passwords, tokens, secrets redacted
Files Modified	2	2

The original prompt looked reasonable, but led the agent to log at only one point. The optimized version covered both detection and execution — with security considerations the original didn't address.

Classification: Structural Improvement

About Variance

Not every difference is an improvement. rashomon distinguishes between structural gains and mere variance.

I tried to create a Variance example — a prompt so clear that optimization wouldn't matter. I couldn't. In practice, the same vague prompt sometimes works beautifully, sometimes completely misses the point.

rashomon just makes that inconsistency visible.

Try It

Requires Claude Code.

claude
/plugin marketplace add shinpr/rashomon
/plugin install rashomon@rashomon
# Restart session
/rashomon Your prompt here

shinpr / rashomon

Compare, improve, and verify prompt changes with evidence — not vibes.

See what actually changes when you improve your prompts — not just different wording.

Why rashomon?

Inspired by the Rashomon effect — the idea that the same event can produce different outcomes depending on perspective rashomon makes those differences explicit and comparable.

Spending too much time on trial-and-error with prompts?
Read best practices but not sure how they apply to your case?
Want proof that your changes actually made things better?

rashomon analyzes, improves, and compares prompts—so you can see what actually changed, and whether it matters.

Who Is This For?

rashomon is designed for:

Developers using Claude Code daily
Teams iterating on complex prompts (coding, analysis, writing)
Anyone who wants evidence, not vibes, when improving prompts

Not ideal if:

You don't use git
You want one-shot prompt rewriting without comparison

Quick Example

/rashomon Write a function to sort an array

What You Get

1. Detected Issues

- BP-002

…

View on GitHub

Stop Putting Everything in AGENTS.md

Shinsuke KAGAWA — Mon, 19 Jan 2026 14:13:49 +0000

If you're using Agentic Coding and find yourself explaining the same thing to the LLM over and over, you have a learning externalization problem.

The fix seems obvious: write it down in AGENTS.md (or CLAUDE.md, depending on your tool) and never explain it again.

Note: This article uses "AGENTS.md" as the generic term for root instruction files. Claude Code uses CLAUDE.md, Codex uses AGENTS.md, and other tools have their own conventions. The principles apply regardless of the specific filename.

But here's what actually happens—you keep adding rules, AGENTS.md grows to 200+ lines, and somehow the LLM still ignores half of what you wrote.

This article is about how to actually make your rules stick: where to write them, what to write, and how to verify they work.

The Real Problem

LLMs don't learn across sessions. Every conversation starts fresh. This means:

You explain something once
It works
Next session, you explain it again
And again
Eventually you get frustrated

The solution is to externalize your learnings into rules. But most people do this wrong.

The Common Mistakes

Mistake	What Happens
Put everything in AGENTS.md	It bloats, becomes noise, important rules get buried
Put everything in code comments	LLM doesn't load them into context unless you explicitly reference the file
Don't write it down at all	You repeat yourself forever

The thing is, where you write a rule determines whether the LLM actually follows it.

Where to Write Rules

Not all rules belong in the same place. A simple decision tree:

When is this rule needed?
│
├─ Always, on every task → AGENTS.md
│
├─ When working on a specific feature → Design Doc
│
├─ When using a specific technology → Rule file (skill)
│
└─ When performing a specific task type → Task guidelines

Note: "Skills" are modular rule files used in tools like Codex and Claude Code. They allow you to inject context-specific rules only when relevant. If your tool doesn't have this concept, think of them as separate rule files you reference when needed.

"Task guidelines" refers to rules that apply only during specific operations—like code review, migration, or content generation. Some call these "task rules" or "task-specific constraints."

The Full Picture

Destination	Scope	When Applied	Examples
AGENTS.md	All tasks	Always	Approval flows, stop conditions, project principles
Rule files (skills)	Specific technology area	When using that tech	Type conventions, error handling patterns, function size limits
Task guidelines	Specific task type	When doing that task	Subagent usage rules, review procedures
Design docs	Specific feature	When developing that feature	Feature requirements, API specs, security constraints
Code comments	Specific code location	When modifying that code	Implementation rationale, gotchas

The Key Question

Ask yourself: "Is this needed on every task in this project?"

Yes → AGENTS.md
No → Put it closer to where it's needed

This keeps AGENTS.md lean (around 100 lines) and ensures task-specific rules don't create noise for unrelated work.

You don't need to get this perfect from day one. Start with one thing: keep AGENTS.md small. That alone changes a lot.

What to Write

This is the hard part. Most people write the wrong thing.

The Principle: Write Root Causes, Not Incidents

When something goes wrong, the instinct is to document the specific incident. But this creates bias—the LLM over-fits to that one case.

❌ Bad (specific incident)
"The getUser() function in UserService was missing null check"

✅ Good (root cause / system fix)
"Always null-check return values from external APIs"

The first one only helps if the LLM encounters that exact function again. The second one prevents the entire class of errors.

Specific Incident vs. Root Cause

Aspect	Specific Incident	Root Cause
Applies to	That one location	All similar cases
Prevents recurrence	Weakly (same bug elsewhere)	Strongly (operates as principle)
Bias risk	High (overfitting)	Low (generalizable)

Finding the Root Cause

When you encounter an issue, ask:

Why did this mistake happen? (direct cause)
Why wasn't it prevented? (system gap)
Where else could this same mistake occur? (scope)

Example:

Direct cause: getUser() was missing null check
System gap: We trusted external API return values without validation
Scope: All external API calls

→ Rule to write: "Always null-check return values from external APIs"

How to Verify Rules Work

This is the step most people skip—and it's critical.

The Principle: Fix the System, Then Discard and Retry

When you add or modify a rule in AGENTS.md or a skill file, you need to verify it actually works. The only way to do this:

Add/modify the rule
Discard the current artifact (or stash it in a branch)
Start a new session with the updated rules
Re-run the same task
Verify the issue doesn't recur

Continue with existing artifact after rule change → ❌
Discard and restart with new rules → ✅

Why This Matters

If you keep the existing artifact and just continue, you're still operating in a context polluted by the old system. The new rule might not get properly applied because:

The existing artifact carries biases from before the rule existed
The LLM might try to "reconcile" the new rule with existing work rather than applying it cleanly
You can't tell if the rule actually works or if you just manually fixed the symptom

Verification Checklist

[ ] Modified the rule (AGENTS.md / skill file / task guideline)
[ ] Discarded current artifact (or moved to a branch)
[ ] Started new session with updated rules
[ ] Re-ran the same task
[ ] Confirmed the issue doesn't recur

For small changes, you can stash instead of discard. The key is: test the system in isolation.

When to Write Rules

Not every issue deserves a rule. Some guidance:

Situation	Write a Rule?	Rationale
You explained the same thing twice	Yes	Prevent the third time
Encountered unexpected behavior	Maybe	Find root cause first
Task completed successfully	Maybe	Retrospective—any generalizable insights?
Found a serious bug	Yes	Prevent recurrence

Warning Signs You're Over-Documenting

AGENTS.md exceeds 100 lines
A single rule file exceeds 300 lines (~1,500 tokens)
Rules take more than 1 minute to read through
You find yourself thinking "is this really needed every time?"
Rules contradict each other

If you see these signs, it's time to prune. Rule maintenance includes deletion.

How to Write Rules (Cheat Sheet)

This section is a reference. You don't need to read it all now—come back when you're actually writing a rule. The rest of the article stands on its own.

1. Minimum Viable Length

Context is precious. Same meaning, shorter expression. But don't sacrifice clarity for brevity.

❌ Verbose (38 chars)
If an error occurs, you must always log it

✅ Concise (20 chars)
All errors must be logged

❌ Too short (unclear)
Log errors

2. No Duplication

Same content in multiple places wastes context and creates update drift.

❌ Duplicated
# base.md
Standard error format: { success: false, error: string }

# api.md
Errors use { success: false, error: string } format

✅ Single source
# base.md
Standard error format: { success: false, error: string }

3. Measurable Criteria

Vague instructions create interpretation variance. Use numbers and specific conditions.

✅ Measurable
- Functions: max 30 lines
- Cyclomatic complexity: max 10
- Test coverage: min 80%

❌ Vague
- Readable code
- Sufficient testing

4. Recommendations Over Prohibitions

Banning things without alternatives leaves the LLM guessing. Show the right way.

✅ Recommendation + rationale
【State Management】
Recommended: Zustand or Context API
Reason: Global variables make testing difficult, state tracking complex
Avoid: window.globalState = { ... }

❌ Prohibition list
- Don't use global variables
- Don't store values on window

5. Priority Order

LLMs pay more attention to what comes first. Lead with the most important rules.

## Critical (Must Follow)
1. All APIs require JWT authentication
2. Rate limit: 100 requests/minute

## Standard Specs
- Methods: Follow REST principles
- Body: JSON format

## Edge Cases (Only When Applicable)
- File uploads may use multipart

6. Clear Scope Boundaries

State what the rule covers—and what it doesn't.

## Scope

### Applies To
- REST API endpoints
- GraphQL endpoints

### Does Not Apply To
- Static file serving
- Health checks (/health)

The Feedback Loop

This is how it all fits together in practice:

[Working with LLM]
       │
       ├─ Issue occurs
       │      │
       │      ▼
       │  Find root cause (not just symptom)
       │      │
       │      ▼
       │  Decide where to write (AGENTS.md? Skill? Task guideline?)
       │      │
       │      ▼
       │  Write the rule
       │      │
       │      ▼
       │  Discard current work
       │      │
       │      ▼
       │  New session with updated rules
       │      │
       │      ▼
       │  Verify issue doesn't recur
       │
       ▼
[Continue working]

The goal is to reach a state where you never explain the same thing twice. Every explanation either:

Gets externalized into a rule, or
Was truly a one-off that doesn't need capturing

Passing Feedback Correctly

One more thing: when you give feedback to the LLM, don't just paste error logs. Include your intent.

❌ Just the error
[Stack trace]

✅ Intent + error
Goal: Redirect to dashboard after user authentication
Issue: Following error occurred
[Stack trace]

Without the intent, the LLM optimizes for "make the error go away." With the intent, it optimizes for "achieve the goal while resolving this error."

These are very different things.

Anti-Pattern Summary

Quick reference if you want to check your current practices:

Anti-Pattern	Reference
Put everything in AGENTS.md	→ "Where to Write Rules"
Write specific incidents instead of root causes	→ "What to Write"
Continue with old artifacts after changing rules	→ "How to Verify Rules Work"
List only prohibitions without recommendations	→ "How to Write Rules" #4
Keep explaining instead of writing it down	→ "When to Write Rules"

Key Takeaways

AGENTS.md is not a dumping ground. Only rules needed on every task belong there. Everything else goes closer to where it's used.
Write root causes, not incidents. "Null-check external API returns" beats "UserService.getUser() was missing null check."
Test your rules. After adding a rule, discard current work and re-run. If the issue recurs, the rule isn't working.
Maintenance includes deletion. If AGENTS.md is over 100 lines, you've probably over-documented. Prune ruthlessly.
Explain twice, document once. If you're explaining the same thing for a second time, stop and externalize it.

Once you stop expecting rules alone to do the work, the real question becomes how to design the workflow around them. In practice, that starts with planning—before execution ever begins.

The Research

The practices in this article are grounded in LLM research:

SALAM (Wang et al., 2023): LLM self-feedback is often inaccurate. Structured feedback from external agents (or externalized rules) is more effective.

LEMA (An et al., 2023): Learning from mistakes (error → explanation → correction) improves LLM reasoning ability—but this requires explicit externalization of what was learned.

Feedback Loop for IaC (Palavalli et al., 2024): Feedback loop effectiveness decreases exponentially with each iteration and plateaus. This supports the "discard and restart" approach over endless iteration in the same context.

Reflexion (Shinn et al., 2023): Combining short-term memory (recent trajectory) with long-term memory (past experience) enables effective self-improvement. Externalized rules function as that long-term memory.

References

Wang, D., et al. (2023). "Learning from Mistakes via Cooperative Study Assistant for Large Language Models." arXiv:2305.13829
An, S., et al. (2023). "Learning From Mistakes Makes LLM Better Reasoner." arXiv:2310.20689
Palavalli, M. A., et al. (2024). "Using a Feedback Loop for LLM-based Infrastructure as Code Generation." arXiv:2411.19043
Shinn, N., et al. (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023.

Why LLMs Are Bad at "First Try" and Great at Verification

Shinsuke KAGAWA — Mon, 12 Jan 2026 12:46:19 +0000

I used to spend hours crafting the perfect prompt.
Detailed instructions, examples, constraints—the works.

And the AI would still add random features I never asked for.
Or refactor code that was perfectly fine.
Or skip steps it decided were "unnecessary."

Eventually it clicked: I was fighting a losing battle.
So I stopped trying to control generation and started focusing on verification.

The Failure Patterns You've Probably Seen

Before diving into why, these are the common anti-patterns:

The Giant Prompt Syndrome: Cramming requirements, design, implementation, and improvement into a single prompt
Overconfidence in Abstract Instructions: Expecting "think carefully" or "be thorough" to actually improve quality
The Invisible Loop: Thinking you're iterating when you're actually spinning in circles within the same biased context
Context Bloat: Adding "just in case" information until the actually important instructions get buried

If any of these sound familiar, you're in the right place.

The Core Insight

The claim is simple:

LLMs perform better at "verify and improve existing artifacts" than at "controlled first-time generation."

Instead of trying to get the perfect output on the first attempt, you get better results by:

Having the LLM produce something first
Then having it verify and improve that output

This is grounded in how LLMs actually process information.

Why Verification Works Better

At first, I assumed better prompts would lead to better first-shot output. But after enough failures, the pattern became clear: there are three interconnected reasons why LLMs become "smarter" when they have something to work with.

1. External Feedback Changes the Task

When an artifact exists, the task fundamentally transforms:

Without artifact: "Generate something good" (vague, open-ended)
With artifact: "Identify what's wrong with this and fix it" (specific, bounded)

The second task has clearer success criteria. The LLM isn't guessing what "good" means—it can evaluate concrete issues against concrete output.

2. Position Bias (Lost in the Middle)

Research has shown that LLMs exhibit a U-shaped attention pattern: they prioritize information at the beginning and end of their context window, while information in the middle tends to get overlooked.

When you feed an artifact as input to a new session, it naturally occupies a prominent position in the context. The LLM is literally forced to pay attention to it.

This also explains why that really important instruction you buried in paragraph 5 of your mega-prompt keeps getting ignored.

3. Task Clarity Drives Performance

"Improve this code" is a more concrete task than "write good code."

The presence of an artifact provides:

A specific target for evaluation
Clear boundaries for the scope of work
Implicit success criteria (this one matters more than you'd think—"better than before" is much easier to verify than "good enough")

The Externality Spectrum

What made the biggest difference for me was reviewing in a completely separate context.
Once I stopped letting the generator review its own work, the blind spots became obvious.

Not all feedback loops are created equal. Different approaches rank very differently in effectiveness:

The thing is, looping within the same session is fundamentally internal feedback. The LLM is still operating within its original generation biases. Only by separating context do you get true "external" perspective.

In short: if the context doesn't change, neither does the model's perspective.

Practical Implications

So what do you actually do with this?

1. Artifact-First Workflow

Stop trying to get everything right in one shot. Instead:

Generate Phase: Get something out, even if imperfect. Don't over-specify.
External Feedback: Run the code, execute tests, use linters
Verification Phase (new session): Feed the artifact + feedback to a fresh context

[Generation Session]
    │
    ├── Input: Requirements, constraints
    ├── Output: Artifact (code, design, etc.) + brief intent summary (1-3 lines)
    │
    ▼
[External Feedback]
    │
    ├── Code execution
    ├── Test execution
    ├── Linter/static analysis
    │
    ▼
[Verification Session]  ← Fresh context
    │
    ├── Input: Artifact + intent summary + feedback results
    ├── Output: Improved artifact
    │
    ▼
[Repeat or Complete]

2. Know When to Separate Sessions

Session separation isn't always necessary. Use your judgment:

Task Type	Approach
Small, localized fixes (typos, formatting)	Same session is fine
Clear error fixes (with stack trace)	Same session works—external feedback (error log) exists
Design changes, architecture revisions	Separate sessions
Quality improvements, refactoring	Separate sessions
Direction changes, requirement pivots	Separate sessions

Rule of thumb: If you're feeling "something isn't working," that's often a sign to start a fresh session. Your intuition about context pollution is usually right.

3. What to Pass Between Sessions

Not everything from the generation phase should go to the verification phase.

Content	Should Pass?	Why
Full Chain-of-Thought log	No	Verbose, becomes noise. Important info gets lost (position bias)
Intent summary (1-3 lines)	Yes	Preserves the "why" compactly
Final decision + rationale	Yes	Useful for debugging
Rejected alternatives	Maybe	Only when specifically relevant

The principle: Pass the "why," not the "how I thought about it."

Designing Your AGENTS.md

You don't need to redesign your AGENTS.md all at once. But understanding position bias changes how you think about what goes in it.

This insight has direct implications for how you structure AGENTS.md (or whatever root instruction file you use—CLAUDE.md, cursorrules, etc.).

The Position Bias Problem

Context Window Position → Attention Weight

[AGENTS.md]            ← Start: HIGH attention
       ↓
[Middle instructions]  ← Middle: LOW attention (Lost in the Middle)
       ↓
[User prompt]          ← End: HIGH attention

If your AGENTS.md is bloated, the truly important principles get diluted. Adding more "just in case" actually makes everything weaker.

Design Principles

What	Approach
AGENTS.md	Core principles only. ~100 lines. What must be followed on every task
Task-specific info	Inject via skills, command arguments, or reference files when needed
Why separate?	Context separation lets you compose optimal information for each task

What Belongs in AGENTS.md

Project purpose and domain
Non-negotiable constraints (security, naming conventions)
Tech stack overview
Communication style
Error handling behavior

What Doesn't Belong

Individual feature specs
API details
Task-specific workflows
Long code examples
"Nice to have" information

Test: Ask "Is this needed for every task?" If no, it belongs elsewhere.

The Human Role

In Agentic Coding, you're not "using an LLM"—you're designing a system where an LLM operates.

Your Responsibilities

Responsibility	Concrete Actions
Design external feedback	Decide which tests to run, which linters to use, what "success" means
Determine session boundaries	Judge when to cut context, what carries over
Define quality gates	Separate automated checks from human review needs
Maintain AGENTS.md	Keep core principles tight, prevent bloat
Articulate intent	Create or validate the "intent summary" that passes between sessions

Automation vs. Human Review

Good for automation:

Code execution, test execution
Linters, formatters
Type checking
Security scans
Applying formulaic fixes

Requires human review:

Design decision validity
Requirement alignment
Session boundary judgment
Trade-off decisions
Validating the "why"

A Framework: Context Separation at Every Level

It took me a while to realize this wasn't about writing better prompts—it was about where I drew the boundaries.

You don't need to apply all of this rigidly. But when something feels off, one of these levels is usually the culprit:

The Research Behind This

These aren't just opinions—they're grounded in LLM research:

Self-Refine (Madaan et al., 2023)
The generate → feedback → refine loop shows approximately 20% improvement over single-shot generation. Key insight: the improvement comes from the structured iteration, not from the model "trying harder."

Lost in the Middle (Liu et al., 2023)
LLMs show U-shaped attention bias, heavily weighting the beginning and end of context while underweighting the middle. This explains why your carefully crafted instructions in paragraph 5 keep getting ignored.

LLMs Cannot Self-Correct Reasoning Yet (Huang et al., 2023)
Without external feedback, self-correction doesn't work—and can actually make things worse. "Review your work" as an instruction has minimal effect; external signals (test failures, linter errors) are what drive actual improvement.

Key Takeaways

Don't optimize for first-shot perfection. Get something out, then improve it.
Session separation is real. The same context that generated the artifact will struggle to objectively improve it.
External feedback is non-negotiable. Tests, linters, execution results—these are what drive quality, not "think harder" prompts.
Keep AGENTS.md lean. Position bias means bloat actively hurts. If it's not needed for every task, move it out.
Pass intent, not process. Between sessions, transfer the "why" in 1-3 lines, not the full thought log.
You're a system designer. Your job isn't to use the LLM—it's to design the workflow, feedback loops, and context boundaries that let it perform.

What's Next

This article focused on why verification-oriented workflows outperform first-shot generation. In future articles, I'll cover:

How to structure work plans that turn execution into verification
Where to put rules so they actually get followed (hint: not all in AGENTS.md)

If you've been struggling with inconsistent LLM output or finding that your detailed prompts underperform simpler ones, try restructuring around verification. The difference is often dramatic.

What's your experience been? Did switching to a verification-first approach change anything for you?

References

Madaan, A., et al. (2023). "Self-Refine: Iterative Refinement with Self-Feedback." arXiv:2303.17651
Liu, N. F., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172
Huang, J., et al. (2023). "Large Language Models Cannot Self-Correct Reasoning Yet." ICLR 2024. arXiv:2310.01798
Hsieh, C.-Y., et al. (2024). "Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization." ACL 2024 Findings. arXiv:2406.16008

Building a Local RAG for Agentic Coding: From Fixed Chunks to Semantic Search with Keyword Boost

Shinsuke KAGAWA — Tue, 06 Jan 2026 12:12:57 +0000

Started with a simple RAG for MCP—the kind of thing you build in a weekend. Ended up implementing semantic chunking (Max-Min algorithm) and rethinking hybrid search entirely. This article is written for people who have already built RAG systems and started hitting quality limits. If you've hit walls with fixed-size chunks and top-K retrieval, this might be useful.

Context: RAG for Agentic Coding
The Invisible Problem: What Does the LLM Actually Receive?
Semantic Chunking: Why Fixed Chunks Break Down
When Semantic Chunks Broke Hybrid Search
Results: What Actually Changed
Architecture Summary
The Other Side: Query Quality
Tradeoffs and Limitations
Conclusion

1. Context: RAG for Agentic Coding

Problem statement

The request was straightforward: load domain knowledge from PDFs for a specialized agent. Framework best practices, project principles (rules), and specifications (PRDs)—the kind of documents you'd want an AI coding assistant to reference while working.

The constraints made it interesting:

Personal use → No external APIs, privacy matters
MCP ecosystem → Integration with Cursor, Claude Code, Codex
"Agentic Coding support" as the use case

Initial implementation

The first version was textbook RAG:

Document → Fixed-size chunks (500 chars) → Embeddings → LanceDB
Query → Vector search → Top-K results → LLM

Standard fixed-size chunking. Vector search with top-K retrieval. Local embedding model via Transformers.js. LanceDB for vector storage—file-based, no server process required.

It worked... sort of.

2. The Invisible Problem: What Does the LLM Actually Receive?

Discovery

Here's the thing about MCP: search results go directly to the LLM. The user never sees them.

User → LLM → MCP(RAG) → LLM → Response
               ↑
         Results hidden from user

When the RAG returns garbage, you don't see it. You just notice the LLM behaving strangely—making additional searches, reading files directly, or giving incomplete answers.

To debug this, I forced the LLM to output the raw JSON search results. The prompt was simple: "Show me the exact JSON you received from the RAG search."

What I found: lots of irrelevant chunks polluting the context. Page markers, decoration lines, fragments cut mid-sentence.

Why top-K fails

The standard approach is "return the top 10 closest vectors." But closeness in vector space doesn't equal usefulness.

Increasing K just adds more noise
No quality signal—just "top 10 closest vectors"
A chunk with distance 0.1 and another with distance 0.9 both make the cut if they're in the top K

First fix: Quality filtering

Three mechanisms, each addressing a different problem:

1. Distance-based threshold (RAG_MAX_DISTANCE)

// src/vectordb/index.ts
if (this.config.maxDistance !== undefined) {
  query = query.distanceRange(undefined, this.config.maxDistance)
}

Only return results below a certain distance. If nothing is close enough, return nothing—better than returning garbage.

2. Relevance gap grouping (RAG_GROUPING)

Instead of arbitrary K, detect natural "quality groups" in the results:

// src/vectordb/index.ts
// Calculate statistical threshold: mean + 1.5 * std
const threshold = mean + GROUPING_BOUNDARY_STD_MULTIPLIER * std

// Find significant gaps (group boundaries)
const boundaries = gaps.filter((g) => g.gap > threshold)

// 'similar' mode: first group only
// 'related' mode: top 2 groups

Results cluster naturally—there's usually a gap between "highly relevant" and "somewhat related." This detects that gap statistically.

3. Garbage chunk removal

// src/chunker/semantic-chunker.ts
export function isGarbageChunk(text: string): boolean {
  // Decoration line patterns (----, ====, ****, etc.)
  if (/^[\-=_.*#|~`@!%^&*()\[\]{}\\/<>:+\s]+$/.test(trimmed)) return true

  // Excessive repetition of single character (>80%)
  const maxCount = Math.max(...charCounts.values())
  if (maxCount / trimmed.length > 0.8) return true

  return false
}

Page markers, separator lines, repeated characters—filter them before they ever reach the index.

New problem emerged

Technical terms like useEffect or ERR_CONNECTION_REFUSED were getting filtered out. They're semantically distant from natural language queries but keyword-relevant.

The fix: hybrid search (semantic + keyword blend). But implementing it properly required rethinking the chunking strategy first.

3. Semantic Chunking: Why Fixed Chunks Break Down

Trigger

I read about "semantic center of gravity" in chunks—the idea that a chunk should have a coherent meaning, not just a coherent length.

Then I observed the LLM's behavior: after RAG search, it would often search again with different terms, or just read the file directly. The chunks weren't trustworthy—they lacked sufficient context for the LLM to act on them.

The waste

If a chunk doesn't contain enough meaning:

LLM makes additional tool calls to compensate
Context gets polluted with redundant searches
Latency increases
Tokens get wasted

The LLM was doing work that good chunking should prevent.

Solution: Max-Min Algorithm

The Max-Min semantic chunking paper (Kiss et al., Springer 2025) provided the foundation. This implementation is a pragmatic adaptation of the Max–Min idea, not a faithful reproduction of the paper's algorithm.

The core idea: group consecutive sentences based on semantic similarity, not character count.

// src/chunker/semantic-chunker.ts

// Should we add this sentence to the current chunk?
private shouldAddToChunk(maxSim: number, threshold: number): boolean {
  return maxSim > threshold
}

// Dynamic threshold based on chunk coherence
private calculateThreshold(minSim: number, chunkSize: number): number {
  // threshold = max(c * minSim * sigmoid(|C|), hardThreshold)
  const sigmoid = 1 / (1 + Math.exp(-chunkSize))
  return Math.max(this.config.c * minSim * sigmoid, this.config.hardThreshold)
}

The algorithm:

Split text into sentences
Generate embeddings for all sentences
For each sentence, decide: add to current chunk or start new?
Decision based on comparing max similarity with new sentence vs. min similarity within chunk

When the new sentence's similarity drops below the threshold, it signals a topic boundary.

Implementation details

Sentence detection: Intl.Segmenter

// src/chunker/sentence-splitter.ts
const segmenter = new Intl.Segmenter('und', { granularity: 'sentence' })

No external dependencies. Multilingual support via Unicode standard (UAX #29). The 'und' (undetermined) locale provides general Unicode support.

Code block preservation

// src/chunker/sentence-splitter.ts
const CODE_BLOCK_PLACEHOLDER = '\u0000CODE_BLOCK\u0000'

// Extract before sentence splitting
const codeBlockRegex = /```
{% endraw %}
[\s\S]*?
{% raw %}
```/g
// ... replace with placeholders ...

// Restore after chunking

Markdown code blocks stay intact—never split mid-block. Critical for technical documentation where copy-pastable code is the point.

Performance tuning

The paper uses O(k²) comparisons within each chunk. For long homogeneous documents, this explodes.

// src/chunker/semantic-chunker.ts
const WINDOW_SIZE = 5      // Compare only recent 5 sentences: O(k²) → O(25)
const MAX_SENTENCES = 15   // Force split at 15 sentences (3x paper's median)

PDF parsing: pdfjs-dist

Switched from pdf-parse to pdfjs-dist for access to position information (x, y coordinates, font size). This enables semantic header/footer detection—variable content like "Page 7 of 75" that pdf-parse would include as regular text.

4. When Semantic Chunks Broke Hybrid Search

The problem

Semantic chunks are richer—more content per chunk, more coherent meaning. But this broke the original keyword matching.

The issue: scores became unreliable. A keyword match in a dense, high-quality chunk meant something different than a match in a sparse, fragmented one.

Attempted: RRF (Reciprocal Rank Fusion)

RRF is the standard approach for merging BM25 and vector results:

RRF_score = Σ 1/(k + rank_i)

Combine rankings by position, not by score. Elegant, widely used, no tuning required.

But there's a fundamental problem: distance information is lost.

Original distances: 0.1, 0.2, 0.9  →  Ranks: 1, 2, 3
Original distances: 0.1, 0.15, 0.18  →  Ranks: 1, 2, 3
# Same ranks, completely different quality gaps

RRF outputs ranks, not distances. Our quality filters—distance threshold, relevance gap grouping—need actual distances to work.

As noted in Microsoft's hybrid search documentation: "RRF aggregates rankings rather than scores." This is by design—it avoids the problem of incompatible score scales. But it means downstream quality filtering can't distinguish "barely made top-10" from "clearly the best match."

Solution: Semantic-first with keyword boost

Keep vector search as the primary signal. Use keywords to adjust distances, not replace them.

// src/vectordb/index.ts
// Multiplicative boost: distance / (1 + keyword_score * weight)
const boostedDistance = result.score / (1 + keywordScore * weight)

The formula:

No keyword match (score=0): distance / 1 = distance (unchanged)
Perfect match with weight=0.6: distance / 1.6 (reduced by 37.5%)
Perfect match with weight=1.0: distance / 2 (halved)

This preserves the distance for quality filtering while boosting exact matches.

Architecture

// src/vectordb/index.ts
// 1. Vector search with 2x candidate pool
const candidateLimit = limit * HYBRID_SEARCH_CANDIDATE_MULTIPLIER

// 2. Apply distance filter
if (this.config.maxDistance !== undefined) {
  query = query.distanceRange(undefined, this.config.maxDistance)
}

// 3. Apply grouping
if (this.config.grouping && results.length > 1) {
  results = this.applyGrouping(results, this.config.grouping)
}

// 4. Keyword boost via FTS
const ftsResults = await this.table
  .search(queryText, 'fts', 'text')
  // ...
results = this.applyKeywordBoost(results, ftsResults, hybridWeight)

Quality filters apply to meaningful vector distances. Keyword matching acts as a boost, not a replacement.

Multilingual challenge

Japanese keyword matching broke with richer chunks. The default tokenizer couldn't handle CJK characters properly.

Solution: LanceDB FTS with n-gram indexing.

// src/vectordb/index.ts
await this.table.createIndex('text', {
  config: Index.fts({
    baseTokenizer: 'ngram',
    ngramMinLength: 2,  // Capture Japanese bi-grams (東京, 設計)
    ngramMaxLength: 3,  // Balance precision vs index size
    prefixOnly: false,  // All positions for proper CJK support
    stem: false,        // Preserve exact terms
  }),
})

N-grams at min=2, max=3 capture both English terms and Japanese compound words without language-specific tokenization.

5. Results: What Actually Changed

Observed behavior (real usage)

My setup: framework best practices (official PDFs), project principles (rules), specifications (PRDs) stored in RAG. Before each task, the agent analyzes requirements and searches RAG for relevant context.

Before (fixed chunks + top-K):

Agent couldn't find relevant information on first search
Multiple search attempts with different query formulations
Eventually gave up and read rule files directly
PDFs were too large to read, so that context was effectively lost

After (semantic chunks + boost + filtering):

Single search usually provides sufficient context
Additional searches happen for depth, not compensation
Agent stopped reading files directly—RAG results were trustworthy

LLM evaluation (before/after comparison)

I had an LLM evaluate search results with project context—not a formal LLM-as-Judge setup, but structured comparison.

Old version:

Garbage chunks (outliers) and fragmented information in ~2/10 results for some queries
Results required additional verification

Updated version:

No garbage chunks
8/10 results directly relevant to the query
2/10 results tangentially related (still useful context)
Evaluator noted: "Search results alone provide necessary and sufficient information"

Examining the raw JSON confirmed the qualitative assessment—chunks contained coherent, dense information rather than fragments.

No benchmarks

This is qualitative observation from real usage, not controlled experiments. But the behavioral change is clear: the LLM stopped compensating for bad RAG results.

6. Architecture Summary

Document → Semantic Chunking (Max-Min) → Embeddings → LanceDB

Query → Vector Search → Distance Filter → Grouping → Keyword Boost → Results

Key decisions

Choice	Reason
Semantic chunking over fixed	Meaning-preserving units reduce LLM compensation
Keyword boost over RRF	Preserves distance for quality filtering
Distance-based grouping	Quality signal, not arbitrary K
N-gram FTS	Multilingual support without tokenizer complexity
Local-only	Privacy, cost, offline capability

Configuration

# Environment variables
RAG_HYBRID_WEIGHT=0.6    # Keyword boost factor (0=semantic, 1=BM25-dominant)
RAG_GROUPING=related     # 'similar' (top group) or 'related' (top 2 groups)
RAG_MAX_DISTANCE=0.5     # Filter low-relevance results

7. The Other Side: Query Quality

RAG accuracy depends on two things:

Search quality (what we've discussed)
Query quality (what the LLM sends)

MCP's dual invisibility

User → LLM → MCP(RAG) → LLM → Response
         ↑         ↑
     Query hidden  Results hidden

Even perfect RAG fails with bad queries. And users can't see either side.

Solution: Agent Skills

Agent Skills is an open format for extending AI agent capabilities with specialized knowledge. Skills are portable, version-controlled packages of procedural knowledge that agents load on-demand.

For this RAG, skills teach the LLM:

Query formulation

# Query patterns by intent
| Intent | Pattern |
|--------|---------|
| Definition/Concept | "[term] definition concept" |
| How-To/Procedure | "[action] steps example usage" |
| API/Function | "[function] API arguments return" |
| Troubleshooting | "[error] fix solution cause" |

Score interpretation

# Score thresholds
< 0.3  : Use directly (high confidence)
0.3-0.5: Include if mentions same concept/entity
> 0.5  : Skip unless no better results

Skills can be installed via the mcp-local-rag-skills CLI.

This completes the optimization loop:

RAG side: semantic chunks + distance filters + keyword boost
LLM side: query formulation + result interpretation

Both sides matter. Optimizing only one leaves performance on the table.

8. Tradeoffs and Limitations

What this approach gives up

BM25-only hits don't surface: Must appear in semantic results first to get boosted
No reranker: Would improve accuracy but adds complexity/latency
No formal benchmarks: Qualitative evaluation only

Where heavier approaches win

RRF + Reranker: Broader candidate pool, reranker compensates for RRF's rank-only output
LLM-as-reranker: Best accuracy, but slow and expensive

Position on the spectrum

Light & Fast ←————————————————————→ Heavy & Accurate
    semantic-only
        └─ semantic + boost (here)
               └─ RRF + Cross-Encoder
                      └─ RRF + LLM Rerank

The goal was: maximum quality within zero-setup, local-only constraints.

9. Conclusion

Standard RAG (fixed chunks + top-K) breaks down for agentic coding use cases
Semantic chunking + quality filtering + keyword boost is a viable middle ground
RRF looks elegant but loses distance information critical for filtering
Query quality matters as much as search quality—Agent Skills address this
The real test: does the LLM stop making compensatory tool calls?

Code: github.com/shinpr/mcp-local-rag

References

Kiss, C., Nagy, M. & Szilágyi, P. (2025). Max–Min semantic chunking of documents for RAG application. Discover Computing 28, 117. https://doi.org/10.1007/s10791-025-09638-7
LanceDB Full-Text Search: https://lancedb.github.io/lancedb/fts/
MCP Specification: https://modelcontextprotocol.io
Agent Skills: https://agentskills.io
Reciprocal Rank Fusion (OpenSearch): https://opensearch.org/blog/introducing-reciprocal-rank-fusion-hybrid-search/
Hybrid Search Scoring (Microsoft): https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking

How I Made Legacy Code AI-Friendly with Auto-Generated Docs

Shinsuke KAGAWA — Fri, 26 Dec 2025 12:30:09 +0000

AI coding assistants are amazing—until you point them at a legacy codebase.

"What does this module do?"
"I don't have enough context."

Sound familiar?

The Problem

Claude Code (and similar tools) hit context limits fast on existing projects. No documentation means no context, which means the AI can't help effectively.

You could spend weeks writing docs manually. Or you could automate it.

The Fix: Generate Docs First

Instead of fighting the AI, I ended up building a workflow that:

Scans your codebase for features
Generates PRD + Design Docs automatically
Verifies docs against actual code
Now AI has context to work with

Quick Start

# Start Claude Code
claude

# Add the marketplace
/plugin marketplace add shinpr/claude-code-workflows

# Install the plugin
/plugin install dev-workflows@claude-code-workflows

Then point it at your legacy code:

/reverse-engineer "src/auth"

That's it.

What Happens

The workflow runs through multiple specialized agents:

scope-discoverer finds what features exist in your code
prd-creator generates product docs for each feature
code-verifier checks if the docs match reality
document-reviewer catches inconsistencies

Each step verifies against the actual code—so you get docs that reflect what the system actually does, not what someone thought it did years ago.

What You Get

PRD for each feature (what it does, why it exists)
Design docs (how it's built, what depends on what)

Now when you ask the AI to modify something, it has context.

Before/After

Before: "Explain the auth module" → Context limit, vague answers

After: AI reads generated docs → Specific, actionable suggestions

When to Use This

Works best when:

You've inherited a codebase with missing docs
Institutional knowledge has left with previous developers
You want to onboard AI assistants to existing projects

It's not magic—complex legacy systems still need human review. But it gets you 80% there automatically.

I built this while trying to make Claude Code usable on projects where no one knows how things work anymore.

shinpr / claude-code-workflows

Production-ready development workflows for Claude Code, powered by specialized AI agents.

Claude Code Workflows 🚀

End-to-end development workflows for Claude Code - Specialized agents handle requirements, design, implementation, and quality checks so you get reviewable code, not just generated code.

⚡ Quick Start

This marketplace includes the following plugins:

Core plugins:

dev-workflows - Backend and general-purpose development
dev-workflows-frontend - React/TypeScript specialized workflows

Optional add-ons (enhance core plugins):

claude-code-discover - Turns feature ideas into evidence-backed PRDs
metronome - Detects shortcut-taking behavior and nudges Claude to proceed step by step
dev-workflows-governance - Enforces TIDY stage and human signoff checkpoint before deployment

Skills only (for users with existing workflows):

dev-skills - Coding best practices, testing principles, and design guidelines — no workflow recipes

These plugins provide end-to-end workflows for AI-assisted development. Choose what fits your project:

Backend or General Development

# 1. Start Claude Code
claude
# 2. Install the marketplace
/plugin marketplace add shinpr/claude-code-workflows

# 3. Install backend plugin
/plugin install dev-workflows@claude-code-workflows

#

…

View on GitHub

Taming Opus 4.5's Efficiency: Using TodoWrite to Keep Claude Code on Track

Shinsuke KAGAWA — Thu, 11 Dec 2025 12:54:19 +0000

I've been using Claude Code with Opus 4.5 for a while now, and there's one thing that kept driving me crazy: it skips steps. Steps I actually needed.

What actually happens

According to Anthropic's docs, Opus 4.5 is designed to "skip summaries for efficiency and maintain workflow momentum." Sounds great in theory.

In practice? You ask for a 5-step process, and it delivers the final result—skipping steps 2, 3, and 4. Efficient? Sure. But not what I needed.

I ran into this when I was working on a test review task. I wanted Claude to:

List all test items from the spec
Evaluate each item against criteria
Filter down to the essential ones
Generate the final test plan

Instead, it jumped straight to step 4. "Here's your optimized test plan!" Thanks, but I needed to see steps 2 and 3 to understand why those tests were selected.

The fix: Make steps explicit with TodoWrite

📢 Update (March 2026): As of Claude Code v2.1.16 (released January 22, 2026), TodoWrite has been superseded by the new Tasks API — TaskCreate, TaskUpdate, TaskList, and TaskGet. The concept in this article still applies, but you'll now use TaskCreate to register steps instead of TodoWrite. You can revert to the old behavior with the env var CLAUDE_CODE_ENABLE_TASKS=false.

Claude Code has a built-in TODO management feature called TodoWrite. When you register tasks explicitly, Opus 4.5 treats them as checkpoints it must complete.

At the start of your task, tell Claude Code to register the steps:

Before starting, register these steps using TodoWrite:
1. List all test items from the spec
2. Evaluate each against the criteria
3. Filter to essential items with reasoning
4. Generate the final test plan

Or just add this to your prompt:

Use TodoWrite to track each step. Do not skip any steps.

Basically, once you've registered steps as TODOs, Opus treats them as real checkpoints—not optional stops it can skip.

A quick limitation I learned the hard way

If you register too many steps (7+), Opus 4.5 may batch them together for "efficiency," defeating the purpose.

Don't do this:

1. Read file A
2. Read file B
3. Read file C
4. Analyze A
5. Analyze B
6. Analyze C
7. Compare results
8. Generate report

Do this instead:

1. Read and analyze all relevant files
2. Compare the implementations
3. Generate the report with findings

Meaningful steps, not micro-tasks.

When this saved me

Multi-step refactoring where I needed to see intermediate states
Debugging sessions where I wanted the reasoning at each stage
Any task where Opus 4.5 kept "helpfully" jumping to the end

Opus 4.5's efficiency is a feature, not a bug—but sometimes you need the journey, not just the destination. TodoWrite gives you that control back.

DEV Community: Shinsuke KAGAWA

A second pair of eyes for Claude Code: building Galley, a local runner that checks the work before the PR opens

TL;DR

The setup that made me build this

"Third time, automate it"

What Galley actually does

The decisions I'd actually defend

A Skill installs the tool

Acceptance-criteria test skeletons go in first

quality.yaml is a thing I grow

Claude writes, a second model signs off

The deterministic / non-deterministic seam

Galley building Galley

Where it is

Claude Code

Codex

I Built a Skill Reviewer. Then I Ran It on Itself.

The Setup

The Discovery

Digging Into Logs

Fixing the Criteria, Not the Reviewer

The Retest

What This Comes Down To

The Broader Problem: Skill Quality Is Hard

Try It

Same Framework, Different Engine: Porting AI Coding Workflows from Claude Code to Codex CLI

TL;DR

The 8-Month Wait

What "Near-Zero Migration" Actually Looks Like

Why This Worked: Design Decisions That Paid Off

1. Natural Language as the Interface Layer

2. Context Separation as Architecture

3. Structured Handoffs Over Shared State

The Framework in Action

Getting Started

Letting LLMs Jump — and Then Verifying Ruthlessly

The "First Plausible Answer" Problem

The Factor That Made the Difference: Purpose

A Real Example: Bug Investigation

The Old Approach

What I Changed

Implementation Example

Main Command (diagnose.md)

Sub-agent: Investigator

Sub-agent: Verifier

Sub-agent: Solver

Practical Guidelines

A Lesson I Learned the Hard Way

Design Integration Checkpoints Before Letting LLMs Code

Why "Everything Works" Until It Doesn't

A Painful Integration Bug That "Worked"

What actually went wrong

Make Integration Explicit

Step 1: List What's Affected

Step 2: Define Integration Checkpoints

Why This Works (Especially with AI)

The Result

Who This Is For

Final Thoughts

Planning Is the Real Superpower of Agentic Coding

Why "Just Execute" Fails

The Plan-Execute Architecture

The Full Workflow

Step 1: Preparation

Step 2: Design (Agree on Direction)

Don't Let It Start Coding Immediately

Review the Plan

Step 3: Work Planning (The Most Important Step)

Why This Step Matters Most

What Work Planning Includes

Perspectives to Consider

Perspective 1: Current State Analysis

Perspective 2: Strategy Selection

Perspective 3: Risk Assessment

Perspective 4: Constraints

Perspective 5: Completion Levels

Perspective 6: Integration Points

Task Decomposition Principles

Work Planning Anti-Patterns

Scaling to Task Size

`quality.yaml` is a thing I grow