Karun Japhet

Posted on May 19 • Originally published at karun.me

Multi-Agent Development Workflows with Claude Code

#ai #programming #productivity #tutorial

Single-agent Claude Code is pair programming. One developer, one task, full attention.

I've been running three or four agents against a project backlog simultaneously. Not because single-agent broke, but because groomed cards were sitting idle.

Here's what that looks like in practice.

The shift: from writing code to shaping work

When you use Claude Code as a single agent, you're pair programming. That's powerful when you're exploring a problem or designing an approach. But if you have independent cards groomed and ready, you're leaving throughput on the table.

Your role shifts. Instead of writing code alongside one agent, you shape the work before it starts and judge it when it's done. You groom cards, make design decisions, dispatch work, and review output. The agents write the code. Addy Osmani calls this the factory model: you're no longer building software, you're building the factory that builds your software. The spec becomes the primary deliverable, and the harness (task tracking, isolation, quality gates, review) is the factory floor.

Steve Yegge's Gas Town post maps this journey in eight stages, from IDE copilot to building your own orchestrator. I started multi-agent work at stage 6: three or four terminal windows, each running Claude Code on a different card. You realise quickly that you're the bottleneck. The agents can move faster than you can review, approve, and redirect. The answer isn't more attention from you. It's giving the agents more autonomy with safety nets: quality gates that catch problems automatically, structured dispatch so agents find their own work, and a review workflow for when they're done.

This post is my version of stage 8. The tooling is still maturing, and this harness will look different in six months. This is the April 2026 version.

Anthropic's 2026 Agentic Coding Trends Report says multi-agent "doesn't make sense for 95% of agent-assisted development tasks." That's probably true for ad-hoc coding. But if you have a groomed backlog of independent cards, running them in parallel is the logical next step to move through the backlog quicker.

Two modes, not a progression

These aren't stages you graduate through. They're modes you switch between based on what you're doing right now.

Thinking mode (single agent)

When you're exploring, designing, or working through a single problem. Grooming cards, writing acceptance criteria, debugging something complex. The value is in the conversation, not the throughput.

This is pair programming. Full attention on one thing.

Throughput mode (parallel workers)

When you have multiple cards ready to go. Each worker gets a card, a worktree, and runs independently. You review their output when they're done.

Choose based on card complexity and dependencies:

Sub-agents for small, independent cards (roughly 15-minute tasks):

Quick fixes, config changes, bounded features
Research running in background
Automated code review of completed work
Short-lived: no auto-compaction, so longer tasks can exhaust the context window
Cheaper: minimal context startup, returns summaries only

Agent teams for substantial cards or cards with cross-card dependencies:

Multi-file features that need to read large parts of the codebase
Cards where the agent needs sustained autonomy and may hit context limits
Each teammate is a full Claude Code session with auto-compaction, so they can sustain longer work
More expensive: each teammate loads full project context independently

Agent teams also handle coordination. When cards genuinely depend on each other, teammates can communicate directly via peer-to-peer messaging (SendMessage), shared task lists with dependency tracking, and auto-unblocking. Vertically-sliced stories following the INVEST principle produce fewer cross-card dependencies than horizontal slicing, but they don't eliminate them. Real dependencies exist even in well-groomed backlogs.

Real coordination cases:

One card updates a shared schema; other in-flight cards need to know before they merge
Large refactors that can't be one card, where agents need to agree on new interfaces
Adversarial debugging: competing hypotheses where agents share findings

Agent Teams require CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 (v2.1.32+) and are still experimental. No session resumption, task status can lag, one team per session.

How to choose

Situation	Mode	Why
Exploring, grooming, designing	Thinking	You need the dialogue
One thing that needs full attention	Thinking	Conversation > throughput
Multiple small, bounded cards ready	Throughput (sub-agents)	Fast, cheap, parallel
Multiple substantial cards	Throughput (agent teams)	Full context, sustained autonomy
Cards with cross-card dependencies	Throughput (agent teams)	Agents need to communicate
Research while you work	Throughput (sub-agents)	Background tasks
Review of completed work	Throughput (sub-agents)	Fresh context, separate reviewer

Rate limits are the real parallelism ceiling. They're pooled across all sessions on your account. Opus has the strictest limits. Plan for this when dispatching multiple workers.

The harness: making throughput mode reliable

Dispatching multiple agents is easy. Getting reliable output is hard. The harness (task tracking, isolation, quality gates, review) is what makes multi-agent development repeatable.

Layer 0: the upstream gate

The most important quality gate happens before any code is written.

Careful grooming is what makes the whole pipeline work. Clear description, specific acceptance criteria, explicit non-goals. As Ankit Jain puts it, "the most valuable human judgment is exercised before the first line of code is generated, not after."

I spend more time grooming cards than I do reviewing agent output. That ratio feels right. Groom in your main Claude Code session, use the conversation to think through edge cases, and write precise acceptance criteria. The card is the spec.

Layer 1: task tracking

Agents need to discover available work, claim it atomically, and track what's been tried. A TODO list isn't enough.

I'm using Beads for this. It stores data locally via Dolt, gives agents programmatic access, and handles dependencies between tasks. The key commands:

bd ready lists tasks with no open blockers
bd update <id> --claim atomically claims a task
bd show <id> gets full card details including previous notes and rejection feedback

A /dispatch skill wraps this into a workflow: find available cards via bd ready, present them for selection, claim each one, and spawn a worker per card with worktree isolation.

For multi-developer setups, a centralized tool (GitHub Issues, Linear) may be more practical. Beads' strength is agent-native programmatic access. See also The Claude Protocol and Metaswarm for existing harness implementations.

Layer 2: isolation

Without worktree isolation, parallel agents can't write to the same files. With it, each agent gets its own branch and working directory.

A worker agent definition (.claude/agents/worker.md):

---
model: sonnet
isolation: worktree
background: true
tools:
  - Read
  - Write
  - Edit
  - Bash
  - Glob
  - Grep
permissionMode: acceptEdits
---

isolation: worktree gives each worker its own git worktree. background: true means the dispatch doesn't block waiting for workers to finish. model: sonnet keeps costs down for development work (swap to opus for complex cards).

Supporting config:

.worktreeinclude copies gitignored files (like .env) into new worktrees
WorktreeCreate hooks handle dependency installation
Scope each agent via CLAUDE.md to prevent merge conflicts across worktrees

Anthropic's C compiler case study used this pattern with 16 parallel agents. They hit duplicate work and merge conflicts. Tighter scoping and atomic task claiming address both.

Layer 3: quality gates

Two categories: automated (hooks that block the agent) and manual (human judgment during review). I underestimated how large agent-generated diffs get when the card isn't tightly scoped. The diff size guard was an afterthought; it's now one of the more useful gates.

Automated gates (fail-fast pyramid)

Run fastest and cheapest first, most expensive last:

Formatting (PostToolUse on Write/Edit, instant). Auto-fix, not a gate.
Linting / static analysis (seconds). Fast, deterministic.
Type checking (seconds). Catches interface mismatches.
Secret detection (PreToolUse on Edit/Write). Blocks before secrets hit disk.
Unit tests (minutes). The foundation.
Diff size guard (instant). Reject if change exceeds threshold. Prevents comprehension debt.
Automated code review (subagent, 30-90s). Separate agent reviews the diff.

The code review subagent must be a separate agent with its own context window. As Nick Tune writes, "asking the main agent to mark its own homework is obviously not a good approach." Hamilton Greene's 9-agent approach achieves roughly 75% useful suggestions versus less than 50% from single-agent review.

Hook implementation:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [{ "type": "command", "command": "./scripts/detect-secrets.sh" }]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [{ "type": "command", "command": "npx prettier --write \"$FILE_PATH\"" }]
      }
    ],
    "TaskCompleted": [
      {
        "matcher": "",
        "hooks": [{ "type": "command", "command": "./scripts/quality-gate.sh" }]
      }
    ]
  }
}

Exit code 0 proceeds. Exit code 2 blocks with feedback (the agent gets the stderr message and iterates). Lint, tests, and code review fire on TaskCompleted (runs once when the agent says "done"). Secret detection fires on PreToolUse (blocks before the write). See hooks reference.

Manual gates

What automated checks can't catch:

Scope adherence. Did the agent build what the card asked for, or add unrequested features?
Architectural coherence. Does the implementation fit the architecture of the rest of the system, or did the agent invent its own patterns?
Business logic correctness. Models infer patterns statistically, not semantically.
Comprehension check. If you can't understand the diff, it's too large or too novel.

Layer 4: review gates

For trunk-based development without PRs, the worktree branch is the review surface. git diff main from the worktree shows exactly what would change on merge.

A /review-worktree skill handles this:

Cross-references bd list --label review:pending with git worktree list
Shows commit history and diff summary for the selected worktree
Options: view full diff, view specific file, run tests, run review agent, approve, reject
Approve: merge to main, close card, clean up worktree
Reject: reopen card with feedback comment visible to the next worker via bd show

For PR-based teams, the agent creates a PR from the worktree branch. Standard review process.

Layer 5: the feedback loop

When work is rejected, the card reopens with a comment explaining why. On the next dispatch, the worker agent sees the rejection feedback via bd show. It has context on what was tried and why it failed.

When an agent is stuck, the card goes to blocked with a needs-help label and a note explaining what was tried. The human reviews and either re-grooms the card or splits it.

What it looks like in practice

Here's a walkthrough on credit-card-lending using Agent Teams. Three cards groomed and ready. The session starts with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 enabled.

Check the backlog and create the team

> bd ready --plain

bd-c4a1  P1  Add payment grace period calculation     customer
bd-e2f7  P1  Fix SSN validation accepting 000 prefix  customer
bd-b8d3  P2  Add CSV export to transaction history     reporting

> Create a team with three teammates, one for each card. Each teammate should use the worker agent definition and get its own worktree.

Creating team "sprint-work"...

Spawning teammate "grace-period" for bd-c4a1...
  Worktree: work/bd-c4a1-payment-grace-period
  Card claimed.

Spawning teammate "ssn-fix" for bd-e2f7...
  Worktree: work/bd-e2f7-fix-ssn-validation
  Card claimed.

Spawning teammate "csv-export" for bd-b8d3...
  Worktree: work/bd-b8d3-csv-export-transactions
  Card claimed.

Team "sprint-work" running. Ctrl+T to toggle task list. Shift+Down to
cycle between teammates.

Monitor progress

The shared task list shows what each teammate is working on. Ctrl+T toggles it:

Tasks:
  [in_progress] bd-c4a1: Add payment grace period calculation (grace-period)
  [completed]   bd-e2f7: Fix SSN validation accepting 000 prefix (ssn-fix)
  [in_progress] bd-b8d3: Add CSV export to transaction history (csv-export)

While teammates work, I stay in the lead session. Groom next sprint's cards, explore a design problem, whatever needs thinking. Teammates message the lead if they're stuck or need clarification.

Review completed work

ssn-fix and csv-export have finished. I review each worktree diff from the lead session.

> Show me the diff for ssn-fix's worktree

Commits (main..HEAD):
  a3f8c21 Fix SSN validation to reject 000 and 999 prefixes
  e7b2d14 Add test cases for invalid SSN prefixes

Changed files:
  src/.../customer/validation/SsnValidator.java      | 12 ++++++--
  src/.../customer/validation/SsnValidatorTest.java  | 28 ++++++++++++++++

Small, focused fix. Two files, clear test coverage. Merge it.

> Merge ssn-fix's worktree to main

Merging work/bd-e2f7-fix-ssn-validation into main... done
Closing bd-e2f7... done
Removing worktree... done

Now the CSV export:

> Show me the diff for csv-export's worktree

Commits (main..HEAD):
  b1c4e89 Add CSV export endpoint for transaction history
  d5a7f23 Add PDF export endpoint for transaction history
  f9e1b34 Add export format selection dropdown to UI

Changed files:
  12 files changed, 847 insertions(+), 23 deletions(-)

Scope creep. The card said CSV export. The teammate added PDF export and a UI component.

> Reject. Send csv-export a message: "Card asked for CSV export only.
  PDF export and UI dropdown are out of scope. Revert those changes
  and keep only the CSV export."

Message sent to csv-export. Reopening bd-b8d3...

With Agent Teams, the rejection goes directly to the teammate via SendMessage. The teammate receives the feedback, reverts the out-of-scope work, and resubmits. No re-dispatch needed.

This is a common failure mode: agents are eager to build adjacent features. The tighter the acceptance criteria in the card, the less often this happens.

Where it falls apart

Compound reliability

Each agent at 95% success. Five agents chained: roughly 77%. Multi-agent trades reliability for parallelism. The benefit must justify the overhead.

Context loss between agents

Every handoff is lossy compression. Google Research found 39-70% degradation in sequential multi-agent tasks. Subagents summarize results back to the caller; teammates don't get the lead's conversation history. Isolation prevents context pollution but loses nuance.

Token cost

Multi-agent consumes 2-5x more tokens for equivalent work. No published harness has budget limits per task. /usage monitoring is the best we have. This is an unsolved problem.

Time blindness

From the C compiler case study: Claude can't tell time and will spend hours running tests instead of making progress. The harness needs to print progress infrequently and offer fast-test options.

Duplicate work

Without task claiming, multiple agents fix the same bug independently and overwrite each other. I've seen this even with bd's --claim, when two cards touch overlapping files. The C compiler case study hit it at scale with 16 agents targeting the same bug.

The 18-month wall

Without quality gates, the pattern is: early velocity (months 1-3), plateau (4-9), decline (10-15), stall (16-18) as comprehension debt accumulates. CodeRabbit's research found AI-generated code produces 1.7x more issues and performance inefficiencies 8x more often than human code. This is why quality gates matter. Without them, the velocity gains are temporary.

The honest tradeoffs

Model lock-in

Claude Code is locked to Claude models. The orchestration layer (sub-agents, agent teams, skills, hooks, worktrees) doesn't exist in other tools. Your model choice is portable (use Claude API keys with aider, opencode, etc.) but the harness is not. No open-source tool today gives you model flexibility and Claude Code's agent stack. If you're invested in this workflow, you're invested in Claude Code.

When to stay in thinking mode

You're exploring, designing, or grooming. The value is in the conversation.
One task that needs your full attention and steering.
Cost constraint. Throughput mode is 2-5x more expensive per equivalent output.
The work isn't decomposed into independent cards yet. Dispatch without grooming is waste.

The real cost

Anthropic's C compiler project: $20K in API costs for 16 agents producing 100K lines of code. That excludes significant human effort for workflow design, task decomposition, agent management, output review, and integration. Budget for both.

What's next

Today's harness is human-triggered. You run /dispatch when you're ready. The next step is agents that continuously pull from the backlog as cards become ready, with the human as reviewer rather than dispatcher.

The pieces exist: bd ready for discovery, worktrees for isolation, hooks for quality, agent teams for coordination. The missing piece is the continuous loop, and the trust to let it run.

Companies with agentic coding infrastructure report 30-50% acceleration in development cycles. But a February 2026 NBER study of nearly 6,000 executives found 89% of firms report zero productivity change from AI. The gap between those groups isn't model quality. It's the infrastructure around the model.

That's been the consistent lesson: harness design matters as much as prompt design.

Top comments (5)

Kyle Carriedo • May 26

Really solid writeup — the compound reliability degradation point (95% agent success → 77% at 5 chained agents) is the number I keep trying to explain to people who think multi-agent means "just add more agents."

The "human as bottleneck" frame is accurate and also the most interesting design problem: if agents can move faster than you can review and redirect, the question isn't "how do I slow the agents down" — it's "how do I move the review+redirect overhead out of my critical path."

What's worked for us on the orchestration side: an out-of-process coordinator that holds the cycle loop, enforces one-agent-per-project-namespace at a time via a file lock, and writes a structured handoff that the next agent reads cold rather than inheriting from a long conversation. It doesn't solve all seven of the failure modes you list, but it addresses the "agents building on each other's bad output" and "duplicate work" problems directly — each agent in the chain starts fresh rather than inheriting accumulated drift.

The model lock-in point at the end is real. The harness assumptions today are Opus-flavored in ways that don't port cleanly to other models. Worth documenting that constraint explicitly if you're recommending this to teams.

Karun Japhet • Jun 5

Thanks Kyle, this is the most useful kind of comment to get.

The cold-handoff detail is the part I want to sit with. What I'm running today leans on Beads for the structured handoff, but the agent still inherits a conversation, so accumulated drift is only half-solved. Your "next agent reads the handoff cold" framing is the sharper version of what I was reaching for. The handoff artifact becomes the contract, and the conversation stops being load-bearing. I think that's the right direction and I hadn't drawn the line that cleanly.

The file lock for one-agent-per-namespace is interesting too. I get isolation from worktrees instead, so two agents never touch the same tree, but that's isolation by separation, not by coordination. Your lock handles the case where the work genuinely overlaps and separation isn't an option. Different tool for a different shape of problem.

On model lock-in: agreed, and I should have been explicit. The harness assumptions are Opus-flavored and don't port cleanly. I'll call that out directly rather than leaving it as a footnote.

Curious how your coordinator decides ordering when two queued items both want the same namespace. Priority, FIFO, something smarter?

Kyle Carriedo • Jun 8

The compound reliability figure (77% for 5 agents at 95% each) is the most useful number in this piece for anyone reasoning about production multi-agent systems. It's intuitive but doesn't feel real until you write it out.

One thing I'd add: the 77% assumes failures are visible. In practice, a significant class of multi-agent failures are silent — a subagent completes its turn without error but with a subtly wrong output, and the next agent treats that as authoritative input. By the time a human reviews the final result, the compounding has happened invisibly.

The mitigation we've found useful: an external coordinator that treats each agent's output as a candidate rather than a fact. The coordinator runs a lightweight schema validation pass before handing off to the next agent, and if the candidate doesn't pass, the coordinator retries or escalates rather than continuing the chain. This doesn't solve the compound reliability problem, but it moves failures from "silent corruption" to "explicit retry" — which is where you want them.

The token cost observation (2-5x) is real and worth front-loading in any multi-agent proposal. Stakeholders usually think of it as "running N models in parallel," not "running N models plus the coordination overhead."

(Disclosure: I build Claudeverse — an orchestration layer for Claude Code. The pattern above is what we run in production. Happy to share more specifics if useful.)

Kyle Carriedo • Jun 17

The compound reliability number (77% for five 95%-reliable agents) is a useful framing, but in practice it understates the problem because failures aren't independent. When agent 3 misunderstands the task spec, agents 4 and 5 often inherit the same misunderstanding if they're reading the same shared context — so you don't get independent failure probabilities, you get correlated failure modes.

The context-loss-between-handoffs finding (39-70% degradation) is the more load-bearing number for anyone designing these systems. The practical implication is that handoff artifacts need to be much more structured than a prose summary — you want schema-validated outputs with explicit field contracts so the receiving agent has no ambiguity to fill in.

On the duplicate-work problem: the file-based task-claiming pattern (agent writes a lock file before starting a task) is simple and works reliably, but it has a failure mode when agents crash mid-task without releasing the lock. Worth building a staleness check (lock file age > N minutes = abandoned) into whatever orchestration layer you use.

The seven failure modes the author lists track closely with what I've seen building multi-agent pipelines. The ones that are hardest to instrument are silent failures — where an agent produces structurally valid output that is semantically wrong. Schema validation catches shape errors but not meaning errors, and meaning errors are the ones that propagate furthest before being caught.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.