DEV Community

Karun Japhet
Karun Japhet

Posted on • Originally published at karun.me

Multi-Agent Development Workflows with Claude Code

Single-agent Claude Code is pair programming. One developer, one task, full attention.

I've been running three or four agents against a project backlog simultaneously. Not because single-agent broke, but because groomed cards were sitting idle.

Here's what that looks like in practice.

The shift: from writing code to shaping work

When you use Claude Code as a single agent, you're pair programming. That's powerful when you're exploring a problem or designing an approach. But if you have independent cards groomed and ready, you're leaving throughput on the table.

Your role shifts. Instead of writing code alongside one agent, you shape the work before it starts and judge it when it's done. You groom cards, make design decisions, dispatch work, and review output. The agents write the code. Addy Osmani calls this the factory model: you're no longer building software, you're building the factory that builds your software. The spec becomes the primary deliverable, and the harness (task tracking, isolation, quality gates, review) is the factory floor.

Steve Yegge's Gas Town post maps this journey in eight stages, from IDE copilot to building your own orchestrator. I started multi-agent work at stage 6: three or four terminal windows, each running Claude Code on a different card. You realise quickly that you're the bottleneck. The agents can move faster than you can review, approve, and redirect. The answer isn't more attention from you. It's giving the agents more autonomy with safety nets: quality gates that catch problems automatically, structured dispatch so agents find their own work, and a review workflow for when they're done.

This post is my version of stage 8. The tooling is still maturing, and this harness will look different in six months. This is the April 2026 version.

Anthropic's 2026 Agentic Coding Trends Report says multi-agent "doesn't make sense for 95% of agent-assisted development tasks." That's probably true for ad-hoc coding. But if you have a groomed backlog of independent cards, running them in parallel is the logical next step to move through the backlog quicker.

Two modes, not a progression

These aren't stages you graduate through. They're modes you switch between based on what you're doing right now.

Thinking mode (single agent)

When you're exploring, designing, or working through a single problem. Grooming cards, writing acceptance criteria, debugging something complex. The value is in the conversation, not the throughput.

This is pair programming. Full attention on one thing.

Throughput mode (parallel workers)

When you have multiple cards ready to go. Each worker gets a card, a worktree, and runs independently. You review their output when they're done.

Choose based on card complexity and dependencies:

Sub-agents for small, independent cards (roughly 15-minute tasks):

  • Quick fixes, config changes, bounded features
  • Research running in background
  • Automated code review of completed work
  • Short-lived: no auto-compaction, so longer tasks can exhaust the context window
  • Cheaper: minimal context startup, returns summaries only

Agent teams for substantial cards or cards with cross-card dependencies:

  • Multi-file features that need to read large parts of the codebase
  • Cards where the agent needs sustained autonomy and may hit context limits
  • Each teammate is a full Claude Code session with auto-compaction, so they can sustain longer work
  • More expensive: each teammate loads full project context independently

Agent teams also handle coordination. When cards genuinely depend on each other, teammates can communicate directly via peer-to-peer messaging (SendMessage), shared task lists with dependency tracking, and auto-unblocking. Vertically-sliced stories following the INVEST principle produce fewer cross-card dependencies than horizontal slicing, but they don't eliminate them. Real dependencies exist even in well-groomed backlogs.

Real coordination cases:

  • One card updates a shared schema; other in-flight cards need to know before they merge
  • Large refactors that can't be one card, where agents need to agree on new interfaces
  • Adversarial debugging: competing hypotheses where agents share findings

Agent Teams require CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 (v2.1.32+) and are still experimental. No session resumption, task status can lag, one team per session.

How to choose

Situation Mode Why
Exploring, grooming, designing Thinking You need the dialogue
One thing that needs full attention Thinking Conversation > throughput
Multiple small, bounded cards ready Throughput (sub-agents) Fast, cheap, parallel
Multiple substantial cards Throughput (agent teams) Full context, sustained autonomy
Cards with cross-card dependencies Throughput (agent teams) Agents need to communicate
Research while you work Throughput (sub-agents) Background tasks
Review of completed work Throughput (sub-agents) Fresh context, separate reviewer

Rate limits are the real parallelism ceiling. They're pooled across all sessions on your account. Opus has the strictest limits. Plan for this when dispatching multiple workers.

The harness: making throughput mode reliable

Dispatching multiple agents is easy. Getting reliable output is hard. The harness (task tracking, isolation, quality gates, review) is what makes multi-agent development repeatable.

Layer 0: the upstream gate

The most important quality gate happens before any code is written.

Careful grooming is what makes the whole pipeline work. Clear description, specific acceptance criteria, explicit non-goals. As Ankit Jain puts it, "the most valuable human judgment is exercised before the first line of code is generated, not after."

I spend more time grooming cards than I do reviewing agent output. That ratio feels right. Groom in your main Claude Code session, use the conversation to think through edge cases, and write precise acceptance criteria. The card is the spec.

Layer 1: task tracking

Agents need to discover available work, claim it atomically, and track what's been tried. A TODO list isn't enough.

I'm using Beads for this. It stores data locally via Dolt, gives agents programmatic access, and handles dependencies between tasks. The key commands:

  • bd ready lists tasks with no open blockers
  • bd update <id> --claim atomically claims a task
  • bd show <id> gets full card details including previous notes and rejection feedback

A /dispatch skill wraps this into a workflow: find available cards via bd ready, present them for selection, claim each one, and spawn a worker per card with worktree isolation.

For multi-developer setups, a centralized tool (GitHub Issues, Linear) may be more practical. Beads' strength is agent-native programmatic access. See also The Claude Protocol and Metaswarm for existing harness implementations.

Layer 2: isolation

Without worktree isolation, parallel agents can't write to the same files. With it, each agent gets its own branch and working directory.

A worker agent definition (.claude/agents/worker.md):

---
model: sonnet
isolation: worktree
background: true
tools:
  - Read
  - Write
  - Edit
  - Bash
  - Glob
  - Grep
permissionMode: acceptEdits
---
Enter fullscreen mode Exit fullscreen mode

isolation: worktree gives each worker its own git worktree. background: true means the dispatch doesn't block waiting for workers to finish. model: sonnet keeps costs down for development work (swap to opus for complex cards).

Supporting config:

  • .worktreeinclude copies gitignored files (like .env) into new worktrees
  • WorktreeCreate hooks handle dependency installation
  • Scope each agent via CLAUDE.md to prevent merge conflicts across worktrees

Anthropic's C compiler case study used this pattern with 16 parallel agents. They hit duplicate work and merge conflicts. Tighter scoping and atomic task claiming address both.

Layer 3: quality gates

Two categories: automated (hooks that block the agent) and manual (human judgment during review). I underestimated how large agent-generated diffs get when the card isn't tightly scoped. The diff size guard was an afterthought; it's now one of the more useful gates.

Automated gates (fail-fast pyramid)

Run fastest and cheapest first, most expensive last:

  1. Formatting (PostToolUse on Write/Edit, instant). Auto-fix, not a gate.
  2. Linting / static analysis (seconds). Fast, deterministic.
  3. Type checking (seconds). Catches interface mismatches.
  4. Secret detection (PreToolUse on Edit/Write). Blocks before secrets hit disk.
  5. Unit tests (minutes). The foundation.
  6. Diff size guard (instant). Reject if change exceeds threshold. Prevents comprehension debt.
  7. Automated code review (subagent, 30-90s). Separate agent reviews the diff.

The code review subagent must be a separate agent with its own context window. As Nick Tune writes, "asking the main agent to mark its own homework is obviously not a good approach." Hamilton Greene's 9-agent approach achieves roughly 75% useful suggestions versus less than 50% from single-agent review.

Hook implementation:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [{ "type": "command", "command": "./scripts/detect-secrets.sh" }]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [{ "type": "command", "command": "npx prettier --write \"$FILE_PATH\"" }]
      }
    ],
    "TaskCompleted": [
      {
        "matcher": "",
        "hooks": [{ "type": "command", "command": "./scripts/quality-gate.sh" }]
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Exit code 0 proceeds. Exit code 2 blocks with feedback (the agent gets the stderr message and iterates). Lint, tests, and code review fire on TaskCompleted (runs once when the agent says "done"). Secret detection fires on PreToolUse (blocks before the write). See hooks reference.

Manual gates

What automated checks can't catch:

Layer 4: review gates

For trunk-based development without PRs, the worktree branch is the review surface. git diff main from the worktree shows exactly what would change on merge.

A /review-worktree skill handles this:

  1. Cross-references bd list --label review:pending with git worktree list
  2. Shows commit history and diff summary for the selected worktree
  3. Options: view full diff, view specific file, run tests, run review agent, approve, reject
  4. Approve: merge to main, close card, clean up worktree
  5. Reject: reopen card with feedback comment visible to the next worker via bd show

For PR-based teams, the agent creates a PR from the worktree branch. Standard review process.

Layer 5: the feedback loop

When work is rejected, the card reopens with a comment explaining why. On the next dispatch, the worker agent sees the rejection feedback via bd show. It has context on what was tried and why it failed.

When an agent is stuck, the card goes to blocked with a needs-help label and a note explaining what was tried. The human reviews and either re-grooms the card or splits it.

What it looks like in practice

Here's a walkthrough on credit-card-lending using Agent Teams. Three cards groomed and ready. The session starts with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 enabled.

Check the backlog and create the team

> bd ready --plain

bd-c4a1  P1  Add payment grace period calculation     customer
bd-e2f7  P1  Fix SSN validation accepting 000 prefix  customer
bd-b8d3  P2  Add CSV export to transaction history     reporting

> Create a team with three teammates, one for each card. Each teammate should use the worker agent definition and get its own worktree.

Creating team "sprint-work"...

Spawning teammate "grace-period" for bd-c4a1...
  Worktree: work/bd-c4a1-payment-grace-period
  Card claimed.

Spawning teammate "ssn-fix" for bd-e2f7...
  Worktree: work/bd-e2f7-fix-ssn-validation
  Card claimed.

Spawning teammate "csv-export" for bd-b8d3...
  Worktree: work/bd-b8d3-csv-export-transactions
  Card claimed.

Team "sprint-work" running. Ctrl+T to toggle task list. Shift+Down to
cycle between teammates.
Enter fullscreen mode Exit fullscreen mode

Monitor progress

The shared task list shows what each teammate is working on. Ctrl+T toggles it:

Tasks:
  [in_progress] bd-c4a1: Add payment grace period calculation (grace-period)
  [completed]   bd-e2f7: Fix SSN validation accepting 000 prefix (ssn-fix)
  [in_progress] bd-b8d3: Add CSV export to transaction history (csv-export)
Enter fullscreen mode Exit fullscreen mode

While teammates work, I stay in the lead session. Groom next sprint's cards, explore a design problem, whatever needs thinking. Teammates message the lead if they're stuck or need clarification.

Review completed work

ssn-fix and csv-export have finished. I review each worktree diff from the lead session.

> Show me the diff for ssn-fix's worktree

Commits (main..HEAD):
  a3f8c21 Fix SSN validation to reject 000 and 999 prefixes
  e7b2d14 Add test cases for invalid SSN prefixes

Changed files:
  src/.../customer/validation/SsnValidator.java      | 12 ++++++--
  src/.../customer/validation/SsnValidatorTest.java  | 28 ++++++++++++++++
Enter fullscreen mode Exit fullscreen mode

Small, focused fix. Two files, clear test coverage. Merge it.

> Merge ssn-fix's worktree to main

Merging work/bd-e2f7-fix-ssn-validation into main... done
Closing bd-e2f7... done
Removing worktree... done
Enter fullscreen mode Exit fullscreen mode

Now the CSV export:

> Show me the diff for csv-export's worktree

Commits (main..HEAD):
  b1c4e89 Add CSV export endpoint for transaction history
  d5a7f23 Add PDF export endpoint for transaction history
  f9e1b34 Add export format selection dropdown to UI

Changed files:
  12 files changed, 847 insertions(+), 23 deletions(-)
Enter fullscreen mode Exit fullscreen mode

Scope creep. The card said CSV export. The teammate added PDF export and a UI component.

> Reject. Send csv-export a message: "Card asked for CSV export only.
  PDF export and UI dropdown are out of scope. Revert those changes
  and keep only the CSV export."

Message sent to csv-export. Reopening bd-b8d3...
Enter fullscreen mode Exit fullscreen mode

With Agent Teams, the rejection goes directly to the teammate via SendMessage. The teammate receives the feedback, reverts the out-of-scope work, and resubmits. No re-dispatch needed.

This is a common failure mode: agents are eager to build adjacent features. The tighter the acceptance criteria in the card, the less often this happens.

Where it falls apart

Compound reliability

Each agent at 95% success. Five agents chained: roughly 77%. Multi-agent trades reliability for parallelism. The benefit must justify the overhead.

Context loss between agents

Every handoff is lossy compression. Google Research found 39-70% degradation in sequential multi-agent tasks. Subagents summarize results back to the caller; teammates don't get the lead's conversation history. Isolation prevents context pollution but loses nuance.

Token cost

Multi-agent consumes 2-5x more tokens for equivalent work. No published harness has budget limits per task. /usage monitoring is the best we have. This is an unsolved problem.

Time blindness

From the C compiler case study: Claude can't tell time and will spend hours running tests instead of making progress. The harness needs to print progress infrequently and offer fast-test options.

Duplicate work

Without task claiming, multiple agents fix the same bug independently and overwrite each other. I've seen this even with bd's --claim, when two cards touch overlapping files. The C compiler case study hit it at scale with 16 agents targeting the same bug.

The 18-month wall

Without quality gates, the pattern is: early velocity (months 1-3), plateau (4-9), decline (10-15), stall (16-18) as comprehension debt accumulates. CodeRabbit's research found AI-generated code produces 1.7x more issues and performance inefficiencies 8x more often than human code. This is why quality gates matter. Without them, the velocity gains are temporary.

The honest tradeoffs

Model lock-in

Claude Code is locked to Claude models. The orchestration layer (sub-agents, agent teams, skills, hooks, worktrees) doesn't exist in other tools. Your model choice is portable (use Claude API keys with aider, opencode, etc.) but the harness is not. No open-source tool today gives you model flexibility and Claude Code's agent stack. If you're invested in this workflow, you're invested in Claude Code.

When to stay in thinking mode

  • You're exploring, designing, or grooming. The value is in the conversation.
  • One task that needs your full attention and steering.
  • Cost constraint. Throughput mode is 2-5x more expensive per equivalent output.
  • The work isn't decomposed into independent cards yet. Dispatch without grooming is waste.

The real cost

Anthropic's C compiler project: $20K in API costs for 16 agents producing 100K lines of code. That excludes significant human effort for workflow design, task decomposition, agent management, output review, and integration. Budget for both.

What's next

Today's harness is human-triggered. You run /dispatch when you're ready. The next step is agents that continuously pull from the backlog as cards become ready, with the human as reviewer rather than dispatcher.

The pieces exist: bd ready for discovery, worktrees for isolation, hooks for quality, agent teams for coordination. The missing piece is the continuous loop, and the trust to let it run.

Companies with agentic coding infrastructure report 30-50% acceleration in development cycles. But a February 2026 NBER study of nearly 6,000 executives found 89% of firms report zero productivity change from AI. The gap between those groups isn't model quality. It's the infrastructure around the model.

That's been the consistent lesson: harness design matters as much as prompt design.

Top comments (0)