Single-agent Claude Code is pair programming. One developer, one task, full attention.
I've been running three or four agents against a project backlog simultaneously. Not because single-agent broke, but because groomed cards were sitting idle.
Here's what that looks like in practice.
The shift: from writing code to shaping work
When you use Claude Code as a single agent, you're pair programming. That's powerful when you're exploring a problem or designing an approach. But if you have independent cards groomed and ready, you're leaving throughput on the table.
Your role shifts. Instead of writing code alongside one agent, you shape the work before it starts and judge it when it's done. You groom cards, make design decisions, dispatch work, and review output. The agents write the code. Addy Osmani calls this the factory model: you're no longer building software, you're building the factory that builds your software. The spec becomes the primary deliverable, and the harness (task tracking, isolation, quality gates, review) is the factory floor.
Steve Yegge's Gas Town post maps this journey in eight stages, from IDE copilot to building your own orchestrator. I started multi-agent work at stage 6: three or four terminal windows, each running Claude Code on a different card. You realise quickly that you're the bottleneck. The agents can move faster than you can review, approve, and redirect. The answer isn't more attention from you. It's giving the agents more autonomy with safety nets: quality gates that catch problems automatically, structured dispatch so agents find their own work, and a review workflow for when they're done.
This post is my version of stage 8. The tooling is still maturing, and this harness will look different in six months. This is the April 2026 version.
Anthropic's 2026 Agentic Coding Trends Report says multi-agent "doesn't make sense for 95% of agent-assisted development tasks." That's probably true for ad-hoc coding. But if you have a groomed backlog of independent cards, running them in parallel is the logical next step to move through the backlog quicker.
Two modes, not a progression
These aren't stages you graduate through. They're modes you switch between based on what you're doing right now.
Thinking mode (single agent)
When you're exploring, designing, or working through a single problem. Grooming cards, writing acceptance criteria, debugging something complex. The value is in the conversation, not the throughput.
This is pair programming. Full attention on one thing.
Throughput mode (parallel workers)
When you have multiple cards ready to go. Each worker gets a card, a worktree, and runs independently. You review their output when they're done.
Choose based on card complexity and dependencies:
Sub-agents for small, independent cards (roughly 15-minute tasks):
- Quick fixes, config changes, bounded features
- Research running in background
- Automated code review of completed work
- Short-lived: no auto-compaction, so longer tasks can exhaust the context window
- Cheaper: minimal context startup, returns summaries only
Agent teams for substantial cards or cards with cross-card dependencies:
- Multi-file features that need to read large parts of the codebase
- Cards where the agent needs sustained autonomy and may hit context limits
- Each teammate is a full Claude Code session with auto-compaction, so they can sustain longer work
- More expensive: each teammate loads full project context independently
Agent teams also handle coordination. When cards genuinely depend on each other, teammates can communicate directly via peer-to-peer messaging (SendMessage), shared task lists with dependency tracking, and auto-unblocking. Vertically-sliced stories following the INVEST principle produce fewer cross-card dependencies than horizontal slicing, but they don't eliminate them. Real dependencies exist even in well-groomed backlogs.
Real coordination cases:
- One card updates a shared schema; other in-flight cards need to know before they merge
- Large refactors that can't be one card, where agents need to agree on new interfaces
- Adversarial debugging: competing hypotheses where agents share findings
Agent Teams require CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 (v2.1.32+) and are still experimental. No session resumption, task status can lag, one team per session.
How to choose
| Situation | Mode | Why |
|---|---|---|
| Exploring, grooming, designing | Thinking | You need the dialogue |
| One thing that needs full attention | Thinking | Conversation > throughput |
| Multiple small, bounded cards ready | Throughput (sub-agents) | Fast, cheap, parallel |
| Multiple substantial cards | Throughput (agent teams) | Full context, sustained autonomy |
| Cards with cross-card dependencies | Throughput (agent teams) | Agents need to communicate |
| Research while you work | Throughput (sub-agents) | Background tasks |
| Review of completed work | Throughput (sub-agents) | Fresh context, separate reviewer |
Rate limits are the real parallelism ceiling. They're pooled across all sessions on your account. Opus has the strictest limits. Plan for this when dispatching multiple workers.
The harness: making throughput mode reliable
Dispatching multiple agents is easy. Getting reliable output is hard. The harness (task tracking, isolation, quality gates, review) is what makes multi-agent development repeatable.
Layer 0: the upstream gate
The most important quality gate happens before any code is written.
Careful grooming is what makes the whole pipeline work. Clear description, specific acceptance criteria, explicit non-goals. As Ankit Jain puts it, "the most valuable human judgment is exercised before the first line of code is generated, not after."
I spend more time grooming cards than I do reviewing agent output. That ratio feels right. Groom in your main Claude Code session, use the conversation to think through edge cases, and write precise acceptance criteria. The card is the spec.
Layer 1: task tracking
Agents need to discover available work, claim it atomically, and track what's been tried. A TODO list isn't enough.
I'm using Beads for this. It stores data locally via Dolt, gives agents programmatic access, and handles dependencies between tasks. The key commands:
-
bd readylists tasks with no open blockers -
bd update <id> --claimatomically claims a task -
bd show <id>gets full card details including previous notes and rejection feedback
A /dispatch skill wraps this into a workflow: find available cards via bd ready, present them for selection, claim each one, and spawn a worker per card with worktree isolation.
For multi-developer setups, a centralized tool (GitHub Issues, Linear) may be more practical. Beads' strength is agent-native programmatic access. See also The Claude Protocol and Metaswarm for existing harness implementations.
Layer 2: isolation
Without worktree isolation, parallel agents can't write to the same files. With it, each agent gets its own branch and working directory.
A worker agent definition (.claude/agents/worker.md):
---
model: sonnet
isolation: worktree
background: true
tools:
- Read
- Write
- Edit
- Bash
- Glob
- Grep
permissionMode: acceptEdits
---
isolation: worktree gives each worker its own git worktree. background: true means the dispatch doesn't block waiting for workers to finish. model: sonnet keeps costs down for development work (swap to opus for complex cards).
Supporting config:
-
.worktreeincludecopies gitignored files (like.env) into new worktrees -
WorktreeCreatehooks handle dependency installation - Scope each agent via CLAUDE.md to prevent merge conflicts across worktrees
Anthropic's C compiler case study used this pattern with 16 parallel agents. They hit duplicate work and merge conflicts. Tighter scoping and atomic task claiming address both.
Layer 3: quality gates
Two categories: automated (hooks that block the agent) and manual (human judgment during review). I underestimated how large agent-generated diffs get when the card isn't tightly scoped. The diff size guard was an afterthought; it's now one of the more useful gates.
Automated gates (fail-fast pyramid)
Run fastest and cheapest first, most expensive last:
- Formatting (PostToolUse on Write/Edit, instant). Auto-fix, not a gate.
- Linting / static analysis (seconds). Fast, deterministic.
- Type checking (seconds). Catches interface mismatches.
- Secret detection (PreToolUse on Edit/Write). Blocks before secrets hit disk.
- Unit tests (minutes). The foundation.
- Diff size guard (instant). Reject if change exceeds threshold. Prevents comprehension debt.
- Automated code review (subagent, 30-90s). Separate agent reviews the diff.
The code review subagent must be a separate agent with its own context window. As Nick Tune writes, "asking the main agent to mark its own homework is obviously not a good approach." Hamilton Greene's 9-agent approach achieves roughly 75% useful suggestions versus less than 50% from single-agent review.
Hook implementation:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Edit|Write",
"hooks": [{ "type": "command", "command": "./scripts/detect-secrets.sh" }]
}
],
"PostToolUse": [
{
"matcher": "Write|Edit",
"hooks": [{ "type": "command", "command": "npx prettier --write \"$FILE_PATH\"" }]
}
],
"TaskCompleted": [
{
"matcher": "",
"hooks": [{ "type": "command", "command": "./scripts/quality-gate.sh" }]
}
]
}
}
Exit code 0 proceeds. Exit code 2 blocks with feedback (the agent gets the stderr message and iterates). Lint, tests, and code review fire on TaskCompleted (runs once when the agent says "done"). Secret detection fires on PreToolUse (blocks before the write). See hooks reference.
Manual gates
What automated checks can't catch:
- Scope adherence. Did the agent build what the card asked for, or add unrequested features?
- Architectural coherence. Does the implementation fit the architecture of the rest of the system, or did the agent invent its own patterns?
- Business logic correctness. Models infer patterns statistically, not semantically.
- Comprehension check. If you can't understand the diff, it's too large or too novel.
Layer 4: review gates
For trunk-based development without PRs, the worktree branch is the review surface. git diff main from the worktree shows exactly what would change on merge.
A /review-worktree skill handles this:
- Cross-references
bd list --label review:pendingwithgit worktree list - Shows commit history and diff summary for the selected worktree
- Options: view full diff, view specific file, run tests, run review agent, approve, reject
- Approve: merge to main, close card, clean up worktree
- Reject: reopen card with feedback comment visible to the next worker via
bd show
For PR-based teams, the agent creates a PR from the worktree branch. Standard review process.
Layer 5: the feedback loop
When work is rejected, the card reopens with a comment explaining why. On the next dispatch, the worker agent sees the rejection feedback via bd show. It has context on what was tried and why it failed.
When an agent is stuck, the card goes to blocked with a needs-help label and a note explaining what was tried. The human reviews and either re-grooms the card or splits it.
What it looks like in practice
Here's a walkthrough on credit-card-lending using Agent Teams. Three cards groomed and ready. The session starts with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 enabled.
Check the backlog and create the team
> bd ready --plain
bd-c4a1 P1 Add payment grace period calculation customer
bd-e2f7 P1 Fix SSN validation accepting 000 prefix customer
bd-b8d3 P2 Add CSV export to transaction history reporting
> Create a team with three teammates, one for each card. Each teammate should use the worker agent definition and get its own worktree.
Creating team "sprint-work"...
Spawning teammate "grace-period" for bd-c4a1...
Worktree: work/bd-c4a1-payment-grace-period
Card claimed.
Spawning teammate "ssn-fix" for bd-e2f7...
Worktree: work/bd-e2f7-fix-ssn-validation
Card claimed.
Spawning teammate "csv-export" for bd-b8d3...
Worktree: work/bd-b8d3-csv-export-transactions
Card claimed.
Team "sprint-work" running. Ctrl+T to toggle task list. Shift+Down to
cycle between teammates.
Monitor progress
The shared task list shows what each teammate is working on. Ctrl+T toggles it:
Tasks:
[in_progress] bd-c4a1: Add payment grace period calculation (grace-period)
[completed] bd-e2f7: Fix SSN validation accepting 000 prefix (ssn-fix)
[in_progress] bd-b8d3: Add CSV export to transaction history (csv-export)
While teammates work, I stay in the lead session. Groom next sprint's cards, explore a design problem, whatever needs thinking. Teammates message the lead if they're stuck or need clarification.
Review completed work
ssn-fix and csv-export have finished. I review each worktree diff from the lead session.
> Show me the diff for ssn-fix's worktree
Commits (main..HEAD):
a3f8c21 Fix SSN validation to reject 000 and 999 prefixes
e7b2d14 Add test cases for invalid SSN prefixes
Changed files:
src/.../customer/validation/SsnValidator.java | 12 ++++++--
src/.../customer/validation/SsnValidatorTest.java | 28 ++++++++++++++++
Small, focused fix. Two files, clear test coverage. Merge it.
> Merge ssn-fix's worktree to main
Merging work/bd-e2f7-fix-ssn-validation into main... done
Closing bd-e2f7... done
Removing worktree... done
Now the CSV export:
> Show me the diff for csv-export's worktree
Commits (main..HEAD):
b1c4e89 Add CSV export endpoint for transaction history
d5a7f23 Add PDF export endpoint for transaction history
f9e1b34 Add export format selection dropdown to UI
Changed files:
12 files changed, 847 insertions(+), 23 deletions(-)
Scope creep. The card said CSV export. The teammate added PDF export and a UI component.
> Reject. Send csv-export a message: "Card asked for CSV export only.
PDF export and UI dropdown are out of scope. Revert those changes
and keep only the CSV export."
Message sent to csv-export. Reopening bd-b8d3...
With Agent Teams, the rejection goes directly to the teammate via SendMessage. The teammate receives the feedback, reverts the out-of-scope work, and resubmits. No re-dispatch needed.
This is a common failure mode: agents are eager to build adjacent features. The tighter the acceptance criteria in the card, the less often this happens.
Where it falls apart
Compound reliability
Each agent at 95% success. Five agents chained: roughly 77%. Multi-agent trades reliability for parallelism. The benefit must justify the overhead.
Context loss between agents
Every handoff is lossy compression. Google Research found 39-70% degradation in sequential multi-agent tasks. Subagents summarize results back to the caller; teammates don't get the lead's conversation history. Isolation prevents context pollution but loses nuance.
Token cost
Multi-agent consumes 2-5x more tokens for equivalent work. No published harness has budget limits per task. /usage monitoring is the best we have. This is an unsolved problem.
Time blindness
From the C compiler case study: Claude can't tell time and will spend hours running tests instead of making progress. The harness needs to print progress infrequently and offer fast-test options.
Duplicate work
Without task claiming, multiple agents fix the same bug independently and overwrite each other. I've seen this even with bd's --claim, when two cards touch overlapping files. The C compiler case study hit it at scale with 16 agents targeting the same bug.
The 18-month wall
Without quality gates, the pattern is: early velocity (months 1-3), plateau (4-9), decline (10-15), stall (16-18) as comprehension debt accumulates. CodeRabbit's research found AI-generated code produces 1.7x more issues and performance inefficiencies 8x more often than human code. This is why quality gates matter. Without them, the velocity gains are temporary.
The honest tradeoffs
Model lock-in
Claude Code is locked to Claude models. The orchestration layer (sub-agents, agent teams, skills, hooks, worktrees) doesn't exist in other tools. Your model choice is portable (use Claude API keys with aider, opencode, etc.) but the harness is not. No open-source tool today gives you model flexibility and Claude Code's agent stack. If you're invested in this workflow, you're invested in Claude Code.
When to stay in thinking mode
- You're exploring, designing, or grooming. The value is in the conversation.
- One task that needs your full attention and steering.
- Cost constraint. Throughput mode is 2-5x more expensive per equivalent output.
- The work isn't decomposed into independent cards yet. Dispatch without grooming is waste.
The real cost
Anthropic's C compiler project: $20K in API costs for 16 agents producing 100K lines of code. That excludes significant human effort for workflow design, task decomposition, agent management, output review, and integration. Budget for both.
What's next
Today's harness is human-triggered. You run /dispatch when you're ready. The next step is agents that continuously pull from the backlog as cards become ready, with the human as reviewer rather than dispatcher.
The pieces exist: bd ready for discovery, worktrees for isolation, hooks for quality, agent teams for coordination. The missing piece is the continuous loop, and the trust to let it run.
Companies with agentic coding infrastructure report 30-50% acceleration in development cycles. But a February 2026 NBER study of nearly 6,000 executives found 89% of firms report zero productivity change from AI. The gap between those groups isn't model quality. It's the infrastructure around the model.
That's been the consistent lesson: harness design matters as much as prompt design.
Top comments (0)