I ran 5 AI coding agents in parallel on a real Rust project for a week. Not a demo. Not a toy. A 51K-line codebase with real users.
Here's what happened — with actual numbers.
The Setup
Project: A Rust CLI tool with a daemon, tmux integration, message routing, and a kanban board parser.
Team configuration:
- 1 architect (Claude Opus) — plans and decomposes work
- 1 manager (Claude Opus) — dispatches tasks, handles escalations
- 3 engineers (Codex) — parallel execution in isolated worktrees
Duration: 5 working days, ~6 hours per day supervised.
Tasks: Backlog of features, refactors, and bug fixes that had been accumulating for weeks.
The Numbers
| Metric | Result |
|---|---|
| Tasks completed | 47 |
| Tasks failed and reassigned | 8 |
| Test gate catches (merge blocked) | 12 |
| Context exhaustions | 3 |
| Merge conflicts | 4 |
| Lines changed | ~8,200 |
| Total time supervised | ~30 hours |
| Estimated sequential time | ~120 hours |
47 tasks in 30 hours of supervision. The same work would have taken me roughly 120 hours doing it sequentially — 4x compression.
What Worked
Task decomposition was the multiplier
The architect agent spent the first 30 minutes of each day reading the backlog and decomposing features into independent, testable tasks. This planning phase was the single most valuable step.
Bad decomposition: "Refactor the message routing system." Three engineers attempted overlapping changes and every merge conflicted.
Good decomposition: "Extract delivery retry logic into its own module." "Add timeout configuration to message delivery." "Write tests for Maildir atomic rename." Three independent tasks, zero conflicts.
The quality of the architect's output determined whether the day went smoothly or devolved into conflict resolution.
Test gating prevented 12 bad merges
12 times, an engineer declared a task complete but the test suite failed. Without test gating, those 12 broken branches would have merged to main, creating cascading failures.
The pattern: engineer produces code that compiles, looks correct, and handles the happy path. But it misses an edge case, breaks an existing test, or introduces a subtle regression. The test gate catches it, sends the failure output back, and the engineer fixes it — usually on the first retry.
Three of those 12 catches were serious: a race condition in merge locking, a missing null check in config parsing, and a test that passed locally but failed because of a hardcoded path. Without the gate, any of these would have cost hours to debug in main.
Worktree isolation eliminated file conflicts (mostly)
Each engineer worked in its own git worktree on its own branch. During active work, there were zero file conflicts. Engineers could edit the same files simultaneously without knowing about each other.
Conflicts only appeared at merge time — 4 total across 47 tasks. All were straightforward to resolve because only one branch was being merged at a time (serialized with a file lock).
What Broke
Context exhaustion on complex tasks
Three times, an engineer hit the context window limit mid-task. Each time, the pattern was the same: a task that seemed simple but required reading many files to understand the full picture.
The worst case: "Update the error handling to use typed errors throughout." The engineer started reading error types, then the modules that used them, then the modules that called those modules. By the time it understood the scope, the context window was nearly full and the actual changes were shallow and incomplete.
Fix: Break broad refactors into per-module tasks. "Add typed errors to the delivery module" fits in one context window. "Add typed errors everywhere" does not.
The architect occasionally over-decomposed
On day 3, the architect broke a single feature into 11 tasks. Three of the tasks were trivial one-liners that took more effort to dispatch, execute, test, and merge than they would have taken to do manually.
Fix: Set a minimum complexity threshold. If a task takes less than 5 minutes for a human, it's not worth the orchestration overhead. Batch trivial changes into a single "cleanup" task.
One engineer got stuck in a retry loop
An engineer hit a failing test, attempted to fix it, introduced a new failure, attempted to fix that, and looped for 40 minutes. The test gate correctly blocked the merge each time, but the agent didn't know how to step back and reconsider its approach.
Fix: After 2 failed retries, escalate to the manager instead of letting the engineer continue. The manager can provide fresh perspective or reassign the task. Batty now enforces this automatically.
What Surprised Me
The 3-5 engineer sweet spot is real
With 3 engineers, merge conflicts were rare and supervision was comfortable. With 5, conflicts increased and I spent more time watching for stuck agents. The codebase — not the tooling — was the bottleneck. Too many concurrent changes in a tightly-coupled codebase created interference even with worktree isolation.
Supervision isn't passive
I expected to kick off tasks and check back later. In reality, I checked agent status every 10-15 minutes during the first two hours, then relaxed to every 30 minutes once the pattern was established. The supervision was lightweight but continuous — closer to managing a team than running a batch job.
The architect agent was the best investment
If I had to choose between 1 architect + 2 engineers or 0 architects + 5 engineers, I'd take the architect every time. Well-decomposed tasks with clear acceptance criteria produced better results than throwing more engineers at vague objectives.
Token costs were reasonable
Total token cost for the week: approximately $45. Sequential work on the same tasks would have cost roughly $30 (fewer context loads). The 50% cost increase bought a 4x time compression. At any reasonable hourly rate, this is an obvious trade.
Would I Do It Again?
Yes, with two changes:
- Minimum task complexity threshold. Don't orchestrate tasks that take less than 5 minutes manually.
- Stricter retry limits from day 1. Two retries, then escalate. No exceptions.
The 4x compression was real, the test gating prevented real damage, and the supervision overhead was manageable. Multi-agent development isn't fire-and-forget, but it's a genuine productivity multiplier for anyone willing to supervise.
Top comments (1)
47 tasks across 5 agents — but no cryptographic proof of which agent produced which commit. If engineer-2's output introduced a subtle vulnerability, your git blame shows a branch name, not a verified identity.
AgentID solves this at the identity layer. Each of your 5 engineers gets an Ed25519 identity. Every task completion produces a dual-signed receipt. If an agent's context exhausts and reloads (your 3 cases), session continuity detection catches it — you'd know the agent that finished the task isn't the same instance that started it.
For your retry loop problem: behavioral monitoring would flag the agent after the second failure as anomalous before you manually notice 40 minutes later.
The architect + engineer pattern maps perfectly to AgentID's daemon agent support — the architect runs as a persistent daemon with heartbeat, engineers spawn per-task with register_or_verify.
pip install getagentid — getagentid.dev