Six months ago, I had one Claude Code session open in a terminal tab. It was already a productivity leap — like pair programming with someone who never gets tired.
Three months ago, I had five terminal tabs with five agents and a spreadsheet tracking who was doing what.
Today, I have an architect, a manager, and five engineers coordinated by a kanban board in my terminal. Each agent works in its own git worktree. Nothing merges until tests pass. I watch them all work simultaneously in tmux panes.
Here's every step of that evolution — and the specific pain point that triggered each one.
Stage 1: One Agent, One Terminal
This is where everyone starts. One Claude Code session (or Codex, or Aider). You give it a task, it works, you review, you move on.
It's genuinely great. For a single focused task — "refactor this module," "add this endpoint," "write tests for this function" — one agent is all you need. I was more productive than I'd been in years.
What worked: Deep focus on one task at a time. The agent maintains full context of the conversation. You review everything before it touches main.
What didn't: I'd finish reviewing one task, start the next, and realize the agent was blocked on something I could've resolved an hour ago. Meanwhile, three other tasks sat in my backlog, untouched.
The pain that triggered Stage 2: Waiting. Watching an agent spend 10 minutes on a refactoring task while four independent tasks — different modules, different files, zero dependencies — sat idle. I was the bottleneck, not the AI.
Stage 2: Multiple Tabs, Manual Dispatch
The obvious next step: open more terminals, run more agents.
Terminal 1: Claude Code → working on auth module
Terminal 2: Claude Code → working on API endpoints
Terminal 3: Codex → writing tests for user model
Terminal 4: Aider → updating README
Terminal 5: Claude Code → refactoring database layer
Instant 5x parallelism, right? Not exactly.
What worked: Tasks did run in parallel. On a good day, I'd close five tasks in the time it used to take to close two.
What didn't: Everything else.
- I became the dispatcher. Which agent is working on what? Is anyone idle? Did I already assign that task? I started a spreadsheet.
-
They stomped on each other's files. Agent 1 edits
src/auth.rswhile Agent 3 is also editingsrc/auth.rs. Merge conflict. Two hours of resolution. - Nobody checked the tests. Agent says "Done!" and I'd merge it, only to find out three tasks later that the test suite was broken. I didn't know which agent broke it.
- Context switching killed me. Tab 1 needs a review. Tab 3 is asking a question. Tab 5 is stuck. I'm spending more time coordinating than thinking.
The pain that triggered Stage 3: A particularly bad Tuesday where two agents edited the same file, their changes conflicted, I manually resolved the conflict, and the result broke the test suite. I spent more time cleaning up than I would've spent just doing the work myself.
Stage 3: Worktree Isolation
The file conflict problem has a clean solution: git worktrees. Each agent gets its own complete working copy on its own branch. They literally can't see each other's changes.
# Set up isolated directories for each agent
git worktree add ./agent-1 -b agent-1/auth-module
git worktree add ./agent-2 -b agent-2/api-endpoints
git worktree add ./agent-3 -b agent-3/user-tests
Now Agent 1 works in ./agent-1/, Agent 2 works in ./agent-2/, and so on. Separate filesystems. Separate branches. No conflicts during active work.
What worked: File conflicts vanished overnight. Each agent works in complete isolation. I could run tests in each worktree independently — Agent 2's bugs don't break Agent 1's tests.
What didn't: I was still the dispatcher, still the test runner, still the merge coordinator. Worktrees solved the isolation problem but not the orchestration problem. And merge sequencing — when three agents finish at the same time, who goes first? — was manual and error-prone.
The pain that triggered Stage 4: Managing worktrees, checking test results, sequencing merges, and dispatching new tasks manually. The spreadsheet had grown to three tabs. I was spending half my time on coordination, not code review.
Stage 4: Kanban Board + Test Gating
Two changes made the biggest difference:
First: a Markdown kanban board for dispatch.
## TODO
- [ ] Add rate limiting to API endpoints
- [ ] Implement email verification flow
## IN PROGRESS
- [x] Refactor auth module (agent-1)
- [x] Write user model tests (agent-3)
## DONE
- [x] Update README (agent-4)
- [x] Add JWT endpoint (agent-2)
Instead of me dispatching tasks to agents, agents read the board. They claim a task, update the status, work, and mark it done. No messages between agents. No coordination tokens. Just file I/O on a Markdown file.
Second: test gating.
Nothing merges until tests pass. Not "the agent says tests pass." Actually pass. I run the full test suite in the agent's worktree, and only accept exit code 0.
When tests fail, I send the last 50 lines of output back to the agent. It fixes its own mistakes most of the time. Two retries, then escalate to me.
# The rule that changed everything
cd ./agent-1
cargo test
# Exit 0? Merge. Exit 1? Agent fixes it.
What worked: The kanban board eliminated the spreadsheet. Test gating caught ~80% of agent-introduced bugs before I looked at the diff. I was finally reviewing tested code instead of debugging raw output.
What didn't: I was still manually running these scripts. cd agent-1 && cargo test && cd .. && cd agent-2 && cargo test gets old fast. The workflow was right, but the automation wasn't there.
The pain that triggered Stage 5: Knowing exactly what the workflow should be — kanban dispatch, worktree isolation, test gating, sequential merge — and doing it all manually. This was begging to be automated.
Stage 5: Hierarchical Teams
This is where I landed. An architect plans the work. A manager dispatches tasks. Engineers execute in isolation. Tests gate everything.
# .batty/team_config/team.yaml — the "simple" template
roles:
- name: architect
role_type: architect
agent: claude
instances: 1
prompt: architect.md
talks_to: [manager]
- name: manager
role_type: manager
agent: claude
instances: 1
prompt: manager.md
talks_to: [architect, engineer]
- name: engineer
role_type: engineer
agent: claude
instances: 3
prompt: engineer.md
talks_to: [manager]
use_worktrees: true
One command launches the entire team in tmux:
cargo install batty-cli
batty init --template simple
batty start --attach
Each agent gets its own tmux pane. The architect is on the left, the manager in the center, three engineers on the right. I can watch all of them work simultaneously, scroll through any agent's history, or detach the session entirely and come back later.
# Send a task to the architect
batty send architect "Build a REST API with JWT auth and user registration"
The architect breaks it into subtasks. Tasks land on the kanban board. The manager dispatches to available engineers. Each engineer picks up a task, creates a branch in its worktree, and starts coding. When an engineer finishes, tests run automatically. Pass? Ready to merge. Fail? Output goes back to the engineer for retry.
What works:
-
Architect quality is the multiplier. A good task breakdown means engineers work independently without stepping on each other. I spend more time on the
architect.mdprompt than any other part of the setup. - Test gating is automatic. I don't manually run tests. Batty runs them in the worktree when the engineer reports done. Two retries, then escalate.
- Merge serialization is handled. File lock, 60-second timeout, sequential merge. No race conditions.
- Mixed agents work. The architect can be Claude Code (best at planning), engineers can be Codex (fast at execution). Each tool's strength covers the others' gaps.
-
Everything is files. YAML config, Markdown kanban, Maildir-style inboxes, JSONL event logs. I can
git diffmy team's entire state.
What I'd Do Differently
If I were starting over:
Start with pair, not simple
Batty has eight built-in templates, from solo (one agent, no hierarchy) to large (19 agents with three management layers). I jumped straight to simple (architect + manager + 3 engineers) because it sounded right.
Wrong. Start with pair — one architect and one engineer:
# pair template — start here
roles:
- name: architect
role_type: architect
agent: claude
instances: 1
prompt: architect.md
talks_to: [engineer]
- name: engineer
role_type: engineer
agent: codex
instances: 1
prompt: engineer.md
talks_to: [architect]
use_worktrees: true
Learn the architect-engineer dynamic with one pair before scaling to a team. Get the prompts right. Understand the merge flow. Then add more engineers.
Invest in the architect prompt early
The architect is the single highest-leverage component. A bad architect produces vague tasks like "implement the backend." A good architect produces: "Create POST /api/users endpoint accepting {email, password}, validate email format, hash password with bcrypt, return 201 with user ID, write tests for success case and duplicate email case."
The difference in engineer output quality is staggering. I should have spent the first week refining architect.md instead of the first month.
Don't skip stages
Each stage taught me something I needed for the next one:
- Stage 1 taught me what agents are good at (focused tasks) and bad at (self-coordination)
- Stage 2 taught me the specific failure modes of naive parallelism
- Stage 3 taught me that isolation is more important than communication
- Stage 4 taught me that test gating is non-negotiable
- Stage 5 automated what I'd learned by hand
If I'd jumped straight to Stage 5, I wouldn't understand why the architecture works. And when something breaks — it still breaks sometimes — I wouldn't know how to debug it.
The Honest Truth
I want to be clear about what multi-agent orchestration is and isn't.
It is: A way to supervise five workstreams instead of doing one. You're a tech lead, not a typist. You review architecture decisions, redirect when agents go off-track, and unblock when someone gets stuck.
It is not: Fire and forget. Autonomous coding. "AI does the work while I sleep." The agents need supervision. Good supervision. The kind that comes from understanding the codebase, the requirements, and what "good" looks like.
The productivity gain is real — 3-4x throughput on parallelizable tasks — but it requires a new skill. The skill of managing agents is closer to managing junior developers than it is to writing code. You need to:
- Decompose work into independent, well-scoped tasks
- Write prompts that are specific enough to be actionable
- Review output for architecture, not syntax
- Know when to intervene and when to let the agent figure it out
If you're already a good tech lead or senior engineer, this skill translates directly. If you're earlier in your career, start with one agent and level up naturally. The stages aren't something to skip — they're something to learn.
Where I Am Now
My current setup for most projects:
Architect (Claude Code) → plans work, breaks into tasks
Manager (Claude Code) → dispatches, monitors, unblocks
Engineer x3 (Codex) → executes in isolated worktrees
Test gate → cargo test before merge
Merge lock → sequential merge to main
For larger projects, the software template adds a second manager and splits engineers by domain (backend/frontend). For quick fixes, the pair template with one architect and one engineer is plenty.
The kanban board, the worktree isolation, the test gates, the merge sequencing — these aren't Batty features. They're workflow patterns. You can build them yourself with bash scripts and git commands (I did, for months, at Stage 4). Batty just wraps them into a single batty start and handles the plumbing.
Getting Started
If you see yourself at one of the stages above and want to jump ahead:
# Install
cargo install batty-cli
# Pick a template that matches your current stage:
batty init --template solo # Stage 1: one agent, no hierarchy
batty init --template pair # Stage 3-4: architect + engineer
batty init --template simple # Stage 5: architect + manager + 3 engineers
# Launch
batty start --attach
# Send a task
batty send architect "Your task description here"
What stage are you at? And what's the pain point that's pushing you to the next one? I'm genuinely curious — the evolution looks different for everyone, and I learn something from every setup I see. Drop a comment.
Batty is open source, built in Rust, and published on crates.io. GitHub: github.com/battysh/batty. Demo: 2-minute walkthrough.
Top comments (0)