Batty

Posted on Mar 24

From One AI Agent to a Team: My Setup Evolution

#ai #opensource #devtools #productivity

Six months ago, I had one Claude Code session open in a terminal tab. It was already a productivity leap — like pair programming with someone who never gets tired.

Three months ago, I had five terminal tabs with five agents and a spreadsheet tracking who was doing what.

Today, I have an architect, a manager, and five engineers coordinated by a kanban board in my terminal. Each agent works in its own git worktree. Nothing merges until tests pass. I watch them all work simultaneously in tmux panes.

Here's every step of that evolution — and the specific pain point that triggered each one.

Stage 1: One Agent, One Terminal

This is where everyone starts. One Claude Code session (or Codex, or Aider). You give it a task, it works, you review, you move on.

It's genuinely great. For a single focused task — "refactor this module," "add this endpoint," "write tests for this function" — one agent is all you need. I was more productive than I'd been in years.

What worked: Deep focus on one task at a time. The agent maintains full context of the conversation. You review everything before it touches main.

What didn't: I'd finish reviewing one task, start the next, and realize the agent was blocked on something I could've resolved an hour ago. Meanwhile, three other tasks sat in my backlog, untouched.

The pain that triggered Stage 2: Waiting. Watching an agent spend 10 minutes on a refactoring task while four independent tasks — different modules, different files, zero dependencies — sat idle. I was the bottleneck, not the AI.

Stage 2: Multiple Tabs, Manual Dispatch

The obvious next step: open more terminals, run more agents.

Terminal 1: Claude Code → working on auth module
Terminal 2: Claude Code → working on API endpoints
Terminal 3: Codex → writing tests for user model
Terminal 4: Aider → updating README
Terminal 5: Claude Code → refactoring database layer

Instant 5x parallelism, right? Not exactly.

What worked: Tasks did run in parallel. On a good day, I'd close five tasks in the time it used to take to close two.

What didn't: Everything else.

I became the dispatcher. Which agent is working on what? Is anyone idle? Did I already assign that task? I started a spreadsheet.
They stomped on each other's files. Agent 1 edits src/auth.rs while Agent 3 is also editing src/auth.rs. Merge conflict. Two hours of resolution.
Nobody checked the tests. Agent says "Done!" and I'd merge it, only to find out three tasks later that the test suite was broken. I didn't know which agent broke it.
Context switching killed me. Tab 1 needs a review. Tab 3 is asking a question. Tab 5 is stuck. I'm spending more time coordinating than thinking.

The pain that triggered Stage 3: A particularly bad Tuesday where two agents edited the same file, their changes conflicted, I manually resolved the conflict, and the result broke the test suite. I spent more time cleaning up than I would've spent just doing the work myself.

Stage 3: Worktree Isolation

The file conflict problem has a clean solution: git worktrees. Each agent gets its own complete working copy on its own branch. They literally can't see each other's changes.

# Set up isolated directories for each agent
git worktree add ./agent-1 -b agent-1/auth-module
git worktree add ./agent-2 -b agent-2/api-endpoints
git worktree add ./agent-3 -b agent-3/user-tests

Now Agent 1 works in ./agent-1/, Agent 2 works in ./agent-2/, and so on. Separate filesystems. Separate branches. No conflicts during active work.

What worked: File conflicts vanished overnight. Each agent works in complete isolation. I could run tests in each worktree independently — Agent 2's bugs don't break Agent 1's tests.

What didn't: I was still the dispatcher, still the test runner, still the merge coordinator. Worktrees solved the isolation problem but not the orchestration problem. And merge sequencing — when three agents finish at the same time, who goes first? — was manual and error-prone.

The pain that triggered Stage 4: Managing worktrees, checking test results, sequencing merges, and dispatching new tasks manually. The spreadsheet had grown to three tabs. I was spending half my time on coordination, not code review.

Stage 4: Kanban Board + Test Gating

Two changes made the biggest difference:

First: a Markdown kanban board for dispatch.

## TODO
- [ ] Add rate limiting to API endpoints
- [ ] Implement email verification flow

## IN PROGRESS
- [x] Refactor auth module (agent-1)
- [x] Write user model tests (agent-3)

## DONE
- [x] Update README (agent-4)
- [x] Add JWT endpoint (agent-2)

Instead of me dispatching tasks to agents, agents read the board. They claim a task, update the status, work, and mark it done. No messages between agents. No coordination tokens. Just file I/O on a Markdown file.

Second: test gating.

Nothing merges until tests pass. Not "the agent says tests pass." Actually pass. I run the full test suite in the agent's worktree, and only accept exit code 0.

When tests fail, I send the last 50 lines of output back to the agent. It fixes its own mistakes most of the time. Two retries, then escalate to me.

# The rule that changed everything
cd ./agent-1
cargo test
# Exit 0? Merge. Exit 1? Agent fixes it.

What worked: The kanban board eliminated the spreadsheet. Test gating caught ~80% of agent-introduced bugs before I looked at the diff. I was finally reviewing tested code instead of debugging raw output.

What didn't: I was still manually running these scripts. cd agent-1 && cargo test && cd .. && cd agent-2 && cargo test gets old fast. The workflow was right, but the automation wasn't there.

The pain that triggered Stage 5: Knowing exactly what the workflow should be — kanban dispatch, worktree isolation, test gating, sequential merge — and doing it all manually. This was begging to be automated.

Stage 5: Hierarchical Teams

This is where I landed. An architect plans the work. A manager dispatches tasks. Engineers execute in isolation. Tests gate everything.

# .batty/team_config/team.yaml — the "simple" template
roles:
  - name: architect
    role_type: architect
    agent: claude
    instances: 1
    prompt: architect.md
    talks_to: [manager]

  - name: manager
    role_type: manager
    agent: claude
    instances: 1
    prompt: manager.md
    talks_to: [architect, engineer]

  - name: engineer
    role_type: engineer
    agent: claude
    instances: 3
    prompt: engineer.md
    talks_to: [manager]
    use_worktrees: true

One command launches the entire team in tmux:

cargo install batty-cli
batty init --template simple
batty start --attach

Each agent gets its own tmux pane. The architect is on the left, the manager in the center, three engineers on the right. I can watch all of them work simultaneously, scroll through any agent's history, or detach the session entirely and come back later.

# Send a task to the architect
batty send architect "Build a REST API with JWT auth and user registration"

The architect breaks it into subtasks. Tasks land on the kanban board. The manager dispatches to available engineers. Each engineer picks up a task, creates a branch in its worktree, and starts coding. When an engineer finishes, tests run automatically. Pass? Ready to merge. Fail? Output goes back to the engineer for retry.

What works:

Architect quality is the multiplier. A good task breakdown means engineers work independently without stepping on each other. I spend more time on the architect.md prompt than any other part of the setup.
Test gating is automatic. I don't manually run tests. Batty runs them in the worktree when the engineer reports done. Two retries, then escalate.
Merge serialization is handled. File lock, 60-second timeout, sequential merge. No race conditions.
Mixed agents work. The architect can be Claude Code (best at planning), engineers can be Codex (fast at execution). Each tool's strength covers the others' gaps.
Everything is files. YAML config, Markdown kanban, Maildir-style inboxes, JSONL event logs. I can git diff my team's entire state.

What I'd Do Differently

If I were starting over:

Start with `pair`, not `simple`

Batty has eight built-in templates, from solo (one agent, no hierarchy) to large (19 agents with three management layers). I jumped straight to simple (architect + manager + 3 engineers) because it sounded right.

Wrong. Start with pair — one architect and one engineer:

# pair template — start here
roles:
  - name: architect
    role_type: architect
    agent: claude
    instances: 1
    prompt: architect.md
    talks_to: [engineer]

  - name: engineer
    role_type: engineer
    agent: codex
    instances: 1
    prompt: engineer.md
    talks_to: [architect]
    use_worktrees: true

Learn the architect-engineer dynamic with one pair before scaling to a team. Get the prompts right. Understand the merge flow. Then add more engineers.

Invest in the architect prompt early

The architect is the single highest-leverage component. A bad architect produces vague tasks like "implement the backend." A good architect produces: "Create POST /api/users endpoint accepting {email, password}, validate email format, hash password with bcrypt, return 201 with user ID, write tests for success case and duplicate email case."

The difference in engineer output quality is staggering. I should have spent the first week refining architect.md instead of the first month.

Don't skip stages

Each stage taught me something I needed for the next one:

Stage 1 taught me what agents are good at (focused tasks) and bad at (self-coordination)
Stage 2 taught me the specific failure modes of naive parallelism
Stage 3 taught me that isolation is more important than communication
Stage 4 taught me that test gating is non-negotiable
Stage 5 automated what I'd learned by hand

If I'd jumped straight to Stage 5, I wouldn't understand why the architecture works. And when something breaks — it still breaks sometimes — I wouldn't know how to debug it.

The Honest Truth

I want to be clear about what multi-agent orchestration is and isn't.

It is: A way to supervise five workstreams instead of doing one. You're a tech lead, not a typist. You review architecture decisions, redirect when agents go off-track, and unblock when someone gets stuck.

It is not: Fire and forget. Autonomous coding. "AI does the work while I sleep." The agents need supervision. Good supervision. The kind that comes from understanding the codebase, the requirements, and what "good" looks like.

The productivity gain is real — 3-4x throughput on parallelizable tasks — but it requires a new skill. The skill of managing agents is closer to managing junior developers than it is to writing code. You need to:

Decompose work into independent, well-scoped tasks
Write prompts that are specific enough to be actionable
Review output for architecture, not syntax
Know when to intervene and when to let the agent figure it out

If you're already a good tech lead or senior engineer, this skill translates directly. If you're earlier in your career, start with one agent and level up naturally. The stages aren't something to skip — they're something to learn.

Where I Am Now

My current setup for most projects:

Architect (Claude Code) → plans work, breaks into tasks
Manager (Claude Code)   → dispatches, monitors, unblocks
Engineer x3 (Codex)     → executes in isolated worktrees
Test gate               → cargo test before merge
Merge lock              → sequential merge to main

For larger projects, the software template adds a second manager and splits engineers by domain (backend/frontend). For quick fixes, the pair template with one architect and one engineer is plenty.

The kanban board, the worktree isolation, the test gates, the merge sequencing — these aren't Batty features. They're workflow patterns. You can build them yourself with bash scripts and git commands (I did, for months, at Stage 4). Batty just wraps them into a single batty start and handles the plumbing.

Getting Started

If you see yourself at one of the stages above and want to jump ahead:

# Install
cargo install batty-cli

# Pick a template that matches your current stage:
batty init --template solo      # Stage 1: one agent, no hierarchy
batty init --template pair      # Stage 3-4: architect + engineer
batty init --template simple    # Stage 5: architect + manager + 3 engineers

# Launch
batty start --attach

# Send a task
batty send architect "Your task description here"

What stage are you at? And what's the pain point that's pushing you to the next one? I'm genuinely curious — the evolution looks different for everyone, and I learn something from every setup I see. Drop a comment.

Batty is open source, built in Rust, and published on crates.io. GitHub: github.com/battysh/batty. Demo: 2-minute walkthrough.

Top comments (1)

Victor Okefie • Apr 6

The architect prompt is the leverage point most people miss. They scale engineers before they refine the task breakdown, then wonder why parallel agents create parallel chaos. One good architect.md beats three good engineers running on vague instructions. That's not AI orchestration, that's just good system design.