DEV Community: Batty

I Let an AI Agent Supervisor Run Unattended for 19 Days. Here's What the Telemetry Says.

Batty — Thu, 16 Apr 2026 23:35:26 +0000

I shipped 13 releases of my AI agent supervisor in 14 days while the supervisor was running it.

For 19 days — March 22 to April 10 — Batty managed our own development without a human in the dispatch loop. An operator was on Discord for escalations, but the planning, the test gating, the merges, the worktree management, and the self-healing all ran themselves.

This isn't a "fully autonomous" pitch. A human was on Discord. Things broke. Stalls happened. The interesting question isn't "can AI run itself?" — it's "what kind of telemetry do you need before you trust an agent fleet to run itself for 19 days, and what does that telemetry say at the end?"

We have those numbers. Here they are.

What Batty is, in one paragraph

Batty is supervised agent execution for software teams. It runs agents (Claude Code, Codex, Aider) inside tmux panes, tracks work on a markdown kanban board, gates merges behind a real test run, and surfaces escalations to a human via Discord. The supervisor itself is a Rust daemon. It's open source: github.com/battysh/batty.

19 days, by the numbers

This window is 2026-03-22 → 2026-04-10, pulled from ~/batty/.batty/telemetry.db and cross-checked against git log.

Throughput	Value
Tasks completed end-to-end	102
Per-seat agent completions	150
Tasks auto-merged	20
Task assignments dispatched	411
Peak daily completions (Apr 6)	55
Commits landed in window	456

55 tasks in a single working day is the throughput point. 102 end-to-end completions with a human only handling escalations is the autonomy point. 13 releases in 14 days is the meta point: the supervisor was shipping itself, faster than I would have shipped it manually.

But throughput isn't the story I want to tell. Self-healing is.

Stability	Value
Verification evidence collections	1,304
Verification phase transitions	790
Auto-doctor self-healing actions	258
Task escalations handled	195
Merge confidence scores written	250
Worktree reconciliations	140
State reconciliations	1,052
Daemon heartbeats persisted	989
Agent pane respawns	167
Disk-hygiene cleanups	146

258 self-healing actions is what unattended actually means in practice. The daemon caught and fixed its own stalls 258 times across 19 days without paging me. 195 escalations did get paged — about ten a day — well under the rate where a single operator gives up. 1,304 verification evidence bundles were collected before tasks were allowed near a merge. Every one of those numbers is a moment where the supervisor either kept itself alive or refused to ship something it couldn't justify.

How we keep ourselves honest

A telemetry table is fine. The harder claim is "and we catch the regressions before they ship." That's the part I want to talk about.

v0.11.0 ships a new test surface called the scenario framework. It runs the real TeamDaemon against in-process fake shims (FakeShim + ShimBehavior) on per-test tempdirs. Zero subprocess spawn. Zero tmux. Fully deterministic. 58 scenario tests run on every PR in about 60 seconds.

Twenty-two of those scenarios are prescribed:

1 happy-path scenario
7 regression scenarios — one per recent release bug
14 cross-feature scenarios (worktree corruption, merge conflicts, scope fence violations, ack loops, context exhaustion, silent death, multi-engineer races, disk pressure, stale merge locks, …)

On top of that, a proptest-state-machine fuzz harness runs three targets — fuzz_workflow_happy, fuzz_workflow_with_faults, fuzz_restart_resilience — against ten cross-subsystem invariants on every randomized case.

And cargo test --lib went from 3,369 at v0.10.10 to 3,410 at v0.11.0 (+41 library tests on top of the new 58-test scenario surface). Total: 3,468 tests gated against every PR. Zero warnings on release builds, locked in since 0.10.9.

The test count itself isn't impressive — plenty of projects have more. The interesting part is what each scenario locks in. Every regression scenario started life as a real bug from production telemetry. Each one is now a deterministic replay you can run in 60ms.

Three stories behind three numbers

258 auto-doctor actions → #634 shim restart cooldown

Most of those 258 self-healing actions are boring: a shim looks unresponsive, the daemon respawns it, work continues.

A handful aren't boring. Bug #634: when handle_supervisory_stall fired in src/team/daemon/health/poll_shim.rs, it could re-trigger a second respawn if a stall check happened right after the previous restart. The result was a respawn loop that degraded into repeated orchestrator disconnected / Broken pipe control-plane disconnects. The fix is a stall-restart::{name} cooldown that holds the respawned member as Idle until its freshly-started shim emits its first StateChanged event, with a regression test pinning the cooldown.

You only catch a bug like that if your telemetry tells you the supervisor is fixing the same thing twice.

195 escalations → #612 stale escalation storms

Escalations are how the supervisor pages a human when it can't make forward progress on its own. 195 over 19 days is roughly ten a day — well under the rate where a solo operator burns out.

But early in the window, escalations were much noisier. Bug #612: the inbox digest kept top-billing escalations whose underlying tasks had already moved to done or archived. Stale escalation storms occupied actionable slots that should have been pointing me at real problems. src/team/inbox.rs defines two new helpers — extract_task_ids_from_body and demote_stale_escalations; src/team/messaging.rs wires them into the digest assembly. They demote stale Escalation/Blocker entries whose referenced tasks are all done or archived back to Status.

Telemetry surfaced the rate. Watching the rate told me the rate was wrong.

20 auto-merges → #592 auto-merge gate

Twenty tasks merged to main without me clicking anything. Each one cleared test gating, evidence collection, and merge confidence scoring before landing.

The gate is the interesting part. Bug #592 implements merge_request_skip_reason plus an AutoMergeSkipReason enum with WrongStatus / MissingPacket / NoBranch categories and a full unit-test catalog. Every refusal to auto-merge is categorized and logged; every acceptance writes a confidence score I can audit later. 250 merge confidence scores were written across the window — about five per working day — because every auto-merge candidate is a deliberate decision, not a heuristic shrug.

The "20 auto-merges" stat is the outcome. The 250 confidence scores are the process. I trust the outcome because I can audit the process.

What this run didn't fix

Three honest gaps.

Bot-token rotation can't be solved from code (#598 archived). Discord and Telegram tokens have to be rotated through a provider console. The supervisor will never roll its own credentials. This stays in operator runbooks; no auto-doctor can solve it.

Context exhaustion is still a real shape of failure. The scenario framework has explicit context-exhaustion scenarios because we kept hitting the wall on long-running engineers. We have recovery, but recovery isn't prevention. Better task decomposition is the answer; the supervisor can't decompose its way out of bad framing.

The 10–15 minute productive window. Throughout this 19-day run, I was occasionally restarting the daemon every 10–15 minutes to clear an event-loop freeze. We hadn't root-caused it yet. We did the morning of release day — and the fix is in v0.11.2, shipped the same afternoon as v0.11.0. Every stat above is the previous shape of stability; the current shape is meaningfully better. That's a separate post.

I'm including this as the answer to the obvious question: "how unattended is unattended, really?" Roughly: unattended in the sense that I didn't have to plan tasks, dispatch them, gate them, or merge them. Attended in the sense that I came back and restarted the daemon every so often when the productive window timer ran down. The 19 days are real. So is the asterisk.

Try it

cargo install batty-cli

The 0.11.x Easter release train is on crates.io as of 2026-04-11. v0.11.0 ships the scenario framework. v0.11.1 patches the auto-merge dropped-task bug. v0.11.2 closes the write-timeout pattern that kept producing the 10–15 minute windows.

If your test suite is weak, Batty makes that worse, not better — the merge gate is only as strong as what it gates against. If your tests are real, Batty turns them into a discipline that keeps an agent fleet honest.

It's on GitHub: github.com/battysh/batty. Open source. Reproduce the numbers; the queries are in CHANGELOG.md.

I Built a Chess Engine with 5 AI Agents — Here's What Surprised Me

Batty — Sun, 05 Apr 2026 14:18:36 +0000

I gave five AI coding agents a task: build a chess engine from scratch. One planned the architecture. Three built components in parallel. One supervised everything.

No external chess libraries. No internet lookups. Just agents, a test suite, and a goal: beat Stockfish at 1200 ELO at least 50% of the time.

The engine works. But what surprised me wasn't the output — it was what I learned about supervised AI agent execution along the way.

The Setup

The team looked like this:

roles:
  - name: architect
    role_type: architect
    agent: claude
    instances: 1
    talks_to: [manager]

  - name: manager
    role_type: manager
    agent: claude
    instances: 1
    talks_to: [architect, engineer]

  - name: engineer
    role_type: engineer
    agent: claude
    instances: 3
    use_worktrees: true
    talks_to: [manager]

Five agents. One architect running Opus for planning. Three engineers running Sonnet for implementation. One manager routing work between them. Each engineer got its own git worktree — its own branch, its own directory, completely isolated from the others.

The task board was a Markdown file:

## To Do
- [ ] Implement board representation (bitboard)
- [ ] Implement move generation (legal moves)
- [ ] Implement position evaluation (material + position tables)
- [ ] Implement search (alpha-beta with iterative deepening)
- [ ] Implement UCI protocol interface
- [ ] Write integration tests against known positions

## In Progress

## Done

I typed batty start --attach and watched.

Surprise 1: The Architect Was 10x More Important Than Any Engineer

This was the biggest lesson. I initially thought the engineers — the agents writing actual code — were the bottleneck. They weren't.

The architect was.

A good architecture plan meant engineers could work independently. Board representation, move generation, and evaluation are naturally isolated — they touch different files, use different data structures, and can be tested independently. The architect saw this and decomposed the work accordingly.

When I ran an earlier version with a weaker architecture plan, engineers kept blocking each other. The evaluation agent needed the board representation agent to finish first. The search agent needed both. Three agents, but only one could work at any given time. Parallel in theory, sequential in practice.

The fix was spending more time — and more expensive tokens — on the planning phase. I ran the architect on Opus (the most capable model) and gave it explicit instructions: "Decompose the work so that each engineer can start immediately without waiting for another engineer's output. Define interfaces upfront."

The architect produced a plan with clear module boundaries, shared type definitions in a common types.rs file, and stub implementations that each engineer could code against. All three engineers started within seconds of each other.

Lesson: In supervised AI agent execution, the quality of task decomposition determines everything. A great architect with mediocre engineers outperforms mediocre architecture with great engineers.

Surprise 2: Test Gating Caught Things I Would Have Missed

Every engineer's branch had to pass cargo test before merging. No exceptions. The agent says "done" — the supervisor runs the tests. Exit code 0 means done. Anything else means try again.

Here's what the test gate caught that I wouldn't have noticed in code review:

Off-by-one in move generation. Engineer 2 implemented pawn moves. The code looked correct — clean, well-structured, proper handling of en passant and promotion. But the test suite included known positions from the Perft test suite — positions where the exact number of legal moves is known. Engineer 2's implementation generated 19 moves in a position that should have 20. A missing edge case in castling rights after a rook capture. The kind of bug that passes code review because the logic reads correctly.

The test gate caught it. The agent got the failure output, saw exactly which position failed and by how many moves, and fixed it in the retry.

Type mismatch in evaluation scores. Engineer 3 implemented position evaluation using centipawn scores. The search module expected scores in a different range. Both modules compiled independently. Both had passing unit tests. The integration test — which ran the full engine against a known position — produced moves that were legal but strategically terrible. The engine was maximizing the wrong scale.

Without the test gate, this would have merged. I would have spent an hour debugging "why does the engine sacrifice its queen for no reason" before finding the score scaling issue.

Lesson: Test gates don't just catch bugs. They catch the class of bugs that look correct in isolation but break at integration boundaries. This is exactly where multi-agent systems fail — each agent's work is locally correct but globally broken.

Surprise 3: Five Agents Was the Sweet Spot

I tried three configurations:

Config	Agents	Result
Pair	1 architect + 1 engineer	Works but sequential. ~45 minutes.
Team	1 architect + 3 engineers + 1 manager	Parallel execution. ~18 minutes.
Squad	1 architect + 5 engineers + 1 manager	Merge complexity killed the gains. ~22 minutes.

Going from 1 to 3 engineers was a clear win. Each engineer worked on a different module. Merges were clean because worktree isolation prevented file conflicts, and the architect's decomposition kept modules independent.

Going from 3 to 5 engineers actually slowed things down.

Why? Two reasons:

Merge serialization. Batty merges branches sequentially with a file lock. With 3 engineers finishing around the same time, merges queue briefly but resolve quickly. With 5, the queue backs up. Each merge triggers a test run in the target branch, and later merges sometimes conflict with earlier ones because the codebase has changed underneath them.

Task granularity. A chess engine has about 5-6 natural modules. With 3 engineers, each gets a substantial chunk of work. With 5, you're splitting modules into smaller pieces that have tighter coupling. Engineer 4 needs to implement the UCI protocol, but it depends on the search module (Engineer 3) and the board representation (Engineer 1). The independence that made 3 agents work breaks down at 5.

Lesson: More agents isn't always better. The optimal team size depends on how many truly independent tasks exist. If you have to create artificial boundaries to give agents work, you've gone too far.

Surprise 4: Token Costs Weren't What I Expected

The naive assumption: 5 agents = 5x the cost. The reality was closer to 2x.

Here's why:

Scoped context. Each engineer only loaded the files relevant to its module. Engineer 1 (board representation) never saw the evaluation code. Engineer 3 (evaluation) never saw the UCI protocol. A strict .claudeignore file kept each agent's context to ~25K tokens instead of the full ~80K project context.

Session resets. After each task, the agent got a fresh session. No accumulated conversation history from previous tasks. Clean context = fewer tokens per completion.

Model mixing. The architect ran on Opus (~15x more expensive per token than Sonnet). The engineers ran on Sonnet. Since engineers do 80% of the token-consuming work, the blended cost was much lower than running everything on Opus.

Cost Component	Tokens	Cost
Architect (Opus)	~40K	~$1.20
Engineer 1 (Sonnet)	~60K	~$0.36
Engineer 2 (Sonnet)	~55K	~$0.33
Engineer 3 (Sonnet)	~65K	~$0.39
Manager (Sonnet)	~15K	~$0.09
Total	~235K	~$2.37

A single agent doing the same work sequentially would use ~180K tokens on Opus (~$5.40) because it carries the full context throughout. The multi-agent approach was both faster and cheaper.

Lesson: Multi-agent execution is a cost optimization strategy, not just a speed optimization. Scoped tasks + model mixing + session resets cut costs more than you'd expect.

What the Engine Looks Like

The result: chess_test. A Rust chess engine built entirely by AI agents under supervision.

It's not going to beat Stockfish at full strength. But against Stockfish at 1200 ELO, it wins consistently. The architecture is clean — separate modules for board representation, move generation, evaluation, search, and UCI protocol. Each module has its own test suite.

The interesting thing isn't the engine itself. It's that the development process — supervised AI agent execution with worktree isolation, test gating, and hierarchical task dispatch — produced a codebase that's more modular and better-tested than what I typically get from a single long agent session.

When one agent does everything, it tends to take shortcuts. Shared mutable state. Implicit dependencies. Tests that pass but don't cover edge cases. When multiple agents work in isolation with hard boundaries, the code is forced to be modular because agents literally can't access each other's files.

How to Try This

If you want to run a similar experiment:

cargo install batty-cli
cd your-project
batty init --template team  # architect + 3 engineers + manager

Edit .batty/team_config/team.yaml to configure agents, roles, and the test command. Add tasks to the kanban board. Run batty start --attach and watch agents work in adjacent tmux panes.

The demo video shows the chess engine build from start to finish — architect planning, engineers implementing in parallel, test gates catching bugs, branches merging.

Source: github.com/battysh/batty

The Takeaway

Supervised AI agent execution isn't about making agents faster. It's about making their output trustworthy.

Five agents building a chess engine taught me:

Invest in the architect. Task decomposition quality > agent count. Use your best model for planning.
Test gates are non-negotiable. Agents produce confident, plausible, broken code. Exit code 0 is the cheapest reviewer you'll ever hire.
More agents ≠ better. Match team size to the number of naturally independent tasks. Stop at the boundary where you'd have to create artificial splits.
Multi-agent is a cost play. Scoped context + model mixing + session resets = faster AND cheaper than one expensive agent doing everything sequentially.

The agents didn't surprise me with their code quality. They surprised me with how much the supervision layer — task decomposition, isolation, test gating — determined the outcome.

The code wrote itself. The architecture didn't.

What's the most agents you've run on a single project? Where did the coordination break down? I'm curious whether the 5-agent ceiling holds for other codebases or if it's specific to this kind of project.

Why a Markdown File Beats a Message Bus

Batty — Sun, 05 Apr 2026 14:18:09 +0000

You have five AI coding agents. They need to know what to work on, what's already taken, and what's done. How do you coordinate them?

The popular answer is a message bus. Agents publish and subscribe. They negotiate tasks, share context, broadcast status. It's the architecture you'd find in CrewAI, AutoGen, or any framework with "multi-agent" in the tagline.

I tried that approach. Then I replaced it with a directory of markdown files — kanban-driven task dispatch for AI agents, using the filesystem instead of a broker. It's been six months, and I haven't looked back.

The O(n²) problem with message buses

When agents coordinate through messages, every agent potentially talks to every other agent. Agent A finishes a task and broadcasts "task 27 done." Agents B, C, D, and E all receive it. Agent B claims the next task and broadcasts "I'm taking task 28." Now everyone else needs to hear that, update their state, and avoid claiming the same task.

With 5 agents, that's manageable. With 10, it's 90 potential message pairs. With 20, it's 380. The communication overhead grows as O(n²), and every message is a chance for race conditions, stale state, or lost updates.

Worse: you can't see what's happening. The coordination state lives in flight — in message queues, in-memory buffers, agent context windows. When something goes wrong, you're debugging invisible state.

The O(1) alternative: read a file

Here's how Batty dispatches tasks instead. Every task is a markdown file in a directory:

.batty/board/tasks/
├── 027-add-jwt-auth.md          # status: in-progress, claimed_by: eng-1
├── 028-user-registration.md     # status: todo
├── 029-add-rate-limiting.md     # status: backlog
└── 030-fix-dashboard-css.md     # status: done

Each file has YAML frontmatter for machine-readable fields and a markdown body for the task description:

id: 28
title: User registration endpoint
status: todo
priority: high
depends_on: [27]
claimed_by:
tags: [api, auth]

# User registration endpoint

Add POST /api/register with email validation,
password hashing, and duplicate detection.

## Done when

- Endpoint returns 201 with user object
- Duplicate email returns 409
- Tests cover happy path and validation errors

An agent doesn't subscribe to a topic or negotiate with peers. It reads a file. One file, one read, one task. O(1).

How kanban-driven dispatch actually works

Batty's daemon runs a polling loop — every 10 seconds, it reads the board and makes decisions:

1. Scan the task directory
2. Find idle agents (no active task)
3. For each idle agent, find the highest-priority task that is:
   - status: backlog or todo
   - not claimed by anyone
   - not blocked
   - dependencies resolved (all depends_on tasks are done)
4. Update the task file: status → in-progress, claimed_by → eng-1
5. Launch the agent with the task context

That's the entire dispatch algorithm. Priority sorting is deterministic: critical tasks dispatch before high, high before medium. Ties break by task ID, so the oldest unblocked task wins. The board is always consistent because the daemon updates the file before launching the agent — if the launch fails, the task stays claimed and the daemon retries next cycle.

No message broker. No pub/sub. No consensus protocol. The filesystem is the coordination layer, and grep is your monitoring tool:

# What's in progress right now?
grep -rl "status: in-progress" .batty/board/tasks/

# Who's working on what?
grep -rn "claimed_by:" .batty/board/tasks/*.md

# How many tasks in each status?
grep -rh "^status:" .batty/board/tasks/ | sort | uniq -c

Try doing that with a message bus.

Why agents understand markdown natively

This is the insight that makes the whole approach work: LLMs already know markdown. It's the dominant format in their training data — README files, GitHub issues, documentation, Stack Overflow posts. When you hand an AI coding agent a markdown task file, it reads the title, parses the acceptance criteria, and starts working. No serialization format to teach it. No API client to configure.

Compare this to handing an agent a message from a coordination bus:

{"type": "task_assignment", "payload": {"id": 28, "title": "User registration endpoint", "priority": "high", "context": {"depends_on": [27], "tags": ["api", "auth"]}, "description": "Add POST /api/register..."}}

The agent can parse this, but the format carries no information about the task. It's overhead. The markdown version is the task description — the agent reads it the same way a human developer would read a ticket.

What happens when things go wrong

Message buses need sophisticated error handling. What if a message is lost? What if two agents claim the same task? What if an agent crashes mid-task?

With file-based dispatch, the answers are simple:

Lost updates: Can't happen. The file is on disk. If the daemon crashes mid-write, the file is either updated or it isn't. On restart, the daemon reads the board and picks up where it left off.

Double claims: The daemon is the only writer for dispatch operations. It claims the task (updates claimed_by in the file) before launching the agent. If the launch fails, the claim is already on the board — the daemon retries or escalates. No two agents race for the same task.

Crashed agents: Every poll cycle, the daemon reconciles. If an agent is idle but has an in-progress task, the daemon re-assigns it. If a task is claimed by an agent that no longer exists, it gets unclaimed and returned to the queue. Orphaned state is impossible because the board is always the source of truth.

Dependency violations: Before dispatching task 28, the daemon checks that all tasks in depends_on: [27] have status: done. If task 27 is still in progress, task 28 stays in the queue. No message ordering to worry about — just a field comparison.

What humans can do that message buses can't

Your kanban board is a directory of text files. This means:

Reprioritize on the fly. Open 029-add-rate-limiting.md, change priority: medium to priority: critical. Next dispatch cycle, it jumps the queue. No API call, no admin panel, no "drag the card."

Add context mid-task. An agent is working on task 28, and you realize it needs additional context. Edit the markdown file — add a note, clarify an acceptance criterion, paste a code snippet. The agent reads the updated file on its next reference.

Debug with cat. When something goes wrong at 11pm, the difference between "open a file" and "connect to a monitoring dashboard and reconstruct message flow" is the difference between fixing the problem and going to bed.

Block a task instantly. Add blocked: "waiting on API key from vendor" to the frontmatter. The daemon skips it on every dispatch cycle until you remove the field. No "pause" button to find, no workflow to trigger.

Version-control everything. git log -- .batty/board/tasks/ shows every task creation, status change, and priority shift. git diff shows exactly what changed. git blame shows who changed it. Your project management history is in the same repo as your code, with the same tools.

When this doesn't work

Honest limitations:

Real-time collaboration. If you need agents to share intermediate results — "I just changed the API schema, everyone update your types" — file polling with a 10-second interval isn't fast enough. You need something push-based.

High agent counts. At 50+ agents, scanning a directory of task files every 10 seconds starts to matter. Batty is built for teams of 3-10 agents, not swarms.

Cross-project coordination. If agents span multiple repositories or machines, a shared filesystem isn't available. You'd need a networked coordination layer.

For the typical use case — a developer running 3-8 AI coding agents on a single project — a directory of markdown files handles dispatch better than any message bus I've tried. It's simpler to operate, simpler to debug, and agents read it as naturally as you read a README.

Try it

cargo install batty-cli
batty init
batty up

Define tasks as markdown files. Batty dispatches them to your agents, gates on tests, and moves them through the board. No message bus required.

How does your multi-agent setup handle task coordination? I'm curious whether anyone else has landed on file-based approaches — or if there's a message bus setup that's actually simple to debug.

Links: GitHub | Demo | Docs

How I Run a Team of AI Coding Agents in Parallel

Batty — Sun, 05 Apr 2026 10:56:25 +0000

Running one AI coding agent is productive. Running five in parallel is chaos.

I've been using Claude Code daily for months. It's great — until you realize there are four other tasks sitting idle while you wait for one agent to finish refactoring a module. So you open more terminal tabs, spin up more sessions, and now you're the bottleneck. You're context-switching between agents, resolving merge conflicts they created, and manually checking if anything still compiles.

I spent a week in this mode before I decided there had to be a better way.

The Multi-Agent Problem

Here's what goes wrong when you naively run multiple AI coding agents on the same repo:

They stomp on each other's files. Agent A edits src/auth.rs while Agent B is also editing src/auth.rs. Someone loses.

Nobody checks the tests. An agent says "Done!" but the test suite is failing. You don't find out until three more tasks are stacked on top of the broken one.

You become the dispatcher. Which agent is working on what? Is anyone idle? Did that task actually get assigned? You're doing more coordination than coding.

There's no shared context. Agent A doesn't know Agent B just changed the API interface it depends on. Chaos.

Sound familiar? If you're using Claude Code, Codex, or Aider, and you've ever wanted to run more than one at a time — this is the wall you hit.

How I Solved It

I built Batty — a terminal-native supervisor that turns multiple AI coding agents into a coordinated team. No web UI. No servers. Just your terminal and tmux.

The core idea: instead of a flat pool of agents, you define a hierarchy. An architect agent plans the work. A manager breaks it into tasks. Engineers execute in isolated environments. A kanban board tracks everything. Tests gate completion.

Here's the minimal setup — an architect and three engineers:

# .batty/team_config/team.yaml
name: my-project
board:
  rotation_threshold: 20
standup:
  interval_secs: 600
  output_lines: 40
roles:
  - name: architect
    role_type: architect
    agent: claude
    instances: 1
    prompt: architect.md
    talks_to: [manager]

  - name: manager
    role_type: manager
    agent: claude
    instances: 1
    prompt: manager.md
    talks_to: [architect, engineer]

  - name: engineer
    role_type: engineer
    agent: claude
    instances: 3
    prompt: engineer.md
    talks_to: [manager]
    use_worktrees: true

Three lines that matter:

talks_to — Agents can only communicate with their defined contacts. No free-for-all. The architect talks to the manager, the manager talks to engineers. This prevents the message chaos that kills multi-agent setups.
instances: 3 — Three engineer agents, each in its own tmux pane. Batty names them eng-1-1, eng-1-2, eng-1-3 and manages them independently.
use_worktrees: true — Each engineer works in an isolated git worktree. Their own branch, their own working directory. No merge conflicts during active work.

The Workflow

cargo install kanban-md --locked && cargo install batty-cli
cd my-project
batty init --template simple
batty start --attach

This launches a tmux session. Each agent gets its own pane — you can watch them all work simultaneously, or detach and come back later.

Then you send a task:

batty send architect "Build a REST API with JWT auth and user registration"

Here's what happens:

The architect analyzes the request and breaks it into subtasks
Tasks land on the kanban board (a Markdown file — yes, you can cat it)
The manager dispatches tasks to available engineers
Each engineer picks up a task, creates a branch in its worktree, and starts coding
When an engineer says it's done, Batty runs the test suite
If tests pass, the work is ready to merge. If not, the task goes back to the engineer.

The whole thing is file-based. YAML config, Markdown kanban, Maildir-style inboxes, JSONL event logs. You can git diff your team's entire state.

What I Learned Running This Setup

After a few weeks of daily use, here's what surprised me:

Five parallel agents is the sweet spot

For most repos, 3-5 engineers is ideal. Beyond that, you start hitting genuine merge complexity even with worktree isolation. The agents aren't the bottleneck — the codebase's ability to absorb parallel changes is.

The architect matters more than the engineers

Task decomposition quality is everything. A good architect agent that breaks "Build auth system" into well-scoped, independent subtasks will outperform six engineers working on poorly defined work. I spent more time refining my architect.md prompt than any other part of the setup.

Test gating is non-negotiable

Before Batty, I'd have agents "complete" tasks that broke everything downstream. Now, a task isn't done until tests pass. Period. This single constraint eliminated most of the chaos.

It sounds obvious. But when you're watching five agents work in parallel and one of them says "Done!", the temptation to just accept it and move on is strong. Don't.

You still need to supervise

Batty is not "fire and forget." It's closer to managing a junior dev team than doing the work yourself. You review architecture decisions, redirect when an agent goes off-track, and unblock when someone gets stuck. But you're supervising five workstreams instead of doing one — that's the leverage.

The tmux-native approach just works

I tried web-based dashboards. I tried custom UIs. Nothing beat having the agents in tmux panes where I already work. I can split, resize, scroll back through an agent's history, or detach the whole session and come back from my phone via SSH.

A Real Example

I used Batty to build chess_test — a chess engine built entirely by a team of AI agents. The challenge: build an engine that can beat Stockfish at 1200 ELO at least 50% of the time. No external libraries. No internet lookups.

The team had an architect planning the engine architecture, a manager coordinating the work, and multiple engineers implementing different components in parallel — move generation, evaluation, search algorithms. Each working in their own worktree, each gated on tests.

It's the kind of project that would take one agent days of sequential work. With a coordinated team, the parallel execution compressed the timeline dramatically.

Getting Started

Batty works with the AI coding agents you already use:

Claude Code — First-class support, built-in templates
Codex — Works as an engineer agent
Aider — Works as an engineer agent
Custom — Any CLI tool that accepts stdin

# Install
cargo install kanban-md --locked
cargo install batty-cli

# Initialize in your project
cd your-project
batty init --template pair  # start small: 1 architect + 1 engineer

# Launch
batty start --attach

# Send a task
batty send architect "Implement user authentication with JWT"

Eight built-in templates range from solo (one agent, no hierarchy) to large (19 agents with three management layers). Start with pair or simple and scale up as you get comfortable.

What Batty Is Not

I want to be honest about limitations:

It's early. Version 0.1.0. The core loop is solid, but the API is still settling.
It's not magic. You still need good prompts and good task decomposition. Batty orchestrates — it doesn't think for you.
It requires tmux. If you don't use a terminal-based workflow, this isn't your tool.
It's not a framework. You can't embed it in your app. It's a CLI supervisor.

If you want a GUI, check out vibe-kanban. If you want a single-agent experience, Claude Code alone is excellent. Batty fills the gap between "one great agent" and "a coordinated team."

Batty is open source, built in Rust, and published on crates.io.

GitHub: github.com/battysh/batty
Demo video: 2-minute walkthrough
Docs: battysh.github.io/batty

If you're already running multiple AI agents and feeling the coordination pain, give it a try. And if you have ideas or feedback — issues and PRs are welcome.

Choosing an AI Agent Orchestrator in 2026: A Practical Comparison

Batty — Sun, 05 Apr 2026 05:13:20 +0000

Running one AI coding agent is easy. Running three in parallel on the same codebase is where things get interesting — and where you need to make a tooling choice.

There's no "best" orchestrator. There's the right one for your workflow. Here's an honest comparison of five approaches, with the tradeoffs I've seen after months of running multi-agent setups.

The Options

1. Raw tmux Scripts

What it is: Shell scripts that launch agents in tmux panes. DIY orchestration.

Pros:

Zero dependencies beyond tmux
Full control over every detail
No abstractions to fight
You already know how it works

Cons:

No state management — you track everything manually
No message routing between agents
No test gating — agents declare "done" without verification
Breaks when agents crash or hit context limits
You become the orchestrator

Best for: One-off tasks where you need 2-3 agents for an afternoon. If your coordination needs fit in a 50-line script, use the script.

Not for: Repeatable workflows, overnight sessions, or anything where "walk away and come back to merged PRs" matters.

2. CrewAI

What it is: Python framework for building multi-agent systems with role-based collaboration.

Pros:

Rich agent definition (role, goal, backstory, tools)
Built-in task delegation and sequential/parallel execution
Large ecosystem of tools and integrations
Active community, good documentation
Supports multiple LLM providers

Cons:

Framework, not a tool — you write Python to configure agents
Agents are CrewAI agents, not existing CLI tools (Claude Code, Codex)
No terminal visibility — agents run as Python processes
Learning curve for the framework concepts
Token costs can be high with verbose agent interactions

Best for: Building custom multi-agent applications in Python. Research, analysis, content generation workflows where you want programmatic control.

Not for: Orchestrating existing CLI coding agents. If you already use Claude Code or Codex and want to run multiples in parallel, CrewAI means rebuilding your agent setup in Python.

3. AutoGen

What it is: Microsoft's framework for multi-agent conversation and collaboration. Note (April 2026): Microsoft has announced AutoGen is entering maintenance phase, replaced by the new Microsoft Agent Framework. AutoGen will still receive bug fixes and security updates, but no new features. Worth considering if you're starting fresh.

Pros:

Sophisticated conversation patterns between agents
Strong research backing (Microsoft Research)
Group chat, nested conversations, teachable agents
Good for complex reasoning chains
Human-in-the-loop support
Large community (56K+ GitHub stars)

Cons:

Entering maintenance mode — Microsoft recommends migrating to Agent Framework
Heavy framework — significant setup for simple use cases
Python and .NET only
Designed for conversational agents, not coding workflows
No git integration, no worktree isolation
Overkill for "run 3 coding agents in parallel"

Best for: Existing projects already built on AutoGen. Complex multi-step reasoning and agent conversations in research settings.

Not for: New projects (consider Microsoft Agent Framework instead). Parallel code execution — AutoGen excels at agent conversations, not at managing git branches and test suites.

4. vibe-kanban

What it is: Web-based kanban board for AI agent task management. Built in Rust with a TypeScript frontend.

Pros:

Visual interface — see all agents and tasks at a glance
Drag-and-drop task management with real-time agent log streaming
Git worktree isolation per agent — same isolation concept as Batty, different interface
Built-in diff review UI for checking agent output before merging
MCP integration (both client and server) — agents can manage the board programmatically
Works with Claude Code, Codex, Gemini CLI, and other coding agents
Large community (24K+ GitHub stars)

Cons:

Web UI means leaving your terminal
No test gating — review is manual through the diff UI
Requires a running web server
Different mental model from terminal-native workflows

Best for: Teams that prefer visual interfaces. Developers who want to see diffs and review agent work in a browser. Workflows where drag-and-drop task management and visual oversight are features, not overhead.

Not for: Developers who live in tmux and want everything in the terminal. If Alt-Tab to a browser feels like context switching, vibe-kanban adds friction your workflow doesn't need.

5. Batty

What it is: Terminal-native Rust CLI that supervises AI coding agents in tmux.

Pros:

Each agent runs in a real tmux pane — your keybindings, SSH attach, pipe-pane all work
Git worktree isolation per agent — no file conflicts
Test gating — nothing merges until tests pass
Markdown kanban for task dispatch — cat the board, git diff the state
File-based everything — YAML config, Maildir inboxes, JSONL logs
Single binary (cargo install batty-cli), no runtime dependencies
Works with existing CLI agents (Claude Code, Codex, Aider)

Cons:

tmux is a hard dependency — doesn't work on Windows without WSL
No web UI — if you want a visual dashboard, look elsewhere
Early stage (v0.1.0) — API still settling
Rust contributor barrier — harder for casual contributions than a Python tool
Smaller community than framework-based alternatives

Best for: Developers who already live in tmux and want to scale from one agent to many without leaving the terminal. Teams that care about test gating and code quality gates.

Not for: Non-terminal users. Windows-primary developers. People who want to build custom agent systems from scratch (use CrewAI/AutoGen instead).

Decision Matrix

Need	Best Choice
Quick one-off parallel tasks	Raw tmux scripts
Custom multi-agent Python app	CrewAI
Complex agent reasoning/debate	AutoGen (or Microsoft Agent Framework)
Visual task management with diff review	vibe-kanban
Terminal-native with test gating	Batty
Windows-only environment	CrewAI or vibe-kanban
Orchestrate existing CLI agents	Batty, vibe-kanban, or tmux scripts

The Question That Matters

Before picking a tool, ask: am I building an agent system or coordinating existing agents?

If you're building from scratch — defining agent behaviors, tool access, conversation patterns — you want a framework. CrewAI and AutoGen give you the building blocks.

If you're already using Claude Code, Codex, or Aider and want to run multiples in parallel — you want a supervisor. Batty, vibe-kanban, and tmux scripts operate at this layer, each with different tradeoffs: vibe-kanban gives you a visual board with diff review, Batty gives you terminal-native supervision with test gating, and tmux scripts give you full control with no abstractions.

My Honest Take

I built Batty, so I'm biased. But I built it because the other options didn't fit my workflow:

CrewAI and AutoGen are frameworks — I didn't want to rewrite my agent setup in Python when Claude Code already works well
vibe-kanban is web-based — I wanted to stay in tmux
Raw scripts broke when agents crashed or I needed to walk away

Batty fills a specific niche: terminal-native supervision with test gating for people who already use CLI coding agents. If that's you, try it. If it's not, the other tools are genuinely good at what they do.

Try Batty: cargo install batty-cli — GitHub | Demo

Try the alternatives:

CrewAI — Python multi-agent framework
AutoGen — Microsoft's agent conversation framework (entering maintenance phase)
vibe-kanban — Visual AI agent kanban

How to Get Your Open Source Project into Awesome Lists (and Why It's Worth the Effort)

Batty — Sun, 05 Apr 2026 05:12:13 +0000

You shipped your open source project. You wrote the README. You posted on Reddit. Now what?

One of the most underrated distribution channels for open source is awesome lists — those curated awesome-* repositories on GitHub. There are thousands of them, many with tens of thousands of stars. Getting your project listed means a permanent, dofollow backlink from a high-authority GitHub page.

I recently submitted our Rust CLI tool to seven awesome lists. Here's what I learned about the process — and what I wish someone had told me before I started.

Why Awesome Lists Matter

The obvious benefit is visibility. Someone browsing awesome-rust (40k+ stars) looking for developer tools might discover your project.

But the bigger win is SEO. Each awesome list that merges your PR gives you a dofollow backlink from a page with high domain authority. GitHub pages rank well in search engines. A single listing on a popular awesome list can outperform weeks of blog posts for driving organic search traffic to your repo.

For a project with under 100 stars, getting listed on 3-4 relevant awesome lists can be the difference between showing up on page 1 or page 5 of Google results for your target keywords.

Step 1: Find the Right Lists

Don't just search GitHub for "awesome" + your language. Think about your project from multiple angles:

Language ecosystem: awesome-rust, awesome-python, awesome-go
Tool category: awesome-cli-apps, awesome-devops, awesome-selfhosted
Domain: awesome-ai-agents, awesome-tmux, awesome-shell
Meta-lists: awesome is the list of lists — search it

I found seven relevant lists for a single Rust CLI tool by thinking about it as (1) a Rust project, (2) a CLI app, (3) a DevOps tool, (4) a tmux plugin, (5) a shell utility, (6) an AI agent tool, and (7) a self-hosted application. Each angle pointed to a different list.

Pro tip: Check how recently the list was updated. A list with no merges in 6 months probably has an inactive maintainer. You'll be waiting a long time.

Step 2: Read the Contribution Guidelines (Seriously)

Every awesome list has different rules. Some are strict. Here's what varies:

Star minimums. awesome-rust requires 50+ stars. awesome-cli-apps requires 20+. Many lists have no minimum. Check before you invest time in a PR.

Age requirements. awesome-cli-apps requires your first release to be 90+ days old. awesome-selfhosted requires 4+ months. Don't submit a week-old project to a list with age gates.

Entry format. Some lists use * bullets, others use -. Some want badges, some don't. Some use YAML data files instead of Markdown (awesome-selfhosted does this). Copy the exact format — maintainers reject PRs over formatting alone.

Sort order. Most lists are alphabetical within sections. Submit to the right position. This is the most common fixable mistake in awesome list PRs.

Section choice. Pick the section that fits best. If your project could go in multiple sections, pick one. Don't submit to three sections in the same PR.

Here's a real example of how formats differ between two lists:

For awesome-rust:

* [user/repo](https://github.com/user/repo) [[crate](https://crates.io/crates/crate)] - Description [![build badge](badge-url)](actions-url)

For awesome-cli-apps:

- [app-name](https://github.com/user/repo) - Description.

Same project. Completely different entry format.

Step 3: Write the PR

Keep it simple. The PR should include:

One commit. Don't restructure the list. Just add your entry.
A clear title. "Add ProjectName" or "Add user/project to Section Name"
A brief body. One sentence about what the project does. Link to the repo. That's it.

Don't write a sales pitch. Maintainers are curating, not buying. They want to know: what is it, does it belong here, does it meet the requirements.

Common mistakes that get PRs rejected:

Adding your project to the wrong section
Not following alphabetical ordering
Including promotional language ("the best", "revolutionary")
Submitting to multiple sections in one PR
Missing required badges or format elements

Step 4: Wait (Then Follow Up Politely)

Here's the part nobody tells you: awesome list PRs sit for weeks.

Maintainers are volunteers. Most popular awesome lists get dozens of PRs. Your submission is competing for attention with spam, low-quality projects, and other legitimate submissions.

From my experience across seven submissions:

One PR was acknowledged within 24 hours
Two have been sitting for 14+ days with no response
One has been open for 40 days

Follow-up etiquette:

Wait at least 7 days before your first nudge
Keep it short and friendly: "Just checking in — happy to make any adjustments if needed"
If your project has improved since submission, mention specific updates
One follow-up comment is fine. Two is the max. After that, the maintainer has either seen it or the list is inactive.
Never close and re-open the same PR to bump it

If a PR sits for 60+ days with zero maintainer activity, it's probably not getting merged. That's OK. Close it cleanly and move on.

Step 5: Track Your Backlinks

Once PRs start getting merged, track the SEO impact:

Google Search Console shows when new backlinks appear and how they affect your search rankings
GitHub traffic analytics (Insights > Traffic) shows referring sites — awesome lists show up here
Star history often shows small bumps after awesome list merges

For a project in the 10-50 star range, a single awesome-list merge from a popular list can drive 5-20 new visitors per week, indefinitely. That compounds.

The Bigger Picture

Awesome lists are one piece of an open source distribution strategy. They work best alongside:

Dev.to / Hashnode articles (more backlinks, different audience)
Reddit posts in relevant subreddits
GitHub Topics (free discovery via github.com/topics/your-keyword)

The effort per submission is small — 15 minutes to read guidelines, 5 minutes to write the PR. Even if only half get merged, the permanent backlinks and discovery are worth it.

Start with the list that fits your project best. Read the CONTRIBUTING.md. Format your entry correctly. Submit. Follow up once. Move on to the next one.

Your future self (and your search rankings) will thank you.

I'm building Batty, a tmux-native supervisor for AI coding agents. Currently going through this exact process — 7 awesome list submissions, results pending. Follow along for more lessons from the open source trenches.

Git Worktrees: The Secret Weapon You're Not Using

Batty — Sun, 05 Apr 2026 04:31:32 +0000

You're deep in a feature branch. A bug report comes in. You need to check main, reproduce the bug, fix it, push — then get back to your feature.

Most developers do one of these:

git stash → switch branch → work → switch back → git stash pop → hope nothing broke
Clone the repo again into a second directory
Commit half-finished work with a "WIP" message

All three are bad. There's a better way, and it's been in git since 2015.

What Are Git Worktrees?

A git worktree is a linked working directory that shares the same .git repository. Each worktree checks out a different branch, and they coexist on disk simultaneously.

my-project/              ← main branch (your primary worktree)
my-project-hotfix/       ← hotfix branch (linked worktree)
my-project-experiment/   ← experiment branch (linked worktree)

Three directories. Three branches. One repository. One .git history.

You can edit files in each directory independently, run tests in each, even have different editors open. No stashing. No switching. No WIP commits.

Getting Started

Create a worktree

# From your main checkout
cd my-project

# Create a worktree for an existing branch
git worktree add ../my-project-hotfix hotfix/login-bug

# Create a worktree with a new branch
git worktree add -b feature/new-api ../my-project-api

That's it. ../my-project-hotfix is now a fully functional checkout of hotfix/login-bug.

List worktrees

git worktree list

Output:

/home/dev/my-project           abc1234 [main]
/home/dev/my-project-hotfix    def5678 [hotfix/login-bug]
/home/dev/my-project-api       789abcd [feature/new-api]

Remove a worktree

When you're done with the branch:

# Delete the directory
rm -rf ../my-project-hotfix

# Clean up the worktree reference
git worktree prune

Or in one step (git 2.17+):

git worktree remove ../my-project-hotfix

Real-World Use Cases

1. Bug fixes without context switching

You're mid-feature. A P0 bug comes in.

# Create a worktree for the fix (30 seconds)
git worktree add -b hotfix/p0-crash ../hotfix main

# Fix the bug in the other directory
cd ../hotfix
# edit, test, commit, push

# Switch back — your feature branch is untouched
cd ../my-project

Your feature branch never moved. No stash. No WIP commit. No mental overhead of remembering where you were.

2. Running tests on one branch while working on another

Long test suite? Run it in the background on the worktree while you keep coding in the primary.

# In terminal 1 (worktree)
cd ../my-project-hotfix
cargo test

# In terminal 2 (primary)
cd ../my-project
# keep working on your feature

3. Comparing behavior across branches

Need to check how the app behaves on main vs. your branch?

git worktree add ../my-project-main main

# Terminal 1: run the app on main
cd ../my-project-main && cargo run

# Terminal 2: run your branch
cd ../my-project && cargo run

# Compare side by side

4. Code review with full context

Reviewing a PR? Check it out in a worktree instead of switching your primary branch:

git worktree add ../review-pr-42 origin/feature/pr-42
cd ../review-pr-42
# run tests, explore code, check behavior
# when done:
git worktree remove ../review-pr-42

5. Parallel CI-like workflows

Running linters, type checkers, and tests simultaneously across branches:

git worktree add ../wt-lint main
git worktree add ../wt-test feature/new-api

# Parallel execution
(cd ../wt-lint && npm run lint) &
(cd ../wt-test && npm test) &
wait

How It Works Under the Hood

When you clone a repo, git creates a .git directory that stores all objects, refs, and history. Your working directory is just one view into that data.

git worktree add creates another view — a new directory with its own HEAD, index, and working tree, but sharing the same .git/objects store.

.git/                    ← shared object store
  objects/               ← all commits, blobs, trees (shared)
  refs/                  ← all branches and tags (shared)
  worktrees/
    my-project-hotfix/   ← per-worktree HEAD and index
    my-project-api/      ← per-worktree HEAD and index

This means:

Disk usage is minimal. Worktrees share all git objects. Only the checked-out files are duplicated.
Commits are immediately visible across worktrees. Push from one, pull from another.
Branches are locked. You can't check out the same branch in two worktrees simultaneously. Git prevents this to avoid conflicting edits.

Common Pitfalls

Branch locking

git worktree add ../wt-main main
# Later, in primary:
git checkout main
# fatal: 'main' is already checked out at '../wt-main'

This is by design. Remove the worktree first, or work on a different branch.

Forgetting to prune

Deleted a worktree directory manually? Run git worktree prune to clean up stale references. Otherwise git still thinks the worktree exists and won't let you check out that branch elsewhere.

Submodules

Worktrees and submodules interact poorly in older git versions. If you use submodules, test with your git version first. Git 2.36+ handles this better.

Worktrees vs. Alternatives

Approach	Disk cost	Speed	Risk
`git stash`	None	Fast	Stash conflicts, forgotten stashes
Second clone	Full repo	Slow	Divergent histories, double fetch
WIP commits	None	Fast	Polluted history, rebase headaches
Worktrees	Minimal	Fast	Branch lock (by design)

Worktrees win on every dimension except one: they require you to know they exist.

Quick Reference

# Create worktree (existing branch)
git worktree add <path> <branch>

# Create worktree (new branch from current HEAD)
git worktree add -b <new-branch> <path>

# Create worktree (new branch from specific base)
git worktree add -b <new-branch> <path> <base>

# List all worktrees
git worktree list

# Remove a worktree
git worktree remove <path>

# Clean up stale worktree references
git worktree prune

Beyond Manual Use: Worktrees for Automation

Worktrees aren't just for humans. Any workflow that needs parallel access to multiple branches benefits: CI scripts, deployment pipelines, automated testing — or AI coding agents that need isolated working directories.

If you're running multiple AI agents on the same repo, worktrees give each agent its own checkout without the overhead of full clones. Each agent edits files independently, and conflicts only surface at merge time — which is when you want them.

Try it: cargo install batty-cli — GitHub | Demo

Git worktrees have been stable since git 2.5 (2015). If you're using any modern git version, they just work.

Git Worktrees Explained: The Secret Weapon You're Not Using

Batty — Sun, 05 Apr 2026 04:31:00 +0000

You're deep in a feature branch. A bug report comes in. You need to check main, reproduce the bug, fix it, push — then get back to your feature.

Most developers do one of these:

git stash → switch branch → work → switch back → git stash pop → hope nothing broke
Clone the repo again into a second directory
Commit half-finished work with a "WIP" message

All three are bad. There's a better way, and it's been in git since 2015.

What Are Git Worktrees?

A git worktree is a linked working directory that shares the same .git repository. Each worktree checks out a different branch, and they coexist on disk simultaneously.

my-project/              ← main branch (your primary worktree)
my-project-hotfix/       ← hotfix branch (linked worktree)
my-project-experiment/   ← experiment branch (linked worktree)

Three directories. Three branches. One repository. One .git history.

You can edit files in each directory independently, run tests in each, even have different editors open. No stashing. No switching. No WIP commits.

Getting Started

Create a worktree

# From your main checkout
cd my-project

# Create a worktree for an existing branch
git worktree add ../my-project-hotfix hotfix/login-bug

# Create a worktree with a new branch
git worktree add -b feature/new-api ../my-project-api

That's it. ../my-project-hotfix is now a fully functional checkout of hotfix/login-bug.

List worktrees

git worktree list

Output:

/home/dev/my-project           abc1234 [main]
/home/dev/my-project-hotfix    def5678 [hotfix/login-bug]
/home/dev/my-project-api       789abcd [feature/new-api]

Remove a worktree

When you're done with the branch:

# Delete the directory
rm -rf ../my-project-hotfix

# Clean up the worktree reference
git worktree prune

Or in one step (git 2.17+):

git worktree remove ../my-project-hotfix

Real-World Use Cases

1. Bug fixes without context switching

You're mid-feature. A P0 bug comes in.

# Create a worktree for the fix (30 seconds)
git worktree add -b hotfix/p0-crash ../hotfix main

# Fix the bug in the other directory
cd ../hotfix
# edit, test, commit, push

# Switch back — your feature branch is untouched
cd ../my-project

Your feature branch never moved. No stash. No WIP commit. No mental overhead of remembering where you were.

2. Running tests on one branch while working on another

Long test suite? Run it in the background on the worktree while you keep coding in the primary.

# In terminal 1 (worktree)
cd ../my-project-hotfix
cargo test

# In terminal 2 (primary)
cd ../my-project
# keep working on your feature

3. Comparing behavior across branches

Need to check how the app behaves on main vs. your branch?

git worktree add ../my-project-main main

# Terminal 1: run the app on main
cd ../my-project-main && cargo run

# Terminal 2: run your branch
cd ../my-project && cargo run

# Compare side by side

4. Code review with full context

Reviewing a PR? Check it out in a worktree instead of switching your primary branch:

git worktree add ../review-pr-42 origin/feature/pr-42
cd ../review-pr-42
# run tests, explore code, check behavior
# when done:
git worktree remove ../review-pr-42

5. Parallel CI-like workflows

Running linters, type checkers, and tests simultaneously across branches:

git worktree add ../wt-lint main
git worktree add ../wt-test feature/new-api

# Parallel execution
(cd ../wt-lint && npm run lint) &
(cd ../wt-test && npm test) &
wait

How It Works Under the Hood

When you clone a repo, git creates a .git directory that stores all objects, refs, and history. Your working directory is just one view into that data.

git worktree add creates another view — a new directory with its own HEAD, index, and working tree, but sharing the same .git/objects store.

.git/                    ← shared object store
  objects/               ← all commits, blobs, trees (shared)
  refs/                  ← all branches and tags (shared)
  worktrees/
    my-project-hotfix/   ← per-worktree HEAD and index
    my-project-api/      ← per-worktree HEAD and index

This means:

Disk usage is minimal. Worktrees share all git objects. Only the checked-out files are duplicated.
Commits are immediately visible across worktrees. Push from one, pull from another.
Branches are locked. You can't check out the same branch in two worktrees simultaneously. Git prevents this to avoid conflicting edits.

Common Pitfalls

Branch locking

git worktree add ../wt-main main
# Later, in primary:
git checkout main
# fatal: 'main' is already checked out at '../wt-main'

This is by design. Remove the worktree first, or work on a different branch.

Forgetting to prune

Deleted a worktree directory manually? Run git worktree prune to clean up stale references. Otherwise git still thinks the worktree exists and won't let you check out that branch elsewhere.

Submodules

Worktrees and submodules interact poorly in older git versions. If you use submodules, test with your git version first. Git 2.36+ handles this better.

Worktrees vs. Alternatives

Approach	Disk cost	Speed	Risk
`git stash`	None	Fast	Stash conflicts, forgotten stashes
Second clone	Full repo	Slow	Divergent histories, double fetch
WIP commits	None	Fast	Polluted history, rebase headaches
Worktrees	Minimal	Fast	Branch lock (by design)

Worktrees win on every dimension except one: they require you to know they exist.

Quick Reference

# Create worktree (existing branch)
git worktree add <path> <branch>

# Create worktree (new branch from current HEAD)
git worktree add -b <new-branch> <path>

# Create worktree (new branch from specific base)
git worktree add -b <new-branch> <path> <base>

# List all worktrees
git worktree list

# Remove a worktree
git worktree remove <path>

# Clean up stale worktree references
git worktree prune

Beyond Manual Use: Worktrees for Automation

Try it: cargo install batty-cli — GitHub | Demo

Git worktrees have been stable since git 2.5 (2015). If you're using any modern git version, they just work.

What I Learned Supervising 5 AI Agents on a Real Project

Batty — Sun, 05 Apr 2026 00:58:58 +0000

I ran 5 AI coding agents in parallel on a real Rust project for a week. Not a demo. Not a toy. A 51K-line codebase with real users.

Here's what happened — with actual numbers.

The Setup

Project: A Rust CLI tool with a daemon, tmux integration, message routing, and a kanban board parser.

Team configuration:

1 architect (Claude Opus) — plans and decomposes work
1 manager (Claude Opus) — dispatches tasks, handles escalations
3 engineers (Codex) — parallel execution in isolated worktrees

Duration: 5 working days, ~6 hours per day supervised.

Tasks: Backlog of features, refactors, and bug fixes that had been accumulating for weeks.

The Numbers

Metric	Result
Tasks completed	47
Tasks failed and reassigned	8
Test gate catches (merge blocked)	12
Context exhaustions	3
Merge conflicts	4
Lines changed	~8,200
Total time supervised	~30 hours
Estimated sequential time	~120 hours

47 tasks in 30 hours of supervision. The same work would have taken me roughly 120 hours doing it sequentially — 4x compression.

What Worked

Task decomposition was the multiplier

The architect agent spent the first 30 minutes of each day reading the backlog and decomposing features into independent, testable tasks. This planning phase was the single most valuable step.

Bad decomposition: "Refactor the message routing system." Three engineers attempted overlapping changes and every merge conflicted.

Good decomposition: "Extract delivery retry logic into its own module." "Add timeout configuration to message delivery." "Write tests for Maildir atomic rename." Three independent tasks, zero conflicts.

The quality of the architect's output determined whether the day went smoothly or devolved into conflict resolution.

Test gating prevented 12 bad merges

12 times, an engineer declared a task complete but the test suite failed. Without test gating, those 12 broken branches would have merged to main, creating cascading failures.

The pattern: engineer produces code that compiles, looks correct, and handles the happy path. But it misses an edge case, breaks an existing test, or introduces a subtle regression. The test gate catches it, sends the failure output back, and the engineer fixes it — usually on the first retry.

Three of those 12 catches were serious: a race condition in merge locking, a missing null check in config parsing, and a test that passed locally but failed because of a hardcoded path. Without the gate, any of these would have cost hours to debug in main.

Worktree isolation eliminated file conflicts (mostly)

Each engineer worked in its own git worktree on its own branch. During active work, there were zero file conflicts. Engineers could edit the same files simultaneously without knowing about each other.

Conflicts only appeared at merge time — 4 total across 47 tasks. All were straightforward to resolve because only one branch was being merged at a time (serialized with a file lock).

What Broke

Context exhaustion on complex tasks

Three times, an engineer hit the context window limit mid-task. Each time, the pattern was the same: a task that seemed simple but required reading many files to understand the full picture.

The worst case: "Update the error handling to use typed errors throughout." The engineer started reading error types, then the modules that used them, then the modules that called those modules. By the time it understood the scope, the context window was nearly full and the actual changes were shallow and incomplete.

Fix: Break broad refactors into per-module tasks. "Add typed errors to the delivery module" fits in one context window. "Add typed errors everywhere" does not.

The architect occasionally over-decomposed

On day 3, the architect broke a single feature into 11 tasks. Three of the tasks were trivial one-liners that took more effort to dispatch, execute, test, and merge than they would have taken to do manually.

Fix: Set a minimum complexity threshold. If a task takes less than 5 minutes for a human, it's not worth the orchestration overhead. Batch trivial changes into a single "cleanup" task.

One engineer got stuck in a retry loop

An engineer hit a failing test, attempted to fix it, introduced a new failure, attempted to fix that, and looped for 40 minutes. The test gate correctly blocked the merge each time, but the agent didn't know how to step back and reconsider its approach.

Fix: After 2 failed retries, escalate to the manager instead of letting the engineer continue. The manager can provide fresh perspective or reassign the task. Batty now enforces this automatically.

What Surprised Me

The 3-5 engineer sweet spot is real

With 3 engineers, merge conflicts were rare and supervision was comfortable. With 5, conflicts increased and I spent more time watching for stuck agents. The codebase — not the tooling — was the bottleneck. Too many concurrent changes in a tightly-coupled codebase created interference even with worktree isolation.

Supervision isn't passive

I expected to kick off tasks and check back later. In reality, I checked agent status every 10-15 minutes during the first two hours, then relaxed to every 30 minutes once the pattern was established. The supervision was lightweight but continuous — closer to managing a team than running a batch job.

The architect agent was the best investment

If I had to choose between 1 architect + 2 engineers or 0 architects + 5 engineers, I'd take the architect every time. Well-decomposed tasks with clear acceptance criteria produced better results than throwing more engineers at vague objectives.

Token costs were reasonable

Total token cost for the week: approximately $45. Sequential work on the same tasks would have cost roughly $30 (fewer context loads). The 50% cost increase bought a 4x time compression. At any reasonable hourly rate, this is an obvious trade.

Would I Do It Again?

Yes, with two changes:

Minimum task complexity threshold. Don't orchestrate tasks that take less than 5 minutes manually.
Stricter retry limits from day 1. Two retries, then escalate. No exceptions.

The 4x compression was real, the test gating prevented real damage, and the supervision overhead was manageable. Multi-agent development isn't fire-and-forget, but it's a genuine productivity multiplier for anyone willing to supervise.

Try it: cargo install batty-cli — GitHub | Demo

How File-Based Architecture Makes AI Agents Debuggable

Batty — Sun, 05 Apr 2026 00:45:29 +0000

When an AI agent does something wrong — and it will — you need to answer two questions fast: what happened, and why?

If your agent state lives in a database, the answer requires a SQL client, the right query, and knowledge of the schema. If it lives in an API, you need auth tokens, endpoint documentation, and a way to correlate events across services.

If it lives in files, the answer is ls and cat.

The Debugging Tax

Every layer of abstraction between you and the agent's state is a debugging tax. Each layer adds latency to your investigation:

Architecture	To see what happened	Time to first insight
Database (SQLite/Postgres)	Open client, write query, parse results	2-5 minutes
API-based state	Authenticate, find endpoint, decode response	3-10 minutes
File-based state	`ls .batty/inboxes/eng-1-1/new/`	5 seconds

At 2am when an agent has been looping for an hour, those minutes matter. File-based state gives you instant visibility with tools you already know.

What File-Based Looks Like

Batty stores every piece of agent state as a file:

.batty/
  team_config/
    team.yaml              # Who does what, who talks to whom
    prompts/               # Per-role instruction files
  kanban/
    board/tasks/           # Each task is a Markdown file
  inboxes/
    eng-1-1/
      new/                 # Undelivered messages
      cur/                 # Delivered messages
      tmp/                 # Atomic write staging
    architect/
      new/
      cur/
  worktrees/
    eng-1-1/               # Full git worktree per engineer
  logs/
    events.jsonl           # Every event, one JSON object per line

No database. No hidden state. Every piece of the system is a file you can read with standard Unix tools.

Four Formats, Four Purposes

YAML Config — Who Does What

roles:
  - name: engineer
    agent: claude
    instances: 3
    talks_to: [manager]
    use_worktrees: true

YAML is human-readable configuration. You edit it in your editor, validate it at startup, and git diff it to see what changed. Configuration doesn't change during a session — it's the rules, not the state.

Markdown Kanban — What's Happening

Each task is a Markdown file with YAML frontmatter:

---
id: 27
status: in-progress
assigned_to: eng-1-1
---
# Add JWT authentication
Implement JWT middleware for protected routes.

Want to see all in-progress tasks? grep -l "status: in-progress" board/tasks/*.md. Want to see what changed? git diff board/. Want to edit a task while the daemon is running? Open the file in vim.

Maildir Inboxes — Who Said What

Messages between agents use the Maildir protocol — the same format email servers have used since 1995:

inboxes/eng-1-1/new/   → Messages waiting to be delivered
inboxes/eng-1-1/cur/   → Messages already delivered
inboxes/eng-1-1/tmp/   → Messages being written (atomic staging)

Each message is a JSON file: sender, recipient, body, timestamp. Delivery is atomic — write to tmp/, rename to new/. No partial writes, no corruption, no WAL.

Debugging a delivery failure:

# What messages are stuck?
ls inboxes/eng-1-1/new/

# What does the stuck message say?
cat inboxes/eng-1-1/new/1711108200.msg

# Who sent it?
cat inboxes/eng-1-1/new/1711108200.msg | jq .from

Compare this to debugging a message queue: connect to the broker, navigate the admin UI, find the right queue, decode the message format. With Maildir, it's cat.

JSONL Logs — What Happened When

Every significant event is appended to events.jsonl:

{"ts":1711108200,"event":"task_assigned","engineer":"eng-1-1","task_id":27}
{"ts":1711108890,"event":"test_executed","task_id":27,"passed":false}
{"ts":1711108950,"event":"message_delivered","from":"batty","to":"eng-1-1"}
{"ts":1711109400,"event":"test_executed","task_id":27,"passed":true}
{"ts":1711109405,"event":"merge","source":"eng-1-1/task-27","target":"main"}

One JSON object per line. Append-only. grep-able. jq-able.

# Which tasks failed tests?
cat events.jsonl | jq 'select(.event == "test_executed" and .passed == false)'

# Average time from assignment to completion?
cat events.jsonl | jq 'select(.event == "task_assigned" or .event == "task_completed")'

# Which engineer fails tests most often?
cat events.jsonl | jq -r 'select(.event == "test_executed" and .passed == false) | .engineer' | sort | uniq -c

No Grafana dashboard. No log aggregation service. Just jq.

Why Not a Database?

SQLite would work. It's fast, embedded, and well-understood. But it adds three problems:

Opaque state. You can't cat a SQLite database. You need a client and a query. When something breaks, the first step is figuring out how to inspect the state — not inspecting it.
Merge complexity. Git can't meaningfully diff a binary database file. With file-based state, git diff shows you exactly what changed between two points in time.
Recovery complexity. If the daemon crashes mid-write, a database might need WAL recovery. With Maildir, the atomic rename protocol means messages are either fully written or not written at all. No recovery logic needed.

The tradeoff: files don't scale to millions of records. But an agent supervisor manages 5-20 agents with 50-100 tasks. At that scale, files are faster to inspect and equally fast to read.

The Compound Effect

When everything is a file:

Backups are cp -r .batty/ /backup/
Version control is git add .batty/ && git commit
Debugging is ls and cat
Monitoring is watch ls inboxes/*/new/
Migration is copying a directory
Testing is seeding files and checking results

No client libraries. No connection strings. No schema migrations. No ORM. The filesystem is the API.

When Files Don't Work

File-based architecture has real limitations:

Concurrent writes from multiple machines — files assume a single host. For distributed agents, you need a coordination layer.
Complex queries — "show me all tasks assigned to eng-1-1 that failed tests in the last hour" is easier in SQL than with grep and jq.
High-volume events — JSONL works for hundreds of events per session. For millions, you need a proper time-series database.

For a single-host agent supervisor managing 5-20 agents? Files are the right abstraction. They're not clever. They're debuggable.

Try it: cargo install batty-cli — GitHub | Demo

Building an AI Agent Supervisor: Series Index

Batty — Sun, 05 Apr 2026 00:06:06 +0000

This series documents the architecture, decisions, and lessons from building Batty — a Rust CLI that supervises teams of AI coding agents in tmux.

Each post covers a specific subsystem or challenge. Start anywhere — they're designed to be useful independently.

The Architecture

How I Run a Team of AI Coding Agents in Parallel — The problem and the solution. Why running multiple agents on the same repo breaks without coordination.
Building a tmux-native agent supervisor in Rust — Deep dive into the Rust implementation. Crate choices, architecture decisions, what I'd do differently.
Why I Chose a Synchronous Poll Loop Over Async — I ripped out tokio after two weeks. Here's why sleep(5) was the right call.
How tmux Became the Runtime — Why tmux, not Docker or a custom TUI, is the perfect agent runtime.

The Patterns

Git Worktrees for AI Agent Isolation — Step-by-step tutorial for parallel agent work without file conflicts.
The Case for Markdown as Your Agent's Task Format — Why Markdown beats JSON for agent task management.
Context Rotation: When Agents Run Out of Memory — Detection, rotation patterns, and scoping tasks to fit context windows.
Your AI Agent Says Done — How Do You Know? — Test gating as the quality gate. Exit code 0 means done.

The Practice

5 Lessons from Running AI Agents in Parallel — Task decomposition, test gating, worktree isolation, supervision vs autonomy.
From Solo Agent to Agent Team: A Migration Guide — Progressive 6-stage path from one agent to full automation.
The Real Cost of Running 5 Agents in Parallel — Token math, cost reduction tactics. 1.5-2x not 5x.
Choosing an AI Agent Orchestrator in 2026 — Honest comparison: Batty vs vibe-kanban vs CrewAI vs AutoGen vs tmux scripts.

The tool: github.com/battysh/batty — open source, MIT licensed, built in Rust.

Open Source Marketing with Zero Budget: How We Got 14 Stars in 4 Weeks

Batty — Sat, 04 Apr 2026 23:21:22 +0000

Four weeks ago, Batty didn't exist as far as the internet was concerned. No stars. No downloads. No articles. No social presence.

Today: 14 stars, 76 crate downloads, 241 cloners, 293 GitHub views, 25+ articles indexed by Google, and 68 X replies that built a recognized presence in the AI coding space.

Total marketing budget: $0.

Here's exactly what we did, what worked, what failed, and what we'd do differently.

The Strategy: Content as Infrastructure

Most open-source projects treat marketing as an event — a launch day, a Show HN, a Product Hunt post. We treated it as infrastructure. Every article is a permanent search discovery path. Every X reply is a conversation that surfaces our profile. The goal wasn't a traffic spike. It was steady compounding.

The strategy has three pillars:

SEO content on Dev.to — articles targeting specific keyword clusters
X engagement — quality replies in relevant threads, not broadcasting
Directory submissions — backlinks from high-authority platforms

That's it. No paid ads, no influencer outreach, no growth hacks.

What Worked

Dev.to Articles (Highest ROI)

25+ articles published across Dev.to and Hashnode. Each targets a different keyword cluster:

Keyword Cluster	Article	Why It Works
"git worktrees AI agents"	How to Use Git Worktrees...	Tutorial, actionable
"sync vs async Rust daemon"	Why I Chose Sync Over Async...	Contrarian, Rust community bait
"AI agent orchestrator 2026"	Choosing an AI Agent Orchestrator...	Comparison, high-intent search
"AI agent task management"	The Case for Markdown...	Opinionated, practical
"tmux AI agents"	How tmux Became the Runtime...	Narrative, niche keyword
"running multiple AI agents"	5 Lessons from Running AI Agents...	Listicle, shareable
"AI agent cost"	The Real Cost of Running 5 Agents...	Addresses #1 objection
"solo to agent team"	From Solo Agent to Agent Team	Migration guide, widest audience

Key insight: article quantity compounds faster than quality. Each article is a permanent Google entry point. A mediocre article that ranks for a long-tail keyword drives more lasting traffic than a perfect article that nobody finds. We'd rather publish 25 good articles than 5 great ones.

The 5-article threshold. After our first 5 articles, Google started treating our Dev.to profile as authoritative for multi-agent coding topics. Articles 6-25 indexed faster and ranked higher. The first 5 were investment; the rest are compounding returns.

Cross-posting to Hashnode added a second indexed domain with canonical URLs pointing back to Dev.to. Zero extra writing — same content, different discovery path.

X Engagement (Highest Quality Traffic)

68 replies across 23 rounds. Strategy: find high-engagement threads about AI coding agents, Rust CLI tools, or developer workflows. Add a genuinely helpful reply. Never promote.

The math: replies to threads from accounts with 10K+ followers appear under their post, visible to their entire audience. A single reply on a 20K-view thread drives more targeted traffic than a standalone post with 100 impressions.

What we actually said: technical observations about multi-agent coordination, token cost management, sync vs async tradeoffs. Things only someone who's actually built an agent supervisor would know. The replies built credibility before anyone clicked through to our profile.

Quality signal: X drove nearly 1:1 visitor-to-unique ratio. When someone clicks through from a quality reply, they're genuinely interested. Most other channels have much higher bounce rates.

We never said "check out Batty" in a reply. Not once across 68 replies. The profile link in our bio does the conversion. Forcing it kills credibility.

Directory Submissions (Backlinks)

SaaSHub (DR 77) — submitted, pending approval
awesome-tmux, awesome-ai-tools, awesome-ai-agents — PRs open
This Week in Rust — PR submitted
Rust Users Forum — post submitted (pending mod approval)
Google Search Console — verified, data incoming

Each approved submission is a permanent dofollow backlink from a high-authority domain. These take weeks to process but compound permanently.

YouTube Comments (Permanent Discovery)

3 comments on major Claude Code tutorial videos. YouTube comments are permanently visible, Google-indexed, and contextually placed — they appear exactly where someone is learning about the problem Batty solves.

What Failed

Show HN (Dead on Arrival)

Our Show HN was posted with a solid title, a strong first comment, and architecture details. The author's HN account got flagged for AI-generated comments (unrelated to the Show HN). Post died at 1 point with zero engagement.

Lesson: HN is a single point of failure. If your account has any issues, the entire launch is wasted. We'd invested weeks in karma building and response templates — all for nothing.

Reddit (Near Zero Engagement)

Three posts across r/commandline, r/rust, and the r/rust weekly thread. Total community engagement: approximately zero. The posts exist and are indexed, but Reddit's algorithm buried them without early upvote velocity.

Lesson: Reddit rewards participation in existing discussions, not announcements. We should have spent those hours commenting helpfully in other threads instead of creating our own posts.

Bluesky (Blocked by Logistics)

We had a Bluesky account but couldn't activate it — email verification issues, app password problems, browser automation incompatibility. 4 unique visitors from near-zero effort suggests potential, but we never got to test it properly.

Lesson: Set up all accounts and verify everything before you need them.

The Numbers Nobody Shows You

Metric	Reality
Articles published	25+
X replies sent	68
Total marketing hours	~40
Stars earned	14
Cost per star	$0 (but ~3 hours of work)
Viral moments	0
Articles that "went viral"	0
Steady daily growth	2-3 visitors/day

No single piece of content drove our growth. It was the accumulation of 25 articles, 68 replies, and consistent presence over 4 weeks. The growth is invisible day-to-day but obvious week-to-week.

What We'd Do Differently

Skip Reddit posts, do Reddit comments. Reply to "what tools do you use" threads instead of creating our own announcement posts.
Set up all accounts on day 1. Bluesky, Discord, AlternativeTo — every platform that needs verification should be ready before you need it.
Start with the comparison article. "Choosing an AI Agent Orchestrator in 2026" is our highest-intent search target. We should have published it first, not fifteenth.
Don't invest in HN. It's high risk, single point of failure, and doesn't compound. The same hours spent on Dev.to articles produce permanently indexed content.
Publish daily from day 1. The 5-article threshold means the first week is pure investment. Starting earlier means reaching the compounding phase sooner.

The Playbook (Steal This)

If you're marketing an open-source tool with zero budget:

Write 5 articles in week 1 targeting different keyword clusters. Dev.to + Hashnode cross-post.
Reply to 3 threads daily on X. Technically specific, never promotional. Your bio link does the conversion.
Submit to directories — SaaSHub, awesome-lists, This Week in Rust (or your language's equivalent). Free backlinks.
Leave YouTube comments on tutorial videos about the problem you solve. Permanent and contextual.
Don't launch. Build presence continuously instead of betting on a single launch day.

The boring answer is the correct answer: show up every day, write something useful, engage genuinely, and let search do the compounding.

The tool: github.com/battysh/batty — supervised agent execution for software teams.

If you're marketing an open-source project with zero budget, I'd love to compare playbooks. What's working for you?