Batty

Posted on Mar 22 • Edited on Apr 5

Building a tmux-native agent supervisor in Rust

#rust #opensource #cli #ai

AI coding agents are good at isolated tasks. They're terrible at working together.

I've been running Claude Code, Codex, and Aider on real projects for months. One agent works great. But the moment you try to run three or four in parallel on the same repository, you hit a wall: they edit the same files, nobody checks if tests pass, and you become a full-time dispatcher instead of a developer.

I built Batty to fix this — a terminal-native daemon that supervises teams of AI coding agents. It runs in tmux, isolates work in git worktrees, routes messages between agents, and gates everything on tests. This post is about the Rust implementation: what I built, why I made the choices I did, and what I'd do differently.

Architecture at a glance

Batty is a single Rust binary (~51k lines, Rust 2024 edition) that manages a hierarchy of agents:

You → Architect → Manager → Engineers (3-5 in parallel)
         ↓           ↓          ↓
       plans      dispatches   code in isolated
       tasks      via kanban   git worktrees

Each agent runs in its own tmux pane. The daemon polls every 5 seconds, detects agent states, delivers messages, dispatches tasks, runs tests, and merges results. Everything is file-based: YAML config, Markdown kanban, Maildir inboxes, JSONL event logs.

The key Rust dependencies:

Crate	What it does
`clap` 4	CLI with derive macros
`serde` + `serde_yaml`	Config and message serialization
`maildir` 0.6	Inbox implementation
`ureq` 2	Blocking HTTP (Telegram integration)
`sha2`	Output hashing for idle detection
`anyhow` + `thiserror`	Error handling
`tracing`	Structured logging
`ctrlc`	Signal handling

Notably absent from that list: any async runtime in the hot path. More on that below.

Decision 1: Synchronous daemon, not async

Batty includes tokio in its dependencies but doesn't use it for the main daemon loop. The core is a straightforward loop with std::thread::sleep():

loop {
    if shutdown_flag.load(Ordering::SeqCst) {
        break;
    }

    poll_watchers();           // detect agent states
    restart_dead_members();     // respawn crashed agents
    deliver_inbox_messages();   // maildir → tmux injection
    retry_failed_deliveries();  // transient error retry
    maybe_auto_dispatch();      // assign next task
    maybe_fire_nudges();        // idle timeout nudges
    maybe_generate_standup();   // periodic status
    // ... ~10 more intervention checks

    thread::sleep(Duration::from_secs(5));
}

Why synchronous? Because the daemon's job is orchestration, not I/O throughput. It polls tmux panes, reads files, and runs shell commands. None of these benefit meaningfully from async. A 5-second poll interval means we're spending 99.9% of the time sleeping anyway.

The synchronous approach bought me three things:

Debuggability. Stack traces are readable. No tokio::spawn making backtraces useless. When something goes wrong at 2am, I can read the logs and trace exactly what happened.
Simplicity. The entire daemon state lives in a single TeamDaemon struct. No channels, no mutexes, no message passing between tasks. State transitions are just method calls.
Predictability. Each poll iteration runs the same checks in the same order. Side effects are explicit. If dispatch happens before nudging, it's because the code says so, not because of task scheduling.

I kept tokio in Cargo.toml because some dependencies pull it in transitively and because I may need it later for WebSocket channels. But the daemon itself stays synchronous until there's a real reason to change that.

Decision 2: tmux via CLI wrapping, not a library

There are Rust crates for tmux interaction. I tried them. They were either incomplete, unmaintained, or abstracted away the things I needed most. So Batty wraps the tmux CLI directly:

pub fn capture_pane(target: &str) -> Result<String> {
    let output = Command::new("tmux")
        .args(["capture-pane", "-p", "-t", target, "-S", "-2000"])
        .output()?;
    Ok(String::from_utf8_lossy(&output.stdout).to_string())
}

pub fn send_keys(target: &str, keys: &str, enter: bool) -> Result<()> {
    let mut args = vec!["send-keys", "-t", target, keys];
    if enter { args.push("Enter"); }
    Command::new("tmux").args(&args).status()?;
    Ok(())
}

This is 1,600 lines of tmux wrapping. It handles session creation, pane splitting, pipe-pane for output capture, dead pane detection (#{pane_dead}), version probing (3.1+ for pipe-pane, 3.2+ for the -o flag), and status bar label updates.

The version probing is more important than it sounds. tmux's feature set varies significantly between versions:

pub fn probe_capabilities() -> TmuxCapabilities {
    let version = parse_tmux_version(&version_string());
    TmuxCapabilities {
        pipe_pane: version >= (3, 1),
        pipe_pane_only_flag: version >= (3, 2),
        status_style: version >= (2, 9),
        split_percent: version >= (3, 1),
    }
}

This lets Batty degrade gracefully on older tmux installations instead of crashing with an unhelpful error.

The tradeoff: wrapping CLI commands means subprocess overhead on every operation. In practice, this doesn't matter — tmux commands complete in 1-5ms, and we're running them on a 5-second interval. The alternative (maintaining a persistent tmux control channel) would add complexity with no measurable benefit at this polling frequency.

Decision 3: Maildir for agent communication

Agents need to send messages to each other. The architect sends plans to the manager. The manager assigns tasks to engineers. Engineers report back when done.

I considered several approaches: a channel-based system, a simple message queue in a SQLite database, files in a shared directory. I went with Maildir — the same protocol email servers have used since 1995.

.batty/inboxes/
  architect/
    new/        # Undelivered messages
    cur/        # Delivered messages
    tmp/        # Atomic write staging
  manager/
  eng-1-1/
  eng-1-2/

Each message is a JSON file with sender, recipient, body, type (send or assign), and timestamp. The maildir crate handles atomic writes — messages go to tmp/ first, then get renamed to new/ in a single filesystem operation. No partial writes. No corruption.

Delivery works in two phases:

Queue: Write message to recipient's new/ directory
Deliver: On the next daemon poll, inject the message into the recipient's tmux pane via send-keys, then move from new/ to cur/

If delivery fails (pane dead, tmux error), the message stays in new/ and gets retried next iteration. Transient errors (rate limits, timeouts) are categorized separately from permanent failures:

impl DeliveryError {
    pub fn is_transient(&self) -> bool {
        matches!(self,
            DeliveryError::ChannelSend { detail, .. }
            if detail.contains("429") || detail.contains("timeout"))
    }
}

Why Maildir over a custom solution? Because it's battle-tested, inspectable (ls .batty/inboxes/eng-1-1/new/), and survives daemon restarts. If Batty crashes mid-delivery, undelivered messages are still sitting in new/ when it comes back. No recovery logic needed — the protocol handles it.

Decision 4: Git worktrees for isolation

This is the decision that made multi-agent development actually work. Each engineer gets a persistent worktree:

.batty/worktrees/
  eng-1-1/     # Persistent directory, fresh branch per task
  eng-1-2/
  eng-1-3/

When a task is assigned, Batty doesn't create a new worktree. It reuses the engineer's existing directory, resets to current main, and creates a fresh task branch:

pub struct PhaseWorktree {
    pub repo_root: PathBuf,
    pub base_branch: String,
    pub start_commit: String,
    pub branch: String,        // e.g., "eng-1-2/task-27"
    pub path: PathBuf,
}

This was a deliberate choice over creating/destroying worktrees per task. Creation is expensive (git has to copy the working tree), and destruction risks losing debugging context if something went wrong. Reusing a stable directory is faster and keeps the filesystem predictable.

The tricky part is merging. When multiple engineers finish tasks simultaneously, you need to serialize merges or risk conflicts:

let lock = MergeLock::acquire(project_root)?; // Blocks up to 60s

match merge_engineer_branch(project_root, engineer)? {
    MergeOutcome::Success => {
        drop(lock);
        board_cmd::move_task(task_id, "done", engineer)?;
    }
    MergeOutcome::RebaseConflict(info) => {
        drop(lock);
        let attempt = daemon.increment_retry(engineer);
        if attempt <= 2 {
            // Send conflict details back to the engineer
            queue_message(engineer, &format!("Merge conflict: {info}"));
        } else {
            // Escalate to manager after 2 failed attempts
            queue_message(engineer, "Conflicts persist. Escalating.");
        }
    }
}

The merge lock is a simple file lock with a 60-second timeout. It serializes concurrent merges so only one engineer's branch is being rebased at a time. If an engineer hits a conflict, it gets two retry attempts before escalating to the manager. This mimics what a human team lead would do — let the developer try to resolve it, then step in if they can't.

Decision 5: Deep idle detection

The hardest problem in agent supervision isn't launching agents or routing messages. It's knowing when an agent is done.

You can't just check if the tmux pane's output changed — an agent might be "thinking" with no visible output for 30 seconds. You can't just check for a specific string like "Done!" — agents express completion in dozens of different ways.

Batty uses a layered approach:

Layer 1: Pane output hashing. Hash the captured pane content. If the hash hasn't changed across multiple poll cycles, the agent is probably idle.

Layer 2: Session file monitoring. Claude Code writes conversation state to ~/.claude/projects/<project>/<session>/. Codex writes to ~/.codex/sessions/. Batty reads these files directly to detect activity that isn't visible in the pane:

let tracker_state = self.poll_tracker()?;
if tracker_state == TrackerState::Completed {
    self.state = WatcherState::Active;
}

This dramatically reduced false-positive idle detections. An agent that's generating a large file might have a static pane for 20 seconds, but its session file is actively growing.

Layer 3: Context exhaustion detection. AI agents hit context limits. When they do, they typically print a specific message and become unresponsive. Batty detects this and flags the agent for restart:

pub enum WatcherState {
    Active,
    Idle,
    PaneDead,
    ContextExhausted,
}

Layer 4: Completion packets. Engineers emit structured completion data (task ID, branch name, commit hash, test results). Batty validates these before accepting a task as done:

pub fn validate_completion(packet: &CompletionPacket) -> CompletionValidation {
    let mut missing = Vec::new();
    if packet.task_id == 0 { missing.push("task_id"); }
    if packet.branch.is_none() { missing.push("branch"); }
    if !packet.tests_run { missing.push("tests_run"); }
    // ...
}

No completion packet with passing tests = the task isn't done. Period.

Decision 6: Test gating as a first-class concept

Test gating sounds simple: run tests, check if they pass. In practice, it required more thought than I expected.

When an engineer reports completion, Batty:

Checks the worktree has actual commits ahead of main
Runs the test command in the worktree directory
Records timing for cost analysis
If tests pass → acquire merge lock → merge
If tests fail → send truncated output back to the engineer

let (tests_passed, output) = run_tests_in_worktree(&worktree_dir)?;

if !tests_passed {
    queue_message(engineer, &format!("Tests failed:\n{}", output));
    return Ok(());
}

The test output is truncated to the last 50 lines before sending it back. A full test suite dump would overwhelm the agent's context — it just needs to see what failed.

The test command itself is configurable per project. For Rust projects it's cargo test. For Node projects it might be npm test. Batty doesn't care — it just runs the command and checks the exit code.

What surprised me

Failure tracking matters. I added a 20-entry circular buffer that tracks recent failures. When the same pattern repeats (same engineer, same error), Batty flags it for intervention instead of retrying indefinitely. This was an afterthought that became essential.

Hot-reloading the binary works. Batty can detect when its own binary has been updated and replace itself via exec() after persisting state. During development, I'd cargo install a new version and the running daemon would pick it up within 5 seconds. This saved me from the stop/start cycle during development and turned out to be useful in production too.

The talks_to constraint is the real feature. Each agent has an explicit list of roles it can communicate with. The architect talks to the manager. The manager talks to engineers. Engineers don't talk to each other. This simple constraint — validated at config load by the type system — prevents the message chaos that kills most multi-agent setups. Without it, you get an O(n^2) communication explosion.

What I'd do differently

Start with fewer intervention types. The daemon has ~10 different intervention checks (triage, owned-task recovery, dispatch-gap recovery, utilization recovery, standups, nudges, retrospectives). I should have started with 3 and added the rest based on observed failures. Some interventions fire too aggressively and had to be tuned down.

Abstract the agent CLI interface earlier. Currently, agent-specific knowledge (how Claude Code shows completion vs. how Codex does it) is scattered across watcher code. A clean trait boundary would make adding new agents easier.

Consider event-driven over polling. The 5-second poll loop works fine at the current scale (up to ~20 agents). If someone wanted to run 50+ agents, the synchronous polling would become a bottleneck. An inotify/kqueue-based approach for file changes would scale better. But — YAGNI. The poll loop is correct and simple. I'll optimize when someone actually hits the limit.

Try it

cargo install kanban-md --locked && cargo install batty-cli
cd your-project
batty init --template pair    # start small: 1 architect + 1 engineer
batty start --attach
batty send architect "Build user authentication with JWT"