Everything That Broke on Day Two

#ai #agents #typescript #devops

AI agents don't tell you when they're broken. They exit clean, report success, produce nothing.

The first real task went fine. The second one hung for 30 minutes and produced nothing.

Day one was the build — ports and adapters, Telegram bot, task queue, CLI worker, all wired up and running on PM2 by midnight. ([Post 1] covers the full 16-hour sprint from empty directory to working product.) Day two was when real tasks hit real repos. "Working on the happy path" turned out to be a generous definition of "working."

Six bugs shipped with v0.1. In the order we found them.

Bug 1: Zero Stdout

First production task ran seven minutes, then got killed by the timeout. Task log showed nothing. stdoutTail: "". The CLI was running, doing work inside its sandbox, writing zero bytes to stdout.

Checked if --output-format json was writing to stderr. Checked if FORCE_COLOR: '0' was suppressing output. Checked TTY buffering. None of it.

Root cause: HOME. We spawn the CLI as a sandboxed user, but the environment inherited HOME=/root from the parent process. The Claude CLI reads its config from ~/.claude/ — wrong HOME meant no config, no session data, no output.

const child = spawn('sudo', ['-u', 'mcbot', this.cliPath, ...args], {
  cwd: params.projectPath,
  env: { ...process.env, HOME: '/home/mcbot', FORCE_COLOR: '0' },
  //                      ^^^^^^^^^^^^^^^^^^^
  //                      This was the entire fix.
});

Ten hours of debugging. One line. Classic.

Bug 2: Git Permission Hell

With stdout fixed, the CLI could do work. But it couldn't commit. Repos owned by root. Sandboxed worker couldn't write to .git/ directories.

Solution: a sudo git wrapper running every command with safe.directory=*, plus auto-chown on project registration so the worker can write to .git/ from the start.

async function git(args: string[], cwd: string): Promise<ExecResult> {
  return exec('sudo', [
    '-u', 'mcbot', '-H', 'git',
    '-c', 'safe.directory=*',
    ...args
  ], { cwd, timeout: 30000 });
}

Straightforward once you see it. Invisible until you do.

Bug 3: Stuck Tasks

PM2 restarts the process on crash. Good. But when PM2 restarts, any task in running state stays there forever. The runner sees running >= MAX_CONCURRENT_TASKS and refuses new work. Queue frozen.

Fix: orphan cleanup on startup. Any task still marked running gets reset to queued for retry. Obvious in hindsight — the kind of thing you discover when your agent crashes at 2 AM and you wake up to a queue that hasn't moved.

Shipped as part of a commit titled "Fix 8 reliability bugs: stuck tasks, dirty repos, silent failures." Eight bugs, one commit. Day two was that kind of day.

Bug 4: Dirty Repo Trap

CLI crashes mid-work — timeout, OOM, process kill — repo has uncommitted changes. Next task tries git checkout -b feature/new-branch and git refuses. Dirty working tree. One crashed task poisons every task after it.

Added a force-checkout fallback in the finally block:

} finally {
  try {
    await checkoutBranch(project.path, originalBranch!);
  } catch (err) {
    log.error('Failed checkout, attempting force checkout', err);
    try {
      await forceCheckout(project.path, originalBranch!);
    } catch (forceErr) {
      log.error('Force checkout failed — repo may need manual fix', forceErr);
    }
  }
}

Normal checkout first. Dirty tree — force checkout. That fails too — log and move on. The repo might need manual intervention, but the system doesn't deadlock.

Bug 5: The Haiku Incident

Not a MissionControl bug. An operational lesson about multi-agent trust.

Ran a parallel agent sprint on another project. Three Haiku agents, each assigned to a specific feature. Fast, cheap, scoped. One of them deleted an entire application directory. Not a file — the directory. Every route, every component, every layout. Gone.

Recovery: git checkout HEAD -- src/app/. But new files the agent created in that directory — untracked by git — were lost permanently.

New rule, enforced from that day forward: Haiku agents get verified by the team lead before any commit. After all agents report done, run git status, review diffs, run the type checker and build yourself. Only then stage. bypassPermissions + fast model + directory access = deletion risk. Scope fast agents to specific files, not directories.

Bug 6: Tool Args

Silent one. Passed --allowedTools Bash,Read,Edit as a single string argument. The CLI received one tool called "Bash,Read,Edit" instead of three separate tools. Every tool call failed — no tool matched the comma-separated name.

From the outside, the agent ran, appeared to think, and timed out. Internally — an agent with no hands.

// Before: one string, wrong
args.push('--allowedTools', 'Bash,Read,Edit');

// After: comma-separated, parsed correctly by the CLI
args.push('--allowedTools', params.allowedTools.join(','));

Trivial fix. The CLI doesn't warn when an --allowedTools value matches nothing. Found it by tracing raw JSON logs until the pattern clicked.

The Pattern

Six bugs. Each one a few lines to fix. Each one invisible until it wasn't — no error messages, no crash dumps, just silent failure. The CLI exits 0. The JSON says is_error: false. The runner marks success. Nothing was actually done.

AI agents need the same operational hardening as any production service: timeouts, health checks, output verification, orphan cleanup, permission audits. The agent doesn't know it's broken. It will report success while producing nothing.

Exit code 0 means nothing without verification. That lesson cost a day. The next one — [Post 3] — cost $5.84.