Batty

Posted on Mar 24 • Edited on Apr 5

Your AI Agent Says 'Done' — How Do You Know It Actually Worked?

#ai #testing #devtools #tutorial

Your AI coding agent just finished a task. It says "Done!" in the terminal. The code looks reasonable at a glance. Do you merge it?

If you're like 52% of developers, you do — without running a single test. Here's why that's a problem, and a zero-cost fix that takes five minutes to set up.

The "Almost Right" Problem

AI coding agents are optimized to produce plausible-looking code. They'll generate something that reads correctly, passes a quick glance, and introduces a subtle bug you won't catch until production.

The numbers are sobering:

CodeRabbit's analysis: AI-generated code contains 1.7x more bugs than human-written code
Sonar's 2026 State of Code survey: 96% of developers distrust AI-generated code
The gap: Only 48% verify before committing — a 48-point "verification debt zone"

The worst bugs aren't the obvious ones. Compilation errors are easy. The "almost right" bugs — the ones that pass a quick read but fail on edge cases, skip null checks, or miss error handling — those waste hours.

Why Manual Review Doesn't Scale

Here's the math problem: AI agents produce 98% more PRs than human-only workflows, but the number of human reviewers stays the same. You can't manually review at the rate AI generates.

Amazon figured this out the hard way. After "a trend of incidents," they added mandatory senior engineer sign-off for all AI-generated code. That's the right instinct — verification before merge — but the manual version doesn't scale.

What you need is automated verification at the point of completion. Not after merge. Not during code review. At the moment the agent says "done."

The Simplest Test Gate

The fix is embarrassingly simple:

Agent works in its own directory
When it says "done," run your test suite
Exit code 0 → accept the work
Exit code 1 → send the failure output back to the agent

That's it. Zero token cost. Zero API calls. Just an exit code.

Here's the bash version you can set up in five minutes:

#!/bin/bash
# test-gate.sh — run before accepting any agent output
# Usage: ./test-gate.sh /path/to/agent/worktree

WORKTREE_DIR="$1"
MAX_RETRIES=2

for attempt in $(seq 1 $MAX_RETRIES); do
    echo "Running tests (attempt $attempt/$MAX_RETRIES)..."

    # Run tests in the agent's working directory
    output=$(cd "$WORKTREE_DIR" && cargo test 2>&1)
    exit_code=$?

    if [ $exit_code -eq 0 ]; then
        echo "Tests passed. Ready to merge."
        exit 0
    fi

    # Truncate to last 50 lines — enough context without overwhelming
    truncated=$(echo "$output" | tail -50)

    echo "Tests failed. Output (last 50 lines):"
    echo "$truncated"

    if [ $attempt -lt $MAX_RETRIES ]; then
        echo "Sending failure output back to agent for retry..."
        # Replace with your agent's message mechanism
        echo "$truncated" > "$WORKTREE_DIR/.test-failure"
    fi
done

echo "Tests failed after $MAX_RETRIES attempts. Escalating."
exit 1

Replace cargo test with npm test, pytest, go test ./..., or whatever your project uses. The pattern is language-agnostic.

Why Truncate to 50 Lines?

When tests fail, you don't send the full output to the agent. You send the last 50 lines.

Why? Two reasons:

Context window efficiency. A full test run can produce thousands of lines. The useful information — which tests failed and why — is almost always at the end. Fifty lines gives the agent the error messages, stack traces, and assertion failures it needs without wasting tokens on passing tests.
Agents fix their own mistakes. This surprised me. When you send truncated failure output back to the agent, it fixes the issue on its own most of the time. You're not debugging — you're delegating the debugging too.

Here's what this looks like in Rust (from Batty's actual implementation):

fn run_tests_in_worktree(worktree_dir: &Path) -> Result<(bool, String)> {
    let output = std::process::Command::new("cargo")
        .arg("test")
        .current_dir(worktree_dir)
        .output()?;

    let stdout = String::from_utf8_lossy(&output.stdout);
    let stderr = String::from_utf8_lossy(&output.stderr);
    let mut combined = String::new();
    combined.push_str(&stdout);
    combined.push_str(&stderr);

    // Truncate to last 50 lines for agent feedback
    let lines: Vec<&str> = combined.lines().collect();
    let trimmed = if lines.len() > 50 {
        lines[lines.len() - 50..].join("\n")
    } else {
        combined
    };

    Ok((output.status.success(), trimmed))
}

The function returns a tuple: did the tests pass, and what's the (truncated) output? That's the entire decision surface. A boolean and a string.

What This Catches

In practice, test gating filters out roughly 80% of agent-introduced issues before you even look at the diff:

Compilation errors the agent didn't notice (common with hallucinated imports)
Broken imports from APIs that don't exist
Regressions in existing functionality the agent didn't know about
Type errors in languages with strict type checking
Missing edge cases that existing tests already cover

The remaining 20% — architecture mistakes, security issues, subtle logic bugs — you catch in code review. But you're reviewing tested code, not debugging raw output. That's a fundamentally different (and better) use of your time.

Adding Worktree Isolation

Test gating works best when each agent has its own isolated environment. If three agents share a working directory, running tests becomes meaningless — Agent B's changes might break Agent A's tests.

Git worktrees solve this:

# Create an isolated working copy for each agent
git worktree add .batty/worktrees/eng-1 -b eng-1/task-42
git worktree add .batty/worktrees/eng-2 -b eng-2/task-43
git worktree add .batty/worktrees/eng-3 -b eng-3/task-44

Each agent gets a complete copy of the repo on its own branch. They can't see each other's changes. When you run tests in eng-1's worktree, you're testing only eng-1's work against a clean baseline.

Performance tip: Don't create and destroy worktrees per task. Keep persistent worktrees per agent and rotate branches:

cd .batty/worktrees/eng-1
git checkout main && git pull && git checkout -b eng-1/task-45

Much faster. The filesystem stays predictable.

Leveling Up: Auto-Merge Policies

Once you trust the test gate, you can automate further. Not everything needs manual review. A well-scoped change — small diff, few files, no sensitive paths — can go straight to main if tests pass.

Here's the decision framework:

# Example auto-merge policy
auto_merge:
  enabled: true
  require_tests_pass: true       # non-negotiable
  max_diff_lines: 200            # small changes only
  max_files_changed: 5           # limited blast radius
  sensitive_paths:               # always require human review
    - Cargo.toml
    - .env
    - team.yaml

The confidence scoring works by subtraction:

Start at 1.0 (full confidence)
Subtract 0.1 per file over 3
Subtract 0.2 per module touched over 1
Subtract 0.3 if a sensitive path is touched
Subtract 0.4 if unsafe blocks are detected

Below your threshold (say, 0.8)? Route to manual review. Above it? Auto-merge. Tests must pass regardless.

This isn't "autonomous coding." It's automated verification with human oversight for anything non-trivial. The agent proposes, tests verify the basics, and you review what matters.

What This Doesn't Catch

I want to be honest about limitations. Test gating is a first-line defense, not a complete solution.

It doesn't catch:

Architecture mistakes — the code works but the approach is wrong
Security vulnerabilities — tests don't typically cover injection, XSS, or auth bypass
Subtle logic bugs — edge cases your test suite doesn't cover yet
Performance regressions — functionally correct but 10x slower
Design drift — each agent's code works in isolation but the combined result is incoherent

For these, you still need human review. But the ratio changes dramatically. Instead of reviewing every line of every agent's output for basic correctness, you're reviewing tested code for design and architecture. That's a much better use of a senior engineer's time.

The Retry Loop

When tests fail, don't give up on the agent immediately. Send the failure output back and let it try again.

Here's the pattern:

Agent completes task
  → Run tests
    → PASS → Accept (merge or route to review)
    → FAIL → Send last 50 lines back to agent (attempt 1/2)
      → Agent fixes, reports done again
        → Run tests again
          → PASS → Accept
          → FAIL → Send output back (attempt 2/2)
            → Agent fixes again
              → PASS → Accept
              → FAIL → Escalate to human

Two retries is the sweet spot. Most fixable issues get resolved on the first retry. If the agent can't fix it in two attempts, it's usually a deeper problem that needs human judgment.

After two failed retries, escalate with full context:

Tests failed after 2 retries.
Task: Add JWT auth endpoint
Branch: eng-1/task-42
Last failure output: [truncated]

Now you're debugging with context, not starting from scratch.

Getting Started in Five Minutes

If you take one thing from this article, it's this: add a test gate between "agent says done" and "code hits main."

The minimal version:

# After your agent finishes a task:
cd /path/to/agent/worktree
cargo test  # or: npm test | pytest | go test ./...

# Exit code 0? Review and merge.
# Exit code 1? Show the agent the failure output.

That's it. No framework. No new dependencies. No token cost. Just an exit code that separates "looks right" from "actually works."

If you want the automated version — worktree isolation, test gates, retry loops, auto-merge policies, all managed in your terminal — Batty wraps this entire workflow into a single batty start command. But the pattern works with any setup. A bash script and cargo test gets you 80% of the way.

Try It

cargo install batty-cli
cd your-project
batty init --template pair
batty start --attach

The pair template gives you one architect and one engineer with test gating enabled by default. Scale up from there.

What's your verification step? I'm genuinely curious — if you're running AI coding agents, how do you know the output actually works before you merge it? Drop a comment.

Batty is open source, built in Rust, and published on crates.io. GitHub: github.com/battysh/batty

DEV Community