An Eval Harness for Tool-Use Agents: 90 Lines, 3 Judges, $3 Per Run

#ai #agents #python #testing

Book: AI Agents Pocket Guide
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You changed one line of the system prompt. The chat-eval suite still passes: output quality looks fine, no hallucinations, the model still answers in JSON. You ship. Two days later, support says the agent stopped sending follow-up emails after refunds. It is calling log_refund instead of send_followup. The text outputs were right; the tool calls were silently rewired.

Output evals do not catch this. You need an eval that grades the tool trajectory: which tools the agent called, in what order, with what arguments. The harness is small. About ninety lines of Python, three judges in a ladder, and a golden CSV. Total bill on a 30-row golden set: a few dollars per run.

The shape of the harness

Five pieces, in this order:

A golden CSV — input plus the expected tool sequence (top-level tool names and the key arguments that matter).
A runner that calls your agent against each row and captures the actual sequence of tool_use blocks.
Three judges in a ladder:
- Judge 1: exact tool-name match. Cheapest. Pure string compare.
- Judge 2: argument validity via Pydantic. Still free, runs on local CPU.
- Judge 3: LLM-as-judge for semantic equivalence. Only fires when judges 1 and 2 disagree on whether to pass the row.
A report with per-row pass/fail and per-judge agreement rates.
A PR gate hook for GitHub Actions.

The ladder is the only interesting part. Most rows pass at judge 1 (exact match) for free. A small slice trips judge 1 but passes judge 2 because the model emitted a synonym tool with the same effect (for example, email_user vs send_email if both are registered). Only the residue (rows where judges 1 and 2 disagree) pays for an LLM call. The bill stays in the single dollars.

The golden CSV

Three columns: input, expected_tools (pipe-separated), expected_args (JSON).

input,expected_tools,expected_args
"Refund order 4421","lookup_order|issue_refund|send_followup","{""order_id"":""4421""}"
"What's the status of order 9912?","lookup_order","{""order_id"":""9912""}"
"Cancel order 7733 and email customer","lookup_order|cancel_order|send_email","{""order_id"":""7733""}"

You do not need to capture every argument, only the load-bearing ones. For a refund flow, order_id is load-bearing; the email subject line is not. The rule: if the wrong value here would cause the wrong real-world side effect, capture it.

Thirty rows is a working baseline. Fifty is comfortable. The rows come from your production logs: pull the last 200 conversations, sample 30 across the tool-use distribution, and label them by hand once.

The 90-line harness

import csv
import json
from anthropic import Anthropic
from pydantic import BaseModel, ValidationError

client = Anthropic()
MODEL = "claude-sonnet-4-5"
JUDGE_MODEL = "claude-haiku-4-5"

class OrderArgs(BaseModel):
    order_id: str

ARG_SCHEMAS = {
    "issue_refund": OrderArgs,
    "send_followup": OrderArgs,
    "send_email": OrderArgs,
    "cancel_order": OrderArgs,
    "lookup_order": OrderArgs,
}

def run_agent(user_input: str, tools: list, system: str) -> list:
    msgs = [{"role": "user", "content": user_input}]
    trajectory = []
    for _ in range(8):
        r = client.messages.create(
            model=MODEL, max_tokens=1024,
            system=system, tools=tools, messages=msgs,
        )
        msgs.append({"role": "assistant", "content": r.content})
        blocks = [b for b in r.content if b.type == "tool_use"]
        if not blocks:
            break
        results = []
        for b in blocks:
            trajectory.append({"name": b.name, "input": b.input})
            results.append({"type": "tool_result",
                            "tool_use_id": b.id, "content": "OK"})
        msgs.append({"role": "user", "content": results})
    return trajectory

def judge_exact(actual: list, expected: list) -> bool:
    return [t["name"] for t in actual] == expected

def judge_args(actual: list, expected_args: dict) -> bool:
    for call in actual:
        schema = ARG_SCHEMAS.get(call["name"])
        if schema:
            try: schema(**call["input"])
            except ValidationError: return False
    return all(any(c["input"].get(k) == v for c in actual)
               for k, v in expected_args.items())

def judge_semantic(actual, expected, user_input) -> bool:
    prompt = (f"User: {user_input}\nExpected: {expected}\n"
              f"Actual: {[t['name'] for t in actual]}\n"
              "Are these semantically equivalent for the user's "
              "goal? Reply with only YES or NO.")
    r = client.messages.create(model=JUDGE_MODEL, max_tokens=8,
        messages=[{"role": "user", "content": prompt}])
    return r.content[0].text.strip().upper().startswith("YES")

def run_suite(csv_path: str, tools: list, system: str) -> dict:
    rows = list(csv.DictReader(open(csv_path)))
    out = {"total": len(rows), "passed": 0, "rows": [],
           "pass_at_ladder": 0, "fail_at_ladder": 0, "semantic": 0}
    for row in rows:
        expected = row["expected_tools"].split("|")
        expected_args = json.loads(row["expected_args"])
        actual = run_agent(row["input"], tools, system)
        e = judge_exact(actual, expected)
        a = judge_args(actual, expected_args)
        if e and a:
            verdict, judge = True, "exact+args"; out["pass_at_ladder"] += 1
        elif e != a:
            verdict = judge_semantic(actual, expected, row["input"])
            judge = "semantic"; out["semantic"] += 1
        else:
            verdict, judge = False, "exact+args"; out["fail_at_ladder"] += 1
        out["passed"] += int(verdict)
        out["rows"].append({"input": row["input"], "verdict": verdict,
                            "judge": judge, "actual": actual})
    return out

That is the whole thing: agent runner, three judges, CSV-driven suite. Save it as harness.py. Run it like this:

if __name__ == "__main__":
    TOOLS = [
        {"name": "lookup_order", "description": "Look up an order",
         "input_schema": {"type": "object",
            "properties": {"order_id": {"type": "string"}},
            "required": ["order_id"]}},
        # ... your other tools
    ]
    report = run_suite("golden.csv", TOOLS,
                       "You are a refund assistant.")
    print(json.dumps(report, indent=2, default=str))
    pass_rate = report["passed"] / report["total"]
    print(f"Pass rate: {pass_rate:.1%}")
    exit(0 if pass_rate >= 0.85 else 1)

The exit code is what your CI hooks into.

Cost math, with the actual numbers

Pricing as of late April 2026 — check the Anthropic pricing page before relying on these. Claude Sonnet 4.5: $3.00 per million input tokens, $15.00 per million output tokens. Claude Haiku 4.5: $1.00 per million input tokens, $5.00 per million output tokens.

Per-row cost on a refund-flow agent that does 3 tool calls before finishing:

4 messages back to the API at ~1.5K input tokens each (system prompt + tool schemas + growing trajectory) ≈ 6K input tokens.
~600 output tokens across the 4 turns.
Sonnet input: 6K × $3 / 1M = $0.018.
Sonnet output: 600 × $15 / 1M = $0.009.
Per-row Sonnet cost: ~$0.027.

For 30 rows: 30 × $0.027 = $0.81 for the agent runs themselves. Judges 1 and 2 run on local CPU (free). Judge 3 fires only on the rows where exact-match and args-validity disagree (call it 4 of 30 in a healthy suite). Each Haiku judge call is ~300 input tokens + 5 output tokens ≈ $0.0003. Negligible.

So the floor is roughly $0.81 per run; even if every row tripped judge 3, the ceiling stays under $1.00 on a 30-row suite. The "$3 per run" headline is a budget ceiling, not the floor. It absorbs longer trajectories (5–8 tool calls per row), bigger tool schemas, and the Pydantic schema set growing as your tool surface grows. On a refund-flow agent with ~12 registered tools, runs typically land between $1.20 and $2.40. Three dollars is the round number you put in the GitHub Actions cost ledger so the budget never surprises anyone.

Compare that to a flat "always use LLM-as-judge" design: every one of the 30 rows pays for a judge call. With a careful judge prompt (~800 input + 50 output tokens), each call costs ~$0.003 on Haiku. Still cheap. The ladder buys you something else: audit-friendliness. When a regression lands, you can see exactly which rows tripped which judge, and the LLM-judge column tells you which it was, a rephrase or a real break.

Wiring it into a PR gate

GitHub Actions, ten lines:

name: agent-evals
on: [pull_request]
jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install anthropic pydantic
      - env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: python harness.py

The harness exits non-zero below 85% pass rate, blocking the PR. Comment the report into the PR with gh pr comment if you want the diff visible to reviewers.

Pair this with LLM observability tooling on your production traffic and you close the loop: production tells you which trajectories matter, the golden CSV pins them, the harness gates the regressions before merge.

Where this falls down

Three honest limits.

The harness assumes deterministic tool dispatch for the runner: every row gets "OK" back from the fake tool. That keeps cost predictable but means you cannot grade flows where the agent's next decision depends on real tool output. For those, replay tool results from a fixture or run against a staging environment with cost caps in place.

The Pydantic judge only catches the arguments you wrote schemas for. New tool, no schema, judge 2 quietly passes. Fix: add a CI check that fails if a tool name appears in expected_tools without a row in ARG_SCHEMAS.

The semantic judge is the weakest link. Haiku 4.5 will sometimes wave through a trajectory that swapped a destructive tool for a logging one if the framing sounds plausible. Tighten the judge prompt with explicit examples of what should fail: swapping issue_refund for log_refund is a real break, and the prompt should say so in those exact words. The book covers judge-prompt patterns for cases like this.

Ship the harness this week

Twenty rows in your CSV. Three judges. One PR gate. Ten dollars of API budget for the first month while you tune. You go from "we changed the prompt and hope it still works" to "we changed the prompt, the harness ran, here is the diff against last week's run." The cost of not having this is the silent rewrite, the kind that ships, runs for two days, and shows up in support tickets instead of in your dashboard.

If this was useful

The AI Agents Pocket Guide covers the patterns this harness assumes: golden trajectories, judge ladders, cost ceilings on agent traces, and the production guardrails the book lays out for autonomous loops. Short book, written for engineers who already know they need evals on tool calls and on text outputs both.