A Cheap Eval Harness for Production LLM Calls in 150 Lines

#observability #llm #python #tutorial

Book: LLM Observability Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A prompt change shipped on Friday at 4:42pm. The diff was 11 words: a clarification about how the model should handle ambiguous order IDs. By Monday morning, support had 43 tickets queued and the on-call engineer was reading them in #cs-escalations trying to figure out what changed. Pulling the trace, you find 14% of customer queries now hit a fallback path because the model started returning a slightly different JSON shape. Nobody ran an eval. There was no eval to run.

That weekend is what an eval harness exists to prevent. It's not a benchmark or a leaderboard. It's a small, dumb script that runs your last 80 representative production prompts against the model you ship, scores them, and fails the build when the pass rate drops. Under 150 lines. Under five dollars per run. No SaaS contract, no UI, no platform team meeting. The newsletter post by Gergely Orosz on a pragmatic guide to LLM evals for devs makes the same case longer; this post is the runnable version.

The golden set is a CSV

A golden set is the fanciest name for a flat file your team agrees on. Inputs, expected outputs, a judge type per row. CSV is fine. JSONL is fine. Pick one and check it into the repo next to the prompts.

# golden.csv
# id,prompt,expected,judge
# 1,Extract order ID from "ORD-7782 shipped",ORD-7782,exact
# 2,Summarize in <=20 words: "...",,llm
# 3,JSON status for "refund pending",^"status":\s*"pending",regex

Three rows is a stub. Sixty rows is the floor for catching real regressions. The questions you put in here matter more than any harness code you write. Pull them from real production traffic. Sample 200, dedupe, label, and keep the set that covers your top intents plus the ten requests that have ever broken in incident reports. The Arize golden-dataset writeup has a longer treatment of how to build one without poisoning it with your own assumptions.

The fourth column is the judge. Three values: exact, regex, llm. That decision per row is where teams overthink and stall.

Use exact when the output is a fixed token: an ID, a class label, a yes/no.
Use regex when the output has a stable shape but a free-form payload: JSON keys, status values, citation formats.
Use llm only when the answer is open-ended and you genuinely cannot pin a shape: summaries, rewrites, tone checks.

If half your set is llm, you wrote a vibes test. Push back on it.

The runner hits prod

The runner is a function. Read the CSV, call the model, score each row, return a per-test result.

import csv
import re
from dataclasses import dataclass
from anthropic import Anthropic

client = Anthropic()
MODEL = "claude-sonnet-4-5"

@dataclass
class Result:
    id: str
    prompt: str
    output: str
    expected: str
    judge: str
    passed: bool
    reason: str

One dataclass for the whole pipeline. It serializes cleanly to JSON for CI artifacts, and it gives you a single place to add fields when you want latency or cost later.

def call_model(prompt: str) -> str:
    resp = client.messages.create(
        model=MODEL,
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.content[0].text.strip()

Twelve lines, one network call. The point is that the runner hits the same model your product hits, with the same name, the same parameters, and the same system prompt if you have one (omitted here for brevity). An eval against a different model is an eval of a different system.

def load_golden(path: str) -> list[dict]:
    with open(path, newline="") as f:
        return list(csv.DictReader(f))

def run(path: str) -> list[Result]:
    rows = load_golden(path)
    results = []
    for row in rows:
        out = call_model(row["prompt"])
        passed, reason = judge(row, out)
        results.append(Result(
            id=row["id"],
            prompt=row["prompt"],
            output=out,
            expected=row["expected"],
            judge=row["judge"],
            passed=passed,
            reason=reason,
        ))
    return results

That is the runner. The dispatch lives in judge.

Three judges, one dispatch

Exact match is one line. Regex is one line. LLM-as-judge is twelve. Keep them in one file so the harness fits on a screen.

def judge_exact(expected: str, out: str) -> tuple[bool, str]:
    ok = out.strip() == expected.strip()
    return ok, "exact match" if ok else "string differs"

def judge_regex(pattern: str, out: str) -> tuple[bool, str]:
    ok = re.search(pattern, out) is not None
    return ok, "regex hit" if ok else f"no match: {pattern}"

Both return (passed, reason). The reason is what shows up in the CI log when a test fails. Make it specific enough that the engineer reading it on Tuesday doesn't have to re-run anything to understand what went wrong.

The LLM-as-judge call is a separate model invocation with a tight rubric. The trick is keeping the judge prompt short, deterministic, and graded on a binary scale. A 1-to-5 score sounds rigorous and is unstable across runs.

JUDGE_PROMPT = """You are grading an LLM output against a rubric.
Rubric: {expected}
Candidate output: {output}
Reply with PASS or FAIL on the first line, then a one-line reason.
"""

def judge_llm(expected: str, out: str) -> tuple[bool, str]:
    resp = client.messages.create(
        model=MODEL,
        max_tokens=80,
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(
                expected=expected, output=out,
            ),
        }],
    )
    text = resp.content[0].text.strip()
    first = text.split("\n", 1)[0].strip().upper()
    return first.startswith("PASS"), text

A different model family for the judge would be better. Same-family judges have a self-preference bias, as documented by Zheng et al. 2023 (MT-Bench) and a now-substantial follow-up literature. For a small team's first harness, the cost win of one provider beats the bias risk. Swap the judge model the day you ship the harness past 200 rows.

def judge(row: dict, out: str) -> tuple[bool, str]:
    kind = row["judge"]
    if kind == "exact":
        return judge_exact(row["expected"], out)
    if kind == "regex":
        return judge_regex(row["expected"], out)
    if kind == "llm":
        return judge_llm(row["expected"], out)
    raise ValueError(f"unknown judge: {kind}")

That is the harness. The runner, three judges, one dispatch. Sixty lines of Python. The other ninety are reporting and the CI hook.

Per-test pass rate over a single average

The output of the harness is a pass-rate per test ID, not a single number for the run. A 92% overall pass rate hides the fact that test 47 has been failing for three weeks. The CI report has to show every row's status.

def report(results: list[Result]) -> dict:
    passed = sum(1 for r in results if r.passed)
    total = len(results)
    return {
        "passed": passed,
        "total": total,
        "rate": passed / total if total else 0.0,
        "failures": [
            {
                "id": r.id,
                "prompt": r.prompt[:80],
                "output": r.output[:120],
                "expected": r.expected,
                "reason": r.reason,
            }
            for r in results if not r.passed
        ],
    }

Persist that JSON to eval-report.json on every run. Two lines of json.dump and you have a CI artifact engineers can diff between PR and main. Keep the last 30 reports in S3 and you have a regression-on-PR system without buying one.

import json
import sys

def main(path: str = "golden.csv") -> int:
    results = run(path)
    rep = report(results)
    with open("eval-report.json", "w") as f:
        json.dump(rep, f, indent=2)
    print(f"{rep['passed']}/{rep['total']} "
          f"({rep['rate']:.1%})")
    for fail in rep["failures"]:
        print(f"  FAIL {fail['id']}: {fail['reason']}")
    return 0 if rep["rate"] >= 0.90 else 1

if __name__ == "__main__":
    sys.exit(main(sys.argv[1] if len(sys.argv) > 1 else "golden.csv"))

The threshold (0.90 here) is per team. Start at the pass rate of your current main branch minus 5%. Tighten quarterly. A threshold of 1.0 sounds principled and produces a flaky build that everyone learns to ignore.

Wiring it into CI

GitHub Actions, two jobs. One runs on every PR, one runs nightly on main. The PR job is the regression gate. The nightly job tracks drift over time even when nobody is shipping.

# .github/workflows/eval.yml
name: eval
on:
  pull_request:
    paths:
      - "prompts/**"
      - "golden.csv"
      - "harness/**"
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install anthropic
      - env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: python harness/eval.py golden.csv
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: eval-report
          path: eval-report.json

The paths: filter is what keeps cost down. The harness only runs when prompts, the golden set, or the harness itself change. A typical PR to a non-prompt file pays zero cents. The eval costs land where the eval matters.

The cost math for a 60-row golden set, three rows graded by LLM-as-judge (rough estimates based on Anthropic's published Sonnet pricing as of April 2026):

60 model calls × ~400 input tokens × ~150 output tokens. On Sonnet-class pricing, that lands near $0.40.
3 judge calls × ~600 input tokens × ~80 output tokens. About $0.04.
Total per run: under $0.50.

Even if your set grows to 300 rows with 30 LLM-judged ones, you're at an estimated $2.20 per run. A team merging 20 prompt-touching PRs a week pays around $44. The first eval that catches a Friday afternoon regression has paid for the year.

For larger suites and a comparison to platform tools, the promptfoo eval-harness writeup and the Pragmatic Engineer evals piece cover the next steps after this one.

What this harness does not do

It does not measure latency. Cost per request, tail-latency regressions, prompt-injection vulnerabilities, hallucination on inputs outside your set — none of that is in scope. It also won't replace a real observability stack with traces, spans, and eval results joined to user feedback. What it does is catch the 14% of customer queries that would otherwise have rolled out unnoticed until support found them. That is the bar. The Friday-to-Monday gap is closed. Add the next 50 lines when you have a second incident and not before.

If this was useful

This harness is the smallest unit of the loop the LLM Observability Pocket Guide covers in full: golden sets joined to traces, judge selection that survives audit, CI gates that don't go flaky, and the eval-result schema you'll wish you had picked on day one. If your team is shipping prompt changes faster than you can verify them, it's the book for that gap.