Building a Production LLM Evaluation Harness in Pytest: Cost-Bounded, Flake-Aware, CI-Gated (Runnable Python)

#testing #llm #ai #python

I shipped my fourth LLM agent to production last quarter. By month two, the eval suite that "passed in CI" was the reason a regression made it to a customer.

The tests were green. But they were green for the wrong reason — every assertion was a single LLM call against a single golden answer, on a model whose temperature happened to land in our favor that day. We had built a coin flip and called it a test.

This article is the harness I wish I'd had on day one. Not another wrapper around DeepEval or RAGAS — a thin layer on top of pytest that solves the five things every production LLM evaluation harness needs and most tutorials skip:

Flake-aware tests. LLMs are stochastic. Single-shot assertions are noise.
Cost-bounded tests. A single misbehaving prompt should not burn $40 on one CI run.
Golden set with versioning. When a result changes, you need to know if the answer drifted or the model did.
Regression-only CI gating. Block PRs on degradation vs. baseline, not on absolute floors that bit-rot.
Multi-metric scoring. Semantic similarity AND structured assertion AND token cost. Any one of these alone lies.

All code below is complete and runnable. No # rest of code. Drop it in tests/llm/ and you have something that actually works.

Why Single-Shot Assertions Lie

Here's the test that failed me. Looks fine, right?

def test_classifier_returns_billing():
    result = call_llm("My credit card was charged twice")
    assert result["category"] == "billing"

I shipped this. It passed locally. It passed in CI for two weeks. Then production traffic exposed that the model returns "billing" 73% of the time on this exact input — the other 27% it returns "payment", "transaction", or once memorably "financial dispute".

A test that passes 73% of the time will eventually pass on the run that matters and fail on every customer hour after. The fix is not "add a temperature=0" — that hides the variance, it doesn't measure it. The fix is to assert on the distribution.

@flake_aware(runs=10, pass_threshold=0.8)
def test_classifier_returns_billing():
    result = call_llm("My credit card was charged twice")
    return result["category"] == "billing"

Run it 10 times, fail the test if fewer than 80% return the right category. Now your green test means something.

The Harness — One File, Drop In

Here's the full plugin. Save as tests/llm/conftest.py:

# tests/llm/conftest.py
import os, json, time, hashlib, statistics
from dataclasses import dataclass, asdict, field
from pathlib import Path
from typing import Callable, Any
import pytest

BASELINE_PATH = Path("tests/llm/_baseline.json")
RESULTS_PATH = Path("tests/llm/_results.json")
COST_CAP_USD = float(os.getenv("LLM_TEST_COST_CAP_USD", "5.0"))


@dataclass
class EvalResult:
    name: str
    pass_rate: float
    runs: int
    total_cost_usd: float
    p50_latency_ms: float
    p95_latency_ms: float
    notes: dict = field(default_factory=dict)


@dataclass
class _RunStats:
    successes: int = 0
    runs: int = 0
    cost_usd: float = 0.0
    latencies_ms: list = field(default_factory=list)
    notes: dict = field(default_factory=dict)


_run_total_cost = {"value": 0.0}
_collected: dict[str, EvalResult] = {}


def flake_aware(runs: int = 5, pass_threshold: float = 0.8, max_cost_usd: float = 0.50):
    """
    Decorator: run the wrapped fn `runs` times. Test passes if pass-rate >= threshold
    and total cost is under max_cost_usd. The wrapped fn must return a bool or a
    (bool, cost_usd, latency_ms) tuple.
    """
    def decorator(fn: Callable) -> Callable:
        def wrapper(*args, **kwargs):
            stats = _RunStats()
            for _ in range(runs):
                start = time.monotonic()
                outcome = fn(*args, **kwargs)
                latency_ms = (time.monotonic() - start) * 1000

                if isinstance(outcome, tuple):
                    ok, cost, reported_latency = outcome
                    latency_ms = reported_latency or latency_ms
                else:
                    ok, cost = bool(outcome), 0.0

                stats.runs += 1
                stats.successes += 1 if ok else 0
                stats.cost_usd += cost
                stats.latencies_ms.append(latency_ms)

                _run_total_cost["value"] += cost
                if _run_total_cost["value"] > COST_CAP_USD:
                    pytest.fail(
                        f"Global cost cap of ${COST_CAP_USD} exceeded. "
                        f"Aborting suite. Spent: ${_run_total_cost['value']:.2f}"
                    )

            pass_rate = stats.successes / stats.runs
            result = EvalResult(
                name=fn.__name__,
                pass_rate=pass_rate,
                runs=stats.runs,
                total_cost_usd=round(stats.cost_usd, 4),
                p50_latency_ms=round(statistics.median(stats.latencies_ms), 1),
                p95_latency_ms=round(_p95(stats.latencies_ms), 1),
                notes=stats.notes,
            )
            _collected[fn.__name__] = result

            if stats.cost_usd > max_cost_usd:
                pytest.fail(
                    f"{fn.__name__} cost ${stats.cost_usd:.3f} > "
                    f"per-test cap ${max_cost_usd}"
                )
            assert pass_rate >= pass_threshold, (
                f"{fn.__name__}: pass rate {pass_rate:.0%} below threshold {pass_threshold:.0%} "
                f"({stats.successes}/{stats.runs}) — cost ${stats.cost_usd:.3f}"
            )
        wrapper.__name__ = fn.__name__
        return wrapper
    return decorator


def _p95(values: list[float]) -> float:
    if not values:
        return 0.0
    s = sorted(values)
    k = max(0, int(round(0.95 * (len(s) - 1))))
    return s[k]


@pytest.fixture(scope="session", autouse=True)
def _persist_results(request):
    yield
    RESULTS_PATH.parent.mkdir(parents=True, exist_ok=True)
    payload = {name: asdict(r) for name, r in _collected.items()}
    RESULTS_PATH.write_text(json.dumps(payload, indent=2, sort_keys=True))


def pytest_terminal_summary(terminalreporter, exitstatus, config):
    if not _collected:
        return
    tr = terminalreporter
    tr.write_sep("=", "LLM EVAL SUMMARY")
    for r in sorted(_collected.values(), key=lambda x: x.name):
        tr.write_line(
            f"{r.name:40s}  pass={r.pass_rate:>5.0%}  runs={r.runs:>3}  "
            f"cost=${r.total_cost_usd:>5.3f}  p95={r.p95_latency_ms:>6.0f}ms"
        )
    tr.write_line(f"TOTAL COST: ${_run_total_cost['value']:.3f} (cap ${COST_CAP_USD})")

Three things to notice. The decorator returns a no-arg test for pytest. Cost is tracked both per-test (so a single bad prompt can't blow the budget) and globally (so the suite can't either). And every run dumps a structured _results.json — that file is what the regression check feeds on.

Cost Bounding — Why It Has to Be Two Layers

The first time I ran an LLM eval suite without a cap, a misbehaving prompt template hit the model with a 12K-token context on every retry. 200 tests × 5 runs × $0.04 per call = $40 in one CI run. The PR was a one-line typo fix.

You need both caps:

Per-test cap (the max_cost_usd=0.50 on the decorator): catches a single regression that explodes context.
Global cap (COST_CAP_USD env var, default $5): catches an accidental loop, a misconfigured runner, or someone leaving runs=200 on by mistake.

The global cap aborts the suite mid-run. The per-test cap fails just that test. That separation matters in practice — you want one runaway test to fail loudly without killing 40 other tests that would have caught real regressions.

Multi-Metric Scoring — The Single Number Trap

Most evaluation tutorials show you cosine similarity to a golden answer. Cosine similarity alone is a trap. A model can return "the customer should be issued a refund per policy 4.2" while the golden says "issue refund" — semantically aligned, but the structured field your downstream code parses now has a policy reference that breaks the regex.

Score on three axes, every time:

# tests/llm/test_classifier.py
from conftest import flake_aware
from your_app.llm import call_llm  # your wrapper that returns (response_dict, cost_usd, latency_ms)

GOLDEN_BILLING = {
    "input": "My credit card was charged twice for the same order",
    "expected_category": "billing",
    "expected_keywords": ["refund", "duplicate", "charge"],
}


@flake_aware(runs=10, pass_threshold=0.8, max_cost_usd=0.20)
def test_classifier_billing_intent():
    result, cost, latency_ms = call_llm(GOLDEN_BILLING["input"])

    # Metric 1: structured field equality
    structural = result["category"] == GOLDEN_BILLING["expected_category"]

    # Metric 2: keyword recall in the rationale (catches semantic drift)
    rationale = result.get("rationale", "").lower()
    keyword_hits = sum(1 for k in GOLDEN_BILLING["expected_keywords"] if k in rationale)
    semantic = keyword_hits >= 2  # at least 2 of 3 keywords

    # Metric 3: cost guardrail
    affordable = cost < 0.005  # half a cent per call ceiling

    return (structural and semantic and affordable), cost, latency_ms

When this test fails, the failure message tells you which axis broke — structural, semantic, or cost. That distinction is the difference between "the model is broken" and "the prompt template now ships 4x more tokens." Both are bugs. They have different fixes.

Golden Set Drift — The Hidden Failure Mode

Here's the failure mode most teams don't see coming. You have a golden set of 50 test cases. You write them in January. By June, your product has shifted — the support team now classifies "shipping damage" as a logistics issue, not customer service. Your tests still pass. The model still returns "customer_service". Reality has moved.

The fix is to version the golden set and audit drift on a schedule:

# tests/llm/golden_set.py
import json, hashlib
from pathlib import Path

GOLDEN_PATH = Path("tests/llm/golden_set.json")


def load_golden() -> list[dict]:
    data = json.loads(GOLDEN_PATH.read_text())
    return data["cases"]


def golden_fingerprint() -> str:
    """Stable hash of the golden set. Bump in commit messages when intentional."""
    raw = GOLDEN_PATH.read_text()
    return hashlib.sha256(raw.encode()).hexdigest()[:12]


def assert_drift_log_current():
    """Run as a separate test. Fails if golden set hash changed without an entry
    in the drift log — forces a human to acknowledge the change."""
    log = Path("tests/llm/golden_drift.log").read_text().splitlines()
    current = golden_fingerprint()
    assert any(line.startswith(current) for line in log), (
        f"Golden set hash {current} not in drift log. "
        f"Add an entry: '{current}  YYYY-MM-DD  reason'"
    )

This is annoying. It's supposed to be annoying. The whole point is that changing the golden set should require a deliberate, logged human action — because every time the golden moves, your "regression" baseline moves with it, and you've lost the ability to spot real regressions.

Regression-Only CI Gating

The last piece is the one that turns this from "tests" into "guardrails." Most LLM eval suites fail on absolute thresholds — pass_rate >= 0.85. That works on day one. By day 90 your prompt has improved, every test passes at 0.97, and a regression to 0.88 is still "above the floor" so CI stays green.

What you actually want is: fail the PR if any test drops more than X% vs. the last green main build.

# tests/llm/test_regression_gate.py
import json, os
from pathlib import Path
import pytest

BASELINE = Path("tests/llm/_baseline.json")
CURRENT = Path("tests/llm/_results.json")
REGRESSION_TOLERANCE = float(os.getenv("LLM_REGRESSION_TOLERANCE", "0.05"))


def test_no_regression_vs_baseline():
    if not BASELINE.exists():
        pytest.skip("No baseline yet — first run will create one")

    baseline = json.loads(BASELINE.read_text())
    current = json.loads(CURRENT.read_text())

    regressions = []
    for name, baseline_result in baseline.items():
        if name not in current:
            continue  # test deleted, ignore
        delta = current[name]["pass_rate"] - baseline_result["pass_rate"]
        if delta < -REGRESSION_TOLERANCE:
            regressions.append(
                f"{name}: {baseline_result['pass_rate']:.0%} -> "
                f"{current[name]['pass_rate']:.0%} (Δ {delta:+.0%})"
            )

    assert not regressions, "Regressions detected:\n  " + "\n  ".join(regressions)

Wire this into CI: on a successful merge to main, copy _results.json to _baseline.json and commit it. Now every PR is gated on not making things worse, which is the only gate that scales.

CI Wiring — The One-Command Setup

# .github/workflows/llm-eval.yml
name: LLM Evals
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.11' }
      - run: pip install pytest openai
      - name: Run eval suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          LLM_TEST_COST_CAP_USD: "3.00"
          LLM_REGRESSION_TOLERANCE: "0.05"
        run: pytest tests/llm/ -v
      - uses: actions/upload-artifact@v4
        with: { name: llm-results, path: tests/llm/_results.json }

That's it. Total cost per PR is capped. Regressions block. Golden drift forces a human ack. And the artifact gives you a paper trail for which prompt change moved which metric.

What I'd Tell Past Me

If I could go back to my first LLM project, I would skip most of the things I optimized for. I would not pick a fancy eval library. I would not build a custom dashboard. I would do exactly this — five primitives, in pytest, with a CI workflow that costs less than a cup of coffee per PR.

The teams I see succeed with LLM features in production are not the ones with the deepest evaluation theory. They're the ones whose tests catch a regression in the time it takes to read a PR description. That's the bar. Everything else is decoration.

(The honest truth: I've watched two teams burn six weeks each on building DSL-based eval frameworks before shipping anything. Both eventually replaced them with pytest plus 200 lines of helper code. So just start there.)

If you're building agents and want the related production pieces — the LLM tool calling patterns and the semantic caching layer plus the production MCP server walkthrough — I wrote those up too. They share the same harness for regression testing. You can also dig into how our team at Velocity Software Solutions wires this kind of harness into LLM integration projects and custom AI agents if you want the application-level view.

For deeper reading: Anthropic's evaluation best practices, the pytest official docs on parametrize and fixtures, and the DeepEval source for prior-art ideas you can borrow without taking on the dependency.

Build the harness this week. Pick five tests that matter. Run them on every PR. The first time it catches a silent regression, you'll wonder how you ever shipped without it.