Stop Writing Unit Tests for Your AI Code. Write These 4 Evals Instead.

#llm #testing #devops #ai

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Your CI pipeline runs assertEqual(output, "expected") on an LLM call. It fails once every 20 runs. Nobody knows which 1-in-20. Somebody opens a PR that wraps the call in a retry loop. It merges. Two weeks later a customer pastes a hallucinated invoice number and you find out your "passing" test suite has been green for a month on a feature that returns garbage 5% of the time.

The fix is not a retry. The fix is that you are testing the wrong thing.

Unit tests were designed around a contract: same input, same output. LLMs break that contract on purpose. Temperature 0 narrows the distribution but does not collapse it. Provider drift, model aliases, tokenizer updates, and stochastic sampling all push the same call toward different completions. A deterministic assertion over a non-deterministic function is a flaky test pretending to be a correctness test.

The right layer for AI correctness is evals. Four of them, each doing a job a unit test cannot.

The rule, up front

Parse-and-validate logic gets a unit test. Semantic output gets an eval.

Your prompt builder that concatenates a system message and a user message: unit test. Your tokenizer helper that counts tokens before you hit the context window: unit test. Your JSON post-processor that strips code fences and parses the result: unit test. Anything that touches the model's actual meaning goes to an eval.

Keep that rule in mind as you read the rest.

1. Schema-validation evals (run in CI on every PR)

This is the cheapest eval you will ever write, and it catches the dumbest failures, which are the ones that take production down. You ask the model for JSON. You validate it against a Pydantic schema. You fail the build if the shape is wrong.

No judge model. No subjective scoring. The call either produces a parseable object or it does not.

# tests/evals/test_extraction_schema.py
import json
import pytest
from pydantic import BaseModel, ValidationError
from typing import List
from app.llm import extract_action_items

class ActionItem(BaseModel):
    assignee: str
    task: str
    due: str | None = None

class Extraction(BaseModel):
    items: List[ActionItem]

CANARY_INPUTS = [
    "Sarah will migrate the DB by Friday. John to review the PR.",
    "Nobody took an action here, it was a status update.",
    "Ship the launch. All hands. No deadline given.",
]

@pytest.mark.parametrize("notes", CANARY_INPUTS)
def test_extraction_returns_valid_schema(notes):
    raw = extract_action_items(notes)
    try:
        Extraction.model_validate_json(raw)
    except (ValidationError, json.JSONDecodeError) as e:
        pytest.fail(f"schema failed on: {notes!r}\n{e}")

Run this on every PR. Three inputs, not three hundred. This eval is about the structural contract, not coverage. If the model returns {"actions": [...]} instead of {"items": [...]} because someone tweaked the prompt, you catch it in 4 seconds.

What this does not catch: hallucinated assignees, invented tasks, wrong dates. That is what the next eval is for.

2. Faithfulness judges (sampled from production)

Schema evals tell you the shape is right. Faithfulness evals tell you the content is grounded. They run on a rolling sample of real production traffic, not in CI, because they are slow, cost money per call, and only make sense against live data.

The technique: send the model's output plus the input context to a second model, ask it to score whether the output is supported by the input, and log the score as a metric. A score below threshold raises an alert the same way latency over threshold would.

# app/evals/faithfulness.py
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

JUDGE_PROMPT = """You are a strict grader. You will receive SOURCE text
and OUTPUT text. Return JSON: {"faithful": true|false, "reason": "..."}.

OUTPUT is faithful only if every factual claim in it is supported by
SOURCE. Invented names, numbers, or dates make it unfaithful.

SOURCE:
{source}

OUTPUT:
{output}
"""

def score_faithfulness(source: str, output: str) -> dict:
    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(source=source, output=output),
        }],
    )
    import json
    return json.loads(resp.content[0].text)

You wire this into your production pipeline as a sampler: 1% of requests get a judge score attached to their trace. The trace ID, input, output, judge score, and judge reason all end up in your observability stack as a structured span attribute. A rolling p50 of that score is a time series. A drop from 0.92 to 0.78 is an alert.

Two things matter here. Pin the judge model with its full snapshot ID (claude-sonnet-4-5-20251001, not the floating alias claude-sonnet-4-5) so that the judge you calibrated last quarter is the judge grading today. And run a meta-eval every couple of months where a human labels 100 traces and you measure agreement between your judge and the human. If agreement drops below 0.8, the judge drifted and you owe the rubric a rewrite.

Shankar et al.'s 2024 paper "Who Validates the Validators?" is the honest read on this. Judges are cheap, judges are useful, judges are biased. Treat their scores as a signal, not a verdict.

3. Regression datasets (canary prompts before merge)

This is the eval that replaces "I ran three examples and it looked fine." You curate 30 to 200 input-output pairs that represent the behavior you care about. You run the current prompt against the set. You run the candidate prompt against the same set. You diff.

The key is the dataset grows by one every time production surprises you. Customer complains about an invoice extraction missing a line item? That example goes in the regression set with the expected output, and from now on every prompt change runs against it.

# tests/evals/test_regression.py
import json
import pytest
from pathlib import Path
from app.llm import extract_action_items
from app.evals.faithfulness import score_faithfulness

REGRESSION = json.loads(
    Path("tests/evals/regression.json").read_text()
)

@pytest.mark.regression
@pytest.mark.parametrize("case", REGRESSION, ids=lambda c: c["id"])
def test_regression_case(case):
    output = extract_action_items(case["input"])
    score = score_faithfulness(
        source=case["input"], output=output
    )
    assert score["faithful"], (
        f"{case['id']} unfaithful: {score['reason']}\n"
        f"output: {output}"
    )

    # Additional case-specific checks, when the dataset
    # author committed a ground-truth assertion.
    for required in case.get("must_contain", []):
        assert required.lower() in output.lower(), (
            f"{case['id']} missing expected entity {required!r}"
        )

The regression.json file is the actual test surface. It starts small:

[
  {
    "id": "missing-assignee",
    "input": "The migration needs to happen before Friday.",
    "must_contain": []
  },
  {
    "id": "multi-assignee",
    "input": "Sarah and John will pair on the API docs.",
    "must_contain": ["Sarah", "John"]
  }
]

You mark this suite @pytest.mark.regression and run it as a pre-merge gate, not on every commit. It costs money per run (faithfulness judge + the model under test) and takes minutes, not seconds. A reasonable CI shape: schema evals on every push, regression evals on every PR-to-main.

Two anti-patterns to avoid. Do not assert exact string equality on model output in this file, ever. must_contain is the right tool, not ==. And when a regression case starts failing for a legitimate reason (the behavior genuinely changed, the old output was wrong), update the case in the same PR that changed the prompt. Never skip it.

4. Property-based evals (invariants, not examples)

The last category is the one most teams skip and later regret. You do not assert a specific output. You assert a property that any reasonable output must hold. Output length. Cost per call. Token count. Absence of banned strings. Presence of a required citation format.

These are fast, they are deterministic, and they catch the failures that have a business cost attached: the prompt edit that quietly doubled average output length, the model rollover that pushed p95 cost from $0.03 to $0.18, the change that started leaking system-prompt text into user responses.

# tests/evals/test_properties.py
import pytest
import tiktoken
from app.llm import extract_action_items, last_call_cost_usd

enc = tiktoken.encoding_for_model("gpt-4o-mini")

SAMPLE_INPUTS = [
    "Ship the launch by Q2. Sarah owns API docs.",
    "No actions, just a retrospective note.",
    "Ten action items across four people, see below: ...",
]

BANNED_SUBSTRINGS = [
    "As an AI language model",
    "I cannot",
    "system prompt",
]

@pytest.mark.parametrize("notes", SAMPLE_INPUTS)
def test_output_is_bounded(notes):
    output = extract_action_items(notes)

    tokens = len(enc.encode(output))
    assert tokens < 1000, (
        f"output too long: {tokens} tokens on input {notes!r}"
    )

    cost = last_call_cost_usd()
    assert cost < 0.05, (
        f"cost regression: ${cost:.4f} exceeds $0.05 budget"
    )

    lowered = output.lower()
    for banned in BANNED_SUBSTRINGS:
        assert banned.lower() not in lowered, (
            f"banned substring leaked: {banned!r}"
        )

The invariants that pay off most:

Length bounds. An upper token count catches runaway verbosity. A lower one catches the model returning an empty string because you silently hit a content filter.
Cost ceiling. Per-call dollar budget. A change that pushes you over is a financial regression, not a behavioral one, and it deserves its own red test.
Banned substrings. Leaked system-prompt fragments, refusal boilerplate in a feature that should never refuse, PII patterns that should not appear in customer-facing copy.
Required format. If your output must cite sources as [source: ID], the property is "regex matches at least once." Catches the drift where the model quietly stops citing.

None of this requires a judge model. All of it runs in CI in seconds. This is the layer that would have caught "wrap it in retry to hide the flakiness" before it shipped.

What stays in unit tests

Evals do not replace your test suite. They replace the tests you should never have written on model output. The deterministic code around the model still gets deterministic tests.

The function that reads a user message, applies your template, and returns a prompt string: unit test. Run it. Assert the string.
The function that counts tokens to decide whether to chunk: unit test. Feed it a string, assert the count.
The JSON post-processor that strips triple-backticks, parses, and normalizes keys: unit test. Fuzz it with malformed inputs if you like.
The router that picks between three models based on input length: unit test. Table-driven, one line per branch.

The split is simple. If the function's output is determined by its input and your code, write a unit test. If the function's output is determined by a distribution over a neural network's parameters, write an eval.

Where this goes in the pipeline

A working setup looks like this. On every push: unit tests on the deterministic code paths, schema evals on three canary inputs. On every PR to main: the regression dataset, the property-based evals. In production: the faithfulness judge sampling 1% of traffic, its scores landing on a dashboard and on alerts.

Four layers, four jobs, one rule running through them. assertEqual on model output is the one you remove from the build on your way out the door.

If this was useful

The book (Observability for LLM Applications) works through the production side of this in depth: Chapter 10 on offline and online evals, Chapter 11 on LLM-as-judge bias and the meta-eval protocol, Chapter 18 on wiring eval scores into alerts that on-call actually trusts. The four-layer shape above is the scaffold; the book is the how.

Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
Thinking in Go: Complete Guide to Go Programming + Hexagonal Architecture in Go — the two-book series on shipping Go in production.
Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools.
Me: xgabriel.com · github.com/gabrielanhaia.