You tweaked the system prompt, ran the same two test questions you always run, the answers looked good, and you shipped. A week later support is forwarding you screenshots of the model confidently doing the exact thing your prompt was supposed to stop. You never saw it, because "did it get better?" was answered by vibes.
This is the single most common failure mode in shipping LLM features, and it has nothing to do with which model you picked. If your only quality gate is reading a handful of outputs and nodding, every change you make is a coin flip. You can't tell whether a prompt edit helped, hurt, or just moved the failures somewhere you didn't look. Evals are how you replace the nod with a number.
This is a practical guide to building that number — from a 30-row eval set you can write this afternoon, through code-based checks and LLM-as-judge scoring, to wiring the whole thing into CI so regressions get blocked instead of discovered by users. No new framework to adopt; just the discipline that separates a demo from a system.
Why you can't just assert output == expected
Traditional tests work because the output space is small and exact. add(2, 2) is 4 or it's a bug. LLM output breaks all three assumptions that make assertEqual work:
- It's non-deterministic. The same prompt can produce different text on two calls. Even at temperature 0 you are not guaranteed byte-identical output across runs or model versions.
- It's open-ended. "Summarize this ticket" has thousands of correct answers. None of them are string-equal to your reference, and that's fine — a good summary isn't the summary.
- It fails softly. A wrong answer isn't a stack trace. It's a fluent, plausible, well-formatted paragraph that happens to be incorrect. Nothing crashes. Nothing logs an error.
So the goal of an eval isn't "is the output identical to the expected string." It's "does the output satisfy the properties I care about" — is it grounded in the provided context, does it stay on policy, does it actually answer the question, is it valid JSON. You're testing behavior against criteria, not bytes against bytes. Once that clicks, the rest is mechanics.
Start with the eval set, not the metric
The instinct is to reach for a fancy metric first. Wrong order. The asset that makes everything else work is a small, representative eval set: a fixed collection of inputs paired with what a good output looks like (or the criteria a good output must meet). This is your golden dataset, your regression suite, your source of truth.
You do not need thousands of examples to start. Thirty to fifty well-chosen pairs turn LLM tuning from vibes into engineering, because now every change is measured against the same fixed bar. Build the set like this:
- Mine real failures. Every time the system gets something wrong in dev or prod, that exact input goes into the eval set with a note on what the right behavior is. Your bug reports are your test cases. This is the highest-signal source you have.
- Cover the categories, not just the happy path. Easy questions, ambiguous ones, adversarial ones, out-of-scope ones ("I don't know" is the correct answer and you should test that it says so), and the edge cases specific to your domain.
- Freeze it and version it. The eval set lives in your repo next to the code. When you add a case, that's a commit. A moving target can't measure progress.
- Keep a holdout. If you start tuning prompts against the eval set, you'll overfit to it. Keep a slice you don't look at until you think you're done.
A minimal eval set is just data — JSON, a CSV, a Python list. Here's the shape:
# evals/dataset.py
EVAL_SET = [
{
"id": "refund-window-basic",
"question": "What is our refund window?",
"context": "Refunds are accepted within 14 days of purchase.",
"expected": "14 days",
"must_not_say": ["30 days", "no refunds"],
},
{
"id": "out-of-scope",
"question": "What's the weather in Cluj tomorrow?",
"context": "Refunds are accepted within 14 days of purchase.",
"expected": "REFUSE", # correct behavior: decline, don't invent
},
# ... 30-50 of these, grown from real failures
]
That's the foundation. Everything below scores outputs against this set.
The two halves of every LLM eval
Separate two questions that get mushed together when you eval by eyeball, because they have different fixes:
- Did the system retrieve / set up the right context? (a retrieval or pipeline question)
- Given that context, did the model produce a good answer? (a generation question)
If you're building RAG, the first half is its own discipline — measuring recall@k and precision@k on questions with known relevant documents tells you whether the right chunk even reached the prompt. That's a deep enough topic that it deserves its own treatment; a dedicated course on RAG and retrieval-augmented generation spends real time there, and the failure modes are different from the ones below. This guide focuses on the second half: scoring the generated answer. The techniques split into two families — code-based checks and model-based judges — and you want both.
Code-based checks: cheaper and more reliable than you think
Before you reach for an LLM to grade an LLM, a surprising amount of quality is checkable with plain code. These checks are deterministic, free, instant, and never hallucinate. Use them for everything they can cover:
- Structural validity. If the output should be JSON matching a schema, validate it. A response that doesn't parse is a hard failure, no judgment call needed.
- Must-contain / must-not-contain. The answer about a 14-day refund window must contain "14" and must not contain "30." Keyword and regex assertions catch a whole class of factual regressions for free.
- Format and bounds. Length limits, required citations present, no leaked system-prompt text, no forbidden phrases (the "as an AI language model" tax), valid enum values.
- Semantic similarity. For open-ended answers, embed the output and your reference answer and check cosine similarity passes a threshold. It's fuzzy, but it catches "the answer wandered off topic" without needing a judge model.
# evals/checks.py
import json
def check_structural(output: str, schema_keys: list[str]) -> bool:
try:
data = json.loads(output)
except json.JSONDecodeError:
return False
return all(k in data for k in schema_keys)
def check_must_not_say(output: str, banned: list[str]) -> bool:
low = output.lower()
return not any(b.lower() in low for b in banned)
The rule of thumb: anything a regex or a schema can catch, don't pay a model to catch. Reserve the expensive, fuzzy judge for the genuinely subjective stuff.
LLM-as-judge: powerful, biased, and fixable
For the subjective half — "is this answer faithful to the source?", "is this helpful?", "is the tone right?" — you use a strong model to grade outputs. This is LLM-as-judge, and it scales human-quality judgment to thousands of examples for the price of an API call. Two metrics carry most of the weight for RAG-style apps:
- Faithfulness / groundedness — does every claim in the answer trace back to the provided context, or did the model invent things? This is your hallucination detector.
- Answer relevance — does the response actually address the question that was asked, or is it a fluent dodge?
The catch: LLM judges have well-documented biases, and if you ignore them your eval numbers are noise dressed up as signal. The big ones, all reported in the research on using models as evaluators:
- Position bias — when comparing two answers, judges favor the one shown first (or in a fixed slot) regardless of quality.
- Verbosity bias — judges tend to rate longer, more elaborate answers higher even when a short answer is more correct.
- Self-preference — a judge model can favor text written in its own style or by its own family.
You don't abandon the technique; you engineer around the bias:
- Score against a rubric, not a vibe. Ask for a 1–5 score with explicit criteria for each level, and require the judge to output its reasoning before the score. A judge forced to justify itself is more consistent.
- For pairwise comparisons, randomize and swap. Run each comparison twice with the order flipped; only count it as a win if the judge picks the same answer both times. This cancels position bias directly.
- Calibrate against humans. Hand-label 20–30 examples yourself, run the judge on them, and check it agrees with you. If it doesn't, fix the rubric before trusting it on 2,000. An uncalibrated judge is a random number generator with good grammar.
- Use a strong model as the judge. Grading is harder than answering. Use a current frontier model for the judge even if your app runs on a smaller, cheaper one.
# evals/judge.py — sketch of a rubric-based faithfulness judge
JUDGE_PROMPT = """You are grading whether an ANSWER is fully supported by the CONTEXT.
CONTEXT:
{context}
ANSWER:
{answer}
Rules:
- A claim is "supported" only if the CONTEXT states or directly implies it.
- Outside knowledge does NOT count as support.
First write one sentence of reasoning. Then output a JSON object:
{{"reasoning": "...", "faithful": true|false}}"""
def judge_faithfulness(client, context: str, answer: str) -> bool:
resp = client.complete(
JUDGE_PROMPT.format(context=context, answer=answer),
temperature=0,
)
return json.loads(resp)["faithful"]
Designing judges that hold up — picking the rubric, calibrating, knowing when a model is the wrong tool for the grade — is exactly the muscle a course on AI evals in production builds, because it's the difference between "the new prompt feels better" and "faithfulness went from 0.78 to 0.91 on the holdout."
Wire it into CI, or it won't survive contact with deadlines
An eval you run by hand when you remember to is an eval you'll stop running the week things get busy. The whole point is to make regressions impossible to ship silently, and that means the eval runs automatically on every change to a prompt, a retrieval setting, or a model version.
The pattern is a regression gate: run the eval set, compute the aggregate score, and fail the build if the score drops below a threshold (or below the last known-good baseline). It looks like an ordinary test suite, because that's what it is.
# tests/test_evals.py
import pytest
from evals.dataset import EVAL_SET
from evals.checks import check_must_not_say
from myapp import answer_question
PASS_THRESHOLD = 0.90 # 90% of eval cases must pass to ship
def run_case(case) -> bool:
output = answer_question(case["question"], case["context"])
if case["expected"] == "REFUSE":
return "i don't know" in output.lower() or "can't" in output.lower()
if not check_must_not_say(output, case.get("must_not_say", [])):
return False
return case["expected"].lower() in output.lower()
def test_eval_suite_meets_threshold():
results = [run_case(c) for c in EVAL_SET]
score = sum(results) / len(results)
failed = [c["id"] for c, ok in zip(EVAL_SET, results) if not ok]
assert score >= PASS_THRESHOLD, f"Eval score {score:.2f} below {PASS_THRESHOLD}. Failed: {failed}"
A few practical notes that keep this sane in CI:
- Pin the model version. Provider model IDs update, and an unpinned model means your eval baseline shifts under you for reasons unrelated to your code. Pin it, and treat a model upgrade as its own deliberate eval run.
- Budget for cost and flakiness. LLM calls cost money and occasionally time out. Cache where you can, run the judge-heavy suite on a schedule rather than every commit if needed, and set a slightly forgiving threshold so one stochastic blip doesn't red-X a good PR.
- Log the failures, not just the score. When the gate trips, the output should name which cases regressed so the fix is obvious. A bare "0.86 < 0.90" sends you debugging blind.
Now a prompt change is a PR with a number attached. The reviewer sees faithfulness went up and refusal rate held steady, or they see it tanked and the build is red. That's the entire difference between hoping and knowing.
Five mistakes that quietly poison your evals
Even teams that build evals often undermine them. Watch for these:
- Testing only the happy path. If every case in your set is a question the system already answers well, your score is a flattering lie. Adversarial and out-of-scope cases are where the signal is.
- Tuning on your test set. Optimize prompts against the same examples you grade on and you'll overfit to them. Keep a holdout you don't peek at.
- An uncalibrated judge. Trusting an LLM judge you never checked against your own labels is trusting a number you made up. Calibrate first.
- One giant blended score. A single average hides that faithfulness improved while refusals broke. Track metrics separately so a regression in one can't be masked by a gain in another.
- Letting the set rot. Your product changes; cases that no longer reflect real usage drag the signal down. Prune and grow the set as part of normal work, the same way you maintain any test suite.
None of these are exotic. They're the eval equivalent of not testing error paths — obvious in hindsight, easy to skip under deadline.
How this connects to the rest of your LLM stack
Evals aren't a standalone chore; they're the measurement layer that makes every other improvement legible. When you tighten a prompt, evals tell you if it worked — which is why structured prompt engineering and a real eval loop are two halves of the same skill. When you redesign what goes into the context window — what to include, what to cut, how to order it — evals are how you know the redesign helped rather than just felt cleaner; that discipline of deciding what earns a place in the prompt is increasingly called context engineering and has its own dedicated course. And when you wire up function calling, multi-tool orchestration, and the production concerns of a real integration, evals are what keep the whole pipeline honest as it grows — the kind of end-to-end build covered in a deeper course on advanced LLM integration. The pattern is always the same: build the measurement first, then every change becomes verifiable instead of hopeful.
Conclusion
The teams whose LLM features actually hold up in production aren't using a secret model or a magic prompt. They're disciplined about measurement. They have a versioned eval set grown from real failures, code-based checks for everything a regex can catch, calibrated LLM judges for the subjective rest, and a CI gate that blocks regressions before users find them.
Start smaller than you think you can. Write thirty cases this afternoon — half of them things your system currently gets wrong — add three code checks and one rubric-based judge, and put a threshold in your test suite. The first time a red build stops you from shipping a prompt change that would have quietly broken refusals, you'll never go back to vibe-checking. That's the moment an LLM demo becomes an LLM system people can trust.
The courses linked throughout are part of Cursuri-AI.ro, an AI-learning platform with hands-on, current tracks on evaluating AI systems in production, prompt engineering, RAG, and advanced LLM integration.
Sources & further reading:
- Zheng et al. — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (documents position, verbosity, and self-enhancement bias in LLM judges)
- Liu et al. — G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
- Liang et al. — Holistic Evaluation of Language Models (HELM)
This article is educational content. Techniques and tooling evolve quickly; validate approaches against your own data and current library documentation.
Top comments (0)