By Allela · AI Engineering · 9 min read
You’ve built something. A chatbot, a document assistant, a code reviewer, a customer support agent. You’ve tested it yourself, shown it to a few people, and it seems… good? The answers feel right. The tone is on point. Nothing obviously embarrassing has slipped through.
So you ship it.
Three weeks later, a user screenshots your AI confidently telling them that your product has a feature it doesn’t have. Another user complains it keeps going in circles. A third says it gave completely different answers to the same question on two different days.
Welcome to the most underrated problem in AI engineering: you never defined what “good” actually meant.
Evaluation — evals, in the industry shorthand — is the discipline of measuring AI quality systematically. Not vibes. Not spot checks. Actual measurement. And it’s the difference between an AI product that scales with confidence and one that silently degrades the moment you stop paying attention.
The Uncomfortable Truth About LLM Outputs
Here’s what makes evaluating LLMs uniquely hard compared to evaluating traditional software.
In classical software, correctness is binary. A function that sorts a list either returns a sorted list or it doesn’t. You write unit tests, they pass or fail, and you ship. The feedback loop is tight, deterministic, and cheap.
LLMs produce language. Language is fuzzy. “Paris is the capital of France” and “France’s capital city is Paris” are both correct. “The capital of France is Lyon” is wrong. But how does your test suite know the difference? How does it handle the thousand shades of grey between a perfect answer and a catastrophic one?
This is why most teams default to vibes-based evaluation for too long. It’s not laziness — it’s that the tooling for measuring language quality is genuinely hard to build. But the cost of not having it is enormous.
What You’re Actually Measuring
Before you can build evals, you need to be precise about what quality means for your system. There is no universal scorecard. A customer support bot and a creative writing assistant have completely different quality profiles.
That said, most LLM systems care about some combination of these dimensions:
Correctness — Is the answer factually accurate? Does it match ground truth? This is the most important dimension for knowledge-intensive tasks like Q&A, summarization, and RAG systems. If correctness is your north star, you need a dataset of questions with known correct answers to measure against.
Faithfulness — Does the answer stay grounded in the provided context? This is distinct from correctness and is especially critical in RAG. A model can give a correct answer that wasn’t actually supported by the retrieved documents — which is still a failure, because you can’t trust the reasoning chain.
Relevance — Did the model actually answer the question that was asked? You’d be surprised how often a fluent, well-written response completely sidesteps the question. High relevance means the output is tightly focused on the query.
Coherence — Does the response hold together logically? Is it well-structured? Does it contradict itself halfway through? This matters especially for long-form generation.
Tone & Style — Does the response match the expected voice? Too formal, too casual, too verbose? This is softer but matters enormously for user-facing products.
Safety & Refusal Behavior — Does the model refuse things it should refuse? Does it refuse things it shouldn’t? Both failure modes are real. An over-refusal problem is as damaging to user experience as an under-refusal one.
The discipline of evaluation starts with a decision: which of these dimensions matters most for your system, and how do you trade them off?
The Three Layers of Evaluation
Think of your eval stack as having three layers, each catching different types of failures at different speeds and costs.
Layer 1 — Unit Evals (Fast, Cheap, Deterministic)
These are the closest thing to traditional unit tests. They’re fast, automated, and catch obvious regressions immediately.
The key insight: not everything about an LLM’s output is fuzzy. Some things are binary.
Does the output contain a phone number when it shouldn’t? Does the response stay under 200 words when you asked for brevity? Does it always respond in the same language as the query? Does it include a citation when your system prompt requires one? Did it refuse a clearly harmful prompt?
These are all checkable with simple code. No AI judge needed.
def test_language_match(query, response):
from langdetect import detect
assert detect(response) == detect(query), \
"Response language doesn't match query language"
def test_no_pii_leaked(response, context_docs):
import re
phone_pattern = r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
phones_in_context = re.findall(phone_pattern, ' '.join(context_docs))
phones_in_response = re.findall(phone_pattern, response)
leaked = set(phones_in_response) - set(phones_in_context)
assert not leaked, f"PII leaked in response: {leaked}"
def test_response_length(response, max_words=250):
word_count = len(response.split())
assert word_count <= max_words, \
f"Response too long: {word_count} words"
Start here. These tests run in milliseconds, cost nothing, and catch a surprising number of real failures.
Layer 2 — Model-Based Evals (The LLM-as-Judge Pattern)
This is where things get interesting. For the fuzzy dimensions — faithfulness, relevance, coherence — you can use an LLM to evaluate another LLM’s output.
The setup: you write a carefully designed prompt that asks an evaluator model (usually GPT-4o or Claude) to score a response on a specific dimension, based on a rubric you provide. The evaluator returns a score and a reason.
def evaluate_faithfulness(question, context, response):
prompt = f"""You are an expert evaluator. Your job is to assess whether
the following response is faithful to the provided context — meaning every
claim in the response is supported by the context, with no fabrications.
Context:
{context}
Question: {question}
Response: {response}
Score the faithfulness on a scale of 1 to 5:
5 = Every claim is directly supported by the context
3 = Most claims are supported, minor unsupported details present
1 = Significant claims appear that are not in the context
Return a JSON object: {{"score": , "reason": ""}}"""
result = llm.invoke(prompt)
return json.loads(result.content)
This pattern is powerful but comes with a crucial caveat: LLM judges have biases. They tend to prefer longer answers. They’re inconsistent at the edges of your rubric. They can be manipulated by confident-sounding but wrong responses. You must validate your evaluator against human judgments before trusting it.
The practical workflow: build a small set of human-labeled examples (50–100 is enough to start). Run your LLM judge on the same examples. Measure agreement. If your judge agrees with humans 80%+ of the time, you have a useful — though imperfect — automated signal.
Layer 3 — Human Evaluation (Slow, Expensive, Essential)
There is no substitute for human judgment on the hardest questions. Not for calibration, not for edge cases, not for deciding whether your system is ready to ship.
The best teams run structured human evals at key milestones — before a major model upgrade, before entering a new use case, before launch. They use a preference evaluation setup: show two responses to the same query (one from your current system, one from a candidate) and ask evaluators which they prefer and why.
This produces comparative data rather than absolute scores, which is more reliable — humans are much better at “which is better?” than “how good is this on a scale of 1–10?”
The cost is real. But the cost of shipping a degraded system because you skipped human evals is usually higher.
The Golden Dataset: Your Most Valuable Eval Asset
Every serious eval program is built around a golden dataset — a curated collection of input-output pairs that represent your system’s most important behaviors.
What goes in it:
∙ Representative queries — the 100 questions your users actually ask most often
∙ Edge cases — queries that have broken things before, adversarial prompts, ambiguous inputs
∙ Regression anchors — specific examples where past system versions failed, so you know you haven’t regressed
∙ Capability checkpoints — queries that test each feature you care about
The golden dataset is a living document. Every production failure that teaches you something should be added to it. Every time a user reports a bad response that reveals a gap in your coverage, it goes in.
Teams that build discipline around maintaining this dataset consistently outperform those that don’t. It’s not glamorous work. It’s what separates a production-grade AI system from a demo.
RAGAS: Evaluation for RAG Systems
If you’re building RAG specifically, there’s a framework worth knowing: RAGAS (Retrieval Augmented Generation Assessment). It operationalizes the key RAG-specific metrics into a clean API.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
Your RAG system's outputs
data = {
"question": ["What is the refund policy?"],
"answer": ["You can request a refund within 30 days."],
"contexts": [["Our refund policy allows returns within 30 days of purchase..."]],
"ground_truth": ["Refunds are available within 30 days."]
}
dataset = Dataset.from_dict(data)
result = evaluate(dataset, metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
])
print(result)
{'faithfulness': 0.96, 'answer_relevancy': 0.91,
'context_precision': 0.88, 'context_recall': 0.83}
Each metric targets a specific failure mode. context_recall tells you whether the retriever is finding the right documents. faithfulness tells you whether the generator is staying grounded in what it found. Together, they help you localize failures — is your RAG pipeline broken in the retrieval stage or the generation stage?
The Regression Problem: Eval as a Safety Net
Here’s a pattern that catches many teams off guard: you improve your system in one dimension and silently break it in another.
You upgrade your model from GPT-3.5 to GPT-4o. Correctness improves significantly. But response length doubles, latency increases, and tone shifts in ways your users don’t love. Or you fine-tune for tone and accidentally degrade factual accuracy. Or you add a new feature to your system prompt and it interferes with an existing behavior you depended on.
Without a systematic eval suite, you discover these regressions in production. With one, you discover them before deployment.
This is exactly how software engineering treats testing — not as a one-time quality check before launch, but as a continuous safety net that runs on every change. The same discipline needs to apply to AI systems.
The minimum viable setup: a CI pipeline that runs your golden dataset through the new system configuration whenever anything changes, compares the scores to your baseline, and blocks deployment if there’s meaningful degradation on any key metric.
What “Good” Looks Like in Practice
Let’s be concrete. A mature eval program for a production RAG system might look like this:
A golden dataset of 200 question-answer pairs, continuously updated from production failures. Unit tests checking format, language, length, and safety properties — running in under 30 seconds on every pull request. An LLM-as-judge pipeline scoring faithfulness and relevance on the full golden dataset — running nightly and on every model or prompt change. A quarterly human preference evaluation comparing the current system against the last major version. A live dashboard tracking answer quality scores, retrieval precision, and user thumbs-up/thumbs-down rates, with anomaly alerts when anything drops.
None of this is technically complex. It’s mostly disciplined process work. The hard part is building the culture around it — making evaluation a first-class concern rather than an afterthought.
The Deeper Point
There’s a reason evaluation is the topic that separates AI engineers from AI tinkerers. Anyone can call an API and get impressive outputs. Building a system you can actually trust — one you can improve confidently, debug systematically, and hand off to users without holding your breath — requires knowing how to measure what you’ve built.
The field is still maturing. There is no pytest for language models, no perfectly reliable judge, no universal benchmark that tells you whether your system is production-ready. Every team is doing some version of figuring this out as they go.
But the teams that take evaluation seriously from the start — who invest in golden datasets, who build automated eval pipelines, who treat a production failure as an opportunity to add a test case rather than just a bug to fix — those teams ship AI systems that actually earn their users’ trust.
And in the end, that’s the only metric that really matters.
Next up: fine-tuning vs. prompting — when is it actually worth the cost? Drop a follow if you want that one in your feed.
Top comments (0)