Siddharth Bhalsod

Posted on Jun 16

Building a Production-Grade LLM Eval System From Scratch

#ainative #aienhanced #aiagents #aievals

Your LLM Eval Is Broken Before You Write the First Test

Most teams discover their eval system is broken the same way. They ship a prompt change that improves tone but silently tanks accuracy on edge cases. They upgrade their model version and something subtly changes — response length, citation patterns, how it handles ambiguity. Nobody catches it because the test suite was checking the wrong things. Or it wasn't running in CI. Or it existed on someone's laptop and that person has since left.

This is not a metrics problem. It is a sequencing problem.

The teams with working eval infrastructure — the ones where a prompt change doesn't become a post-mortem — built their system in a specific order. They defined what good looks like before they wrote a single test. They instrumented the system before they had enough data to justify it. They treated evaluation as architecture, not as a final validation step bolted on before launch.

In the AI Native series, Article 3 established that most teams build the wrong stack because they start with the model and work backward. The same mistake compounds inside the eval layer: most teams start with a framework and work backward. They install DeepEval or Braintrust, run a quick hallucination check, ship it, and call the eval layer done. The framework is not the system. The framework is one component inside a system that has to be deliberately designed.

This article is the design guide for that system. Not a framework tutorial — a sequencing blueprint.

The Wrong Starting Point

When a team decides to "add evals," the first thing they typically reach for is a library. pip install deepeval. Add AnswerRelevancyMetric. Run it against a few test cases. Green outputs feel like progress.

They are not progress. They are the illusion of instrumentation.

The problem is that answer relevancy is a generic metric. It tells you whether the model's response is topically related to the query — which is almost always true for any reasonably sized model and any reasonably coherent prompt. Passing this metric by default is like testing whether your e-commerce site can render a product page and calling the checkout flow validated.

The real question is not "does this output look relevant?" The real question is: what does quality actually mean for this specific system, in this specific product context, for this specific user?

That question is not a technical question. It is a product question. And it has to be answered before any eval framework is touched.

Layer One: Define Quality Before You Measure It

Consider two products that both use retrieval-augmented generation. The first is a legal research tool — lawyers use it to find case precedents before drafting filings. The second is a customer support assistant — customers use it to resolve billing disputes without calling in.

Both systems retrieve documents. Both generate responses. Both could fail on hallucination and answer relevancy. But the quality definitions are completely different.

For the legal tool, the most dangerous failure is a confident answer that cites a real case incorrectly — a paraphrase that changes the meaning of a ruling. For the support tool, the most dangerous failure is a refusal to resolve something the system should be able to handle — a hedge that sends the customer to a human unnecessarily.

Run the same generic metric set on both and you will get a score. That score will mean nothing to either product team.

This is why quality definition is Layer 1. Not Layer 4. Not "something we add later when we have real data."

The way to do it: write three to five failure statements before you write any test. Not metric names — failure statements. Things like: "The system confidently states a legal precedent that does not exist," or "The system routes a resolvable billing dispute to a human agent." These statements describe what broken looks like in terms your product team and your eval framework can both understand.

Then map each failure statement to a metric type. Some will map to built-in DeepEval metrics. Some will require a custom GEval criterion. Some will require a deterministic code-based check. The mapping is the architecture decision.

Layer Two: Instrument Without Waiting for Data

The second failure mode: teams wait until they have enough production data to build a "real" test suite. This feels responsible. It is actually how you end up with no eval coverage during the months when the system is most likely to change.

The practical answer is synthetic goldens.

DeepEval's Synthesizer can generate test cases from your knowledge base before a single real user has touched the system. If you are building a RAG pipeline, you feed it your document corpus and it generates realistic input/output pairs — questions a real user might ask, grounded in the content the system will retrieve. These are not perfect proxies for real traffic. They are good enough to establish a baseline and to catch the class of failures that break obviously.

GitHub's Copilot team runs comprehensive offline evaluations against every model before it reaches production — testing across metrics like latency, accuracy, and response consistency before any user interaction. They do not wait for user feedback to tell them the model regressed. The eval system surfaces regressions in the same pipeline that builds the release.

The minimum viable starting point is not fifty production examples. It is twenty-five synthetic goldens, two to three metrics that map to your failure statements, and a passing threshold. That is a real eval system. Run it before every prompt change, every model swap, every retrieval parameter update.

Layer Three: Structure the Test Suite Around Failure Modes, Not Features

This is the architectural distinction most teams miss.

The natural instinct is to organize test cases around features: here are the tests for the summarization flow, here are the tests for the question-answering flow, here are the tests for the refusal behavior. This organization feels logical. It mirrors how the product is structured.

The problem is that eval systems organized by feature tell you what broke but not why. When the summarization score drops three points, you know summaries got worse. You do not know whether the retrieval layer is returning worse context, whether the prompt changed something in formatting behavior, or whether a model update shifted the generation style.

Structure the test suite around failure modes instead. Each failure statement from Layer 1 becomes a test class. Each test class runs its specific metric. When a test class fails, the failure message is already diagnostic — it points to the component and the behavior, not just the feature.

In DeepEval, this looks like:

`from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

confident_hallucination = GEval(
name="ConfidentHallucination",
criteria="The output should never state a legal precedent with high confidence unless the retrieved context directly supports it.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT],
threshold=0.8
)`

This is not the generic HallucinationMetric. It is a custom GEval criterion written in plain English, tied to the specific failure mode the legal research team identified. When it fires, it fires on a specific category of error — not on a score that requires interpretation.

DeepEval's recommendation, grounded in production experience, is to limit yourself to five metrics maximum: two to three generic system-specific metrics (contextual precision for a RAG pipeline, tool correctness for an agent) and one to two custom, use-case-specific metrics. The constraint is intentional. More metrics means noisier signals and harder-to-diagnose failures.

Layer Four: The Drift Problem No Test Suite Catches

There is a class of failure that well-designed test suites miss almost entirely. Call it version drift.

A model provider pushes a silent update. Not a new model version — an update to the weights behind the same model string. Your evals pass. Your prompts are unchanged. And quietly, over the following two weeks, something shifts. Users start submitting more corrections. Satisfaction scores drift down by a few points. The output that used to be crisp and structured gets slightly looser. Nobody changed anything. But the system got worse.

This is the failure mode that unit testing, however well structured, cannot catch. Offline evals run against a snapshot. They tell you whether the system performs acceptably on the dataset you created. They cannot tell you whether the live system is drifting from that baseline in production.

The answer is production monitoring — which is Layer 4 of the eval system and the layer most teams skip entirely.

Production monitoring means scoring a sample of real user interactions continuously using referenceless metrics. Referenceless because you will not have ground truth labels for live traffic. DeepEval provides these — metrics like AnswerRelevancyMetric, FaithfulnessMetric, and ConcisenessMetric that can run without a known correct answer.

The setup is straightforward: route ten to twenty percent of live traffic through your eval pipeline, aggregate scores on a rolling window, and alert when scores cross a threshold. Confident AI — the cloud platform built on top of DeepEval — handles the dashboard and monitoring infrastructure if you do not want to build it yourself. The point is not the tool. The point is that offline evals and production monitoring are two different systems solving two different problems, and you need both.

Teams that run only offline evals are flying blind during the longest part of a product's life: after launch.

The Build Order

The failure modes are not random. They follow directly from building the eval system in the wrong order.

Teams that instrument too late — after the system is in production — start with generic metrics and work backward to product meaning. They are always trying to retrofit quality definitions onto scores they do not fully trust.

Teams that organize by feature instead of failure mode always have a two-step debugging process: find the failing test, then figure out what the failing test actually means.

Teams that skip production monitoring ship a system that degrades invisibly until users tell them it has.

The right order is four layers, built in sequence:

Define quality as failure statements, before touching any framework.
Generate synthetic goldens and establish baselines, before waiting for real data.
Structure test classes around failure modes, not features.
Add production monitoring for drift, as a separate system from the offline test suite.

This is not how most teams build their eval layer. Most teams build Layer 2 first — the framework, the test cases, the CI run — and never get to Layers 1 and 4 at all.

The eval system that degrades invisibly is not a testing failure. It is a sequencing failure.

DEV Community

Building a Production-Grade LLM Eval System From Scratch

Top comments (0)