Lamhot Siagian

Posted on Feb 22

Evals Aren’t a One-Time Report: Build a Living Test Suite That Ships With Every Release.

#llm #ai #evaluation

Continuous evaluation in production (monitoring, regressions, evals in CI/CD)

You finally shipped that generative AI feature, and the initial manual testing looked spectacular. A few weeks later, users start complaining that the system is hallucinating, dropping context, or responding with a completely different tone. You haven’t changed the model, but the underlying API provider updated their weights, your retrieval corpus grew, and user prompts evolved.

Traditional software engineering relies on deterministic unit tests to catch regressions before they hit production. AI engineering, however, often relies on static, one-off evaluation spreadsheets that age out the moment a model is deployed. This gap between traditional Continuous Integration/Continuous Deployment (CI/CD) and AI evaluation is the root cause of silent degradation in production systems.

In this article, you will learn how to shift from manual vibe checks to a continuous evaluation paradigm. We will explore how to integrate automated evaluations directly into your CI/CD pipelines, monitor production regressions, and build a living test suite that scales with your AI applications.

Why This Topic Matters Now

The transition from traditional machine learning to large language models (LLMs) has fundamentally changed how we define a "regression." In classical ML, you monitor for data drift or accuracy drops on a fixed classification task. With generative systems like Retrieval-Augmented Generation (RAG) or AI agents, the output is open-ended, non-deterministic, and highly sensitive to minor prompt tweaks.

When a prompt engineer tweaks a system instruction to fix a specific edge case, they risk unintentionally breaking ten other supported use cases. Without automated regression testing, these breakages are pushed directly to users.

Furthermore, foundation models are moving targets. Even if you pin a specific model version, upstream providers frequently push subtle updates that alter generation behavior. Continuous evaluation acts as your early warning system, ensuring that external dependencies and internal code changes meet a baseline of quality before they reach production.

Core Concepts in Plain Language

To build a robust testing architecture, we need to separate our evaluation strategies into three distinct phases of the software development lifecycle.

Offline Evaluations

These are the heavy, comprehensive tests run during the experimental phase. When you are comparing entirely new architectures, foundation models, or embedding strategies, you run offline evals. They are slow, expensive, and designed to establish a baseline.

CI/CD Evals (Pre-Deployment)

This is the automated gatekeeper. When an engineer opens a pull request that modifies prompt templates, application logic, or RAG retrieval parameters, a subset of evaluations runs automatically. These tests must be fast, cost-effective, and focused on preventing known regressions.

Online Evaluations (Production Monitoring)

Once the system is live, you cannot run expensive LLM-as-a-judge evaluations on every single user interaction. Online evals rely on lightweight proxy metrics, user feedback loops, and asynchronous sampling to detect anomalies in real-time.

How It Works Under the Hood

The foundation of continuous AI evaluation is the concept of "Evaluation as Code." Just as you version your application logic, you must version your test datasets, your evaluation prompts, and your scoring thresholds.

The industry standard approach is leveraging the LLM-as-a-Judge paradigm (Zheng et al., 2023, arXiv:2306.05685). Instead of relying on brittle string-matching or exact-match assertions, we use a strong secondary LLM to score the outputs of our primary application against a set of rubrics.

For a RAG system, this typically involves isolating the evaluation into specific metrics (Es et al., 2023, arXiv:2309.15217). We evaluate Context Precision to ensure our vector search is returning relevant documents. We evaluate Faithfulness to ensure the generated answer is strictly grounded in the retrieved context. Finally, we evaluate Answer Relevance to confirm the response actually addresses the user's query.

By treating these metric scores as standard test outputs, we can wrap them in assertion logic. If a pull request drops the Faithfulness score below an agreed-upon threshold of 0.85, the CI pipeline fails, blocking the merge.

Practical Applications and Examples

Let’s look at a concrete mini-walkthrough of how a Test Architect might implement this for a RAG pipeline using GitHub Actions or GitLab CI.

Step 1: Curate the Golden Dataset
You cannot evaluate continuously without a stable baseline. Start by curating a "Golden Dataset" of 50 to 100 highly representative user queries, along with their ideal retrieved contexts and expected answers. This dataset should live in your repository or a data registry, versioned alongside your code.

Step 2: Automate the CI/CD Pipeline
Configure your CI runner to trigger an evaluation script on every pull request targeting the main branch. The script spins up your RAG application in a containerized environment, ingests the Golden Dataset, and captures the generated responses and retrieved contexts.

Step 3: Score and Assert
The CI runner then passes these outputs to your evaluation framework. The framework calls your Judge LLM to compute Faithfulness and Answer Relevance.

Step 4: Report and Block
Instead of a pass/fail binary, the script outputs a markdown table directly into the pull request comments. It highlights which specific queries degraded. If the overall suite average falls below your defined threshold, the script returns a non-zero exit code, failing the build.

Common Pitfalls and Limitations

The most significant limitation of continuous AI evaluation is the introduction of "flaky tests." Because LLMs are non-deterministic, an evaluation might pass on one run and fail on the next, even if the application code hasn't changed.

This causes alert fatigue. If developers learn that they can simply re-run the CI pipeline to get a passing grade, trust in the evaluation architecture collapses. This non-determinism is a heavily researched open challenge. Recent work on arXiv preprints suggests that carefully calibrating judge models and utilizing multi-agent debate for scoring can significantly reduce variance and improve alignment with human judgments (Li et al., 2024, arXiv:2401.10020).

Another major pitfall is cost and latency. Running a GPT-4-class model as a judge for hundreds of regression tests on every commit is prohibitively expensive and slows down development velocity.

To mitigate this, sophisticated testing architectures employ a tiered approach. They use fast, deterministic metrics (like semantic similarity) or smaller, fine-tuned judge models for CI/CD pipelines, reserving the expensive LLM-as-a-judge solely for nightly regression sweeps or major release candidates.

Where Research Is Heading Next

The field of AI evaluation is moving rapidly from static benchmarks to dynamic, adversarial testing. We are seeing a shift toward automated red-teaming directly within CI/CD pipelines.

Instead of evaluating against a static Golden Dataset, future CI pipelines will spin up adversarial "Attacker Agents." These agents will actively probe the new pull request for vulnerabilities, attempting to jailbreak the system or induce hallucinations, generating synthetic test cases on the fly (Perez et al., 2022, arXiv:2202.03286).

Furthermore, research is heavily focused on creating specialized, open-weights evaluation models. Rather than relying on closed-API generalists to judge outputs, teams will soon deploy localized, ultra-fast models whose sole architectural purpose is computing evaluation metrics with high determinism.

Conclusion

Continuous evaluation is no longer an optional luxury for AI engineering teams; it is the fundamental mechanism for shipping reliable generative features. By treating your prompts, retrieval logic, and evaluation datasets as interconnected code artifacts, you can build an automated safety net that catches regressions before your users do.

The transition from a one-time evaluation report to a living, breathing CI/CD test suite requires a shift in engineering culture as much as a shift in tooling. Start small, establish a baseline, and iteratively expand your coverage.

Concrete Next Steps:

Curate your first Golden Dataset: Select 20 representative user queries and their ideal responses. Hardcode these into a simple JSON file in your repository.
Implement a basic CI gate: Write a script that runs those 20 queries through your application and uses a lightweight semantic similarity metric to compare the output against the expected answer.
Explore evaluation frameworks: Look into open-source libraries designed for continuous evaluation to understand how they abstract the LLM-as-a-judge architecture.

Top comments (1)

Alex Morgan • May 9

The upstream model drift problem is the one most teams discover too late — you pin a model version, think you're stable, then a "minor" API update changes tokenization behavior or system prompt handling and suddenly your eval scores quietly drop 8 points. Your point about evals being a living suite rather than a snapshot check is exactly right. The teams I've seen skip this step always end up doing emergency firefighting instead of shipping. Worth adding: scoring consistency across LLM-judges is still a pain point — using the same model as both generator and judge creates circular failure modes that are hard to detect.****

DEV Community