Why Your Agent's Eval Suite Won't Catch Production Failures

#ai #python #agents #llm

Your eval suite passed. Your agent is degrading in production. These two facts are not contradictory - they're the expected outcome when you treat offline evaluation as a sufficient signal for production reliability.

Offline evals and production outcome tracking solve different problems. Conflating them is how you end up with green CI checks and a support queue full of AI-generated nonsense.

What Evals Are Actually Measuring

A typical eval setup looks like this: you have a dataset of input/expected-output pairs, a harness that runs your agent against them, and a set of metrics (accuracy, BLEU score, LLM-as-judge ratings). You run this before deploying. If it passes, you ship.

This is useful. It catches regressions when you change your prompt, swap models, or restructure your agent logic. It gives you a baseline for comparison across configurations.

But the eval suite is measuring a fixed distribution. Your labeled dataset reflects the traffic patterns, model behaviors, and user intent distributions at the time it was created. Production traffic is a live distribution that shifts continuously.

Three failure modes that evals reliably miss:

Model drift: The model provider updates the underlying model weights. Your eval dataset was labeled against the previous behavior. The new behavior is subtly different in ways that don't trigger your existing test cases but do degrade real user outcomes. This happened after several GPT-4 updates in 2023-2024 - evals passed, production quality dropped.

Distribution shift: Your users change how they phrase requests, or new user segments start using the feature, or an upstream system change alters input format. Your eval dataset doesn't cover the new distribution. Success rates drop on inputs you've never tested.

Unknown failure modes: Evals catch what you know to test for. They don't catch failure modes you haven't encountered yet. A new adversarial pattern, an edge case in a niche use case, a prompt injection in user-supplied content - these are invisible in labeled datasets until after they've already caused problems in production.

The Point-in-Time Problem

An eval suite is a point-in-time measurement. You run it, get a score, and that score reflects the state of your system against a fixed dataset at a specific moment. The score doesn't update when the model provider changes something. It doesn't update when your user behavior shifts. It doesn't update when a downstream system introduces data quality issues.

Production is continuous. Your agent is making decisions right now, against real inputs, with real consequences. The question that matters is not "what was my accuracy score on the eval dataset last Tuesday?" - it's "what is my outcome rate on real traffic right now, and how has it changed in the last 24 hours?"

The gap between these two questions is where production failures live.

Eval Suite:                Production Reality:
- Fixed dataset            - Live traffic stream
- Labeled ground truth     - Implicit outcome signals
- Run before deploy        - Continuous measurement
- Catches known regressions - Catches unexpected degradation
- Measures capability      - Measures actual outcomes

A Minimal Eval Harness

Before we get to production monitoring, the eval harness still matters. Run it before every deploy. It's your regression net.

import json
from dataclasses import dataclass
from typing import Callable
import openai

@dataclass
class EvalCase:
    input: str
    expected_output: str
    metadata: dict = None

@dataclass 
class EvalResult:
    case: EvalCase
    actual_output: str
    passed: bool
    score: float
    failure_reason: str = None

def run_eval_suite(
    agent_fn: Callable[[str], str],
    dataset: list[EvalCase],
    judge_fn: Callable[[str, str], float] = None
) -> dict:
    """Run offline eval suite. Call this in CI before deploying."""
    results = []

    for case in dataset:
        try:
            actual = agent_fn(case.input)
            score = judge_fn(actual, case.expected_output) if judge_fn else _exact_match_score(actual, case.expected_output)
            results.append(EvalResult(
                case=case,
                actual_output=actual,
                passed=score >= 0.7,
                score=score
            ))
        except Exception as e:
            results.append(EvalResult(
                case=case,
                actual_output="",
                passed=False,
                score=0.0,
                failure_reason=str(e)
            ))

    pass_rate = sum(1 for r in results if r.passed) / len(results)
    avg_score = sum(r.score for r in results) / len(results)
    failures = [r for r in results if not r.passed]

    return {
        "pass_rate": pass_rate,
        "avg_score": avg_score,
        "total_cases": len(results),
        "failed_cases": len(failures),
        "failure_details": [
            {"input": r.case.input[:100], "reason": r.failure_reason or f"score:{r.score:.2f}"}
            for r in failures[:10]
        ]
    }


def llm_judge(actual: str, expected: str) -> float:
    """Use GPT-4o as a judge for subjective quality evaluation."""
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Score how well the actual response matches the expected response. "
                           "Return a JSON object with a single key 'score' between 0.0 and 1.0. "
                           "1.0 = semantically equivalent, 0.0 = completely wrong or unrelated."
            },
            {
                "role": "user",
                "content": f"Expected:\n{expected}\n\nActual:\n{actual}"
            }
        ],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content).get("score", 0.0)


def _exact_match_score(actual: str, expected: str) -> float:
    return 1.0 if actual.strip().lower() == expected.strip().lower() else 0.0

This runs before deploy. It catches regressions. It is not sufficient for production reliability.

Production Outcome Tracking

In production, you don't have labeled ground truth. You have signals: did the user accept the output, did the downstream system consume it successfully, did a human reviewer approve it, did the action triggered by the output succeed?

These signals are noisier than eval scores. They're also real.

import kalibr
from kalibr import Router
import openai
import uuid

client = openai.OpenAI()

def run_agent_with_outcome_tracking(user_input: str, session_id: str) -> dict:
    goal_id = f"goal_{uuid.uuid4().hex[:12]}"

    router = Router(
        goal_id=goal_id,
        task_type="user_query_response",
        session_id=session_id
    )
    policy = router.get_policy()

    response = client.chat.completions.create(
        model=policy.model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": user_input}
        ]
    )
    output = response.choices[0].message.content

    # Return goal_id so the caller can record outcome when it's known
    return {
        "output": output,
        "goal_id": goal_id,
        "model": policy.model
    }


def record_user_feedback(goal_id: str, was_helpful: bool, feedback_text: str = None):
    """Called when user gives thumbs up/down or when downstream system reports result."""
    router = Router(goal_id=goal_id)
    router.record_outcome(
        success=was_helpful,
        quality_score=1.0 if was_helpful else 0.0,
        metadata={"feedback": feedback_text} if feedback_text else {}
    )


def record_downstream_result(goal_id: str, system_accepted: bool, error: str = None):
    """Called when a downstream system reports whether it could use the agent's output."""
    router = Router(goal_id=goal_id)
    router.record_outcome(
        success=system_accepted,
        error=error
    )

Now you have two streams of quality signal running in parallel:

Offline evals against your labeled dataset (run in CI, before every deploy)
Production outcomes from real user signals (continuous, updates the routing model)

The production stream feeds Kalibr's Thompson Sampling. Models that perform well on real traffic get higher selection probability. Models that degrade get deprioritized automatically, before you've written a new eval case for the failure mode.

Detecting Model Drift in Production

Model drift is the failure mode that evals are worst at catching because it happens after you deploy. Your eval suite passed against the model behavior at time T. The provider updates the model at time T+30 days. Eval still passes on re-run because your dataset was labeled against the old behavior. Production outcome rate drops.

With continuous outcome tracking, the degradation shows up as a change in the success rate time series:

import kalibr

# Check for recent performance change
drift_report = kalibr.get_insights(
    task_type="user_query_response",
    lookback_hours=48,
    compare_to_baseline_hours=168  # Compare last 48h to the 7-day baseline
)

if drift_report.performance_delta < -0.05:  # 5% relative degradation
    print(f"Performance degradation detected:")
    print(f"  Baseline success rate: {drift_report.baseline_success_rate:.1%}")
    print(f"  Current success rate: {drift_report.current_success_rate:.1%}")
    print(f"  Delta: {drift_report.performance_delta:+.1%}")
    print(f"  Affected models: {drift_report.degraded_models}")
    print(f"  Recommended: {drift_report.routing_recommendation}")

This surfaces drift without requiring you to have anticipated the specific failure mode. You don't need to add eval cases for behavior you didn't know would change. The outcome signal is model-agnostic - it measures whether the output was useful, not whether it matched your labeled expectations.

Complementary, Not Competing

The framing of "evals vs production monitoring" can be misleading. They're complementary tools with different jobs.

Evals: Run before deploy. Catch prompt regressions, validate model swaps against known test cases, measure capability on your labeled distribution. If eval fails, don't deploy.

Production outcome tracking: Run after deploy, continuously. Catch distribution shift, model drift, novel failure modes. If production outcomes degrade, route around the failing configuration automatically.

The workflow:

1. Write eval cases as you discover failure modes
2. Run eval suite in CI against every commit
3. Block deploy if eval pass rate drops below threshold
4. Deploy with production outcome tracking active
5. Monitor outcome rate time series for degradation
6. When degradation appears, check which models/configs are affected
7. Add the new failure mode to your eval suite so it's caught at deploy time next time

Evals are your regression net. Production tracking is your early warning system. The new failure mode you catch in production today becomes an eval case for tomorrow.

The failure modes your evals don't cover yet are not a gap in your process - they're inevitable. No labeled dataset covers the full distribution of what real users will send. The question is whether you have a production monitoring layer that catches the failures you didn't anticipate, or whether you find out about them from your users.

For how production routing decisions work when failures are detected, see Stop Hardcoding Your AI Model Selection. For how these signals work in multi-agent pipelines where failures compound across hops, see Multi-Agent Systems Break Differently Than Single Agents.