Moazzam Qureshi

Posted on May 20

The complete process for evaluating production AI agents (datasets, evaluators, offline + online)

#ai #llm #machinelearning #agents

Most teams ship an AI agent, watch it work in a demo, and push it to production. Then it breaks on real traffic and nobody can say why. The gap between "worked in the demo" and "works in production" is almost always an evaluation gap — there was never a systematic way to measure what the agent actually does once real users hit it.

This is the complete evaluation process I run on every production agent I audit. It is vendor-neutral: the concepts apply whether you use LangSmith, Braintrust, Langfuse, Arize, or a homegrown harness. Treat it as the reference you wish someone had handed you before you shipped.

The mental model: two modes, one loop

Every serious evaluation practice has exactly two modes, and they form a single continuous loop:

Offline evaluation — "test before you ship." You evaluate against a curated dataset during development, so you can compare versions and catch regressions before they reach users.
Online evaluation — "monitor in production." You evaluate real user interactions on live traffic, in real time, so you detect issues on the inputs your users actually send.

The loop closes when failing production traces flow back into your offline dataset. A real failure your monitoring caught becomes a new test case, so the next version is evaluated against the exact thing that hurt you. This feedback loop is the difference between an agent that gets more reliable over time and one that decays.

   ┌─────────────── offline evaluation ───────────────┐
   │  datasets → evaluators → experiments → analysis    │
   └───────────────────────┬────────────────────────────┘
                           │ ship the version that passed
                           ▼
   ┌─────────────── online evaluation ───────────────┐
   │  production runs → evaluators → monitoring        │
   └───────────────────────┬────────────────────────────┘
                           │ failing traces become test cases
                           └──────────► back to datasets

The five components

1. Datasets — what you test against

A dataset is a collection of test cases (examples), each with an input and, for offline evals, a reference output. The single highest-leverage decision in your whole eval practice is where the dataset comes from.

Three sources, not equally valuable:

Production traces (best) — real inputs: malformed, multi-part, off-topic, the edge cases you'd never invent. An eval built from production traces predicts production behavior. One built from anything else does not.
Manual curation (good for known risks) — hand-written cases covering compliance scenarios, adversarial inputs, "this must never happen" cases.
Synthetic generation (use to scale, not to seed) — LLM-generated variations. Useful once you have a real seed set; dangerous as your primary source because it reflects what a model thinks users do.

The mistake I see most: the team built the dataset by imagining how users behave. The eval passes. Production fails. The fix is always to rebuild from real traces. If you have no eval set at all (the common case), this is also how you build your first one.

2. Evaluators — how you score outputs

Four types, and choosing the right one per criterion is what separates real evaluation from theater:

Code/heuristic — deterministic checks (valid JSON? right tool called? cost under threshold?). Always your first line. If a regex can verify it, never spend an LLM call on it.
LLM-as-judge — a model scores against a rubric. Powerful for subjective criteria (correctness, groundedness, tone), dangerous if uncalibrated.
Human review — ground truth. Slow, but the only thing you can calibrate an LLM-judge against.
Pairwise comparison — "is A better than B?" Far more reliable than absolute scores for subjective judgments.

3. Offline evaluation — testing before you ship

Run your evaluators against your dataset to produce an experiment: a measurement of one agent version on one dataset. Four use cases:

Benchmarking — compare versions (change one variable at a time)
Unit tests — validate one discrete behavior
Regression tests — fail the build when a score drops (the highest-ROI infra most teams skip)
Backtesting — run a new version against historical inputs to prove a fix works

4. Online monitoring — evaluating live traffic

The key difference: no reference output exists. A real user sends a real input; nobody knows the "right" answer. So you use reference-free evaluators: groundedness, format validity, safety checks, refusal correctness, tool-call validity, trajectory sanity.

This is where you catch agent decay — the agent shipped working and is silently worse two months later. It shows up in eval metrics (hallucination rate, tool-call accuracy, cost per task) long before it shows up in your product dashboards. Wire anomaly alerts into Slack/PagerDuty, not a dashboard nobody opens.

5. Criteria and metrics — what "good" means

The fundamental split most teams miss: output metrics vs trajectory metrics.

Output metrics — was the final answer correct, grounded, well-formed?
Trajectory metrics — what path did the agent take? Which tools, how many steps, did it loop, what did it cost?

Most teams measure only outputs. An agent can produce a correct answer while calling the same tool 14 times and burning $3 on a $0.02 task. Output-only evaluation scores that as a pass. It is not a pass — it is a production incident waiting for scale.

And kill the single aggregate "87% pass rate." It hides which 13% failed (the high-stakes cases?), whether failures cluster in one category, and whether you regressed. Decompose by category, track over time, surface the specific failing examples.

Why most evaluation fails

Four patterns, from real audits:

No dataset, or an imagined one. Production fails in ways the team had no way to see.
Output-only metrics. The trajectory failures that cost the most go unmeasured.
No online evaluation at all. Evaluated once before launch, never again. The agent has been decaying for weeks.
The loop is open. Production failures never become test cases, so each failure teaches nothing.

None of these are model problems. Switching GPT-4 → Claude → Gemini fixes none of them. They are engineering problems with known solutions.

I wrote this up in full, chapter by chapter (datasets, evaluators, offline, online, metrics), as an open guide here: The complete AI agent evaluation process. It is the exact process my firm runs on every production agent we audit.

If your agent is in production and breaking in ways you can't measure, that's the gap. Happy to talk through it — fixmyagent.agency.

DEV Community