How to Evaluate AI Agents: Trajectory Evals That Work

#ai #agents #testing #mlops

You cannot evaluate an agent by checking its final answer. A multi-step agent can reach the right output through a broken path, calling the wrong tool, recovering by luck, taking eight steps where two would do, and a final-answer check waves it through. Then the same broken path fails on the next input and you have no idea why. Agent evaluation has to grade the trajectory, not just the destination.

We build and ship AI agents, and the eval harness is the part that separates the agents that survive a model upgrade from the ones that silently regress the day the provider ships a new version.

Score the path, not just the answer

A useful agent eval covers the whole trajectory with several dimensions, not one number:

Tool correctness: did it call the right tools? A deterministic check, exact tool names against expected.
Argument correctness: were the parameters right? Also deterministic where you can specify required fields.
Step efficiency: did it take a reasonable number of steps, or wander?
Plan adherence and plan quality: did it follow a sensible plan, and was the plan good to begin with?
Task completion and reasoning quality: did it actually finish the job, and was the reasoning sound?

The important split: use deterministic checks for anything with a crisp right answer (tool names, required parameters, expected outputs) and save LLM-as-judge for the subjective stuff. Don't pay a judge model to check something a string comparison can verify.

Multi-agent regressions hide in the sub-agents

If you've got an orchestrator with sub-agents, a top-level score will lie to you. The orchestrator can look fine while a sub-agent quietly degrades, because the system recovered or the bad output got averaged away. You need span-level evaluation: grade each sub-agent's span on its own. Most production regressions in multi-agent systems live in exactly the sub-agent nobody's eval was watching.

LLM-as-judge is useful and quietly biased

LLM-as-judge is the right tool for subjective criteria, and it's riddled with biases you have to actively counter:

Position bias. Judges favor whichever answer came first, sometimes heavily. Flipping the order can flip the verdict. Fix: evaluate both orderings and average, or randomize position.
Self-preference. A judge tends to prefer outputs from its own model family. Fix: use a judge that's maximally different from the model you're grading, or require cross-family consensus.
Verbosity bias. Longer answers get rated higher regardless of substance. Fix: control for length, or instruct the judge to ignore it and spot-check that it does.

Properly calibrated, with biases controlled and validated against human labels, LLM-as-judge reaches strong agreement with human preferences, about the level humans agree with each other. The judge is reliable once you've done the work to calibrate it. It is not reliable out of the box.

Calibrate against humans, then trust the automation

The step teams skip is calibration. Before you trust a rubric, hand-label a set of examples and check that your judge agrees with your humans. If it doesn't, the rubric is ambiguous or the judge is biased, and either way your green dashboard is fiction. Humans calibrate the grader; the grader scales the humans. And watch for eval-set contamination: if benchmark examples leaked into training data, you're measuring memorization, not capability. Keep a held-out set you generated yourself.

Offline evals miss drift, so run online too

A test suite you run before deploy catches known failures. It does not catch the new ways real traffic breaks your agent. Run streaming evals on a sample of production traffic with drift detection and alerting. Offline evals are your regression net; online evals are how you find the failures you didn't know to write a test for. This is the runtime version of the same investment we argued for on AI-written code: AI writes 4x the code, here's the QA layer that stops 4x the bugs.

Key takeaways

Grade the trajectory: tool correctness, argument correctness, step efficiency, plan quality, completion. Not just the final answer.
Deterministic checks for crisp things (tool names, params); LLM-as-judge for subjective things.
Evaluate sub-agents at the span level. Top-level scores hide sub-agent regressions.
LLM judges have position, self-preference, and verbosity biases. Counter them, then trust them.
Calibrate judges against human labels, keep a held-out set, and run online evals to catch drift.

FAQ

Why isn't final-answer accuracy enough?
Because an agent can get the right answer through a broken path that fails next time. Trajectory evals catch the broken path before it costs you.

Can I trust LLM-as-judge?
After calibration, yes, for subjective criteria. Control for position and verbosity bias, use a different model family, and validate against human labels.

Do I need online evals if I have a good offline suite?
Yes. Offline catches known regressions; online catches drift and novel real-world failures your tests never anticipated.

If you're standing up an eval harness for agents and wrestling with judge calibration, that's a problem we like. Happy to swap rubrics and harness designs with anyone building agents at Shanti Infosoft.