Evals for Agents: Scoring Task Success, Trajectory, and Human Review

#ai #llm #python #agents

Book: Observability for LLM Applications — Tracing, Evals, and Shipping AI You Can Trust
Also by me: Agents in Production — the companion book in The AI Engineer's Library (2-book series)
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

The agent finished. The trace is green. Every span says ok. The user got an answer, and the success counter on your dashboard ticked up by one. You would have shipped it on a Friday and gone home.

Then someone reads the run. To close a support ticket, the agent searched the knowledge base, found nothing, searched again with the same query, found nothing, escalated to a refund_customer tool it was never meant to touch on that tier, refunded thirty-seven dollars, wrote a cheerful closing note, and marked the ticket resolved. Task success: 1.0. The run itself: a small fire.

That gap is where single-number evals fall through with agents. When you scored a plain LLM call, the unit was the output. Input in, output out, grade it, move on. An agent has no single output to grade. It has an input, a trajectory, and a final state. Score only the final state and you miss every interesting failure. Score only the trajectory and you miss the ones that worked by accident. You need both, plus a way to pull in a human when neither is sure.

Three axes, not one

The eval loop you already know still holds. Collect a dataset, score against it with a judge, track the score, fail CI when it drops. What changes for agents is the unit. A run has three things worth scoring:

Task success. At the end, did the world look the way the user asked? The final-state check.
Trajectory. Along the way, were the steps sane? Right tools, right order, right arguments, no wandering?
Human judgment. When neither automated check is confident, or the stakes are high, a person grades the run. Those grades become ground truth for the first two.

Every production agent eval stack is some blend of these three. The vocabulary differs across tools. The shape does not.

Axis one: task success

Task success is the easiest to explain and the most dangerous to trust alone. You take the original request, you take the final state, and you ask whether the state satisfies the request. An LLM judge reads the run and returns a 0 to 1 score with a reason.

DeepEval's TaskCompletionMetric reads a trace, infers the task from the first user turn and the outcome from the final assistant turn plus the tool calls, and scores it. No hand-written test case required.

from deepeval.metrics import TaskCompletionMetric
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.models import AnthropicModel

# Bare model strings resolve through DeepEval's
# default provider. Wrap for a Claude judge.
judge = AnthropicModel("claude-haiku-4-5")

metric = TaskCompletionMetric(
    threshold=0.7,
    model=judge,
    include_reason=True,
)

case = LLMTestCase(
    input="Refund the last charge for user 8821.",
    actual_output=(
        "Refunded $37.00 to card ending 4242. "
        "Ticket #91234 closed."
    ),
    tools_called=[
        ToolCall(name="lookup_user", input={"id": 8821}),
        ToolCall(name="list_charges", input={"user": 8821}),
        ToolCall(
            name="refund_charge",
            input={"charge_id": "ch_7x"},
        ),
        ToolCall(name="close_ticket", input={"id": 91234}),
    ],
)

metric.measure(case)
print(metric.score, metric.reason)

Wire it into CI the way you wired faithfulness for plain calls. Pull a sample of production traces nightly, score them, fail the build when the rolling average drops below threshold.

One thing to be blunt about: the judge model matters. A judge should be at least as strong as the agent it grades. A weak judge grading a strong agent hands out kind, forgiving scores that drift upward and tell you nothing. A reasonable default is to judge with the cheapest frontier model you can afford, such as Claude Haiku, when the agent runs on a frontier model, then upgrade the judge to the agent's tier once you catch it passing runs that humans later flag.

Now the catch that keeps the refund fire on the books. The refund agent's endpoint was "ticket closed with refund issued." If the request was "refund the last charge," that endpoint scores a perfect 1.0. The judge has no idea the agent was not authorized to issue that refund on a free-tier account. That is not a final-state question. That is a trajectory question.

Axis two: trajectory

Trajectory scoring is the glass-box half. You stop asking "did it finish" and start asking "were the steps reasonable given what the agent already knew." The research on this converges on one recipe: break the run into small pieces, grade each piece against clear criteria, roll up. Feed the trajectory judge the steps, not the final answer, because judges systematically over-score runs that ended well no matter how messy the path.

You do not have to build that from scratch. The agentevals package from the LangChain org ships ready-made trajectory evaluators: exact-match tool sequence, subset, unordered set, and LLM-graded. Plug in your trace, plug in the expected shape, get a verdict.

from agentevals.trajectory.match import (
    create_trajectory_match_evaluator,
)

evaluator = create_trajectory_match_evaluator(
    trajectory_match_mode="strict",
)

actual = [
    {"role": "user", "content": "Refund user 8821."},
    {"role": "assistant", "tool_calls": [
        {"name": "lookup_user", "args": {"id": 8821}},
    ]},
    {"role": "tool", "content": "{'tier': 'free'}"},
    {"role": "assistant", "tool_calls": [
        {"name": "refund_charge",
         "args": {"charge_id": "ch_7x"}},
    ]},
]

expected = [
    {"role": "user", "content": "Refund user 8821."},
    {"role": "assistant", "tool_calls": [
        {"name": "lookup_user", "args": {"id": 8821}},
    ]},
    {"role": "tool", "content": "{'tier': 'free'}"},
    {"role": "assistant", "tool_calls": [
        {"name": "escalate_to_human",
         "args": {"reason": "free tier refund"}},
    ]},
]

result = evaluator(
    outputs=actual,
    reference_outputs=expected,
)
print(result)
# {'key': 'trajectory_strict_match', 'score': False, ...}

The refund scenario fails this check loudly. The expected path ends in escalate_to_human. The actual path ends in refund_charge. Task success said 1.0. Trajectory says false. Neither axis alone catches the failure class the other misses.

Do not grade every step with an LLM judge unless cost is no object. A practical stack is exact-match on the steps that matter (the destructive tool calls, the final answer shape), an LLM judge on the fuzzy reasoning in between, and one rubric pass at the end. Cheap step-level scorers are plain functions:

AUTHORIZED = {"free": set(), "pro": {"refund_charge"}}

def destructive_authorized(step, **_):
    call = step.get("tool_call")
    if not call:
        return 1.0
    if call["name"] not in {"refund_charge", "send_email"}:
        return 1.0
    tier = step["context"]["user_tier"]
    ok = call["name"] in AUTHORIZED[tier]
    return 1.0 if ok else 0.0

def trajectory_length(output, **_):
    steps = len(output["trajectory"])
    if steps <= 6:
        return 1.0
    if steps >= 20:
        return 0.0
    return 1.0 - (steps - 6) / 14

The first catches an unauthorized destructive call on a per-step basis. The second penalizes wandering. Neither calls a model, both run on every replay, and both catch a class of failure that an endpoint metric never sees.

Axis three: human review

Automated evals are cheap and wrong often enough that the disagreements cluster in the cases that matter most. The fix is not another round of judge-prompt tuning. The fix is a human review queue that pulls ambiguous runs out of production, puts them in front of a person, and feeds the resulting grades back as ground truth.

Build it from three parts. A router reads your trace stream and writes flagged trace IDs to a queue table. A reviewer UI pops the next flagged trace, shows it in full, and captures a label. A writer appends completed labels to your offline eval dataset. Langfuse, Braintrust, and LangSmith ship all three; rolling your own is roughly two hundred lines plus a six-column table.

The UI choice that matters most: the reviewer must see the trajectory, not just the final message. Show only the fluent closing reply and they will rubber-stamp everything. Put the request first, verbatim. Then the trajectory, one step per row, with destructive tools badged. Then the final message, last. Then the automated scores, visible but not authoritative. Then a short scoring form, three fields at most. A reviewer labels fifty runs an hour with a three-field form and ten runs an hour with a ten-field one.

Routing rules keep the queue honest. Send everything and reviewers start clicking "pass" to clear the backlog. Good triage: judge score below 0.7, any un-preapproved destructive call, any run that blew the cost cap and recovered, anything past the 95th-percentile latency, and one green run in two hundred at random so reviewers stay calibrated against normal, successful behavior.

The loop is the point. A human label on a flagged trace is worth a hundred synthetic cases, because it came from a real request and a real failure. Export labels weekly, re-run your judge against them, and tune the judge when it disagrees with people. Over a quarter the judge gets cheaper and more accurate, and you lean on the queue less for volume and more for the long tail.

About the public benchmarks

Someone always asks where SWE-bench and GAIA fit. Short answer: not in your CI. They tell you which models clear a baseline in a domain, and a model that cannot solve SWE-bench Verified will not solve your codebase. But a model that tops it might still be bad at your codebase, because the benchmark's input distribution is not your users'. SWE-bench tasks come from clean Python projects with isolated fixes. Your bug reports arrive as Slack screenshots and a "fix" spans three services and a migration.

Build the boring thing instead. Pull a hundred real requests from the last week, hand-label each for the outcome you wanted, save the pairs as JSON. That golden set is smaller and noisier than any leaderboard, and it is the only dataset whose score tracks whether your agent works for the people paying you. Run it on every PR that touches the agent, and grow it from the review queue. When a model upgrade moves the public leaderboard and your golden set in opposite directions, trust the golden set.

Where this leaves you

You can score an agent now on three axes: task success for the endpoint, trajectory for the path, human review for the runs neither one can call. Task success alone would have shipped the refund fire. The three together catch it. None of them stops the agent from refunding money it was not allowed to refund the first time. Tracing sees it after the fact, evals grade it after the fact, the queue catches it after the fact. After is the word all three share, which is exactly why you want all three running before the next Friday deploy.

If you are building the agent that produces these runs, Agents in Production is the book on tool loops, authorization, and shipping the thing safely. If you are wiring the tracing and eval pipeline underneath it, Observability for LLM Applications covers the traces, the judges, and the cost accounting that make the loop above real. Together they are The AI Engineer's Library.