Part of an eval-first series. The trajectory evaluator described here shipped as eval-sanity v0.3 (zero dependencies, deterministic).
Repo: https://github.com/elvisyao007/eval-sanity · Agent + traces: https://github.com/elvisyao007/onprem-llm-stack/tree/main/payloads/invoice-agent
By 2026 the agent-evaluation problem is no longer hypothetical. LangChain's State of AI Agents report puts 57% of organizations with agents in production and names quality as the top deployment barrier. The standard answer to "how do you evaluate an agent" has become: capture the trajectory, then have an LLM judge it.
LLM-as-judge is real and necessary — for the parts that need it. But a large fraction of agent evaluation is deterministic, needs no judge at all, and happens to catch the failures that hurt most in an enterprise setting: the agent calling the wrong tool, skipping a required check, or writing bad data into a system of record. I built a small deterministic trajectory evaluator to make exactly that point, and ran it against a real invoice-processing agent.
Here's the case for doing the cheap, deterministic layer first — and doing it well.
The agent: invoice extraction with a refusal condition
The test subject is deliberately boring: a Japanese invoice (請求書) agent running on a self-hosted stack (LiteLLM gateway → local model). Three tools, no framework — just native function calling, because the agent is the thing being evaluated, not the thing being engineered.
-
extract_fields(pdf)— pull structured fields from the invoice -
validate(fields)— check the consumption-tax math and that line items sum to the stated total -
write_back(fields)— commit to a (mock) accounting system
The interesting behavior isn't extraction — OCR can extract. It's the refusal: when validate fails, the agent must not write back. An agent that dutifully commits a invoice whose tax is miscalculated is worse than no agent, because it launders a bad number into your books with an audit trail that says "automated."
I seeded five invoices: three clean, two with planted arithmetic errors (wrong 消費税 on one, wrong total on another). The good agent extracts all three clean ones and writes them back; on the two broken ones, it flags and refuses. That refusal is the whole value proposition.
What you can check without a judge
Here's the part the "just use an LLM judge" framing underrates. For an agent like this, most of what you care about is decidable by assertion, not by opinion:
- Tool-call correctness — did the expected tools get called, with valid arguments?
-
Order constraints — was
write_backalways preceded by a passingvalidate? This is a pure structural property of the trace. - Step efficiency — how many steps, and were there redundant or repeated calls?
- Task completion — against ground truth, did the right thing happen (write-back for clean, refusal for broken)?
None of these need a model to grade them. They're exact, reproducible, and fast enough to run as a CI gate on every prompt change, tool addition, or model swap. The 2026 consensus is converging on exactly this ordering — cheap deterministic checks first, escalate to an LLM judge only for what rules genuinely can't reach (was the agent's prose helpful, was its reasoning sound). I'm not arguing against LLM judges. I'm arguing that skipping straight to them skips the layer that catches the operationally worst failures.
Proving the evaluator actually discriminates
A familiar trap — one I walked into on an earlier model-selection benchmark — is an evaluator that passes everything. An evaluator that rubber-stamps good traces tells you nothing; you have to show it fails the bad ones.
So I didn't only run it on the five real (passing) traces. I constructed deliberately broken trajectories and confirmed each one gets caught, deterministically:
| Constructed failure | Caught? | How |
|---|---|---|
write_back without calling validate
|
✅ | missing required tool + order violation: "step 2 write_back: no preceding validate" |
write_back after a failing validate |
✅ | order violation: "'passed' never True" |
| Redundant / unexpected extra tool calls | ✅ | surfaced as diagnostics (redundant count, unexpected tool list) |
write_back on an invoice that should be refused |
✅ | forbidden-tool violation on the refusal spec |
On the real traces: the three clean invoices pass with 3 steps each, zero violations; the two broken invoices correctly show 2 steps, no write_back, status flagged. The evaluator distinguishes "did the right thing" from "did the wrong thing" — which is the only property that makes an evaluator worth running.
Silent trajectory regression
There's a sneakier failure than an outright wrong answer: the agent still completes the task, but its path quietly degrades — more steps, an occasional skipped check, a creeping violation rate — after a prompt tweak or model swap. Outcome-only evaluation misses this completely, because the outcome still looks fine.
The evaluator reuses a paired-bootstrap regression check (carried over from the retrieval-metric version of this tool) at the trajectory level: compare a baseline set of traces against a candidate set and alarm when completion stays flat but violation rate or step efficiency degrades significantly. In testing, a baseline of 8 good traces against a candidate of 4-bad-plus-4-good fired the alarm (completion −0.50, violations +0.50); two identical runs produced zero movement and correctly stayed silent.
When this is the wrong tool
The honest boundary, because it matters: if your agent always runs the same fixed sequence — retrieve, generate, format, every time — scoring the path buys you little; the output already tells you what you need. And if you're still in early prototyping, figuring out what the agent should do, formalizing trajectory specs is premature.
Trajectory evaluation earns its place when the path has constraints that can be violated — like "never write back without a passing validate." My invoice agent has exactly that property, which is why structural checking is worth it here. A different agent might not need it. Knowing which case you're in is part of the judgment.
The design choices worth stealing
Two decisions did more work than the metrics themselves:
The order constraint is enforced in code, not just evaluated. The agent's write_back has a Python-side guard that refuses to commit unless validate passed — independent of whether the LLM "decided" to follow instructions. You cannot trust an agent to always honor step ordering from a prompt; the load-bearing constraint belongs in the code, and the evaluator then confirms the trace respects it. Defense in depth, not prompt faith.
The eval is configurable, not absolute. Calling an unexpected tool doesn't auto-fail — it surfaces as a diagnostic unless you explicitly add the tool to a forbidden list. Different tasks tolerate different slack. The strictness is a property of the spec you write, not baked into the evaluator. That's a feature: it forces you to state what "correct" means for this task.
Deterministic agent evaluation isn't the whole story — the LLM-judge layer above it is real, for the dimensions rules can't reach. But it's the cheaper layer, it's the CI-gateable layer, and for enterprise agents that touch systems of record, it's the layer that catches the failures you can least afford. Do it first, and do it well.
Evaluator (zero deps, deterministic): eval-sanity v0.3
The invoice agent and its traces: onprem-llm-stack/payloads/invoice-agent


Top comments (0)