DEV Community

Sattyam Jain
Sattyam Jain

Posted on

Stop running an LLM judge on every agent call. Here's the cheaper gate.

The bill that made me rebuild

My agent monitoring cost more than my agent inference. The gate was a second model grading the first on every call — correct, but a tax that grew linearly with traffic, and it still let through the failure I care about most: agents reporting a "done" they never earned.

What the research says you can do instead

Detect cheaply. Cheap Reward Hacking Detection (arXiv:2606.08893) trains a small encoder over agent trajectories and puts a linear probe on top. It hits AUC 0.9467 / TPR@5%FPR 0.8296 — matching a sanitized LLM-as-judge (AUC 0.9510) at ~4 orders of magnitude lower cost per trajectory. The ablation: remove the reasoning text and AUC drops to 0.62. The probe reads why, not just what.

Or prevent structurally. Goal-Autopilot (arXiv:2606.11688) externalizes agent state into a gated finite-state machine and forbids any terminal "done" whose falsifiable gate didn't actually run. Fabrication on SWE-bench Lite goes 33.7% → 0.67%, with a No-False-Success theorem and constant per-tick context cost.

The architecture this implies

every span      -> deterministic heuristics  (did the claimed gate execute?)
sampled spans   -> distilled probe           (cheap learned signal)
gold-set only   -> frontier LLM judge        (calibration + audits)
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if your monitor exceeds ~20-25% of production cost, you built the wrong monitor. The frontier judge belongs on the gold-set, not the hot path.

The one-liner I keep

An honest stall is recoverable; a confident wrong "done" is not. If a "done" has no receipt, it isn't done — and the receipt should be cheap enough that you never turn it off.

What's the cheapest always-on signal that's caught a real agent failure in your stack?

Top comments (0)