Stop running an LLM judge on every agent call. Here's the cheaper gate.

#ai #machinelearning #llmops #python

The bill that made me rebuild

My agent monitoring cost more than my agent inference. The gate was a second model grading the first on every call — correct, but a tax that grew linearly with traffic, and it still let through the failure I care about most: agents reporting a "done" they never earned.

What the research says you can do instead

Detect cheaply. Cheap Reward Hacking Detection (arXiv:2606.08893) trains a small encoder over agent trajectories and puts a linear probe on top. It hits AUC 0.9467 / TPR@5%FPR 0.8296 — matching a sanitized LLM-as-judge (AUC 0.9510) at ~4 orders of magnitude lower cost per trajectory. The ablation: remove the reasoning text and AUC drops to 0.62. The probe reads why, not just what.

Or prevent structurally. Goal-Autopilot (arXiv:2606.11688) externalizes agent state into a gated finite-state machine and forbids any terminal "done" whose falsifiable gate didn't actually run. Fabrication on SWE-bench Lite goes 33.7% → 0.67%, with a No-False-Success theorem and constant per-tick context cost.

The architecture this implies

every span      -> deterministic heuristics  (did the claimed gate execute?)
sampled spans   -> distilled probe           (cheap learned signal)
gold-set only   -> frontier LLM judge        (calibration + audits)

Rule of thumb: if your monitor exceeds ~20-25% of production cost, you built the wrong monitor. The frontier judge belongs on the gold-set, not the hot path.

The one-liner I keep

An honest stall is recoverable; a confident wrong "done" is not. If a "done" has no receipt, it isn't done — and the receipt should be cheap enough that you never turn it off.

What's the cheapest always-on signal that's caught a real agent failure in your stack?