The bill that made me rebuild
My agent monitoring cost more than my agent inference. The gate was a second model grading the first on every call — correct, but a tax that grew linearly with traffic, and it still let through the failure I care about most: agents reporting a "done" they never earned.
What the research says you can do instead
Detect cheaply. Cheap Reward Hacking Detection (arXiv:2606.08893) trains a small encoder over agent trajectories and puts a linear probe on top. It hits AUC 0.9467 / TPR@5%FPR 0.8296 — matching a sanitized LLM-as-judge (AUC 0.9510) at ~4 orders of magnitude lower cost per trajectory. The ablation: remove the reasoning text and AUC drops to 0.62. The probe reads why, not just what.
Or prevent structurally. Goal-Autopilot (arXiv:2606.11688) externalizes agent state into a gated finite-state machine and forbids any terminal "done" whose falsifiable gate didn't actually run. Fabrication on SWE-bench Lite goes 33.7% → 0.67%, with a No-False-Success theorem and constant per-tick context cost.
The architecture this implies
every span -> deterministic heuristics (did the claimed gate execute?)
sampled spans -> distilled probe (cheap learned signal)
gold-set only -> frontier LLM judge (calibration + audits)
Rule of thumb: if your monitor exceeds ~20-25% of production cost, you built the wrong monitor. The frontier judge belongs on the gold-set, not the hot path.
The one-liner I keep
An honest stall is recoverable; a confident wrong "done" is not. If a "done" has no receipt, it isn't done — and the receipt should be cheap enough that you never turn it off.
What's the cheapest always-on signal that's caught a real agent failure in your stack?
Top comments (0)