DEV Community

Cover image for Your LLM Judge Costs More Than the Agent. Gate It in 40 Lines.
Alexey Spinov
Alexey Spinov

Posted on • Originally published at finops.spinov.online

Your LLM Judge Costs More Than the Agent. Gate It in 40 Lines.

LLM judge cost is the share of your eval bill spent grading agent output instead of producing it. To control it, run a 40-line offline pre-gate that triages every span with four deterministic rules and escalates only the uncertain tail to the expensive judge. On one trace this cut judge cost share from 50% to 16%.

LLM judge cost is the line item nobody puts on the FinOps dashboard. You add an LLM-as-judge to grade every agent span, you sleep better, and three weeks later the eval layer is quietly billing a third of what the agent itself costs. This post measures that share of your bill spent judging instead of doing, with a 40-line offline meter, and shows the one move that drops it from 50% to 16% on the same trace.

AI disclosure: I drafted this with an AI writing assistant. The tool, both fixtures, and every number below come from a real local run of judge_gate.py on Python 3.13.5, no network, no API key. I ran it, checked the exit codes, hashed the output twice to confirm it's deterministic, and edited every line myself before publishing.

Here's the sentence that set me off. Sattyam Jain wrote it on Dev.to on June 12, in a post arguing you should stop running an LLM judge on every agent call: "if your monitor exceeds ~20–25% of production cost, you built the wrong monitor." (Dev.to) That's a great rule of thumb. It's also unfalsifiable until you can put a number on your monitor. His post sketches the tiered architecture (cheap deterministic heuristics first, expensive judge last) but ships no code you can run against your own trace. So I wrote the missing 40 lines.

The timing isn't an accident. The token bill is coming due across the whole industry right now. TechCrunch reported on June 5 that "Uber blew through its entire 2026 AI coding budget by April," and that a Priceline employee saw "a routine Cursor contract renewal came back 4–5x more expensive." (TechCrunch) Two days earlier the Linux Foundation announced its intent to launch the Tokenomics Foundation — open standards for AI cost management, because, in Jim Zemlin's words, "tokens have become the new unit of technology spend." (Linux Foundation) Everyone's auditing what the agent spends. Almost nobody's auditing what the watchdog spends.

And the watchdog is an LLM call too. You priced the agent. Did you price the thing watching the agent?

TL;DR

  • An LLM judge on every span isn't rigor — it's a second agent you forgot to budget. Price it before it surprises you.
  • judge_gate.py is a 40-line, offline, keyless, zero-network script. Feed it a JSONL trace; four deterministic rules triage each span as OK / BAD / UNCERTAIN, and only UNCERTAIN ones would reach the expensive judge.
  • On a well-instrumented 50-span trace it resolved 68% cheaply and sent only 32% to the judge → 16% judge cost share (exit 0, PASS). On the same agent logged as free text, 100% escalated → 50% cost share (exit 1, FAIL).
  • The judge is never actually called. It's priced via configurable --judge-price and --prod-cost flags. Substitute your own rates; I ship neutral placeholder units.
  • Exit code is a CI gate: 0 if judge cost share ≤ budget (default 0.25), 1 if over, 2 on bad input. Deterministic — byte-identical across runs.

This is the next piece in a series on controlling agents before they execute, not after. The pre-execution gate gates the agent's action. The success gate decides what to verify in a result. This one is a level up the stack: it doesn't gate the agent at all. It gates the judge — and asks how much that judge is allowed to cost.

What "judge cost share" actually means

Here's the failure mode I keep seeing. Someone reads that agents silently fail (true) and bolts on an LLM-as-judge to grade every step. Every span: a second model call, often a frontier model, sometimes with a chunky rubric prompt. It works. It catches things. Then the finance person asks why the eval bill is the same order of magnitude as the agent bill, and the honest answer is "because we run a full second model over every single thing the first one does."

The number that matters is a ratio. Call it judge cost share: the cost of the judging layer divided by the cost of the production run it's judging.

judge_cost_share = (judge_calls × judge_price) / prod_cost
Enter fullscreen mode Exit fullscreen mode

If that's 8%, fine — cheap insurance. If it's 50%, you didn't add a monitor, you added a co-pilot you're paying full freight for and calling overhead. The whole game is shrinking judge_calls: the number of spans that actually need a human-grade judgment, versus the spans a dumb deterministic rule can settle for free.

Most spans don't need a judge. A tool either got called or it didn't. A JSON output either parses or it doesn't. A 200 with an empty body is wrong no matter how confident the prose around it sounds. You don't need a frontier model to know [] is not a successful invoice send. You need an if statement.

The fix: triage every span, escalate only the uncertain tail

The pre-gate is a function. It looks at one span and returns one of three verdicts:

  • OK — cheaply, provably fine. Don't pay to judge it.
  • BAD — cheaply, provably broken. Don't pay to judge it either; you already know.
  • UNCERTAIN — the cheap rules abstain. This is the only span the expensive judge should ever see.

Four rules carry almost all the weight. They're the deterministic heuristics Sattyam Jain pointed at ("did the claimed gate execute?") turned into code:

  1. Claim-vs-evidence. The span says it called send_email, but tools_called doesn't contain send_email. Claim without evidence → BAD. (This is the same idea as the success gate's middle check, reused here as a free triage rule.)
  2. Output schema. The output isn't even a JSON object — it's a raw string, or it's missing. → BAD.
  3. 200-with-empty-payload. Status says success, body is empty. The classic silent lie. → BAD.
  4. Duplicate retry. This span's argument hash equals the previous span's. A byte-identical retry — the waste-after-failure loop signature. → BAD.

If none of those fire and the span has a clean ok: true + 200, it's OK. Otherwise the rules abstain and it's UNCERTAIN — escalate. Here's the whole triage:

def triage(span):
    """Return (verdict, rule). UNCERTAIN means 'a human-grade LLM judge is needed'."""
    out = span.get("output")
    if not isinstance(out, dict):                      # output not valid JSON object
        return "BAD", "schema:not-an-object"
    if span.get("claimed_tool") and span["claimed_tool"] not in span.get("tools_called", []):
        return "BAD", "claim-without-evidence"         # said it called X, trace has no X
    if span.get("status") == 200 and not out:          # 200 OK with empty payload
        return "BAD", "200-empty-payload"
    if span.get("arg_hash") and span["arg_hash"] == span.get("prev_arg_hash"):
        return "BAD", "duplicate-span"                 # byte-identical retry of prior call
    if out.get("ok") is True and span.get("status") == 200:
        return "OK", "clean-success"                   # explicit ok + 200, no contradiction
    return "UNCERTAIN", "needs-judge"                  # cheap rules abstain -> escalate
Enter fullscreen mode Exit fullscreen mode

That's it. No network, no key, no model. The judge layer is priced, not called: I count the UNCERTAIN spans and multiply by a price you supply on the command line. I refuse to hardcode a vendor rate — those go stale in a month and I'd rather be honestly empty than confidently wrong about someone's bill.

The run: 32% to the judge, not 100%

I built two traces of the same 50-span agent — a support-desk bot doing searches, record updates, email sends, classifications, and reply drafts.

The first, trace_gated.jsonl, is well-instrumented: each span logs the tool it claimed, the tools actually called, a structured output (an ok flag where the verdict is clear-cut, a confidence value or label where it isn't), and an argument hash. The second, trace_naive.jsonl, is the same agent logging only free-text outputs like {"text": "email sent"}, the way a lot of agents actually log in the wild. Same work. Different telemetry.

Here's the verbatim output. I didn't touch it:

$ python3 judge_gate.py fixtures/trace_gated.jsonl --judge-price 1 --prod-cost 100
spans total:        50
resolved by gate:   34 (68.0%)  [OK=29 BAD=5]
sent to LLM judge:  16 (32.0%)
judge cost share:   16.0% of prod cost (judge_price=1.0, prod_cost=100.0, budget=25%)
verdict: PASS - judge layer within budget
$ echo $?
0

$ python3 judge_gate.py fixtures/trace_naive.jsonl --judge-price 1 --prod-cost 100
spans total:        50
resolved by gate:   0 (0.0%)  [OK=0 BAD=0]
sent to LLM judge:  50 (100.0%)
judge cost share:   50.0% of prod cost (judge_price=1.0, prod_cost=100.0, budget=25%)
verdict: FAIL - judge layer over budget
$ echo $?
1
Enter fullscreen mode Exit fullscreen mode

Read the two side by side. Same agent, same fifty spans, same --judge-price 1 --prod-cost 100. The well-instrumented trace sends 16 spans to the judge and lands at 16% cost share: a PASS, exit 0. The free-text trace can't resolve a single span cheaply, sends all 50, and lands at 50%: a FAIL, exit 1, tripping Sattyam Jain's "wrong monitor" line by a mile.

The lever isn't a fancier judge. It's whether your trace carries the four cheap facts a rule can read. Of the 16 spans that did escalate in the gated run, most are genuinely subjective: ambiguous contract summaries (confidence: 0.45), hedged reply drafts ("I cannot find the order, but it is probably fine."), borderline intent labels. A handful escalate for a humbler reason — they carry no ok flag for a cheap rule to confirm, so the gate abstains instead of guessing. Either way, that's the tail you want a human-grade judge on. The other 34? Five were provably broken (one duplicate retry, two claims with no matching tool call, one 200 with an empty body, one non-object output) and the rest were clean successes. None of those needed a model to adjudicate.

I want to be precise about a number I almost fudged. The cost figures are placeholder units (judge_price=1, prod_cost=100). I am not telling you a judge call costs a dollar or that your run costs a hundred of anything. Plug in your real per-call judge price and your real run cost. The rate, 32% vs 100% of spans escalating, is the part that's mine: measured, reproducible. The dollars are yours.

Am I just moving the bug into the gate?

Fair objection, and it's the one I'd raise. If the cheap rules are wrong, you've replaced a $50 judge bill with a 16% bill and a stack of bad verdicts. So: how good can a cheap layer actually be?

Two recent papers say: surprisingly good, on the parts that matter. In Cheap Reward Hacking Detection (arXiv:2606.08893, June 8), Belenky, Itria and Johns put a linear probe on a small transformer encoder and detected reward hacking at AUC 0.9467, TPR 0.8296 at 5% FPR, at "roughly four orders of magnitude lower per-trajectory cost" than an LLM-as-judge baseline. And Goal-Autopilot (arXiv:2606.11688) reports a gated finite-state machine that "forbids any terminal 'done' claim whose falsifiable gate did not actually execute and pass," cutting fabrication on SWE-bench Lite from 33.7% to 0.67%. Those are their numbers on their setups, not mine. I'm citing them as evidence that a cheap deterministic layer catches most of what a dear one catches, not as my own result.

My four if statements are cruder than a trained probe. They don't need to be clever. They need to be right when they're confident and silent when they're not — which is the whole point of the UNCERTAIN bucket. A rule that isn't sure doesn't guess. It escalates. The judge still grades the hard 32%. You just stopped paying it to rubber-stamp the easy 68%.

What this is not

  • Not an eval suite. It doesn't score answer quality. It decides which spans deserve a judge, then prices that layer. Correctness of the hard tail is still the judge's job.
  • Not a runtime cap. It reads a finished trace and fails CI. If you need to block a runaway loop mid-flight, that's a sliding-window spend guard, a different tool.
  • Not a verdict on confidence fields. Honest limitation: my gate ignores a span's self-reported confidence. One span in the fixture says confidence: 0.95, "no ambiguity" and still got escalated, because I refuse to trust a model's own confidence as a cheap signal — that's the kind of self-assessment that lies. If you trust yours, add a fifth rule. I didn't.
  • Not a license to skip the judge. The judge gets the genuinely uncertain spans. The argument is against running it on the obvious ones, not against running it at all.

Run it on your own trace

Export 40–60 spans of a real agent run to JSONL with six fields per span (status, claimed_tool, tools_called, output, arg_hash, and prev_arg_hash carrying the previous span's hash so the duplicate-retry rule can fire), point judge_gate.py at it, and pass your real --judge-price and --prod-cost. If your judge cost share comes back under 10%, ignore me; your monitor's fine. If it comes back at 40%, you've found a line item.

One thing I genuinely don't know yet and would put real money on being argued in the comments: where the honest threshold is. Sattyam Jain says 20–25%. I shipped a default of 25%. But for a low-stakes summarizer, even 10% might be waste, and for an agent that moves money, maybe 40% is cheap. The budget is a --flag precisely because I don't think there's one right answer.

So I'll ask you: what's the judge cost share on a real eval pipeline you've shipped — and where would you set the budget before it counts as the wrong monitor?


I publish one runnable FinOps tool for AI agents at a time, with the real run log attached. Follow for the next number from the next trace — and drop your judge cost share in the comments, I read every one.

Top comments (0)