Benchmarking a kill switch for runaway AI agents -- and why the real number is a ceiling, not a %

#ai #llm #opensource #devops

Claims about AI cost control are cheap. "Cut your agent spend by 60%!" is on every landing page. So instead of a claim, here's a benchmark you can run yourself in one command -- and an honest reading of what its number actually means, because the headline percentage is the least interesting part.

The short version: I ran the same looping agent twice -- once unguarded, once behind a hard dollar budget -- against a deterministic provider, and measured the spend. Then I'll show you why the "% saved" framing undersells it, and why a flat ceiling is the number that matters.

The setup

I wrote about why a runaway agent slips past logging, monitoring, and max_tokens: it's not one anomalous call, it's a thousand individually-valid ones, and the only thing that stops it is a deterministic, pre-call, per-run limit. This benchmark measures exactly that limit doing its job.

The harness (benchmark/ in the repo) is built so the only variable between the two runs is whether the budget fired:

Deterministic provider. A mock of the Chat Completions API returns a fixed token usage (1000 in / 1000 out) on every call. No network variance, no real money, exactly reproducible.
Real prices, pinned. gpt-4o at its list price ($2.50 / $10.00 per 1M input/output tokens). That makes one call cost 1000·2.50/1e6 + 1000·10.00/1e6 = $0.0125.
Measured, not modeled. The governed run's spend is read straight from the runtime's own cost ledger (GET /v1/runs/{id} -> usage.dollars), not computed by the benchmark. The runtime meters each call and halts the run before the call that would cross the ceiling.
Same per-call price on both sides, so the two numbers are directly comparable.

A 50-iteration runaway, with a $0.25 ceiling on the governed run:

  RiskKernel cost benchmark -- runaway loop
  ------------------------------------------------------
  loop length (N)            50
  dollar budget              $0.25
  per-call cost              $0.0125   (gpt-4o, from the ledger)
  ------------------------------------------------------
                            calls        spend
  baseline (no governance)     50      $0.6250
  governed (RiskKernel)        20      $0.2500
  ------------------------------------------------------
  dollars saved              $0.3750   (60%)
  stopped by                 dollar_budget_exceeded

20 calls × $0.0125 = exactly $0.25. The 21st call was refused before it left the process. The baseline ran all 50.

Why "60%" is the wrong number

Sixty percent looks like the headline. It isn't -- it's an artifact of where I set N. I chose a 50-call loop; the budget caught it at 20. Make the loop longer and the percentage climbs, because the governed spend doesn't move:

If the runaway loops…	Baseline spend	Governed spend	Saved
50×	$0.63	$0.25	$0.38 (60%)
1,000×	$12.50	$0.25	$12.25 (98%)
10,000×	$125.00	$0.25	$124.75 (99.8%)

The governed column is flat. That's the whole point. A runaway loop has no natural stopping condition -- that's what makes it a runaway -- so the baseline grows until a human notices, which in the canonical $47K incident took eleven days. The thing you're buying isn't a percentage discount. It's a number that cannot exceed what you set, no matter how badly the agent misbehaves or how long before anyone looks.

So I distrust "X% cheaper" claims in this space, including ones I could make. The percentage depends entirely on the failure you benchmark against. The honest guarantee is the ceiling: spend is bounded by the budget, full stop.

Why this benchmark is honest (and where it isn't the whole story)

I'd rather you trust the harness than the author, so:

It's one command, key-free, no real spend: python3 benchmark/benchmark.py. The mock and the pricing file are right there -- inspect them, change them, break them.
It deliberately removes provider latency and variance to isolate the governance effect. This is a benchmark about dollars, not milliseconds. The enforcement overhead the runtime itself adds is small and belongs in a separate latency benchmark -- I won't smuggle it into this one.
It measures one dimension: the cost ceiling. The other half of "safe to leave running" is crash-recovery -- kill -9 a long run and resume it without re-spending -- which is demonstrated end-to-end in examples/kill-9-resume, not in this harness. A timed recovery benchmark is next.

The takeaway

If you're evaluating anything that claims to control agent cost, ask it for two things: the harness (so you can reproduce the number) and the ceiling (so you know the worst case, not the average case). A percentage without a reproducible loop length is marketing. A flat, enforced ceiling -- refused pre-call, in compiled code, read back from a ledger -- is an SLA you can reason about.

The runtime is RiskKernel: open-source (Apache-2.0), self-hosted, pip install riskkernel or docker run, one env var in front of an agent you already have. Run the benchmark, then tell me where you'd push on it -- a benchmark only earns trust if people try to break it.