Claims about AI cost control are cheap. "Cut your agent spend by 60%!" is on every landing page. So instead of a claim, here's a benchmark you can run yourself in one command -- and an honest reading of what its number actually means, because the headline percentage is the least interesting part.
The short version: I ran the same looping agent twice -- once unguarded, once behind a hard dollar budget -- against a deterministic provider, and measured the spend. Then I'll show you why the "% saved" framing undersells it, and why a flat ceiling is the number that matters.
The setup
I wrote about why a runaway agent slips past logging, monitoring, and max_tokens: it's not one anomalous call, it's a thousand individually-valid ones, and the only thing that stops it is a deterministic, pre-call, per-run limit. This benchmark measures exactly that limit doing its job.
The harness (benchmark/ in the repo) is built so the only variable between the two runs is whether the budget fired:
- Deterministic provider. A mock of the Chat Completions API returns a fixed token usage (1000 in / 1000 out) on every call. No network variance, no real money, exactly reproducible.
-
Real prices, pinned.
gpt-4oat its list price ($2.50 / $10.00 per 1M input/output tokens). That makes one call cost1000·2.50/1e6 + 1000·10.00/1e6 = $0.0125. -
Measured, not modeled. The governed run's spend is read straight from the runtime's own cost ledger (
GET /v1/runs/{id}->usage.dollars), not computed by the benchmark. The runtime meters each call and halts the run before the call that would cross the ceiling. - Same per-call price on both sides, so the two numbers are directly comparable.
A 50-iteration runaway, with a $0.25 ceiling on the governed run:
RiskKernel cost benchmark -- runaway loop
------------------------------------------------------
loop length (N) 50
dollar budget $0.25
per-call cost $0.0125 (gpt-4o, from the ledger)
------------------------------------------------------
calls spend
baseline (no governance) 50 $0.6250
governed (RiskKernel) 20 $0.2500
------------------------------------------------------
dollars saved $0.3750 (60%)
stopped by dollar_budget_exceeded
20 calls × $0.0125 = exactly $0.25. The 21st call was refused before it left the process. The baseline ran all 50.
Why "60%" is the wrong number
Sixty percent looks like the headline. It isn't -- it's an artifact of where I set N. I chose a 50-call loop; the budget caught it at 20. Make the loop longer and the percentage climbs, because the governed spend doesn't move:
| If the runaway loops… | Baseline spend | Governed spend | Saved |
|---|---|---|---|
| 50× | $0.63 | $0.25 | $0.38 (60%) |
| 1,000× | $12.50 | $0.25 | $12.25 (98%) |
| 10,000× | $125.00 | $0.25 | $124.75 (99.8%) |
The governed column is flat. That's the whole point. A runaway loop has no natural stopping condition -- that's what makes it a runaway -- so the baseline grows until a human notices, which in the canonical $47K incident took eleven days. The thing you're buying isn't a percentage discount. It's a number that cannot exceed what you set, no matter how badly the agent misbehaves or how long before anyone looks.
So I distrust "X% cheaper" claims in this space, including ones I could make. The percentage depends entirely on the failure you benchmark against. The honest guarantee is the ceiling: spend is bounded by the budget, full stop.
Why this benchmark is honest (and where it isn't the whole story)
I'd rather you trust the harness than the author, so:
-
It's one command, key-free, no real spend:
python3 benchmark/benchmark.py. The mock and the pricing file are right there -- inspect them, change them, break them. - It deliberately removes provider latency and variance to isolate the governance effect. This is a benchmark about dollars, not milliseconds. The enforcement overhead the runtime itself adds is small and belongs in a separate latency benchmark -- I won't smuggle it into this one.
-
It measures one dimension: the cost ceiling. The other half of "safe to leave running" is crash-recovery --
kill -9a long run and resume it without re-spending -- which is demonstrated end-to-end inexamples/kill-9-resume, not in this harness. A timed recovery benchmark is next.
The takeaway
If you're evaluating anything that claims to control agent cost, ask it for two things: the harness (so you can reproduce the number) and the ceiling (so you know the worst case, not the average case). A percentage without a reproducible loop length is marketing. A flat, enforced ceiling -- refused pre-call, in compiled code, read back from a ledger -- is an SLA you can reason about.
The runtime is RiskKernel: open-source (Apache-2.0), self-hosted, pip install riskkernel or docker run, one env var in front of an agent you already have. Run the benchmark, then tell me where you'd push on it -- a benchmark only earns trust if people try to break it.
Top comments (0)