Alexey Spinov

Posted on Jun 13 • Originally published at finops.spinov.online

Sliding-Window Spend Guard: the $47K Loop Per-Call Caps Miss

#finops #ai #python #reliability

Sliding-Window Spend Guard for AI Agents: Catch the $47K Loop Per-Call Caps Miss

A sliding-window spend guard sums what your agent has spent over the last N minutes and refuses the next call before it dispatches — which is the thing a per-call cap can't do. A per-call cap asks "is this one call too expensive?" The runaway loops that empty a budget are built from calls that each pass that check. The damage lives in the sum, not in any single call.

In short: a sliding-window spend guard tracks a trailing window of calls and blocks the next one when cumulative spend or a repeated near-identical call breaches a per-window rule. In my run it stopped an Analyzer-Verifier ping-pong at call 12, $45.80 in, after a naive per-call $5 cap let all 12 through. Stdlib, keyless, runs in seconds.

AI disclosure: I wrote window_guard.py with AI assistance and ran it myself before publishing. Every number in the output blocks below is pasted from a real run of that script on a fixture I'll show you. The $47K incident is someone else's, and I link the postmortem next to it. I label which is which.

A $47K agent loop where every single call was fine

In November 2025 a team woke up to a $47,000 bill from a single agent deployment. Four LangChain agents, talking to each other over A2A, and two of them — an Analyzer and a Verifier — got into a ping-pong. Analyzer hands work to Verifier, Verifier kicks it back, repeat. For 264 hours.

The cost didn't spike. It escalated, week over week: $127, then $891, then $6,240, then $18,400. The author of the postmortem, Gabriel Anhaia, describes the root cause in a way I keep coming back to: the dashboard was green for eleven days, and there was no step cap, no per-conversation USD budget, no orchestrator deciding when the work was done (dev.to/gabrielanhaia, Nov 2025). The dashboard showed the number. It just showed it after each call, never before the next one. A follow-up teardown by the Waxell team sharpened the line into the title of their piece: token-budget alerts aren't budget enforcement (dev.to/waxell, 2025).

Here's the part that matters for your code, not your nerves. Not one of those calls was a runaway on its own. Each Analyzer→Verifier round was a cheap, valid, well-formed request. A per-call spend cap — the thing most people reach for first — would have waved through every last one of them, because no individual call was expensive. The loop wasn't a bug in any call. It was the accumulation of perfectly fine calls that no per-call check is built to see.

I've shipped four guards on this blog and all four share that blind spot. SpendGuard caps the cost of one call. TxCanary screens one transaction. The pre-execution gate refuses one bad action. Per-call, per-call, per-call. Useful — and structurally blind to the window. So I built the missing layer and ran it against the exact pattern that cost that team $47K.

What this is, in one paragraph

window_guard.py keeps a trailing window of your agent's recent calls — collections.deque, a timestamp per call — and before it lets the next call dispatch, it checks three things over that window: total spend, how many times a near-identical call has repeated, and how often a named side effect (a refund, a charge) has fired. Trip any one and the call is refused before it runs, not logged after. Stdlib only. No keys, no network, no money moves. The clock is injectable, so a fixture run is byte-for-byte reproducible — same point most of my tools make, because a number you can't reproduce is a number you shouldn't quote.

This is not a server. It's not a gateway you have to adopt. Cloudflare and others now sell per-window spend limits at the proxy layer, and those are good if you're already on their infra. This is forty-odd lines you paste into the agent loop you already have, that does its own arithmetic, in your process, before the request leaves. No vendor.

The guard, end to end

The whole decision lives in one method. Evict what's aged out of the window, then check three rules against what's left plus the call you're about to admit:

def _check(self, now, cost, fp, sideeffect):
    """Raise WindowExceeded if admitting (cost, fp, sideeffect) at `now`
    would breach a rule. Runs BEFORE the call is recorded or executed."""
    self._evict(now)

    # Rule A: cumulative spend over the window (the call itself included).
    projected = sum(e.cost for e in self._events) + cost
    if projected > self.spend_cap:
        raise WindowExceeded("cumulative_spend",
            f"${projected:,.2f} over last {self.window:.0f}s would exceed "
            f"cap ${self.spend_cap:,.2f} ({len(self._events) + 1} calls in window)")

    # Rule B: same semantic call repeated too often inside the window.
    if fp:
        repeats = sum(1 for e in self._events if e.fp == fp) + 1
        if repeats >= self.loop_threshold:
            tool = fp.split("|", 1)[0]
            raise WindowExceeded("loop_repeat",
                f"'{tool}' fired {repeats}x in {self.window:.0f}s "
                f"(threshold {self.loop_threshold}); near-identical args")

Rule A is the $47K rule: sum the window, including the call you're weighing, and if it crosses the cap, stop. Rule B catches the loop a spend cap might miss when the calls are nearly free — an ask_clarification that fires forever and burns five cents a turn.

The loop check has one detail that earns its keep. It doesn't compare calls for exact equality. It fingerprints them, stripping volatile churn fields first:

def _fingerprint(tool, args, ignore=("nonce", "rev", "id", "ts", "request_id")):
    """Two calls that differ only by an incrementing nonce or revision get the
    SAME fingerprint, which is exactly what a strict equality check misses."""
    stable = {k: v for k, v in (args or {}).items() if k not in ignore}
    return tool + "|" + json.dumps(stable, sort_keys=True, ensure_ascii=False)

A stuck agent rarely repeats itself byte-for-byte. It bumps a retry counter, a timestamp, a request id. A naive equality check sees two different calls and lets the loop run. Strip the churn, and the loop shows its real shape.

On a real call site you don't touch any of that. You wrap the function:

guard = WindowGuard(window_seconds=3600, spend_cap_usd=50)

@guard.wrap(cost=lambda r: r["usd"],
            fingerprint=lambda *a, **k: ("search", k["q"]))
def call_tool(...):
    ...

Running it: three scenarios, one verdict

The demo replays three fixtures through a fresh guard. Same policy for all three: a one-hour window, a $50 cap, a loop threshold of 8, and a refund side-effect cap of 5.

python3 window_guard.py --fixture window_fixture.json

The header that prints first:

window_guard.py  -  sliding-window spend & loop guard
policy: window=3600s  spend_cap=$50  loop_threshold=8  sideeffect_cap={'refund': 5}

Scenario 1 — a normal workload passes

First, the case that should sail through, so you trust the brake isn't just trigger-happy. Nine mixed calls — searches, page reads, a couple of drafts — over an hour:

SCENARIO: PASS  (normal mixed workload, never breaches the window)
  call  1 t=    0s  search_web         $0.42  (window spend so far $0.42)
  call  4 t=  600s  summarize          $1.10  (window spend so far $2.11)
  call  7 t= 2100s  draft_section      $2.20  (window spend so far $5.04)
  call  9 t= 3300s  summarize          $1.30  (window spend so far $8.39)
  => PASS. 9 calls admitted, $8.39 total, no window breached.
  (for contrast: a naive per-call $5 cap would block 0 of these 9 calls.)

Nine calls, $8.39, nothing blocked. Note the contrast line, because it runs under every scenario: the per-call $5 cap also blocks 0 here. On a healthy workload the two guards agree. That's the point — the window guard isn't stricter, it's watching a different axis. They only disagree when the axis matters.

Scenario 2 — the $47K loop, in miniature

Now the Analyzer↔Verifier ping-pong. Each call costs about $4 — cheap, valid, the kind a per-call cap nods through. Watch the window spend so far column climb:

SCENARIO: BLOCK-cumulative  (Analyzer<->Verifier ping-pong, each call cheap, sum runs hot)
  call  1 t=    0s  analyze_section    $4.10  (window spend so far $4.10)
  call  5 t=  280s  analyze_section    $4.40  (window spend so far $20.65)
  call  9 t=  600s  analyze_section    $4.20  (window spend so far $37.35)
  call 11 t=  780s  analyze_section    $4.10  (window spend so far $45.80)
  call 12 t=  870s  verify_section     BLOCKED -> [cumulative_spend] $50.05 over last 3600s would exceed cap $50.00 (12 calls in window)
  => BLOCKED at call 12 on rule 'cumulative_spend'. $45.80 spent before the brake hit; call 12 never dispatched.
  (for contrast: a naive per-call $5 cap would block 0 of these 12 calls.)

There it is. The guard does the addition the dashboard does — but it does it before call 12, sees the window would land at $50.05, and refuses. Call 12 never goes out. $45.80 spent, then a hard stop. The per-call $5 cap, on the identical twelve calls? Blocks zero. Every call is under five dollars. That's the $47K loop reproduced in a teacup, and the difference between watching it and stopping it is one sum() evaluated at the right moment.

One honest caveat about that "blocks zero," because it's the obvious objection and I'd rather raise it than have you. The $5 figure is the demo's default, and these calls top out at $4.40 — so of course $5 waves them through. Drop the per-call cap to $4 and it would trip this loop, on call 3, at $8.00. So why isn't a tight per-call cap the answer? Because to catch a $4 loop you'd set the cap near $4, and now every legitimately expensive call — a deep research step, a long synthesis — gets refused too. You'd be tuning a single threshold to sit below your normal call cost, which breaks normal work to catch abnormal accumulation. The window guard doesn't win because it stops sooner (a tight per-call cap stops sooner and cheaper here). It wins because it watches the sum, so it can stay loose on any single call and still see the pile. Different axis, not a stricter one.

I want to be honest about the brake distance: it doesn't stop at zero dollars. It stops at $45.80, because the window has to reach the cap before the next call can breach it. A per-window guard bounds your blast radius to roughly one window of budget. It is not a circuit breaker that fires on call one. If you need a tighter bound, shrink the window or the cap. That's the knob.

Scenario 3 — a loop that's nearly free

The last one is the loop a spend cap is worst at: a clarification call that costs a nickel and never resolves. The args change every time, but only in a churn field, so the fingerprint stays the same:

SCENARIO: BLOCK-loop  (same semantic call repeated; args differ only by a churn field)
  call  1 t=    0s  ask_clarification  $0.05  (window spend so far $0.05)
  call  7 t=   80s  ask_clarification  $0.05  (window spend so far $0.37)
  call  8 t=   94s  ask_clarification  BLOCKED -> [loop_repeat] 'ask_clarification' fired 8x in 3600s (threshold 8); near-identical args
  => BLOCKED at call 8 on rule 'loop_repeat'. $0.37 spent before the brake hit; call 8 never dispatched.
  (for contrast: a naive per-call $5 cap would block 0 of these 8 calls.)

Thirty-seven cents in, the guard sees the same call fired eight times and stops it. A spend cap would never catch this — at a nickel a turn you could run it ten thousand times before any budget alarm twitched. By then it's not a nickel anymore. This is the muggleai class of failure, where the agent guesses and the meter just runs.

The summary the script prints at the end says the whole thing in three rows:

SUMMARY  (window guard  vs  naive per-call $5 cap)
  scenario          window guard              per-call cap
  --------------------------------------------------------
  PASS              PASS                      PASS (0 blocked)
  BLOCK-cumulative  BLOCK cumulative_spend@12 PASS (0 blocked)
  BLOCK-loop        BLOCK loop_repeat@8       PASS (0 blocked)

Three workloads. The per-call cap's column reads PASS, PASS, PASS. On the two that were actually on fire, it never moved. Each call passed. The window didn't.

tracking ≠ control

This is the line I'll defend in the comments, and I'd be glad to be argued out of it: a dashboard that shows your spend is not a control on your spend. It's a rear-view mirror. The $47K team had the number the whole time. It updated faithfully, after every call, all 264 hours. What they didn't have was anything that read that number before the next call and said no.

That's the only move the window guard makes that a dashboard doesn't. Same arithmetic — sum of the window — but evaluated at admission time, in the path of the call, with the authority to raise instead of just render. A per-call cap is one slice of pre-execution control. A per-window guard is the slice that sees accumulation. Most production agents I've seen have neither in the loop and a beautiful dashboard beside it.

What this is NOT

I'd rather you trust the small honest claim than oversell it.

It's not a replacement for per-call caps — it's a second layer. A single genuinely runaway call (one $900 request) is a per-call problem, and this guard, if that call alone fits the window budget, would admit it. Keep both. They cover different failures.
The brake distance is one window, not zero. It stopped at $45.80, not $0. You bound the damage to roughly a window's worth of budget; you don't prevent the first dollar. Tune window and cap to the blast radius you can live with.
The escalation numbers ($127 / $891 / $6,240 / $18,400) are from the public postmortem, not from my run. My fixture demonstrates that a window guard halts this pattern early; it does not re-prove the incident. The only numbers I generated are the demo's: the $45.80 stop, the call-12 block, the call-8 loop trip.
The third rule (side-effect cap) ships in the code but isn't exercised by these fixtures. window_guard.py has a Rule C — a per-window cap on a named side effect like a refund or a charge — and the policy header prints sideeffect_cap={'refund': 5}. But all three demo scenarios test only spend and loop; none fire a side effect, so you won't see Rule C trip in the output. It's there, it's tested by the same _check, and you can wire it up with the sideeffect= argument — but I'm not going to claim the demo proves it. It demonstrates two of the three rules. Treat Rule C as code you can adopt, not a result I showed you.
The per-call comparison uses a $5 cap on purpose, and the result depends on it. "Blocks 0 of 12" holds because these calls cost ≤$4.40. A per-call cap set below your normal call cost would catch the loop — at the price of refusing legitimate expensive calls. The honest claim isn't "per-call caps fail"; it's "per-call caps and window guards watch different axes, and you want both."
A fixture is not your production loop. The window (3600s) and the cap ($50) are demo values. Yours depend on your call rate and your budget. The code is the reusable part; the policy is yours to set.

Pick your window and your cap

Open your agent's loop and find the single call site that fans out — the planner, the verifier, the retry wrapper. Wrap it. Set window_seconds to a few times your normal task length, and spend_cap_usd to the most you'd tolerate losing before a human looks. Then leave it. The day it fires, it'll have saved you from a number with too many zeros.

Here's the open question I haven't settled, and I'd genuinely like your take: what's the right window length when an agent's healthy work and its runaway loop run at the same call rate? Too short and you trip on a legitimate burst; too long and the loop drains a window before the brake bites. I've been setting it to ~3× the longest normal task and eyeballing it, which is not a method I'm proud of. If you've tuned a per-window cap on a real agent and found a rule that holds, drop it in the comments — I read every one. And follow along; the next post takes this guard from one call site to a whole multi-agent graph and tries to find where the window should live.

DEV Community