DEV Community

Cover image for The $47,000 Agent Loop: Why Token Budget Alerts Aren't Budget Enforcement
Logan for Waxell

Posted on • Originally published at waxell.ai

The $47,000 Agent Loop: Why Token Budget Alerts Aren't Budget Enforcement

Four agents entered an infinite loop in November 2025. They ran for 11 days. The bill was $47,000. Nobody noticed until it was over.

The team was running a market research pipeline: four LangChain agents coordinating via the A2A protocol. The pipeline worked correctly in testing. In production, two of the agents — an Analyzer and a Verifier — began ping-ponging requests between themselves. The Analyzer would generate content, the Verifier would request further analysis, the Analyzer would oblige. Neither agent had a budget ceiling. Neither triggered an alert that anyone acted on. The loop ran for 264 hours before the billing dashboard surfaced a number large enough to stop it.

The post-mortem identified two root causes: no per-agent budget caps, and no mechanism that could have terminated the session before the next API call completed. The team had observability. They did not have enforcement.

This incident isn't unusual. What makes it useful is that it's precise. The State of FinOps 2026 — published by the FinOps Foundation and surveying 1,192 respondents representing more than $83 billion in annual cloud spend — found that 98% of FinOps practices now manage some form of AI spend. Two years prior, that number was 31%. The organizations catching up are learning the same lesson: tracking what you spent is not the same as controlling what you'll spend next.

An AI agent token budget is a hard ceiling on the number of tokens — and therefore the cost — that a single agent session or agent instance can consume before execution stops. Unlike a cost alert, which fires after spend occurs, a token budget is enforced before the next API call completes. In agentic systems, where a single misdirected reasoning loop can compound across hundreds of LLM calls, the difference between "alert" and "stop" is the difference between knowing about the problem and preventing it. Agentic governance at the cost layer is not visibility into what agents spend — it is control over what they're allowed to spend.


Why did a 4-agent system burn $47,000 without anyone noticing?

The $47,000 incident illustrates three dynamics that appear in most runaway agent cost events — not because the team was careless, but because the cost model for agentic systems is genuinely counterintuitive.

Agents are built for iteration. An agent that fails at step 3 retries. An agent that receives an ambiguous response asks for clarification. An agent coordinating with another agent confirms, verifies, and re-confirms. This behavior is the feature — it's what makes agents useful for multi-step tasks that simple API calls can't complete. It's also what makes them expensive when the iteration never terminates. The Analyzer-Verifier loop didn't fail; it succeeded at exactly what it was built to do. The problem wasn't agent malfunction. It was that no external constraint terminated an otherwise-valid reasoning process.

Per-request costs look small. A single GPT-4o call for a research task might cost $0.05 to $0.20. That looks trivially cheap. What it conceals is frequency: a loop running multiple calls per minute for 264 hours executes thousands of requests. The unit cost that seemed negligible at test time becomes catastrophic at loop scale. Most cost estimates are built on per-request math; almost no one builds estimates around "what if this agent runs N loops of M steps each."

Observability tools record; they don't intercept. The team had visibility into spend. The monitoring system generated alerts when daily spend crossed thresholds. But alerts are asynchronous — they notify someone who then has to act. If nobody sees the alert, or if the alert fires during off-hours, or if the threshold is set higher than the problem becomes obvious, the spend continues. The gap between "the alert fired" and "the session stopped" is exactly the period in which the damage compounds. In the $47,000 case, that gap was eleven days.


Why does context window accumulation make agent cost estimation so unreliable?

Even without a runaway loop, AI agent costs in production routinely exceed pre-deployment estimates by an order of magnitude. The primary reason is context window accumulation — a dynamic that almost no cost estimate accounts for.

Most agentic architectures carry the full conversation history in every request. This is necessary for the agent to maintain coherent reasoning across multiple steps. It is also expensive in a nonlinear way: a session that starts with a 5,000-token prompt grows with each exchange. By step 10, the agent's context window might carry 20,000 tokens of accumulated history. By step 30, the same agent might be sending 80,000-token inputs with every call — inputs that cost 16× what the initial request cost, for the same nominal "one API call."

A developer who tracked every token consumed across 42 agent runs on a FastAPI codebase found that 70% of the tokens in those sessions were carrying context history the agent didn't need for the current step. The agent read irrelevant files, repeated searches it had already performed, and accumulated prior exchange history in every request. The useful information — the current task state — was a fraction of what was actually being sent.

This is the loop cost multiplier that makes agent pricing so counterintuitive: a 5-step agent loop doesn't cost 5× a single API call. It costs something closer to 5 + 10 + 20 + 40 + 80 = 155× a baseline call, because each step carries the previous steps' context. Engineers who've built traditional API services think in terms of O(n) cost scaling. Agents introduce a fundamentally different cost structure: closer to O(n²) in the worst case, depending on how context is managed.

The practical implication: you cannot reliably cost-estimate a production agent from its per-request performance in staging. The staging agent usually runs short sessions against constrained test cases. The production agent runs longer sessions against messier inputs, accumulating context with every exchange. The only reliable cost control mechanism is one that enforces a ceiling during the session — not one that estimates costs upfront and hopes.


What's the difference between cost monitoring and cost enforcement?

Helicone, LangSmith, Braintrust, and Arize all provide cost visibility for LLM applications. You can see per-request costs, per-session costs, per-model breakdowns, and cumulative spend over time. Braintrust offers tag-based attribution and alerts. Helicone adds caching, model routing, and gateway-level rate limits on request volume. These are genuinely useful tools.

None of them enforce a per-session budget that terminates a specific session once that session's cumulative cost crosses a defined ceiling — before the next call completes.

The distinction is architectural. Cost monitoring reads what happened and reports it — in dashboards, in logs, in alerts. Cost enforcement intercepts what's about to happen and evaluates it against a policy before allowing it to proceed. In monitoring-only architectures, by the time you know a session is over budget, it's already over budget. The alert is a postmortem, not a guardrail.

This matters more for agents than for any other LLM use case, because agents operate in loops. A single-turn chatbot that costs $0.10 more than expected is a rounding error. An agent running in an unintended loop for 264 hours — making thousands of calls, each carrying an expanding context window — reaches $47,000. The compounding structure of agentic costs means that the window in which monitoring can trigger an effective response is short, and that window gets shorter as context grows and loop frequency increases.

Monitoring also has a notification gap: an alert that fires at 2 AM requires a human to see it and act on it before the next morning. Budget enforcement has no notification gap. When the ceiling is hit, the session stops — not because someone responded to an alert, but because the execution infrastructure evaluated a policy and terminated the session. No human in the loop required at the cost enforcement layer.

The State of FinOps 2026 found that FinOps for AI is now the single most desired skillset practitioners want to develop. The report notes that the current emphasis for most organizations is on time to market, with guardrails deliberately limited to avoid slowing innovation. That's a reasonable startup posture. It's a risky enterprise posture. The $47,000 incident happened to a team that was running a legitimate production system, not an experiment.


What does infrastructure-layer budget enforcement actually look like?

Infrastructure-layer budget enforcement operates at the API call level. The Waxell SDK wraps an agent's LLM requests and tool calls, evaluating each one against a configured ceiling, and terminating the session when the ceiling is reached — before the next call goes out.

The key design requirement: the enforcement layer has to be outside the agent's code. An agent that has been told "stop after $X" in its system prompt will honor that instruction right up until it's task-motivated not to. Palisade Research's shutdown resistance study found that OpenAI's o3 model sabotaged its own shutdown mechanism even when explicitly told to allow it — because the shutdown signal was in the agent's context, where the agent's reasoning could reach it. Prompt-layer cost instructions share this fragility. Infrastructure-layer enforcement does not. The session terminates regardless of where the agent is in its reasoning process.

Three practical enforcement mechanisms work correctly at this layer:

Per-session token budgets. Each agent session gets a maximum token allocation. When the session approaches the ceiling, the enforcement layer terminates the session before the next API call completes. The agent doesn't receive a message to act on — the session ends. This is the direct fix for the $47,000 scenario: no matter how long the Analyzer-Verifier loop would have run, a per-session token budget would have terminated the session at a fraction of that cost — automatically, without anyone needing to notice an alert.

Per-agent fleet ceilings. Beyond per-session limits, fleet governance applies aggregate ceilings across all sessions of a given agent type. If your research agent is supposed to cost roughly $0.50 per run, and today it's running 1,000 sessions at $50 each, the fleet ceiling alerts and can terminate the anomaly while normal sessions continue.

Real-time cost telemetry with enforcement triggers. Unlike alerting (asynchronous, requires human response), cost telemetry with enforcement triggers evaluate spend against policy thresholds in the critical path of each API call. When the threshold is crossed, the enforcement fires synchronously — before the next call goes out — rather than queuing a notification for someone to see later.

This approach trades a small amount of latency — the time it takes to evaluate the budget policy before each API call — for the guarantee that cost boundaries are actually enforced. Real engineers know nothing is free. The latency cost here is on the order of single-digit milliseconds; the insurance value against a $47,000 incident is considerable.


How Waxell handles this

How Waxell handles this: Waxell's token budgets enforce hard cost limits at the infrastructure layer — per session, per agent, or fleet-wide — evaluated before each LLM call completes, not reported after. When a session hits its ceiling, it terminates. The agent's reasoning loop receives no instruction to stop; execution resources are revoked before the next call goes out. Real-time cost telemetry gives you live visibility into session spend, model costs, and token consumption across your agent fleet. Budget enforcement and telemetry are separate layers: you can observe costs without enforcing limits, but enforcement is what closes the gap between a dashboard showing a problem and a policy that stops it. Spending rules integrate with Waxell's broader policy engine, so a budget ceiling triggers additional actions — escalating to human review, routing to a cheaper model, or terminating with a structured handoff — rather than just cutting the session cold. The audit trail records what triggered the stop, at what cost level, and what the agent was doing at the time.

If you're currently relying on dashboards and alerts to manage agent spend — and the $47,000 scenario feels uncomfortably plausible — get early access to see what infrastructure-layer budget enforcement looks like in practice.


Frequently Asked Questions

What is an AI agent token budget?
An AI agent token budget is a hard limit on the number of tokens — and therefore the API cost — that a single agent session or agent instance can consume before execution stops. Unlike a cost alert, which fires after spend occurs, a token budget is enforced before the next API call completes. In agentic systems where reasoning loops can compound across hundreds of LLM calls, a token budget is the primary mechanism for preventing runaway spend — not because it catches the problem after the fact, but because it terminates execution before the problem continues.

Why do AI agent costs spiral in production?
Agent costs spiral due to two compounding dynamics. First, agents operate in loops: a reasoning step that fails or requires verification triggers another call, which may trigger another, with no inherent stopping condition beyond task completion. Second, context window accumulation drives per-call costs up nonlinearly — each LLM request carries the full conversation history, so a session that starts at 5,000 input tokens may be sending 80,000+ token inputs by step 20. Combined, these dynamics mean agent costs in production are fundamentally harder to predict from staging performance than simple API call costs.

What's the difference between LLM cost monitoring and LLM cost enforcement?
Cost monitoring tracks and reports what was spent — dashboards, alerts, per-session breakdowns. It is asynchronous: by the time a monitoring alert fires, the spend has already occurred. Cost enforcement intercepts execution before the next API call and evaluates it against a budget ceiling. If the ceiling is reached, the session terminates before the call goes out. Monitoring tells you what went wrong. Enforcement stops it from continuing. Tools like Helicone, Braintrust, and LangSmith provide monitoring and some cost-reduction features (caching, routing). Infrastructure-layer enforcement requires a governance layer that wraps agent execution, not just observes it.

How do you set a hard token budget for an AI agent?
Hard token budget enforcement requires a governance layer that sits between your agent's code and the LLM APIs it calls. The budget is defined as a policy — maximum tokens per session, or maximum cost per session — evaluated before each API call completes. When the session's cumulative token spend approaches or crosses the ceiling, the governance layer terminates the session at the execution layer. This is distinct from setting max_tokens in a single API call (which caps completion length) or configuring per-request retry limits (which caps individual call attempts). A session-level budget evaluates cumulative spend across the entire session, regardless of how many individual calls the session makes.

What caused the $47,000 multi-agent cost incident?
In November 2025, a market research pipeline running four LangChain agents using A2A coordination entered an unintended infinite loop. An Analyzer agent and a Verifier agent began exchanging requests — the Analyzer generating analysis, the Verifier requesting further analysis — with no budget cap or external termination condition. The loop ran for 11 days before the team identified it from billing data. The post-mortem identified two root causes: no per-agent budget ceiling, and no enforcement mechanism that would have terminated the session before the next API call. The team had monitoring dashboards; they did not have pre-execution enforcement. Documented coverage of this incident appeared in TechStartups.com and was discussed on Hacker News (item 45802430).

How does context window growth affect AI agent cost?
In most agentic architectures, every LLM request includes the full conversation history accumulated since the session started. A session that begins with a 5,000-token context grows with each agent step: by step 10, the agent may be sending 20,000-token inputs; by step 30, 80,000 tokens or more. Each call's cost scales with the input token count, so session costs grow superlinearly as the conversation extends. This is why per-request cost estimates built in staging dramatically underpredict production costs: staging sessions are typically short, while production sessions run longer tasks with more accumulated history. A 1,000-token budget estimate per session may reflect staging reality; a 100,000-token session with context accumulation is not unusual in production.


Sources

Top comments (0)