Joe Carpenter

Posted on Apr 25

How an AI Agent Ran Up a $47,000 Bill in 11 Days (And How to Stop It)

#agents #ai #llm #monitoring

How an AI Agent Ran Up a $47,000 Bill in 11 Days (And How to Stop It)

Published by Innovative Systems Global — April 2026

In November 2025, four AI agents entered an infinite retry loop.

Nobody noticed for 11 days.

When the bill arrived, it was $47,000. All of it from LLM API calls. All of it preventable. The team had logging. They had monitoring. They did not have a hard limit.

This is not a unique incident. It's becoming a rite of passage for engineering teams running agents in production.

Why this keeps happening

Every major LLM provider — OpenAI, Anthropic, Google — charges per token. The more your agent runs, the more you pay. This is the correct model. The problem is that agents don't know how much they're spending, and nothing stops them when they exceed a budget.

Current "solutions":

Spend alerts — fire after the damage is done. An alert at $1,000 doesn't help when an agent burns $4,700 per day.
API rate limits — these throttle requests per minute, not total spend.
Observability platforms (Helicone, LangSmith) — they show you what happened. They don't prevent it.
Cloud billing alerts — by the time AWS or OpenAI sends an alert, the loop has been running for days.

What's missing: a hard gate that runs before the LLM call, checks the budget, and refuses to proceed if the limit is exceeded.

The two-line problem

Here's what most agent code looks like:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

There is no cost tracking here. No budget check. No receipt. If this code runs 50,000 times in an infinite loop, you find out when the bill arrives.

The fix: meter every call, enforce every limit

We built dingdawg-governance to solve this. Three new MCP tools in v2.1.0:

meter_llm_call — call this after every LLM response. Pass the model, tokens in, tokens out, and your agent ID. Get back the cost, your cumulative spend, and your budget status.

{
  "receipt_id": "mtr_abc123_def456",
  "agent_id": "my-research-agent",
  "provider": "openai",
  "model": "gpt-4o",
  "prompt_tokens": 1200,
  "completion_tokens": 800,
  "cost_usd": 0.018,
  "cumulative_spend_usd": 12.43,
  "budget_status": "ok",
  "budget_limit_usd": 50.00
}

set_llm_budget — set a hard limit for any agent. Daily or monthly. Warning fires at 80% by default.

{
  "agent_id": "my-research-agent",
  "limit_usd": 50.00,
  "period": "daily"
}

get_spend_report — query spend by agent, model, and date range. See exactly which agents cost what.

How the $47K incident gets prevented

With dingdawg-governance wired:

Day 1: Agent starts loop. meter_llm_call tracks each call.
Day 1, ~$40 in: budget_status flips to "warning". Your code can log, alert, or throttle.
Day 1, $50 in: budget_status flips to "exceeded". Your code stops the agent.
Total damage: $50, not $47,000.

The enforcement is in YOUR code — you decide what to do when the budget is exceeded. The meter gives you the signal.

Installation

# As an MCP server (Claude Desktop, Cursor, any MCP-compatible client)
npx dingdawg-governance

# Claude Code
claude mcp add dingdawg-governance npx dingdawg-governance

Free tier: unlimited meter_llm_call and set_llm_budget calls. Local filesystem storage. No API key required.

Paid tier ($19/month): cloud receipt storage, team dashboards, cross-session spend history, PDF export. API key at dingdawg.com/developers.

Price table

Built in. Covers 30+ models across OpenAI, Anthropic, Google, Groq, Mistral, Cohere, and DeepSeek. Updated with each release.

If your model isn't in the table, it returns cost_usd: 0 with a note — it never silently miscalculates.

Works with any agent framework

dingdawg-governance is an MCP server. Any agent that can call MCP tools can use it — LangChain, AutoGen, CrewAI, custom agents, Claude Code, Cursor. No SDK required. No framework lock-in.

The broader problem

The $47K incident is the visible symptom. The real problem is that enterprises are deploying agents with no spend governance at all. Every dollar an agent spends is invisible until it's gone.

As agents become more autonomous — running overnight, chaining into other agents, operating without human supervision — the spend problem compounds. A single misconfigured retry policy can turn a $50 research job into a $50,000 infrastructure incident.

Budget enforcement isn't a nice-to-have. It's the seatbelt.

Get started

npx dingdawg-governance

Source: github.com/dingdawg/governance-sdk
Pricing: dingdawg.com/developers

Innovative Systems Global builds AI governance infrastructure for teams running agents in production. Based in the Rio Grande Valley, Texas.

Top comments (3)

Keesan • May 22

The key shift is moving from alerts to admission control. By the time a budget alert fires, the loop has already been allowed to keep spending.

The controls that ended up mattering most for us were per-run budget caps, retry ceilings, verifier-gated reattempts, and a receipt that explains why the next call is justified.

That is a lot of what we have been building into MartinLoop on the open-source side, but the broader pattern applies anywhere: require proof before the next attempt, not just notification after the burn. Curious which signal you think should block the next call first.

Keesan • May 27 • Edited

we found a fix its called martinloop, and its open source at martinloop.com

check it out...solves the rogue coding agent issue and enforces a governance layer with run receipts, audit logs, budget caps, etc. I find it useful for overnight runs, hope it helps.

ADARSH PRASHAR • Jun 7 • Edited

The part that resonates: they had logging and monitoring but no hard limit.

That's the whole gap - observability is post-hoc; it tells you the bill after it's run. The fix has to be pre-call: the run is refused before the next request, deterministically, not by another model. I've spent years building deterministic risk engines that wrap non-deterministic systems where mistakes cost real money, and the thing that kept them safe was never the smart part - it was a hard-coded layer around it.

A per-run ceiling ($ + loop count + wall-clock) + a kill switch stops the thousand-perfectly-valid-calls failure mode that max_tokens can't.

Nice write-up.