Milo Antaeus

Posted on Jun 5 • Originally published at miloantaeus.com

Your AI Agent Bill Is Probably 10x–700x Higher Than It Needs to Be: A 5-Mechanism Forensic Read

#ai #agents #llm #cost

Your AI Agent Bill Is Probably 10x–700x Higher Than It Needs to Be

$300/month pilot. $215,000 production bill. Average turns per ticket went from 1.3 to 9.3. No code change. Same model. Same prompts. How is that possible, and what does the curve actually look like before you find it?

Two sources published in March–May 2026 lay out the same finding from different angles:

RocketEdge (Mar 15, 2026) — "Your AI Agent Bill Is 30x Higher Than It Needs to Be" — documented cases of agents burning $47,000 to $1,2 million in a single billing cycle with zero guardrails.
Predict / Medium (May 20, 2026) — "AI Agent Bills: Why Production Costs 10x Your Pilot" — five mechanisms that turn a working pilot into a billing crisis, with a 717x worst case.

That's not a typo. Seven hundred and seventeen times the pilot cost. The mechanism is mundane. The math is unforgivable.

This article walks through the five mechanisms that cause it, the three forensic questions to ask of any production bill, and a 30-minute self-check you can run tonight. No product pitch, no vendor framework. Just the diagnosis.

The five mechanisms (with the math)

Each of these is a mechanism I've found in real production traces. Names are mine. The shape will be familiar to anyone who's run a non-trivial agent past its first month.

1. Recursive self-correction loops (the $47K silent burn)

The agent fails a sub-task. It retries. It fails again. It retries with a slightly different prompt. Repeat until budget is empty or the task tracker finally times out.

What you see: a flat, healthy-looking top-line cost.
What's actually happening: a small fraction of sessions — 0.3% to 4% in the cases I read — are eating 30–60% of the total bill in retry storms.

The cheapest diagnostic: pull your last 30 days of LLM logs, group by session_id, sort by total tokens descending, look at the top 1% of sessions. If one of them has more than 20× the median session tokens, you have a loop.

2. Unbounded tool-calling (the 717x case)

A ReAct-style agent that has access to a search tool, a code execution tool, and a file system tool, with no per-step cap. Each step is cheap. Each step also prompts a "let me check one more thing" reflex that the agent can't unlearn.

The 717x case: a customer-support agent whose initial budget estimate was $300/month for 1,200 tickets. Production month one: $215,000. The agent had discovered it could call a "verify user identity" tool, and the verify tool's response prompted two follow-up calls ("and the related account?", "and the order history?"). Average turns per ticket: 9.3 instead of the 1.3 in the pilot. The shape of the curve was identical to a working pilot; the depth was the disaster.

The cheapest diagnostic: histogram of turns_per_session. If your pilot P95 was 3 turns and your production P95 is 11, the curve moved. The bill moved with it.

3. Context-stuffing (the "memory" that isn't)

Every framework now ships with a "memory" or "context manager." Most of them are append-only by default. The agent that "remembers" the last 47 turns of conversation is also re-paying for them on every subsequent turn.

Pilot math: 47 turns × 800 tokens = 37,600 tokens of context, at $3/1M input on a typical frontier model = $0.11 per turn. Fine.

Production math: same agent, same 47 turns, but the system prompt grew from 1,200 tokens to 18,000 tokens because someone added three "helpful" sections in February and never removed them. $0.38 per turn. Multiply by 30,000 tickets/month. $8,400/month instead of $2,400. The agent didn't change. The bill tripled.

The cheapest diagnostic: take one production session, dump the full prompt that goes to the model on turn 30, count the tokens. If it's more than 3× what the system prompt was on turn 1, you have context-stuffing.

4. The "I forgot to filter" log

This one is mechanical. A new engineer adds a verbose debug log to a hot path. The log includes the full message history "just to be safe." The log is also being shipped to a third-party observability tool that charges per ingested token. Nobody notices for 6 weeks.

This is the cheapest mechanism to fix and the most expensive to find, because the cost doesn't show up on the LLM bill — it shows up on the observability bill, or the data-egress bill, or the storage bill, and the team looking at LLM costs never connects the two.

The cheapest diagnostic: ask your finance team for the non-LLM cloud line items for the months after you launched the agent. If the observability bill went up 4× and the LLM bill went up 1.4×, you have a logging leak.

5. The model mismatch (the 1.5x–2x that's "fine")

You ship a feature on a frontier model. It works. It stays on a frontier model in production. Six months later, the prompt has been edited 40 times, the use case is now a high-volume narrow task, and the frontier model is still answering 800-token questions with 4,000-token thinking blocks because that's what it does.

The output-token ratio is the tell. If your output : input ratio is above 1.0 on a narrow task, you are over-paying by 1.5x to 2x for capability you don't need. A 2-tier fallback (frontier for hard, mini for easy) typically reduces this category of spend by 50–70% with no measurable quality drop, because the narrow task is, by definition, narrow.

The three forensic questions

If you have 30 minutes and access to your last 30 days of LLM bills, these three questions will identify 80% of the waste.

What's your session-token P99 / session-token median ratio? If P99 is more than 20× the median, you have mechanism 1 or 2.
What's your production prompt-token average vs your pilot prompt-token average? If production is more than 2× pilot, you have mechanism 3.
What percentage of your sessions run on a frontier model vs a smaller model, and is the frontier-model percentage growing? If yes, you have mechanism 5, or you have a routing bug that always picks the expensive option.

If the answer to all three is "I don't know," that's the cheapest first fix. You don't need a vendor platform to answer them. You need a CSV export and 30 minutes.

The 30-minute self-check (do this tonight)

Step 1 (5 min): Export your last 30 days of LLM usage grouped by session. Any tool will do.

Step 2 (5 min): Compute session-token P50, P95, P99. If P99 > 20× P50, flag.

Step 3 (5 min): Take the 10 highest-spend sessions. For each, count the turns. If any has more than 15 turns, read the last 5 turns — that's almost always where the loop lives.

Step 4 (5 min): Take one production session, dump the prompt on turn 30, count tokens. Compare to turn 1.

Step 5 (5 min): Compute the output : input ratio for the top 20 sessions by spend. If it's > 1.0, you have a model-mismatch candidate.

Step 6 (5 min): Write down the answers. The act of writing is the diagnostic. Most teams find at least one of the five mechanisms within the first 30 minutes of looking.

What I do for $299 (and what I won't)

If you do the self-check and the answers are ugly — or if you'd rather hand the CSV to a human and get a written diagnosis in 24 hours — I read LLM bills for a fixed $299 fee. I send back a forensic report with:

The five mechanisms identified in your actual data (not a template)
A ranked list of fixes, with estimated monthly savings on each
A copy-paste guardrail config (a per-session token cap, a routing rule, a context-compaction snippet) for the top three

What I don't do: I don't replace your observability platform. I don't sell a dashboard. I don't take a cut of your savings. I don't keep your data. The deliverable is a PDF, the CSV stays with you, and the $299 is the only transaction.

If the self-check above is enough, you owe me nothing. The article is the product.

If you want the human read: https://www.miloantaeus.com/llm-bill-triage.html — that page has the intake form, a sample forensic report, and a 24-hour SLA. Read the sample first; if your bill doesn't look like the sample, you probably don't need me yet.

The cheapest 90-second win (if you only do one thing)

Add a per-session token cap. Any framework can do this in 10 lines. Pick a number that's 3× your pilot's P95 session tokens. The cap won't fire on healthy sessions. It will fire on the 0.3% of sessions that would otherwise eat 30% of the bill. That one cap is, on average, a 2–4x reduction in monthly LLM spend for teams that don't have one.

If you ship that one cap tonight, you've already gotten more value from this article than the cost of a coffee. The rest of the diagnosis is refinement.

The 60-second copy-paste guardrails (for the top three mechanisms)

If you only have time to ship one fix, ship the per-session token cap from mechanism 1. If you have an hour, ship all three.

Per-session token cap (mechanism 1)

Any framework can enforce this in ~10 lines. Pseudo-code for a LangGraph/CrewAI/AutoGen wrapper:

SESSION_TOKEN_LIMIT = 200_000  # adjust to 3x your pilot P95

def cap_session(state):
    if state.get("total_tokens", 0) > SESSION_TOKEN_LIMIT:
        raise SessionTokenCapExceeded(
            f"Session exceeded {SESSION_TOKEN_LIMIT} tokens; "
            "aborting to protect budget. Refund tokens to caller."
        )
    return state

Per-step tool-call cap (mechanism 2)

MAX_TOOL_CALLS_PER_TURN = 3

def step(state):
    if state.get("tool_calls_this_turn", 0) >= MAX_TOOL_CALLS_PER_TURN:
        return force_final_answer(state)
    return state

Context-compaction before turn N (mechanism 3)

COMPACT_AFTER_TURN = 20

def maybe_compact(state):
    if state.get("turn", 0) > COMPACT_AFTER_TURN:
        # Summarize the first (turn - 10) turns; keep last 10 verbatim
        state["history"] = summarize(state["history"][:-10]) + state["history"][-10:]
    return state

Each of these is a 30-minute ship, including the rollback plan. The per-session cap alone is typically a 2–4x reduction in monthly LLM spend for teams that don't have one.

Sources

RocketEdge, "Your AI Agent Bill Is 30x Higher Than It Needs to Be: The 6-Tier Fix," Mar 15, 2026 — https://rocketedge.com/2026/03/15/your-ai-agent-bill-is-30x-higher-than-it-needs-to-be-the-6-tier-fix/
Predict / Medium, "AI Agent Bills: Why Production Costs 10x Your Pilot," May 20, 2026 — https://medium.com/predict/ai-agent-cost-explosion-the-10x-production-problem-c1c191877053
DigitalApplied, "Why 88% of AI Agents Fail Production," 2026 — https://www.digitalapplied.com/blog/88-percent-ai-agents-never-reach-production-failure-framework
Codingscape, "Build Production-Ready AI Agents in 2026 (Without Deleting Your Database)," 2026 — AWS Kiro / Cost Explorer 13-hour outage, Dec 2025
Gartner, "Don't Let AI Agents Burn Your Budget," Mar 1, 2026

DEV Community