The Agent That Spent $47K on Itself: An Autonomous-Loop Postmortem

#ai #agents #observability #python

Book: AI Agents Pocket Guide
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

The dashboard was green for eleven days. The agents looked busy. Logs scrolled. The on-call rotation was quiet. The team only learned anything was wrong when the monthly invoice landed: $47,000 of LLM API spend on a system the product manager would have described as "an internal research tool."

That number is real. The story was first written up on Medium and picked up by Tech Startups in November 2025. Four LangChain-style agents wired together with agent-to-agent messaging. The Tech Startups secondary writeup breaks the roles down as research, analysis, verification, and summary; the Medium primary just says "four LangChain agents." No step cap. No per-conversation USD budget. No orchestrator deciding when the work was done. Two of the four agents got into a clarification ping-pong and ran for 11 straight days (≈264 hours, my arithmetic).

The cost did not arrive all at once. It compounded. Week 1 was $127. Week 2 was $891. Week 3 was $6,240. Week 4 was $18,400. By the time anyone read the bill, the loop had been running long enough for "we noticed at week 4" to be the punchline.

Here is the postmortem the team did not get to write before they paid for it. The anatomy of the loop, three signals that would have caught it on day one, and defense code you can paste into your agent runtime tonight.

Anatomy of the loop

Strip the framework names off and the failure mode is generic. An agent calls the model. The model emits a tool call. The tool happens to be "ask another agent for clarification." The other agent calls the model. That model emits a tool call back to the first agent. Both turns count as legitimate work. Neither agent is allowed to declare the conversation finished, because there is no shared state that says "the user's question is answered."

In the documented incident, the analyzer would send a clarification request. The verifier would respond with more instructions. The analyzer expanded and asked for confirmation. The verifier re-requested changes. Repeat for 11 days.

What was missing, in plain terms:

A hard cap on how many model calls one logical conversation can make.
A USD budget per conversation that aborts before the cap matters.
A self-loop detector that notices when the same tool is being called with the same input twice in a row.
A trace that lets a human see the loop shape inside ten seconds.

You only need one of those four to catch a loop on iteration three. The team had zero.

Signal 1 — duplicated tool inputs

The cheapest tell is also the most reliable. Hash every tool input as it goes out. If the same hash repeats inside one conversation, you are watching a loop start.

import hashlib
import json

def input_hash(name: str, args: dict) -> str:
    payload = {"tool": name, "args": args}
    blob = json.dumps(payload, sort_keys=True).encode()
    return hashlib.sha256(blob).hexdigest()[:16]

Sixteen hex chars is enough entropy to avoid collisions inside one conversation and short enough to read in a trace UI. The detector keeps a set per conversation and trips on the second match.

class LoopDetector:
    def __init__(self, threshold: int = 2):
        self.seen: dict[str, int] = {}
        self.threshold = threshold

    def record(self, name: str, args: dict) -> bool:
        h = input_hash(name, args)
        n = self.seen.get(h, 0) + 1
        self.seen[h] = n
        return n >= self.threshold

record returns True the moment the same tool input has been seen threshold times. Two is the smallest number that distinguishes a one-shot tool call from a loop. Raise it to three for idempotent tools you intentionally call repeatedly (a paginated search, a polling status check).

In the $47K incident, the verifier was asking the analyzer to "please clarify section 3.2" with effectively the same payload every cycle. A 16-byte hash would have caught it on iteration two.

Signal 2 — per-conversation USD budget

Token counts are abstract. Dollars are not. Convert tokens to dollars at the call site and check a hard ceiling before you make the next request.

COST_PER_M = {
    "claude-sonnet-4-5": (3.0, 15.0),
    "claude-haiku-4-5":  (0.8,  4.0),
}

def usd_for_call(model, in_tokens, out_tokens):
    in_rate, out_rate = COST_PER_M[model]
    return (in_tokens * in_rate +
            out_tokens * out_rate) / 1_000_000

class BudgetGate:
    def __init__(self, ceiling_usd: float):
        self.ceiling = ceiling_usd
        self.spent = 0.0

    def charge(self, model, in_tokens, out_tokens):
        cost = usd_for_call(model, in_tokens, out_tokens)
        self.spent += cost
        if self.spent >= self.ceiling:
            raise BudgetExceeded(
                f"spent ${self.spent:.2f} of "
                f"${self.ceiling:.2f}"
            )

The ceiling is a product decision, not an engineering one. A support conversation might be capped at $0.50. A research run that the user explicitly approved might be $5. A background batch job might be $20 per item. The exact number is a product call. What matters is that the gate exists.

Pricing rates above are the public numbers; check the Anthropic pricing page before you ship.

In the documented incident, even a $50-per-conversation cap would have stopped the bleeding at week 1.

Signal 3 — OTel span depth and call count

Self-loops have a shape you can see in a trace. One conversation parent, dozens of child spans, the same tool name appearing in a flat fan. If your tracing already follows the OpenTelemetry GenAI conventions, the alert is a saved query.

Two conditions are enough:

agent.iterations > N on a closed agent.turn span.
The same tool.input_hash appearing more than twice under one agent.turn.

Either one fires and the on-call gets paged before the bill does. Pick a sensible N for your workload: 10 covers the common research-then-summarize chain. Tune up if your agent does multi-hop retrieval, down if it is single-shot Q&A.

The full agent loop with all three guards

Here is the agent loop with every guard wired in. Anthropic SDK, hard step cap, USD budget, duplicate-input detector. Eighty lines. The same pattern works for any provider. The gates do not care about the model.

import anthropic

class AgentError(Exception): pass
class BudgetExceeded(AgentError): pass
class StepLimitExceeded(AgentError): pass
class SelfLoopDetected(AgentError): pass

MAX_STEPS = 10
PER_CONVO_USD = 1.00

client = anthropic.Anthropic()

def run_agent(user_prompt, tools, dispatch):
    budget = BudgetGate(PER_CONVO_USD)
    detector = LoopDetector(threshold=2)
    messages = [{"role": "user",
                 "content": user_prompt}]

    for step in range(1, MAX_STEPS + 1):
        resp = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1024,
            tools=tools,
            messages=messages,
        )
        budget.charge(
            "claude-sonnet-4-5",
            resp.usage.input_tokens,
            resp.usage.output_tokens,
        )

        messages.append(
            {"role": "assistant",
             "content": resp.content}
        )

        if resp.stop_reason == "end_turn":
            return _final_text(resp)

        if resp.stop_reason != "tool_use":
            raise AgentError(
                f"unexpected stop: {resp.stop_reason}"
            )

        results = []
        for block in resp.content:
            if block.type != "tool_use":
                continue
            if detector.record(block.name, block.input):
                raise SelfLoopDetected(
                    f"duplicate tool: {block.name}"
                )
            out = dispatch[block.name](**block.input)
            results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": str(out),
            })
        messages.append(
            {"role": "user", "content": results}
        )

    raise StepLimitExceeded(
        f"hit MAX_STEPS={MAX_STEPS}"
    )

Each exception maps to a different failure mode. BudgetExceeded means tune the cap or the prompt is wasteful. StepLimitExceeded means the model cannot finish the task with this toolset. SelfLoopDetected means two tools are talking past each other and the model has not noticed.

Every client.messages.create response carries usage.input_tokens and usage.output_tokens. That is the value you charge against the budget, with no separate metering layer required. The Anthropic SDK docs document the fields.

Retries that look like loops

The duplicate detector will fire on a legitimate retry. The right move is to make retries explicit: raise the threshold to 3 for an idempotent tool, or wrap the retry in a different tool name (search_orders_retry instead of search_orders). What you do not want is to switch the detector off "for now." That is how you end up paying $47,000 to learn a lesson the team in the source article already paid for.

The smallest version of all of this

If you take nothing else from this postmortem, take three lines:

assert step <= MAX_STEPS
assert spent <= BUDGET_USD
assert input_hash(name, args) not in seen

Three asserts. Drop them anywhere in your agent runtime. Step cap, budget, loop detector — runaway iteration, credit card, the exact failure mode where two agents kept talking because no code told them to stop.

A green dashboard for eleven days, with no caps and no budget gate and no loop detector, means nothing was wired to fail.

If this was useful

The AI Agents Pocket Guide covers iteration caps, budget gates, self-loop detection, and the multi-agent failure modes that show up the moment two agents start sending each other tool calls. If you are running anything more complex than a single-agent loop in production, the chapters on orchestration boundaries and termination conditions map directly to what went wrong in the source incident.