Tool-Calling Loops: The Bug That Burns $4,000 Overnight (and the 7-Line Fix)

#ai #agents #python #observability

Book: AI Agents Pocket Guide
Also by me: LLM Observability Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

The trace was 63 hours long. The bill was $4,200. The shape, when Sattyam Jain finally read the logs, was almost embarrassingly simple: plan, call a tool, hit a 429, re-plan, call the same tool, hit a 429, re-plan. About 4,800 cycles per hour, all weekend, on an agent that had been told to "keep trying until it works." The full postmortem is on Medium.

This is not a rare bug. The Operator Collective documented a multi-agent research tool that ran for eleven days before anyone noticed, posting a $47,000 OpenAI invoice. The LangChain and LangGraph repos have open issues and bug reports about agents that loop until the framework's recursion limit terminates them. The shape is the same every time.

What the loop looks like on the wire

Three messages on repeat. The model emits a tool call. The tool returns an error. The next model turn looks at the error, decides to try again, emits the same tool call.

turn 12: assistant -> tool_call: search_db(query="orders 2026-04-01")
turn 13: tool      -> error: 429 rate_limit
turn 14: assistant -> tool_call: search_db(query="orders 2026-04-01")
turn 15: tool      -> error: 429 rate_limit
turn 16: assistant -> tool_call: search_db(query="orders 2026-04-01")
...

The arguments are byte-identical. The tool name is identical. The error is identical. The model never adapts because nothing in its context tells it the previous attempt was the same call it is about to make again. Every turn, it sees a new error message, decides retrying is reasonable, and retries.

Why the LLM keeps trying

The model is not being stupid. The context is.

One: the context window flattens history. By turn 50, the model is looking at fifty tool_call -> error pairs. It is structurally hard to notice you are repeating because every turn looks like the previous turn looked, and the only signal that the loop is happening is the count of how many times it has happened. Counts are exactly what LLMs are bad at.

Two: the system prompt told it to retry. Production agents almost always have something in the prompt that says be persistent, do not give up, try alternative approaches. That instruction is correct in the median case (a transient 503 should be retried). It is catastrophic in the tail (a permanent 429 should not).

Three: the tool error message is too thin. 429: rate_limited does not tell the model you have already hit this rate limit twelve times in the last two minutes. From the model's perspective, this 429 is the first 429 it has seen, because the previous twelve are buried in turns the attention is not weighting heavily.

You fix this in the tool dispatcher, not the prompt. You can write do not retry the same call in seventeen different ways and the model will still retry, because the context flattening problem dominates the instruction. The dispatcher is the code that runs the calls before the model ever sees results.

The seven-line detector

A sliding window of the last N tool-call fingerprints. If the same fingerprint appears more than once in the window, abort the trace.

from collections import deque
import json

WINDOW = 6
seen = deque(maxlen=WINDOW)

def detect(name: str, args: dict) -> bool:
    fp = f"{name}:{json.dumps(args, sort_keys=True)}"
    if fp in seen:
        return True
    seen.append(fp)
    return False

Seven lines of logic. Hash the call by name plus sorted args. If the hash is in the recent window, you are looping. Return True, abort.

Drop it into the dispatcher:

for call in m.tool_calls:
    name = call.function.name
    args = json.loads(call.function.arguments)
    if detect(name, args):
        return f"LOOP_KILL: {name} repeated"
    out = dispatch(call)

The agent terminates and the trace ID lands in your error log. Cost stops accruing within a second of the first repeat.

Why a sliding window beats a global set

The first thing every engineer reaches for is seen: set[str]. Track every fingerprint forever, abort on any repeat. It is the right instinct and the wrong implementation.

A global set false-positives on legitimate retries. Real workflows do call the same tool with the same args twice. A get_user(id=42) at step 3 and again at step 47, after the agent did unrelated work in between, is fine. A search_db(query="x") at step 12 and again at step 13 is the bug.

The sliding window encodes "recent" without encoding "ever." Six is a defensible default for single-task agents with 3–5 tools. For long-running orchestrations with dozens of tools, twelve. For tight short-task agents, four. Tune the window size from your trace data: pull the last hundred successful traces, look at the maximum gap between legitimate same-fingerprint calls, set the window above that number.

The richer version: backoff and kill-switch

The seven-line version aborts on the second repeat. Sometimes you want a graduated response. First repeat warns, second backs off, third kills.

import time
from collections import defaultdict

counts: dict[str, int] = defaultdict(int)
last_seen: dict[str, float] = {}

def graduated(name: str, args: dict, kill_at: int = 4) -> str:
    fp = f"{name}:{json.dumps(args, sort_keys=True)}"
    counts[fp] += 1
    n = counts[fp]
    if n >= kill_at:
        return "kill"
    if n >= 2:
        delay = 2 ** (n - 1)
        time.sleep(delay)
        last_seen[fp] = time.time()
        return "backoff"
    last_seen[fp] = time.time()
    return "ok"

Twelve lines. First call returns ok. Second call sleeps one second and returns backoff, which gives a transient error time to clear. Third call sleeps two seconds. Fourth call returns kill and you abort the trace.

This works against the failure mode the seven-line detector is too aggressive for: real transient 429s on a tool you legitimately need to call multiple times. Backoff lets the rate limit window pass. Kill stops the loop that backoff cannot fix.

Wire it into the dispatcher with the same shape:

for call in m.tool_calls:
    name = call.function.name
    args = json.loads(call.function.arguments)
    decision = graduated(name, args)
    if decision == "kill":
        fp = f"{name}:{json.dumps(args, sort_keys=True)}"
        return f"LOOP_KILL: {name} after {counts[fp]}"
    out = dispatch(call)

What about partial repeats

The fingerprint above hashes the entire arguments object. An agent that searches for "orders 2026-04-01" then "orders 2026-04-02" produces two different fingerprints and the detector lets both through. That is the right behavior most of the time. It is the wrong behavior when the agent is iterating through a list and each iteration is itself failing. You want to detect the pattern of failure, not the literal repeat.

The fix is a second detector keyed on (name, error_class) instead of (name, args).

err_counts: dict[tuple[str, str], int] = defaultdict(int)

def error_pattern(name: str, error: str, kill_at: int = 5) -> tuple[bool, str]:
    cls = error.split(":")[0].strip()  # rough error classifier
    err_counts[(name, cls)] += 1
    return err_counts[(name, cls)] >= kill_at, cls

Wire after the tool returns an error:

out = dispatch(call)
if out.startswith("ERROR"):
    kill, cls = error_pattern(name, out)
    if kill:
        return f"LOOP_KILL: {name} {cls} pattern"

Five lines. Catches the case the fingerprint detector misses: same tool, different args, same error five times in a row. Most often this is the agent hammering a misconfigured tool with iterations, expecting at least one to succeed.

What this catches and what it does not

Loop detection catches the shape of failure that produced the $4,200 burn and the $47,000 invoice. It does not catch:

A trace that spends $400 in a single legitimate tool call (the cost ceiling guardrail catches this).
A trace that takes 200 sensible steps and produces wrong output (no detector catches this; you need offline evals).
A trace where the agent calls a destructive tool legitimately once with a wrong argument (the action-class guardrail catches this).

It catches what it is designed to catch (the call-fingerprint loop and the error-pattern loop) and those are the two shapes that account for most of the public agent postmortems from the last twelve months.

Where to put it in your stack

Inside the tool dispatcher, before the actual tool runs. Not in the LangChain agent executor, not in the model call, not in a wrapper around the whole trace. The dispatcher is the only place that sees every tool call before it costs anything.

If you are running LangGraph, the dispatcher is your ToolNode or your custom tool-handling function. If you are on the OpenAI Assistants API, it is the function-call handler in your run loop. If you are on the Anthropic SDK, it is the tool_use block handler. The detector is the same in all three places: seven lines, a deque, a JSON hash.

The cost of the detector is one dictionary lookup per tool call. The cost of not having it is on Sattyam Jain's invoice and on Replit's public apology and on every team that has ever woken up to a bill they did not budget for.

If this was useful

The AI Agents Pocket Guide walks through the loop shapes, the detectors, and the dispatcher patterns these snippets come from. The LLM Observability Pocket Guide covers the trace data you need to tune the window size: what to put on the span so the next loop is one query away from being noticed before the bill clears.