DEV Community

Cover image for Your Agent Just Called the Same Tool 47 Times. Here's the 20-Line Detector.
Gabriel Anhaia
Gabriel Anhaia

Posted on

Your Agent Just Called the Same Tool 47 Times. Here's the 20-Line Detector.


The $47K loop

A LangChain user burned roughly $47,000 in a single weekend because their agent looped on one tool call. The story made the rounds on Twitter and HN in 2023, and the shape of the failure has not changed. The agent called the same retrieval tool, with the same arguments, over and over, while the framework happily fed every result back into the next prompt and billed each round.

Ten seconds with the trace and you'd see it. Forty-seven spans in a row, same tool_name, same args payload, different timestamps. No human writes that. No model wants to write that. But put a tool-using agent in front of a fuzzy question with a slightly-broken tool and it'll grind on the same call until something kills it.

The thing that should have killed it is twenty lines of Python. It doesn't live in the agent. It lives in the trace pipeline, so it survives framework swaps, model upgrades, and the next refactor your team does at 4pm on a Friday.

Why max_iterations is the wrong knob

The advice you get on the first page of Google is "set max_iterations=10". This is wrong for the same reason a speed limit on a residential street is wrong for a highway. It punishes legitimate work.

Consider two agents running in the same product.

Agent A is a deep research assistant. It pulls a PDF, runs a search, summarizes, follows three citations, runs three more searches, dedupes the findings, and writes a memo. Eighty tool calls, all different, all useful. The user paid for that depth.

Agent B is a question-answerer with a flaky vector index. On query #1 it calls search_docs(query="refund policy"). The result is empty because of a stale embedding. The agent reasons "I should try again" and calls search_docs(query="refund policy") a second time. Then a third. By step 7 it has called the exact same tool with the exact same arguments seven times in a row.

A depth limit at 10 cuts off Agent A before it finishes and lets Agent B burn six iterations before it trips. You want the opposite: Agent A running as long as it's making progress, Agent B dying at iteration 4. Repetition is the signal, not depth.

The detector in 20 lines

Here it is. A sliding-window counter keyed on (tool_name, args_hash). Push every tool invocation. If any key shows up threshold times in the last window calls, raise.

from collections import deque, Counter
from dataclasses import dataclass, field
import hashlib
import json


class LoopDetected(Exception):
    pass


@dataclass
class LoopDetector:
    window: int = 10
    threshold: int = 4
    _calls: deque = field(default_factory=deque)

    def observe(self, tool_name: str, args: dict) -> None:
        key = (tool_name, _args_hash(args))
        self._calls.append(key)
        if len(self._calls) > self.window:
            self._calls.popleft()
        counts = Counter(self._calls)
        most_common_key, hits = counts.most_common(1)[0]
        if hits >= self.threshold:
            raise LoopDetected(
                f"{most_common_key[0]} called {hits}x "
                f"in last {len(self._calls)} steps"
            )


def _args_hash(args: dict) -> str:
    canonical = json.dumps(_canonicalize(args), sort_keys=True)
    return hashlib.sha256(canonical.encode()).hexdigest()[:16]


_VOLATILE_KEYS = {
    "timestamp", "request_id", "trace_id", "span_id",
    "nonce", "now", "_ts", "correlation_id",
}


def _canonicalize(value):
    # strip keys that change every call but don't change intent
    if isinstance(value, dict):
        return {
            k: _canonicalize(v)
            for k, v in value.items()
            if k not in _VOLATILE_KEYS
        }
    if isinstance(value, list):
        return [_canonicalize(v) for v in value]
    return value
Enter fullscreen mode Exit fullscreen mode

That's the whole detector. Twenty-ish lines depending on how you count the imports. Drop it in, call observe() after every tool invocation, catch LoopDetected, do something useful.

The hash is truncated to 16 hex chars. Collisions don't matter here. A false positive (two distinct calls hashing the same) costs you nothing because the loop wasn't real and the next legitimate call breaks the pattern. A false negative (a real loop slipping through because the hash collided) is statistically irrelevant at 16 hex chars over a 10-call window.

Where to put it

You have three options, ranked from worst to best.

Inside the agent loop. You import LoopDetector into your agent runner and call observe() after each tool call. Easy. Also brittle. The day you swap LangChain for LangGraph, or move from one framework to another mid-quarter, the detector goes with the old code. You also have to remember to instrument every new agent. The third agent your team ships in a hurry won't have it.

Framework callback. LangChain has BaseCallbackHandler, LangGraph has node hooks, OpenAI's Agents SDK has lifecycle events. You write one callback that calls observe(). Better than inline. Still framework-specific. Still dies when you swap.

OTel span exporter. This is where it belongs. Your traces already flow through an exporter. Add a SpanProcessor that watches for tool-call spans and runs the detector on them. Framework-agnostic. Cannot be forgotten. Catches every agent in your fleet whether it was shipped today or last quarter.

The placement looks like this:

from opentelemetry.sdk.trace import SpanProcessor
from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
)
from collections import defaultdict


class LoopDetectingProcessor(SpanProcessor):
    def __init__(self, inner: SpanProcessor):
        self.inner = inner
        # one detector per trace_id
        self._detectors = defaultdict(LoopDetector)

    def on_start(self, span, parent_context=None):
        self.inner.on_start(span, parent_context)

    def on_end(self, span) -> None:
        # GenAI semconv name for a tool invocation
        if span.name == "execute_tool":
            attrs = span.attributes or {}
            tool_name = attrs.get(
                "gen_ai.tool.name", "unknown"
            )
            # tool args often live under gen_ai.tool.call.arguments
            raw_args = attrs.get(
                "gen_ai.tool.call.arguments", "{}"
            )
            try:
                args = json.loads(raw_args)
            except (TypeError, ValueError):
                args = {"_raw": str(raw_args)}

            trace_id = format(span.context.trace_id, "032x")
            detector = self._detectors[trace_id]
            try:
                detector.observe(tool_name, args)
            except LoopDetected as exc:
                span.set_attribute("loop.detected", True)
                span.set_attribute("loop.reason", str(exc))
                # signal the agent runtime via your own channel:
                # Redis pub/sub, a kill flag in DB, etc.
        self.inner.on_end(span)

    def shutdown(self):
        self.inner.shutdown()

    def force_flush(self, timeout_millis=30000):
        return self.inner.force_flush(timeout_millis)
Enter fullscreen mode Exit fullscreen mode

You wrap your existing exporter and register it on the tracer provider. The detector now sees every tool span from every agent your platform runs. The attribute names follow the OpenTelemetry GenAI semantic conventions (gen_ai.tool.name, gen_ai.tool.call.arguments), so this code works with anything that emits those spans.

Tuning window and threshold

Defaults that hold up in practice: window=10, threshold=4.

The reasoning. A well-behaved ReAct agent revisiting a tool because the first result was unclear will hit it twice, maybe three times with slightly different arguments. Four identical calls in ten steps means it's not exploring. It's stuck. Pushing threshold to 3 catches loops one step earlier but flags some legitimate retries. Pushing it to 5 lets one extra wasted call through per loop, which at GPT-4-class token rates is real money.

If your agents have exponential backoff baked in (call, wait, call again, wait longer), widen the window to 15-20 and keep threshold at 4. The backoff stretches the repetition over more steps, so a wider window catches it without being trigger-happy on legitimate retries.

If your tool catalog is small (3-5 tools) and the agent legitimately revisits one tool a lot, like read_file in a coding agent or search_web in a research agent, key on (tool_name, args_hash) not just tool_name. The args hash is what separates "called search_web 8 times with 8 different queries" (fine) from "called search_web 8 times with the same query" (broken).

What to do on detection

Three options, in increasing order of how much you trust your agent.

Killswitch. Default. Raise an exception, log the loop, return a structured error to the caller. Cheap and safe. The user retries.

Downgrade with a prompt. Inject a system message: "You have called search_docs four times with the same arguments. The tool is returning the same result. Try a different approach or stop and report what you've found." The model usually breaks out. Sometimes it doesn't, and then the killswitch fires on the next observation.

Page on-call. For agents where loops mean a real outage (say, an internal autonomous tool with no user retry) wire LoopDetected to PagerDuty. Rare, but for the agents that should never loop, the page is the right shape.

Start with the killswitch. Move to downgrade-with-prompt only after you have data on which loops are recoverable.

Two edge cases that bite

Non-deterministic args. The hash will diverge on every call if your tool args include a timestamp, a request ID, or a nonce. The canonicalizer above strips a known set of volatile keys (timestamp, request_id, trace_id, span_id, nonce, now, _ts, correlation_id) before hashing. Add to that set when you hit a new volatile field in your own tool schemas. The agent that smuggles created_at: <now> into its args is the agent whose loop you'll never catch otherwise.

Streaming tool calls. Some frameworks emit partial spans while a tool call is still running. Filter to spans with a gen_ai.tool.call.id and ignore any where the call is still streaming. Otherwise you'll count one slow tool call as multiple observations and false-positive yourself.

Where this fits in your stack

The detector is one of three runtime guards every production agent should have.

Token budget. A hard cap on cumulative input + output tokens per agent invocation. Catches the "the prompt grew to 200K tokens" failure mode that loop detection misses.

Loop detector. The thing in this post. Catches stuck repetition.

Goal-completion verifier. A separate small LLM call at the end that checks "did this agent actually do what the user asked, or did it produce confident-sounding output that misses the point?" Catches the "ran for 30 steps, produced garbage" failure that the first two miss.

Run all three in the trace pipeline, not inside the agent. The agent is the unreliable part. The pipeline is where the guards go.

What's the worst agent loop you've shipped to production? Drop the trace in the comments. I want to know if anyone has beaten 47.


If this was useful

The runtime guard triad (token budget, loop detector, goal verifier) is one of the patterns in AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs. The book covers the rest of the production checklist: tool catalog discipline, sub-agent boundaries, replay and drift detection, and the trace-layer instrumentation that makes all of it observable. If you're shipping agents and want the patterns laid out in one place, that's the book.

AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs

Top comments (0)