Mukunda Rao Katta

Posted on May 25

Your Agent Is Calling That Tool Again: tool-loop-guard

#hermeschallenge #ai #python #agents

A research agent called search("quantum computing") 34 times in a row.

Not 34 searches with different queries. The same query, 34 times. Each time the model got back a results page, it read it, decided it needed more information, and called search("quantum computing") again. The response always said to look for more info. The model always agreed. The loop had no exit.

The bill was $8 when someone noticed. The run had been going for about six minutes. Nothing had blown up. No error. No timeout. The agent was just... working. Doing exactly what it was told.

That is the failure mode tool-loop-guard addresses.

The Shape of the Fix

tool-loop-guard is a small Python library. You give it a window size and a threshold. It watches your tool calls. When the same tool with the same arguments appears more than the threshold number of times inside the last N calls, it raises LoopDetected.

Install:

pip install tool-loop-guard

Basic usage:

from tool_loop_guard import LoopGuard, LoopDetected

guard = LoopGuard(window=10, threshold=3)

while True:
    response = call_llm(messages)
    tool_calls = response.tool_calls

    if not tool_calls:
        break

    for tc in tool_calls:
        try:
            guard.record(tc.name, tc.arguments)
        except LoopDetected as e:
            print(f"Loop detected: {e.tool_name} called {e.count} times")
            raise

        result = run_tool(tc.name, tc.arguments)
        messages.append({"role": "tool", "content": result})

window=10 means the guard looks at the last 10 tool calls. threshold=3 means if any (tool_name, args_hash) pair appears 3 or more times in that window, it raises. You tune both numbers for your use case.

The LoopDetected exception carries useful context:

except LoopDetected as e:
    print(e.tool_name)     # which tool
    print(e.args_hash)     # canonical hash of the args
    print(e.count)         # how many times in the window
    print(e.window)        # the full window at time of detection

You can reset the guard between agent turns if you want each turn to start fresh:

guard.reset()

Or you can let it span the full run if you want loop detection across turns. Both patterns make sense depending on your architecture.

What It Does NOT Do

A few things this library deliberately leaves out:

It does not detect semantically similar calls. If your agent searches for "python" and then "python programming language", those are two different arg hashes. Not a loop. Only structurally identical calls, same tool and same serialized arguments, trigger it.
It does not integrate directly into an LLM client. You call guard.record() yourself. There is no monkey-patching, no client wrapper.
It does not recover or retry. It raises and gets out of the way. What happens next is your call.
It does not track cost, tokens, or time. The boundary is calls. Combine with token-budget-py or agent-deadline if you want multiple constraints.

Inside the Lib: Structural-Only Matching

The most important design choice in this library is the decision to match structurally, not semantically.

This needs explaining because the intuition goes the other way. You might think: "my agent is spinning if it keeps doing the same kind of thing." But "same kind of thing" is a fuzzy concept. Implementing it requires embeddings, similarity thresholds, or heuristics that will be wrong in ways that are hard to predict.

Structural matching is simpler and more predictable. The library serializes the arguments dict into canonical JSON, sorted by key, then hashes the result. Two calls match if and only if they produce the same hash. search("python") and search("Python") have different hashes. search("python") called five times has the same hash every time.

This means you will miss some loops. An agent that searches for "python", "Python", "python language", "python programming" in a tight cycle is probably stuck, but this library will not flag it. That is a tradeoff the library accepts.

What you gain: zero false positives on legitimate variation. An agent that calls a data API with slightly different parameters each time will never trigger the guard, even if it calls the API frequently. The guard fires only when the agent is provably doing the same thing over and over.

For the research agent story, structural matching is exactly what you want. The agent was calling search("quantum computing") with byte-for-byte identical arguments every time. That is the case the guard catches.

For semantic loop detection, you need a different tool. This is not that tool.

When This Is Useful

Research agents that call search APIs. These are the most common spinners. The agent reads results, gets encouraged to look for more information, and re-issues the same query.

Agents that interact with external state, like writing to a database or posting to an API. A loop that posts the same message 20 times is worse than a loop that runs expensive queries.

Any agent where you want a safety net that fires before the cost cap. tool-call-budgets can cap total calls per tool, but it will not tell you that the same call is repeating. tool-loop-guard is specific: it detects repetition, not volume.

Debugging runs where you want to understand agent behavior. LoopDetected gives you the full window at the point of detection. That is useful for understanding what the model was doing in the turns before it got stuck.

When NOT to Use This

If your agent legitimately calls the same tool with the same arguments on purpose, for example a polling loop that checks a status endpoint every 30 seconds, you either need a higher threshold or a different approach entirely. The guard cannot distinguish "stuck" from "intentionally polling."

If your concern is semantically similar calls rather than structurally identical ones, this will not help. Use a similarity-based approach or NoProgress from llm-stop-conditions which tracks whether the model is producing useful output.

If you want to limit total calls per tool regardless of repetition, tool-call-budgets is the right tool. The two have different jobs.

Install

pip install tool-loop-guard

Zero dependencies. Python 3.9+.

Source: MukundaKatta/tool-loop-guard

23 tests covering single-tool loops, multi-tool interleaving, window boundary behavior, threshold edge cases, and reset.

Siblings

These libraries cover adjacent boundaries in the same agent control plane:

Lib	Boundary	Repo
llm-stop-conditions	NoProgress stop condition catches the same pattern at the output level	MukundaKatta/llm-stop-conditions
tool-call-budgets	Per-tool call cap, coarser control over total volume	MukundaKatta/tool-call-budgets
llm-circuit-breaker-py	Error-rate breaker, different trigger, same idea of stopping a runaway loop	MukundaKatta/llm-circuit-breaker-py
agent-deadline	Time-based deadline that catches infinite loops by running out the clock	MukundaKatta/agent-deadline

One thing worth calling out on the llm-stop-conditions relationship. That library's NoProgress condition fires when the model produces N consecutive turns with no tool calls and very short output. It is watching the output side. tool-loop-guard watches the input side, the calls you are about to make. A model stuck in a search loop will keep making tool calls, so NoProgress will not catch it. tool-loop-guard will.

What Is Next

A few things that would improve this library:

Configurable match functions. Right now the match is always on canonical args hash. A hook that lets you supply a custom similarity function would let you opt in to fuzzy matching for specific tools where you know what "same" means.

Async support. The current API is synchronous. An async-safe version with proper locking would fit naturally into asyncio agent loops.

Per-tool thresholds. Right now one threshold applies to all tools. Being able to say "allow up to 5 repeated calls on read_file but only 2 on write_file" would give you finer control without raising the global threshold.

Window inspection. An API to query the current window state without recording a new call would help if you want to log what the guard is tracking between iterations.

The research agent that burned $8 is not an exotic edge case. Any agent that depends on external information will sometimes get into a state where the model believes the right answer is to look again. Without a guard, that state runs until something external stops it. The guard makes the loop self-terminating instead.

Part of the @mukundakatta agent tooling stack, built for the Hermes Agent Challenge.

DEV Community