Why your AI agent loops forever (and how to break the cycle)

#ai #agents #python #debugging

The 3 AM tool-call loop from hell

Last month I deployed a ReAct-style agent to handle customer support triage. By 3 AM I had an alert: one user session had burned through 47,000 tokens in a single conversation. The agent had been calling the same search_knowledge_base tool 73 times in a row, with slightly different queries each time, never deciding to stop.

If you've built any kind of tool-using agent, you've probably seen this pattern. The agent gets stuck in a loop, either repeating the same action or oscillating between two actions. Tokens evaporate. Costs spike. Users wait forever for a response that never comes.

This isn't a model problem. It's an architectural problem. And once you understand what's actually happening inside the loop, the fix is straightforward.

What's actually happening inside the loop

A typical agent loop looks roughly like this:

def naive_agent_loop(user_query):
    messages = [{"role": "user", "content": user_query}]

    while True:
        response = llm.chat(messages, tools=AVAILABLE_TOOLS)

        # model decided to finalize
        if response.finish_reason == "stop":
            return response.content

        # otherwise, execute the tool call and feed the result back
        tool_call = response.tool_calls[0]
        result = execute_tool(tool_call.name, tool_call.args)

        messages.append(response.message)
        messages.append({"role": "tool", "content": str(result)})

The model generates an action, you execute it, you append the result to the context, and you ask the model what to do next. Repeat until the model says "I'm done."

The failure mode lives in that "until done" condition. Three things commonly go wrong:

The model has no concept of "I've already tried this." Each iteration looks at the conversation history, but if the history shows ten failed searches, the model often interprets that as "I should search harder" rather than "this approach isn't working."
Tool errors are silent or ambiguous. When a search returns an empty list, is that "no results found" or "the tool is broken"? The model can't tell, so it tries again with a different phrasing.
The stop condition is implicit. Many implementations only stop when the model produces a final-answer message. There's nothing forcing the model to ever produce one.

The fix: explicit state, hard limits, structured feedback

Here's a stripped-down version of the pattern I use now. It runs in plain Python against any chat-completion API:

import hashlib
import json

MAX_STEPS = 10

def hash_action(name: str, args: dict) -> str:
    # canonical JSON so {a:1, b:2} and {b:2, a:1} hash the same
    payload = json.dumps({"name": name, "args": args}, sort_keys=True)
    return hashlib.sha256(payload.encode()).hexdigest()

def safer_agent_loop(user_query):
    messages = [{"role": "user", "content": user_query}]
    seen_actions = set()

    for step in range(MAX_STEPS):
        response = llm.chat(messages, tools=AVAILABLE_TOOLS)

        if response.finish_reason == "stop":
            return response.content

        tool_call = response.tool_calls[0]
        action_id = hash_action(tool_call.name, tool_call.args)

        if action_id in seen_actions:
            # tell the model it's repeating instead of running the tool again
            observation = {
                "status": "duplicate",
                "message": "You already called this tool with the same args. Try something different or finalize your answer."
            }
        else:
            seen_actions.add(action_id)
            observation = execute_tool(tool_call.name, tool_call.args)

        messages.append(response.message)
        messages.append({"role": "tool", "content": json.dumps(observation)})

    # hard stop: don't raise, return whatever partial answer we can
    return synthesize_partial_answer(messages)

Three things changed:

Hard step limit. No matter what the model decides, the loop terminates after MAX_STEPS. Pick a number based on the actual task — for triage I use 8, for research workflows I sometimes go up to 20.
Action deduplication. Before executing a tool call, hash the (tool, args) pair and check whether we've already done it. If yes, return a synthetic observation telling the model so.
Structured error envelopes. Tools return a typed result, not raw strings. The model can see status: "no_results" vs status: "error" vs status: "ok" and make a better decision.

Detecting oscillation, not just repetition

Exact-duplicate detection catches the obvious case. But agents are clever enough to find creative ways to loop. The next pattern I had to handle: the agent calling search("authentication errors"), then search("auth errors"), then search("login failures") — semantically the same query, syntactically different.

A simple defense is to track the last N tool calls and check whether the agent is making progress:

from collections import deque

class ProgressTracker:
    def __init__(self, window: int = 4):
        self.window = window
        self.recent_tools = deque(maxlen=window)

    def record(self, tool_name: str) -> None:
        self.recent_tools.append(tool_name)

    def is_stuck(self) -> bool:
        # if the last N calls all hit the same tool, we're probably looping
        if len(self.recent_tools) < self.window:
            return False
        return len(set(self.recent_tools)) == 1

This isn't perfect — semantic similarity via embeddings would be more robust — but it catches roughly 80% of the oscillation cases I've seen in production without the complexity of a separate similarity model.

Why frameworks don't solve this for you

I've worked with several popular agent frameworks. Most of them give you a max_iterations parameter and call it a day. That's the floor of what you need, not the ceiling.

If you're building anything beyond a demo, you need:

Per-tool quotas, not just global step limits
Logging that captures the full action/observation trail so you can debug after the fact
A mechanism to inject "you've already tried this" context back into the model
A graceful exit path when the limit hits — return a partial answer, not an exception

There's a community-maintained list of agent learning resources on GitHub called Agent-Learning-Hub that covers a lot of these patterns at a deeper level, including pointers to academic papers on planning and reflection that helped me understand why the naive ReAct loop has these failure modes in the first place.

Prevention tips that have actually saved me

A few habits I've adopted after enough 3 AM alerts:

Log every action and observation, with timestamps. When something goes wrong in production, you want the full trace, not just the final state.
Set token budgets per conversation, enforced server-side. Don't trust the agent to police itself.
Write tools that return semantically useful errors. "No results for query X. Try a more general term." beats [].
Test with adversarial prompts. Specifically try inputs designed to confuse the agent and verify it bails out cleanly.
Track tool-call entropy. If the variance in your tool-call distribution drops over the course of a conversation, that's a leading indicator of stuck behavior.

Wrapping up

Agent loops failing in production almost always come down to missing state, missing feedback, or missing limits. The model isn't broken — it's doing exactly what the prompt and the architecture told it to do. Fix the architecture and the loops go away.

The hardest part is accepting that "let the model decide when to stop" isn't a strategy. You're the one writing the loop. Own the termination logic.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.