- Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Your agent hit step 3 and has been calling search_docs for the last four minutes. The chat turns look slightly different each time. The tool responses are almost identical. Nothing is erroring. The token counter is ticking up. Your terminal sits there printing Entering new AgentExecutor chain... like it has all day.
You are not debugging a bug. You are watching a well-formed agent do exactly what you told it to do, which is call the same tool until something tells it to stop.
This is the same failure shape that produced the $47,000 LangChain agent loop in November 2025. Four agents running for 11 days, every call returning 200, every span green, nobody noticing until the cloud invoice arrived. The public incident had a monthly budget alert that fired two days too late. Yours has max_iterations=15 and a prayer.
The loop has three usual root causes. This post is each one, with the fix.
Cause 1: The tool description is ambiguous
The single most common reason an agent loops on a tool is that the tool's description does not tell the model when to stop calling it. The model reads the description, decides the tool is relevant, calls it, gets back a result that does not obviously satisfy the question, and decides the tool is still relevant. That is not a reasoning failure. That is you writing a bad docstring.
Here is the shape that loops:
from langchain_core.tools import tool
@tool
def search_docs(query: str) -> str:
"""Search the documentation."""
return vector_store.similarity_search(query, k=3)
"Search the documentation" tells the model what the tool does. It says nothing about:
- what a good input looks like
- what the output represents
- when the tool has given you enough to answer
- when to give up and try a different tool
Claude or GPT will cheerfully call search_docs("react hooks"), get back three chunks, decide those chunks are not sufficient, call search_docs("react hooks useEffect"), get three more chunks, decide those are also not sufficient, and keep going.
The fix is a tool description that specifies the contract. Inputs, outputs, and termination:
from langchain_core.tools import tool
@tool
def search_docs(query: str) -> str:
"""Search the product documentation for a topic.
Use this tool ONCE per user question to retrieve
relevant documentation chunks. The tool returns the
top 3 matching passages with source URLs.
If the returned passages do not answer the user's
question, do NOT call this tool again with a
rephrased query. Instead, tell the user the
documentation does not cover their question and
suggest they contact support.
"""
results = vector_store.similarity_search(query, k=3)
return format_with_sources(results)
Two things changed. One, the description tells the model the tool is single-shot ("ONCE per user question"). Two, it tells the model what to do when the tool fails ("suggest they contact support"), which is the piece most descriptions leave out and the reason the loop starts.
Read your tool descriptions the way the model reads them. If there is no termination clause, there is no termination.
Cause 2: The agent has no memory of what it already did
The second cause is structural. A LangChain AgentExecutor keeps the intermediate steps inside the current invoke call. But if your tool results are large, or if the framework summarizes scratchpad state to stay under a context window, the model ends up at step 12 looking at a chat history that reads: "the user asked a question, you called search_docs with query X, here is a summary of what happened." The summary is lossy. The model re-derives the plan and calls search_docs again because, as far as it can tell, it has not yet.
You can see this in the trace: the AgentScratchpad is shorter than the actual step count. The model is reasoning from a compressed picture and the compression hides the repetition.
The fix is to give the agent an explicit, visible record of which tool was called with which argument and what it returned, and to feed that record back into the prompt on every step.
from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
PROMPT = ChatPromptTemplate.from_messages([
("system",
"You are a documentation assistant.\n\n"
"You have access to these tools: {tool_names}.\n\n"
"IMPORTANT: Before calling any tool, check the\n"
"TOOL_CALL_HISTORY below. If the same tool was\n"
"already called with a similar input, do not call\n"
"it again. Answer from the prior result or tell\n"
"the user you cannot help.\n\n"
"TOOL_CALL_HISTORY:\n{tool_call_history}"),
MessagesPlaceholder("messages"),
])
Then maintain the history outside the agent, as a plain Python list, and render it in as text on every invoke:
tool_call_history: list[dict] = []
def render_history() -> str:
if not tool_call_history:
return "(no tools called yet)"
lines = []
for i, call in enumerate(tool_call_history, 1):
lines.append(
f"{i}. {call['tool']}({call['input']!r}) "
f"-> {call['output'][:120]}..."
)
return "\n".join(lines)
# Inside your step loop:
response = llm.invoke(PROMPT.format_messages(
tool_names=", ".join(t.name for t in tools),
tool_call_history=render_history(),
messages=messages,
))
Two properties matter. First, the history is materialized in the prompt, not buried inside the scratchpad. The model reads it the same way it reads the user's question. Second, the history survives compression. If you truncate chat history to fit a context window, truncate messages, not tool_call_history.
This pattern is also the cheapest way to add an agent-level circuit breaker later. Once the history is visible in the prompt, counting repeated entries for a loop-detection check is three lines.
Cause 3: The termination condition does not know what "done" looks like
The third cause is termination. LangChain's AgentExecutor stops on three things: the model emits a final_answer, you hit max_iterations, or you hit max_execution_time. That is a floor, not a ceiling. max_iterations=15 on a gpt-4o-mini agent with 2K-token tool responses is about $0.02 per failed run. On gpt-4o with larger responses, closer to $0.40. Per failed run. If the loop fires on 1% of your traffic and you serve a thousand requests a day, that is ten failed runs every day, every day, until you notice.
The fix is a termination condition that knows your domain. Two layers.
Layer one, a loop detector that trips on same-tool-same-input repetition:
from collections import Counter
from dataclasses import dataclass, field
@dataclass
class RunState:
tool_calls: Counter = field(default_factory=Counter)
total_tokens: int = 0
class LoopDetector:
def __init__(
self,
max_same_tool: int = 3,
max_run_tokens: int = 50_000,
):
self.max_same_tool = max_same_tool
self.max_run_tokens = max_run_tokens
def check(self, run: RunState, tool: str, arg: str) -> None:
key = f"{tool}:{arg}"
run.tool_calls[key] += 1
if run.tool_calls[key] > self.max_same_tool:
raise StopIteration(
f"loop: {tool}({arg!r}) called "
f"{run.tool_calls[key]} times"
)
if run.total_tokens > self.max_run_tokens:
raise StopIteration(
f"budget exceeded: {run.total_tokens} tokens"
)
Wire it into the agent loop. The key is that LoopDetector.check runs before the tool executes, not after:
detector = LoopDetector(max_same_tool=3, max_run_tokens=50_000)
run = RunState()
for step in range(MAX_STEPS):
action = agent.plan(messages, tool_call_history)
if action.kind == "final_answer":
return action.output
detector.check(run, action.tool, action.input)
result = execute_tool(action)
run.total_tokens += count_tokens(result)
tool_call_history.append({
"tool": action.tool,
"input": action.input,
"output": result,
})
messages.append(tool_result_message(action, result))
Three same-tool-same-input calls triggers the break. That threshold is a product decision, not a constant you inherit from a library. Pick the number that matches your tools. For search_docs, 1 is probably right because the tool is single-shot. For get_order, 1 is right per order ID but your agent may legitimately look up several. Key the counter on tool + input, not tool alone.
Layer two, a budget cap. max_run_tokens=50_000 on gpt-4o-mini is about $0.15 per run. Put a number there. Any number. A bad number is better than no number, because no number is what produced the $47K incident.
Shipping this
The three fixes compound. Tool descriptions that specify termination reduce the number of loops that start. A visible tool-call history lets the model see its own repetition. The loop detector catches what slips past both.
What you want in production is all three, plus a trace you can inspect. Under the OpenTelemetry GenAI semantic conventions, a LangChain agent run is an invoke_agent span with execute_tool children. Alert on the count of identical gen_ai.tool.name children under a single parent. More than three in one run? Page someone. That is a one-line Prometheus rule and the signal the $47K team was missing.
If this was useful
The book this pattern is pulled from is Observability for LLM Applications. Chapter 6 walks through agent tracing under the OTel GenAI semconv. Chapter 16 covers cost tracking and the circuit-breaker pattern above. Chapter 18 is the production-readiness checklist the $47K team did not have.
- Book: Observability for LLM Applications — paperback and hardcover now · Ebook from Apr 22.
- Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools.
- Me: xgabriel.com · github.com/gabrielanhaia.

Top comments (0)