- Book: AI Agents Pocket Guide
- Also by me: LLM Observability Pocket Guide
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
The agent burned $41.20 in tokens overnight. The Slack alert fired at 06:47 UTC: monthly LLM spend up 320% on a Sunday. You open the dashboard. One conversation. One user. The agent had called search_orders(customer_id="C-7841") 187 times in a row, each call returning the same empty result, each retry feeding the empty result back into the next prompt as "context I have so far."
You had logs. You had a token-spend metric. What you did not have was a way to see, at 04:12 UTC when the loop started, that the same tool was being called with the same input over and over. So the loop ran for two and a half hours.
The instrumentation that catches it in iteration one is small: one parent span per agent turn, one child span per tool call, attributes that make duplicates findable, and a single alert rule that fires before the bill does.
The shape of the trace
The OpenTelemetry GenAI semantic conventions define an agent span and tool execution spans as separate operations. Keep that shape. One span for the whole turn. One child span for each LLM call. One child span for each tool invocation. The parent carries the user-visible work; the children carry the loop signal.
agent.turn (parent)
├── llm.chat iteration=1
├── tool.execute name=search_orders, iteration=1
├── llm.chat iteration=2
├── tool.execute name=search_orders, iteration=2
├── llm.chat iteration=3
└── tool.execute name=search_orders, iteration=3
The thing you are looking for is a flat fan of identical tool.execute children under one parent. If you cannot see that shape in your tracing UI, no amount of extra logging will help you, because the loop will hide in the volume.
Standard OTel Python setup gets you that shape. Nothing exotic.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
BatchSpanProcessor,
)
from opentelemetry.exporter.otlp.proto.http.trace_exporter import (
OTLPSpanExporter,
)
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter())
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent.runtime")
Point OTEL_EXPORTER_OTLP_ENDPOINT at whatever backend you use. The code below is backend-agnostic; the spans land wherever you send them.
The parent span: one per turn
A "turn" is one user message in, one final assistant message out, however many tool calls in between. That bracket is the parent span. Open it when the user message arrives, close it when you return the final answer or hit the iteration cap.
def run_turn(user_msg: str, conversation_id: str) -> str:
with tracer.start_as_current_span(
"agent.turn",
attributes={
"gen_ai.conversation.id": conversation_id,
"gen_ai.agent.name": "support-bot",
},
) as turn_span:
result = agent_loop(user_msg, turn_span)
turn_span.set_attribute(
"agent.iterations", result.iterations
)
turn_span.set_attribute(
"agent.stop_reason", result.stop_reason
)
return result.final_message
Two attributes carry most of the debugging weight. agent.iterations is the loop counter at the moment the turn ended. agent.stop_reason is one of final_answer, max_iterations, error, user_cancelled. When you go looking for stuck loops, you filter on stop_reason = max_iterations and the worst offenders are at the top of the list.
The tool span: one per call
Every tool invocation gets its own child span. This is where the loop becomes visible.
import hashlib
import json
import time
def call_tool(name, args, iteration):
input_hash = hashlib.sha256(
json.dumps(args, sort_keys=True).encode()
).hexdigest()[:16]
with tracer.start_as_current_span(
"tool.execute",
attributes={
"tool.name": name,
"tool.input_hash": input_hash,
"iteration": iteration,
},
) as span:
start = time.perf_counter()
try:
result = TOOLS[name](**args)
span.set_attribute(
"tool.result_size", len(str(result))
)
return result
finally:
latency_ms = (time.perf_counter() - start) * 1000
span.set_attribute("latency_ms", latency_ms)
Four attributes do the work. tool.name lets you group. tool.input_hash is the duplicate detector: same hash across iterations means same input means probable loop. iteration is the position in the turn. latency_ms catches the other failure mode: the tool is slow, the agent times out and retries, the retries pile up.
The hash is a 16-char prefix of SHA-256 over the canonicalized JSON args. That is enough entropy to avoid collisions inside a single turn (you are not going to see 65k distinct tool calls per turn) and short enough to read in a trace UI without horizontal scroll.
The agent loop, instrumented
Here is the loop with both spans wired in. The iteration counter is the same value passed to each child and stamped on the parent at the end.
MAX_ITERATIONS = 10
def agent_loop(user_msg, turn_span):
history = [{"role": "user", "content": user_msg}]
for i in range(1, MAX_ITERATIONS + 1):
response = llm_call(history, iteration=i)
if response.tool_calls:
for tc in response.tool_calls:
output = call_tool(
tc.name, tc.args, iteration=i
)
history.append(
{"role": "tool", "content": output}
)
continue
return Result(
final_message=response.content,
iterations=i,
stop_reason="final_answer",
)
return Result(
final_message="(iteration cap reached)",
iterations=MAX_ITERATIONS,
stop_reason="max_iterations",
)
The cap matters. Without MAX_ITERATIONS, the only thing standing between you and the $40 bill is the LLM eventually deciding it is done. That is not a control loop you want in your runtime. Pick a number — 10 is a reasonable default for support and retrieval agents, lower if your tools are expensive.
Per the OTel GenAI agent span convention, set the span kind to INTERNAL when the agent runs in-process and CLIENT when it calls out to a hosted agent runtime. Most Python agents are the first case.
The alert that catches the loop
You have the spans. Now you need the rule that pages someone before iteration 187. Two conditions, OR'd together:
-
agent.iterations > 8on a closedagent.turnspan. - Same
tool.input_hashappears more than 2 times under oneagent.turn.
Condition 1 catches turns that are running long for any reason. Condition 2 catches the specific failure mode where the agent is making the same call repeatedly because the LLM is not adapting to the result.
In a backend with span-search APIs (Tempo, Honeycomb, Datadog), the second condition is a query you can put on a monitor. The pseudo-query:
SELECT trace_id, tool.input_hash, COUNT(*) AS dupes
FROM spans
WHERE name = 'tool.execute'
AND parent.name = 'agent.turn'
AND timestamp > now() - 5m
GROUP BY trace_id, tool.input_hash
HAVING dupes > 2
Three duplicate calls in five minutes is a threshold that tends to survive in production without false positives from legitimate retries. Tune to your tools: if you have an idempotent search you call twice on purpose, raise the bar.
For backends without span-search, do the detection in-process. Keep a set of (tool_name, input_hash) per turn, increment a counter, and emit a metric agent.duplicate_tool_calls{tool=...} when the counter passes 2. Alert on the metric.
class TurnState:
def __init__(self):
self.tool_call_counts = {}
def record(self, name, input_hash):
key = (name, input_hash)
n = self.tool_call_counts.get(key, 0) + 1
self.tool_call_counts[key] = n
if n > 2:
DUPLICATE_TOOL_CALLS.labels(tool=name).inc()
return n
Wire that into the call_tool helper above and you have a metric that fires regardless of where your spans end up.
A healthy turn looks like a parent span four seconds long with three children: one llm.chat, one tool.execute carrying tool.name=search_orders, iteration=1, latency_ms=89, one final llm.chat. Stop reason final_answer. agent.iterations=1. Move on, nothing to triage.
The bad turn from the opener, with this instrumentation in place, has the same parent shape, but by iteration 3 the tool.input_hash is identical across three tool.execute children. The duplicate counter trips. The alert fires inside the 5-minute window. The on-call kills the conversation before the spend crosses $2. Same agent code, same model, same tool. The only difference is that the loop was visible the first time it tried to repeat itself. The most expensive bug an agent can have is the one your traces do not show; instrument the turn, instrument the tool, hash the input, and the loop stops being invisible.
If this was useful
The AI Agents Pocket Guide covers tool-call control loops, iteration caps, and the failure modes that make agents expensive in production. The LLM Observability Pocket Guide goes deeper on span design, attribute taxonomies, and the alert rules that actually catch agent regressions before the bill does. If you are wiring tracing into a working agent right now, both have chapters that map directly to the code above.


Top comments (0)