- Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
In November 2025, four LangChain agents on a legitimate production system entered an infinite conversation cycle. An Analyzer and a Verifier. A misclassified error was treated as "retry with different parameters." The loop ran for 11 days before anyone noticed. The bill was $47,000.
Not a toy. Not a hackathon. Production.
The system had latency dashboards. It had error-rate dashboards. It had a monthly budget alert that fired on day 9 (two days too late). Every LLM call returned 200. Every tool call succeeded. Every span in every trace was green.
This post is about what a properly instrumented trace would have shown on day one — and the circuit-breaker pattern that turns that signal into a stop.
Why traditional alerts missed it
The alerts a team usually wires up for an LLM feature, in the order they bolt them on:
- Latency. A single LLM call that takes 1.4s is fine. An agent that calls itself 400,000 times at 1.4s each is a series of fine calls. Latency p99 on each span stays flat. The cumulative wallclock is invisible because it is not a property of any one span.
- Error rate. HTTP 200 on every call. No 5xx. No exceptions thrown. The retry loop is not a failure mode the transport layer has a vocabulary for.
- Token budget per request. Each request stays under its per-request cap. The cap was designed against single-turn calls. It does not know what a multi-turn agent is.
- Monthly cost alert. Fires once the damage is done. A monthly alert is a receipt, not a brake.
The 2026 FinOps survey found 98% of practices now manage some form of AI spend. Most of them are tracking what they spent, not controlling what they will spend next. Those are different instruments.
What the trace should have looked like
Under the OpenTelemetry GenAI semantic conventions, a tool-calling agent is expressed as a single invoke_agent parent span with alternating chat and execute_tool children. The standard attributes are gen_ai.agent.id, gen_ai.agent.name, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.tool.name.
For a healthy agent run, the span tree looks like this:
invoke_agent [agent=analyzer, total_tokens=4212]
├── chat gpt-4o-mini [input=310, output=64]
├── execute_tool [tool=search_orders]
├── chat gpt-4o-mini [input=380, output=82]
├── execute_tool [tool=get_order]
└── chat gpt-4o-mini [input=460, output=120]
Three to six children. Clean termination. Tokens add up to single-digit thousands.
For the looping agent, the span tree would have been:
invoke_agent [agent=analyzer, total_tokens=?]
├── chat gpt-4o-mini [input=310, output=64]
├── execute_tool [tool=verify_result]
├── chat gpt-4o-mini [input=340, output=71]
├── execute_tool [tool=verify_result]
├── chat gpt-4o-mini [input=360, output=74]
├── execute_tool [tool=verify_result]
├── ... (repeats 400,000 times) ...
The signal is not any single span. It is the count of execute_tool children under a single invoke_agent parent, grouped by gen_ai.tool.name. That count has a normal distribution. Ten calls of the same tool under one parent? Weird. A hundred? Stop the world.
The circuit breaker that would have stopped it
Here is a minimum-viable quality-aware circuit breaker, adapted from Chapter 18 of the book. It trips on repeated-tool-call pattern and on cumulative-token overshoot, not on HTTP status.
# agent_breaker.py — trips on loops, not on HTTP errors.
from collections import Counter
from dataclasses import dataclass, field
@dataclass
class RunState:
same_tool_calls: Counter = field(default_factory=Counter)
total_input_tokens: int = 0
total_output_tokens: int = 0
class AgentCircuitBreaker:
def __init__(
self,
max_same_tool: int = 8,
max_run_tokens: int = 200_000,
):
self.max_same_tool = max_same_tool
self.max_run_tokens = max_run_tokens
def check(self, run: RunState, next_tool: str):
run.same_tool_calls[next_tool] += 1
if run.same_tool_calls[next_tool] > self.max_same_tool:
raise RuntimeError(
f"loop detected: {next_tool} called "
f"{run.same_tool_calls[next_tool]} times"
)
total = run.total_input_tokens + run.total_output_tokens
if total > self.max_run_tokens:
raise RuntimeError(
f"token budget exceeded: {total} > {self.max_run_tokens}"
)
Two numbers. max_same_tool=8 kills any loop that calls the same tool more than eight times in one agent run. max_run_tokens=200_000 caps cumulative cost per run at roughly $0.60 on gpt-4o-mini. Both numbers are product decisions; neither is optional.
Wire it into the agent loop:
# agent.py — the only change that matters.
breaker = AgentCircuitBreaker()
run = RunState()
while not agent.finished:
step = agent.next_step()
if step.kind == "tool":
breaker.check(run, step.tool_name)
result = execute_tool(step)
else:
response = call_llm(step.messages)
run.total_input_tokens += response.usage.input_tokens
run.total_output_tokens += response.usage.output_tokens
A tool call that goes past the limit raises. The agent run dies. The trace captures the RuntimeError on the span. Your on-call sees a single clean signal (agent.loop_detected) instead of a $47K invoice three weeks later.
The alert that should have fired
An alert on invoke_agent spans with this PromQL shape, computed per tool, would have paged on day one:
# Tool-repetition alert: more than 10 calls to the same
# tool under a single agent run, over the last 5 minutes.
max by (gen_ai_agent_id, gen_ai_tool_name) (
count_over_time(
gen_ai_tool_call_total[5m]
)
) > 10
One line. Groups by agent ID and tool name. Surfaces the exact loop as soon as it starts. Costs nothing to compute. Costs nothing to run.
The team who took the $47K hit had OpenTelemetry set up. They shipped spans. They just did not alert on this shape.
The three things an agent observability stack owes you
From Chapter 6 of the book, which walks through agent tracing in detail:
-
A single
invoke_agentparent span per user turn. Every chat and every tool call is a child of that parent. If your tracing is flat — every LLM call a separate root trace — you cannot compute "how many times did this agent call this tool in this turn." That is the metric you need. -
Cumulative usage roll-up.
gen_ai.usage.input_tokenson the parent span, summed across all child calls. A single number per agent turn that can be alerted on. - A circuit breaker the router knows about. The breaker is not in your application code. It is in the layer that dispatches the next step. LangGraph has one. LangChain has one behind a flag. LlamaIndex's workflow runner has one. If your agent framework does not, you are the circuit breaker.
If this was useful
The pattern is cheap. The alert is one line of PromQL. The circuit breaker is 30 lines of Python. The book (Observability for LLM Applications) walks through the full OTel GenAI instrumentation, the eval layer that catches quality regressions on top of cost regressions, and the incident playbook for when a loop still slips through.
- Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
- Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools.
- Me: xgabriel.com · github.com/gabrielanhaia.

Top comments (0)