- Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
A team at a fintech startup handed their research agent a $100 budget for a weekend experiment. By Monday morning it had burned $97 before anyone noticed. The agent was doing what it was told. It kept calling the same vector-search tool with slightly different queries, convinced each one would finally return the answer. The fix was 40 lines of guardrail code.
You want an agent that quits. You want it to quit early, quit loudly, and quit on a signal that is not your credit-card bill. An agent that self-terminates after three minutes of going nowhere is working as designed. An agent that runs for 11 days on a retry loop is not.
The $47K LangChain incident in November 2025 is the cautionary anchor here. Four agents, a misclassified error treated as retry-with-different-parameters, eleven days of HTTP 200 responses. The monthly budget alert fired on day nine. That alert was a receipt, not a brake.
What firing yourself looks like
Three guardrails cover most of the damage surface:
- Budget cap — tokens and wallclock. Two numbers, both hard.
- Same-tool-loop detection — any single tool called more than N times under one agent run trips the breaker.
-
Self-reported termination — the model declares
doneorneed_helpvia structured output. You trust it enough to let it quit, not enough to let it run forever.
None of this replaces observability. It sits underneath it. Observability tells you why the agent died. Guardrails tell you that it died before the bill arrives.
The minimum-viable wrapper
Here is the core of it. A small state object, a guardrail that inspects every step, and a structured output schema that lets the model tap out when it is stuck. Python, standard library plus pydantic.
# guardrails.py — the agent's seatbelt.
import time
from collections import Counter
from dataclasses import dataclass, field
from pydantic import BaseModel
from typing import Literal
@dataclass
class RunState:
started_at: float = field(default_factory=time.monotonic)
tool_calls: Counter = field(default_factory=Counter)
input_tokens: int = 0
output_tokens: int = 0
class GuardrailTripped(Exception):
"""Raised when the agent should stop. Not a bug."""
class Guardrail:
def __init__(
self,
max_tokens: int = 100_000,
max_seconds: float = 180.0,
max_same_tool: int = 6,
):
self.max_tokens = max_tokens
self.max_seconds = max_seconds
self.max_same_tool = max_same_tool
def check(self, run: RunState, next_tool: str | None = None):
elapsed = time.monotonic() - run.started_at
if elapsed > self.max_seconds:
raise GuardrailTripped(
f"wallclock: {elapsed:.1f}s > {self.max_seconds}s"
)
total = run.input_tokens + run.output_tokens
if total > self.max_tokens:
raise GuardrailTripped(
f"tokens: {total} > {self.max_tokens}"
)
if next_tool:
run.tool_calls[next_tool] += 1
count = run.tool_calls[next_tool]
if count > self.max_same_tool:
raise GuardrailTripped(
f"loop: {next_tool} called {count} times"
)
Two numbers and a counter. max_same_tool=6 means the agent gets six swings at the same tool before the breaker assumes it is stuck. max_seconds=180 is the three-minute floor. max_tokens=100_000 caps run cost on gpt-4o-mini at roughly thirty cents.
None of these are defaults you copy blindly. They are product decisions. A research agent with long context windows wants different numbers than a support bot answering FAQs. The point is that the numbers exist and the code enforces them.
Letting the model quit
The structured-output trick is underused. Most agent frameworks assume the model wants to keep going and the harness decides when to stop. Flip it. Give the model a way to declare it is done or stuck, and trust that signal.
# schema.py — the model tells you what happens next.
class AgentDecision(BaseModel):
status: Literal["continue", "done", "need_help"]
reason: str # one sentence, always required
next_tool: str | None = None
final_answer: str | None = None
The agent returns one of these on every turn. continue means take another step. done means it has a final answer. need_help means it has hit a wall and would rather escalate than keep burning tokens on a dead end. All three are valid terminal states for the turn.
Wire it into the loop:
# agent.py — the loop that knows how to quit.
def run_agent(task: str, guardrail: Guardrail) -> str:
run = RunState()
messages = [{"role": "user", "content": task}]
while True:
guardrail.check(run)
decision = call_llm_structured(messages, AgentDecision)
run.input_tokens += decision.usage.input
run.output_tokens += decision.usage.output
if decision.status == "done":
return decision.final_answer or ""
if decision.status == "need_help":
raise GuardrailTripped(
f"model escalated: {decision.reason}"
)
if decision.next_tool:
guardrail.check(run, decision.next_tool)
result = execute_tool(decision.next_tool)
messages.append({"role": "tool", "content": result})
GuardrailTripped is not an error in the HTTP sense. It is the agent finishing on a signal that is not the happy path. You catch it at the edge, log it as a first-class outcome, and move on. The team you hand this to should see three terminal states: done, escalated, tripped. All three are fine. All three are telemetry.
Why "same-tool-loop" catches what tokens miss
Token budgets are a blunt instrument. By the time you notice a token overshoot on a single run, you have paid for it. The same-tool-loop counter catches the shape of a stuck agent long before the bill grows.
The November 2025 LangChain incident is instructive: an Analyzer agent called a verify_result tool thousands of times under one logical run. Each call was cheap. Each call succeeded. The aggregate was the disaster. A max_same_tool=6 check would have tripped on call seven. The whole run would have died in the first few seconds of what became an eleven-day bleed.
The rule of thumb a team I know uses: if any single tool is called more than five times under one agent turn, you have a bug. Either the tool is broken, or the planner is broken, or the prompt is teaching the model to retry when it should escalate. All three want human eyes.
What the fintech team actually shipped
Back to the opening story. The fintech agent was researching companies from a public filings dataset. The vector-search tool was returning near-misses the model could not distinguish from hits. Every turn the model decided the next query would finally land, and every turn the search returned another plausible-looking miss. The agent was not looping in the classic sense. It was grinding.
What killed the bleed was the wallclock cap. The team set max_seconds=120 on the research workflow. The agent would run for two minutes, trip the guardrail, log a tripped:wallclock outcome, and hand the half-finished state to a human reviewer. Most of the time the reviewer saw that three candidate matches were close enough and closed the ticket. The agent did 60% of the work, flagged its own uncertainty, and stopped.
Seven dollars a ticket instead of ninety-seven dollars a weekend. The agent was shipping value the whole time. It just needed a way to hand off before the cost got weird.
Where guardrails end and observability begins
The guardrail is the brake. You still need the dashboard.
-
Per-run token histogram. The 99th percentile should sit well below your
max_tokens. If p99 is near the cap, your cap is too tight or your agent is flailing on a subset of inputs. -
Trip-reason distribution. Three outcomes, plotted as a stacked bar:
done,escalated,tripped. If thetrippedslice grows, your agent is hitting a shape of input it cannot handle. That is a prompt or tooling problem, not a budget problem. -
Same-tool-call top-N. Group
execute_toolspans bygen_ai.tool.nameunder eachinvoke_agentparent, take the max per run, sort descending. The top of that list is where your loops live.
None of these are exotic metrics. They fall out of OpenTelemetry's GenAI semantic conventions if your tracing is wired up at the agent level instead of the LLM-call level. Chapter 6 of the book walks through the span layout; Chapter 18 covers the alert shapes.
The part nobody wants to hear
A guardrail that fires is a guardrail that is working. When your on-call channel lights up with GuardrailTripped messages, the instinct is to widen the caps. Resist that for a week. Look at the trips. Most of them are telling you something the tests did not catch. A planner that calls the same tool eight times is a planner that needs a better prompt, not a planner that needs permission to call it eighty.
The agents that quit early save you from the agents that never quit at all.
If this was useful
Agent guardrails are an observability problem in disguise. The budget, the wallclock, the tool-call counter — they all live in the same trace that an APM stack can already ship, if the semantic conventions are right. The book, Observability for LLM Applications, covers the OTel GenAI instrumentation, the alert shapes that catch loops before the bill, and the incident playbook for when one slips through anyway.
- Book: Observability for LLM Applications — paperback and hardcover now; ebook from April 22.
- Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI coding tools.
- Me: xgabriel.com · github.com/gabrielanhaia.

Top comments (0)