Agent Guardrails: Loop Limits, Cost Caps, and Human Approval Gates

#ai #llm #python #agents

Book: Agents in Production — Building, Tracing, and Shipping Multi-Step AI You Can Trust
Also by me: Observability for LLM Applications — the companion book in The AI Engineer's Library (2-book series)
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You ship the agent on a Friday. The flag goes on for a
small cohort, the traces look green, you close the laptop.
On Sunday the on-call channel has a quiet message from
finance: "is this bill right?" It is not right. One
session hit a broken tool, the model decided the last call
almost worked, and it retried for thirty-six hours.
Nobody was watching, because nothing told anyone to watch.

That is the failure guardrails exist to prevent. Not the
exotic prompt-injection chain from the security papers.
The boring one: an autonomous loop with no ceiling, spending
money and touching systems at the speed of its own retries.

An agent is a while loop that calls a model, runs the
tools the model asks for, and feeds the results back. The
loop is the product. It is also the liability. Three
guardrails turn the liability back into a product: a step
ceiling, a per-run budget denominated in dollars, and an
approval gate on the verbs that change the world. Build
all three before the loop sees traffic.

Step ceilings: make the agent fail fast

An agent that does not know when to stop will run forever.
Every framework ships a loop limit, and most defaults are
too generous. LangGraph
0.4 defaults recursion_limit to 25 and raises a hard
GraphRecursionError when you hit it. The
OpenAI Agents SDK
uses max_turns and raises MaxTurnsExceeded. Both are
fine. Both need wrapping so a
hit ceiling returns a structured result instead of a stack
trace your users see.

Pick the number from the task shape, not the framework
default. A user-facing chat agent is almost always broken
past 8 to 12 turns. A background research agent lives
around 20 to 30. A coding agent runs closer to 100, but
only because a wall-clock cap and a token budget bite long
before the step counter does. The default exists to avoid
surprising new users, not to protect your bill.

The step counter is one axis. Wall-clock time and total
tokens are the other two. Whichever hits first wins, and
every ceiling exits through the same path so everything
downstream understands one failure shape.

from dataclasses import dataclass
from time import monotonic


@dataclass
class Ceilings:
    max_steps: int = 12
    max_seconds: float = 120.0
    max_tokens: int = 200_000


def tripped(steps, tokens, started, c: Ceilings):
    if steps >= c.max_steps:
        return "loop_limit"
    if monotonic() - started > c.max_seconds:
        return "time_limit"
    if tokens >= c.max_tokens:
        return "token_limit"
    return None

Call tripped at the top of the loop, before the next
model call. Check first, then spend. If you check after,
you have already paid for the call that put you over.

A step ceiling alone still misses the agent that takes
twelve steps and makes no progress: same tool, same
arguments, same result, twelve times. Hash the
(tool_name, arguments) tuple on each step and abort when
the last three are identical.

def no_progress(steps: list[tuple[str, dict]]) -> bool:
    if len(steps) < 3:
        return False
    return len(set(map(str, steps[-3:]))) == 1

Cost caps: the budget is denominated in dollars

An agent can stay under the step ceiling and still spend a
thousand dollars if each turn pulls a hundred-thousand-token
document into context and asks Claude Opus about it. A raw
token count hides this, because tokens are not one price.

Cached input tokens are cheap. Fresh input tokens are
medium. Output tokens are expensive. Reasoning tokens on an
extended-thinking model are the most expensive of all. A
budget counted in tokens flattens that into a lie. Count it
in dollars.

PRICE_PER_MTOK = {
    "cache_read": 1.50,
    "input": 15.00,
    "output": 75.00,
    "reasoning": 75.00,
}


def cost_usd(usage) -> float:
    return (
        usage.cache_read_tokens * PRICE_PER_MTOK["cache_read"]
        + usage.input_tokens * PRICE_PER_MTOK["input"]
        + usage.output_tokens * PRICE_PER_MTOK["output"]
        + usage.reasoning_tokens * PRICE_PER_MTOK["reasoning"]
    ) / 1_000_000

Those are representative Opus-class figures per million
tokens (Anthropic public pricing circa 2026 — check current
rates); substitute your vendor's published numbers. What
matters is that the ceiling is a dollar amount, because a
thousand reasoning tokens on a frontier model cost more
than ten thousand cached input tokens on a small one.

You want cost control in two places. A per-run budget
inside the harness, and a per-key budget at the gateway.
The in-process budget is the cheapest to write, so write it
first: accumulate usage across every call and abort when
the run crosses its cap.

The gateway budget is the one that survives a bug in your
harness. LiteLLM Proxy
ships a budget manager with hard caps per user, per key, and
per model. Portkey does the same
through a managed dashboard. Set the in-process cap to
the worst reasonable task cost, times two. Set the gateway
cap to the number you are willing to write a postmortem
about. If they are the same number, you will have a bad
afternoon.

One guard, one exit, one span

Scattered checks work, but in production you want a single
object that enforces steps, seconds, tokens, and dollars in
one place and records every trip on the current trace span.
When the postmortem asks why a run died, the answer is a
span query, not a dig through logs.

from time import monotonic
from opentelemetry import trace

tracer = trace.get_tracer("agent.guard")


class GuardTripped(Exception):
    def __init__(self, reason: str, detail: str = ""):
        self.reason = reason
        super().__init__(f"{reason}: {detail}")


class AgentGuard:
    def __init__(self, max_steps=12, max_seconds=120.0,
                 max_tokens=200_000, max_usd=5.0):
        self.c = (max_steps, max_seconds,
                  max_tokens, max_usd)
        self.steps = self.tokens = 0
        self.usd = 0.0
        self.started = monotonic()

    def check(self) -> None:
        ms, msec, mtok, musd = self.c
        if self.steps >= ms:
            raise GuardTripped("loop_limit")
        if monotonic() - self.started >= msec:
            raise GuardTripped("time_limit")
        if self.tokens >= mtok:
            raise GuardTripped("token_limit")
        if self.usd >= musd:
            raise GuardTripped("cost_limit")

    def charge(self, usage) -> None:
        self.steps += 1
        self.tokens += (usage.input_tokens
                        + usage.output_tokens)
        self.usd += cost_usd(usage)
        span = trace.get_current_span()
        span.set_attribute("agent.guard.usd_total",
                           self.usd)

The harness calls check() before each model turn and
charge(response.usage) after each response. Four lines,
one exit path, one span attribute per reason.

guard = AgentGuard(max_steps=10, max_usd=2.0)
while not done:
    guard.check()
    response = client.messages.create(
        model="claude-opus-4-8",
        messages=messages,
        max_tokens=1024,
    )
    guard.charge(response.usage)
    messages.append(
        {"role": "assistant", "content": response.content}
    )

The point is the shape, not the code. You will write your
own, shaped by your framework.

Approval gates: a human clicks for destructive verbs

Ceilings and budgets stop a runaway. They do nothing about
a single tool call that ships a wire, deletes a table, or
emails a customer. For any verb that changes state in a way
you cannot cheaply undo, the right check is a person who
clicks a button. Not an LLM judge. Not a rule engine. A
human, looking at the exact arguments.

Every framework converged on the same shape. LangGraph
calls it interrupt(): the graph pauses, the checkpointer
persists state, the caller gets the proposed action, and
execution resumes on Command(resume=...). The Agents SDK
models it as a needs_approval flag on the tool. The
mechanism is boring. The discipline around it is not.

from functools import wraps
from uuid import uuid4

DANGEROUS_VERBS = (
    "send_", "delete_", "pay_", "run_shell",
)


def dangerous(fn):
    @wraps(fn)
    def wrapper(*args, **kwargs):
        req_id = str(uuid4())
        payload = {"tool": fn.__name__,
                   "args": args, "kwargs": kwargs}
        decision = approval_queue.request(
            req_id, payload, timeout_s=3600
        )
        span = trace.get_current_span()
        span.set_attribute("approval.request_id", req_id)
        span.set_attribute("approval.decision",
                           decision.outcome)
        span.set_attribute("approval.approver",
                           decision.approver)
        if decision.outcome != "approved":
            raise ApprovalDenied(decision.outcome)
        return fn(*args, **kwargs)
    return wrapper


@dangerous
def send_email(to: str, subject: str, body: str):
    return mailer.send(to, subject, body)

Two rules make the gate real instead of theater.

Show the payload verbatim. The human must see the exact
arguments passed to the tool, in a monospace font, not the
model's summary of them. Picture an approval UI that renders
"a short note to Alice confirming the meeting" while the
raw arguments carry a bcc to an address the model chose
not to mention. The human clicks approve. A summary loses
fidelity exactly where the damage hides.

Never add "remember my choice." It feels like a fix for
approval fatigue. Six weeks later most approvals are
auto-approved, the gate is decoration, and nobody notices
until the incident. If fatigue is real, and it always is,
narrow the set of tools that trigger approval. Do not weaken
the approval itself. A tool too slow to gate is a tool that
should not be in the agent's hands.

Give the gate a timeout. An agent that pauses for a human
and waits forever leaks threads on whatever runtime hosts
it. Pick a deadline, treat expiry as a rejection with reason
approval_timeout, and let the caller restart if they still
want the action.

Build them in this order

You do not need all of this before the first demo. You need
it in the right order, each layer before the failure it
prevents can reach a real user.

Step ceiling and dollar budget. Day one, before the first demo. These catch the failure you actually hit in week one: an agent that works for the inputs you tested and loops forever on the 20% you did not.
Tool allow-list. Before the first tool that mutates anything. The agent's tool set is a closed universe declared at construction. The model never discovers tools at runtime.
Approval gate on destructive verbs. Before the first user who is not you. The developers know what the agent does. The first outsider asks it to do something nobody planned for.

Guardrails are not a tax on capability. They are the reason
the capability is shippable. A coding agent with no ceiling
is a demo. The same agent with a step ceiling, a dollar
budget, an allow-list, and a human gate on git push and
anything that touches production is a product. One sprint of
infrastructure sits between the two, and every night of
sleep after it.

Write the step ceiling today. Write the dollar budget
today. Decide today which verbs need a human. Then turn the
flag on.

Guardrails keep a confused agent from spending forever. The
harder work is seeing why a run tripped and what a healthy
one looks like, and that is a tracing problem. Agents in
Production covers the guardrail stack end to end (ceilings,
budgets, allow-lists, approval queues), and Observability
for LLM Applications, its companion in The AI Engineer's
Library, covers the spans, evals, and cost telemetry that
tell you which tripwire is carrying the load. Together they
are the difference between a demo and something you can page
someone about.