The 5 Guardrails Every AI Agent Needs Before It Touches Production

#ai #agents #observability #architecture

Book: AI Agents Pocket Guide
Also by me: LLM Observability Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

The Operator Collective collected ten production agent postmortems and the same shapes keep showing up. A multi-agent research tool ran for eleven days before anyone noticed, posting a $47,000 OpenAI invoice. A coding agent wiped a production database during a code freeze. Sattyam Jain documented a research agent that burned $4,200 in 63 hours, looping on a 429 it never learned to recognize.

In April 2026, Foresiet catalogued six AI security incidents in two weeks, including a Meta agent that, per Foresiet, hallucinated permission scopes and surfaced internal data for roughly forty minutes before monitoring caught it. The OWASP Top 10 for Agentic Applications 2026 names this pattern explicitly: Least-Agency and Strong Observability are the two principles that turn an agent demo into something you can run on a Tuesday afternoon without getting paged.

Five guardrails wrap the agent loop, in the order the production-incident corpus keeps demanding. Each one is 15–30 lines of Python around an existing agent. Composition matters. The order is the post.

The agent we are wrapping

Standard tool-calling loop, OpenAI SDK. The naked version is what every guardrail attaches to.

import json, os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
MODEL = os.environ.get("AGENT_MODEL", "gpt-4o-mini")

def naked_agent(task: str, tools, schema, dispatch) -> str:
    msgs = [{"role": "user", "content": task}]
    while True:
        r = client.chat.completions.create(
            model=MODEL, messages=msgs, tools=schema
        )
        m = r.choices[0].message
        msgs.append(m)
        if not m.tool_calls:
            return m.content
        for call in m.tool_calls:
            out = dispatch(call)
            msgs.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": str(out),
            })

Five guardrails, wrapping outward from the tool dispatch.

Guardrail 1 — action-class classification

The Replit incident is the cleanest argument for this one. The agent had write access to production. There was no taxonomy distinguishing read_user(id) from drop_table("users"). The OWASP 2026 doc calls this Zero-Trust Tooling: every tool call is a high-risk operation until proven otherwise.

Tag every tool with an action class. Read tools run free. Destructive tools route through an approval channel. The classification lives next to the tool definition so it cannot drift.

READ, WRITE, DESTRUCTIVE = "read", "write", "destructive"
TOOL_CLASS: dict[str, str] = {
    "get_user": READ,
    "update_user": WRITE,
    "drop_table": DESTRUCTIVE,
}

def classify_dispatch(call, raw_dispatch, approver):
    name = call.function.name
    cls = TOOL_CLASS.get(name, DESTRUCTIVE)  # default deny
    if cls == DESTRUCTIVE:
        if not approver(name, call.function.arguments):
            return f"DENIED: {name} requires approval"
    return raw_dispatch(call)

Default to DESTRUCTIVE for unknown tools. A new tool added without a class entry should fail closed, not fail open. The approver is whatever your environment has: a Slack interaction, a queue your on-call drains, a webhook to a CLI prompt for dev mode.

Twelve lines including the dictionary. This is the inner wrap; everything else stacks on top.

Guardrail 2 — output-policy filter on tool inputs

Action classification is necessary but not sufficient. The Replit agent did not need to call drop_table; it called something legitimate with a destructive argument. The OWASP doc names this as a separate failure mode: an LLM-generated argument that passes schema validation but violates policy.

The filter sits between the model's tool call and the dispatcher. It validates the contents of the call, not just the shape.

import re

DENY_PATTERNS = [
    re.compile(r"\bDROP\s+TABLE\b", re.I),
    re.compile(r"\bDELETE\s+FROM\b\s+\w+\s*;?\s*$", re.I),
    re.compile(r"^\s*rm\s+-rf\b"),
]

def policy_check(args_json: str) -> str | None:
    for p in DENY_PATTERNS:
        if p.search(args_json):
            return f"policy violation: {p.pattern}"
    return None

def filtered_dispatch(call, raw_dispatch):
    err = policy_check(call.function.arguments)
    if err:
        return f"DENIED: {err}"
    return raw_dispatch(call)

Twelve lines. The deny list is environment-specific, whatever your security review names as off-limits for the agent. For SQL tools, parse the AST and enforce on the parsed form rather than regex; the example above is the shape, not the production-grade matcher. For shell tools, an allowlist of binaries beats a denylist of patterns. The principle is constant: validate the call's contents before you let it run.

Guardrail 3 — max-step cap with reason logging

The eleven-day $47,000 trace did not look like a loop on any single step. Each step was different. The total just kept going. Step caps are how you stop runaway non-loops: the agent that has a real plan and the plan keeps growing.

A step cap with a reason log is more useful than a step cap alone. When the cap fires, you want to know what the agent was doing.

MAX_STEPS = 25

def step_capped(task, tools, schema, dispatch):
    msgs = [{"role": "user", "content": task}]
    for step in range(MAX_STEPS):
        r = client.chat.completions.create(
            model=MODEL, messages=msgs, tools=schema
        )
        m = r.choices[0].message
        msgs.append(m)
        if not m.tool_calls:
            return m.content
        for call in m.tool_calls:
            out = dispatch(call)
            msgs.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": str(out),
            })
    last = [
        c.function.name for c in (msgs[-2].tool_calls or [])
    ]
    return f"STEP_CAP_HIT after {MAX_STEPS}; last tools={last}"

Twenty lines. The reason string is what gets paged when the cap fires. last_tools=[search, search, search, summarize] is the difference between the agent stalled in a search loop and the agent finished researching but could not write the final answer.

Twenty-five steps is a defensible default for a single-task agent. Longer-running orchestrations need the cap on each sub-task, not on the parent.

Guardrail 4 — per-trace cost ceiling with hard kill

Step caps stop at step count. Cost ceilings stop at money spent. They are not the same. A 5-step trace with massive context windows can cost more than a 30-step trace on small contexts. Both fire on different failure modes.

Track usage from every API response. When the cumulative spend crosses the threshold, abort.

MAX_USD = 1.00
PRICE_IN, PRICE_OUT = 0.15 / 1e6, 0.60 / 1e6

def cost_capped(task, tools, schema, dispatch):
    msgs = [{"role": "user", "content": task}]
    spent = 0.0
    while True:
        r = client.chat.completions.create(
            model=MODEL, messages=msgs, tools=schema
        )
        u = r.usage
        spent += u.prompt_tokens * PRICE_IN
        spent += u.completion_tokens * PRICE_OUT
        if spent > MAX_USD:
            return f"COST_KILL ${spent:.4f} > ${MAX_USD}"
        m = r.choices[0].message
        msgs.append(m)
        if not m.tool_calls:
            return m.content
        for call in m.tool_calls:
            out = dispatch(call)
            msgs.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": str(out),
            })

Twenty-two lines. The kill is hard. The loop returns, no exception, no retry. The string returned starts with a stable prefix (COST_KILL) so dashboards can count it. Pricing is approximate and changes; production code reads the rate from config and audits monthly against the provider's invoice.

The Operator Collective writeup is unambiguous about ordering: cost ceiling on the outside, hard cap, no soft warnings. The eleven-day trace had log lines about cost. Nobody read them.

Guardrail 5 — structured error escalation to a human

The first four guardrails decide whether a step runs and whether the trace continues. None of them decide what to do when the agent is confused: when it has tried things, none worked, and it does not know what to do next.

The default in tutorial code is to keep retrying. The default in production code should be to escalate. A small classifier on the agent's last message routes confusion to a human queue with the trace ID attached.

import uuid

CONFUSION_MARKERS = (
    "i'm not sure", "i cannot determine",
    "without more information", "appears to have failed",
)

def escalate_if_stuck(final_message: str, trace_id: str, queue):
    low = (final_message or "").lower()
    if any(m in low for m in CONFUSION_MARKERS):
        queue.put({
            "trace_id": trace_id,
            "reason": "agent_confusion",
            "last_message": final_message,
        })
        return f"ESCALATED trace={trace_id}"
    return final_message

def run_with_escalation(task, tools, schema, dispatch, queue):
    trace_id = uuid.uuid4().hex
    out = cost_capped(task, tools, schema, dispatch)
    return escalate_if_stuck(out, trace_id, queue)

Eighteen lines. The classifier is intentionally simple. It catches the literal phrases the model produces when it has given up. A learned classifier or a structured-output schema (the agent emits confidence: low and a status field) is better; the markers above are the floor.

The queue is whatever you have: a database table, an SQS, a Slack webhook. The trace ID is the load-bearing detail. When the human picks up the escalation, the first thing they need is the full trace, not a one-line summary.

How they compose, from inside to outside

escalation
  └── cost_ceiling
        └── step_cap
              └── policy_filter
                    └── action_class
                          └── tool_dispatch

Inner-to-outer matters because each layer makes a different kind of decision and you want the cheap decisions first.

Action class runs on every tool call and most of them are READ. Fast path, no network.
Policy filter is regex over arguments, also cheap. Catches the pre-check failures before any external call.
Step cap is a counter on the outer loop. Fires when the trace is structurally too long.
Cost ceiling wraps the model calls themselves. Fires when the trace is structurally too expensive.
Escalation runs once at the end. Decides whether the human queue gets pinged.

If you put the cost ceiling on the inside, you are charging cost-tracking overhead per tool call. If you put the action class on the outside, you are checking permissions after the model already paid for the planning. The order is not aesthetic. It is the difference between a guardrail that costs a millisecond and one that costs a request.

What is left

These five do not cover everything. The OWASP 2026 doc names eight other categories: prompt injection through tool outputs, cross-agent privilege creep, identity confusion in multi-agent setups, memory poisoning. Those need their own treatment.

But the production postmortems from the last twelve months (the $4,200 burn, the $47,000 invoice, the Replit wipe, the Meta scope hallucination) each map to at least one of these five guardrails as a plausible mitigation. Not all of them to all five. Each to at least one.

If your agent is wrapped in zero of them, you are running a tutorial. If it is wrapped in one, you are running an experiment. Five is the floor for production.

If this was useful

The AI Agents Pocket Guide covers the loop and the wrappers: what each guardrail catches, where it composes, what it lets through. The LLM Observability Pocket Guide is the next book over. What to put on the trace span when one of these fires, so the human picking up the escalation has the context to act.