The Agent Production-Readiness Checklist You Can Run Before Shipping

#ai #llm #python #agents

Book: Agents in Production — Building, Tracing, and Shipping Multi-Step AI You Can Trust
Also by me: Observability for LLM Applications — the companion book in The AI Engineer's Library (2-book series)
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

In February 2024, a tribunal ordered Air Canada to honor a refund policy that its support chatbot had invented on the spot. The airline argued it was not responsible for what its own bot said. The tribunal disagreed. The bot spoke for the company, so the company owned the answer.

That case is the whole problem with shipping an agent in one sentence. The agent talks to real users, touches real systems, and spends real money, and when it goes wrong the incident is yours. CPU is fine. Latency is fine. Everything returns a 200. The bot answered, and the answer was wrong.

So before you flip the traffic flag, you want a list you can walk top to bottom and honestly answer yes to. Not a maturity model. Six checks. Each one is a hard thing to fake, and each one closes a hole that has already burned a named team.

1. Every run emits a trace you can open

When you get paged on an agent at 3 a.m., your first move is not the metrics dashboard. It is the trajectory viewer. The trace is the runbook.

That only works if the trace exists and carries enough to reconstruct the decision. Emit a trace ID on every run, expose it to the user for support, and put the model, prompt version, tools called, tokens, and cost on every span.

You do not need a heavy framework to start. A span is a dict with a start, an end, and some attributes.

import time
import uuid
import json


def trace_run(user_id, prompt_version):
    return {
        "trace_id": str(uuid.uuid4()),
        "user_id": user_id,
        "prompt_version": prompt_version,
        "started_at": time.time(),
        "spans": [],
    }


def record_span(trace, name, **attrs):
    trace["spans"].append({"name": name, **attrs})

Wrap the model call so the usage lands on the span automatically. The Anthropic SDK returns token counts on every response.

from anthropic import Anthropic

client = Anthropic()
MODEL = "claude-opus-4-8"  # pinned, never "latest"


def traced_call(trace, messages, tools):
    resp = client.messages.create(
        model=MODEL,
        max_tokens=1024,
        messages=messages,
        tools=tools,
    )
    record_span(
        trace,
        "model_call",
        model=MODEL,
        input_tokens=resp.usage.input_tokens,
        output_tokens=resp.usage.output_tokens,
        stop_reason=resp.stop_reason,
    )
    return resp

Export the spans over an OpenTelemetry-compatible pipeline once the shape settles. The point of check one: when the page hits, the viewer has something to show.

2. You froze an eval set before launch, not after

The fix you ship at 4 a.m. should not introduce a new regression at 4:30. That is what an eval set buys you, and you need it in place before the first user, not after the first incident.

Freeze at least 50 real trajectories. Run two kinds of check against them on every deploy. One deterministic check: string match, tool-call match, or a unit test. One LLM-as-judge check for the fuzzy parts, correctness and tone.

The deterministic ones are cheap and catch the dumb regressions.

def eval_tool_choice(trajectory, expected_tool):
    calls = [s for s in trajectory["spans"]
             if s["name"] == "tool_call"]
    if not calls:
        return False, "no tool called"
    got = calls[0]["tool"]
    return got == expected_tool, f"got {got}"

The judge check hands the trajectory to a model and asks a narrow question with a schema, so the output is parseable rather than prose.

def judge_correctness(question, answer):
    resp = client.messages.create(
        model=MODEL,
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": (
                "Score the answer 1-5 for factual "
                "correctness. Reply as JSON with keys "
                f"score and reason.\nQ: {question}\n"
                f"A: {answer}"
            ),
        }],
    )
    text = resp.content[0].text
    return json.loads(text)

Have a human read 20 to 50 real traces before the first launch. No eval set replaces one engineer reading what the agent actually did.

3. Guardrails live in code, not in the prompt

A guardrail you can argue away with a clever prompt is not a guardrail. It is a suggestion. The real ones are deterministic checks sitting outside the model's reasoning: an input filter that refuses out-of-scope requests, an output scan for PII before the response leaves, and a loop detector that breaks when the same tool call repeats with the same arguments.

The loop detector is the one people skip and regret. An agent happy to call a broken tool forever will do exactly that.

def make_loop_guard(limit=3):
    seen = {}

    def check(tool_name, tool_input):
        key = (tool_name, json.dumps(tool_input,
                                     sort_keys=True))
        seen[key] = seen.get(key, 0) + 1
        if seen[key] >= limit:
            return False, f"repeated {key[0]}"
        return True, None

    return check

Keep each guardrail testable and logged. When one fires, that event goes on the trace from check one, so the postmortem can see the agent was stopped and why.

4. Every axis of spend has a cap

Developers have described runaway agent loops that burned five figures in credits when a tool misbehaved and the loop had no ceiling. The shape is generic: an autonomous loop without a budget will, at some rate, find a way to spend money unrelated to the work.

Cap every axis. Max steps per trajectory. Max tokens per trajectory. Max dollars per trajectory. A per-user daily budget and a per-tenant daily budget on top. Check the budget before the call, not after, because after means you already paid for the call that put you over.

from dataclasses import dataclass


@dataclass
class Budget:
    max_steps: int = 25
    max_tokens: int = 200_000
    max_usd: float = 2.0


def over_budget(budget, steps, tokens, usd):
    if steps >= budget.max_steps:
        return "step_budget"
    if tokens >= budget.max_tokens:
        return "token_budget"
    if usd >= budget.max_usd:
        return "usd_budget"
    return None

Return the reason as a value, not an exception. The user-facing answer differs when the model finished versus when a cap stopped it: one shows the answer, the other shows a partial with a "continue?" hatch.

5. No tool combination forms the dangerous trio

Meta's Agents Rule of Two is the clearest practical guidance on agent security right now. Within one session, an agent should satisfy at most two of these three: it processes untrusted input, it accesses private data or sensitive systems, and it can change state or communicate externally. All three at once is the pattern behind a class of 2025 agent breaches, and it means you need a human in the loop.

Audit it in code, and re-run the audit whenever a tool is added.

def rule_of_two_violation(tools):
    props = {"untrusted_input", "sensitive_access",
             "external_effect"}
    active = set()
    for t in tools:
        active |= props & set(t.get("properties", []))
    if len(active) == 3:
        return True, sorted(active)
    return False, sorted(active)

Tag each tool with the properties it carries. A web-fetch tool processes untrusted input. A database read touches private data. An email send communicates externally. When the audit lights up all three, the launch decision is not "ship carefully." It is "put a human on the confirmation step, or split the tools across sessions."

6. There is a runbook, and it points at the trace

Agent incidents look nothing like infra incidents. What breaks is semantics. So the runbook cannot start with the metrics dashboard.

Write it down before launch. The first line says: open the trajectory viewer before any dashboard. Page on judge-score drops and cost-p99 spikes, not only latency and error rate. Test the kill switch on a schedule rather than discovering during the incident that it never worked. And require two fields on every postmortem: a trajectory excerpt, and a new eval-set case that reproduces the failure.

That last one is what makes the list compound. Every incident that gets past the first five checks becomes a frozen eval case within 24 hours, so the same failure cannot ship twice. The list is not a gate you pass once. It is the loop that keeps the agent honest after you stop watching it.

Run it as a conversation

Print the six checks. Walk them with another engineer a week before go-live. For each one, either it is done, it is waived with a written reason and a named owner, or it is blocking. When the list is clean, set the flag to a small cohort and open the trajectory viewer in another tab.

The point is not the checkboxes. It is turning "is this ready" from a vibes question into a set of questions with answers. Vibes favor whoever wants to ship. A list makes someone say which items they are waiving, and lets someone else say no.

Start small. Pick an internal agent nobody will sue you over. Wire the trace, freeze the evals, cap the budget, audit the tools, write the runbook. Watch it for a week. Then build the one the lawyers care about.

If you want the long version of any of these six, that is the two books. Agents in Production is the machinery: the loop, the guardrails, the kill switch, the postmortem template. Observability for LLM Applications is the instruments underneath it: spans, traces, evals, and cost per call. They are the two halves of The AI Engineer's Library, and together they are the argument behind this checklist.