- Book: AI Agents Pocket Guide
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Sattyam Jain documented a production agent that burned $4,200 in 63 hours over a Friday-to-Monday window in April 2026. The agent called the same tool, hit the same 429, re-planned, and called it again. Roughly 4,800 cycles per hour, running all weekend before anyone checked the dashboard. The loop had no notion of cumulative cost or wall-clock time, and no memory of prior calls.
That story is not unique. The Operator Collective catalogued ten production agent failures, including a multi-agent research tool that ran for eleven days before anyone noticed and posted a $47,000 OpenAI invoice. Replit's coding agent deleted a production database during a code freeze and then generated fake user records when asked about the deletion (Tom's Hardware writeup; Replit CEO Amjad Masad's public apology on X; Jason Lemkin's SaaStr writeup). The 50-line "build an agent in an afternoon" tutorials all skip the parts that bite when nobody is watching.
The 50-line agent every tutorial ships
Here is the version you have read a dozen times. Tool registry, message list, loop until the model emits a final answer.
import json
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
MODEL = os.environ.get("AGENT_MODEL", "gpt-4o-mini")
def get_weather(city: str) -> str:
return f"Weather in {city}: 18C, light rain."
TOOLS = {"get_weather": get_weather}
SCHEMA = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}]
def run(task: str) -> str:
msgs = [{"role": "user", "content": task}]
while True:
r = client.chat.completions.create(
model=MODEL, messages=msgs, tools=SCHEMA
)
m = r.choices[0].message
msgs.append(m)
if not m.tool_calls:
return m.content
for call in m.tool_calls:
args = json.loads(call.function.arguments)
out = TOOLS[call.function.name](**args)
msgs.append({
"role": "tool",
"tool_call_id": call.id,
"content": str(out),
})
Hand this "What's the weather in Lisbon?" and it finishes in two turns. Hand it the failure-triggering task below and the meter starts spinning.
The task that breaks it
TASK = (
"Get the weather in Atlantis. If the tool errors, "
"retry until you get a real answer. Do not give up."
)
The tool is going to be modified to raise an exception for unknown cities. The model has been told to retry. Nothing in the loop says stop after N attempts, abort if the same call repeats, or kill the trace if cost crosses a threshold. So it does not stop.
def get_weather(city: str) -> str:
if city.lower() == "atlantis":
raise RuntimeError("429: rate limited")
return f"Weather in {city}: 18C, light rain."
Run it under a token meter and the trace expands until you Ctrl-C. The Operator Collective writeup characterises the shape plainly: nothing in the loop knows when to quit. The agent has no concept of "done."
Guardrail 1 — per-tool retry budget
The first failure mode is the retry storm. The model retries the same tool call after every error because it was told to. There is no per-tool counter anywhere in the loop.
The fix is a dictionary keyed by tool name with a hard ceiling. When a tool exceeds its budget, you do not call the tool again. You push a synthetic error into the message list so the model sees "this tool is exhausted" and has to pick something else (or give up).
RETRY_CAP = 3
retries: dict[str, int] = {}
In the call loop, replace the naive tool dispatch with:
name = call.function.name
retries[name] = retries.get(name, 0) + 1
if retries[name] > RETRY_CAP:
out = f"ERROR: {name} exceeded {RETRY_CAP} retries"
else:
try:
out = TOOLS[name](**args)
except Exception as e:
out = f"ERROR: {e}"
Six lines. The agent now sees its own ceiling. After three failed get_weather calls, it gets ERROR: get_weather exceeded 3 retries and is forced to either return the failure to the user or stop.
Guardrail 2 — fingerprint loop detector
Retry caps catch retry storms on a single tool. They do not catch the subtler loop where the agent calls tool_a, then tool_b, then tool_a again, alternating forever. The shape Sattyam Jain documented is exactly this: plan, call, 429, re-plan, call the same thing.
The detector hashes each tool call (name + sorted args) and tracks the last N fingerprints. If the same fingerprint appears more than twice in the recent window, you abort the trace.
from collections import deque
WINDOW = 6
seen: deque[str] = deque(maxlen=WINDOW)
Inside the tool dispatch, before executing:
fp = f"{name}:{json.dumps(args, sort_keys=True)}"
if list(seen).count(fp) >= 2:
raise RuntimeError(f"loop: {fp} repeated in last {WINDOW}")
seen.append(fp)
The agent gets one repeat (sometimes a real retry is legitimate), but the second repeat in the window kills the trace. The window size is the knob: too small and you false-positive on legitimate batches, too large and the loop runs longer before detection. Six works for a 3–5 tool loop; tune up if you batch reads.
Guardrail 3 — cost ceiling per trace
Retry caps and loop detectors stop most failures. They do not stop the case where the model legitimately calls many different tools, each succeeds, but the cumulative token cost crosses your sanity threshold. This is the one that ate the $4,200 bill. Every individual call looked fine.
Track usage from the API response and abort when the trace cost crosses a hard ceiling. Pricing is approximate and changes; the point is the guard, not the precise number.
MAX_USD = 0.50
# gpt-4o-mini, as of April 2026: see https://openai.com/api/pricing/
PRICE_IN = 0.15 / 1_000_000 # $/input token
PRICE_OUT = 0.60 / 1_000_000 # $/output token
spent = 0.0
After every chat completion:
u = r.usage
spent += u.prompt_tokens * PRICE_IN
spent += u.completion_tokens * PRICE_OUT
if spent > MAX_USD:
return f"ABORTED: trace cost ${spent:.4f} > ${MAX_USD}"
The trace cannot exceed fifty cents no matter what the model decides to do. For production, this number lives in config and varies per route — a research agent's ceiling is not a chat agent's ceiling.
The 80-line version, all three guardrails inline
import json
import os
from collections import deque
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
MODEL = os.environ.get("AGENT_MODEL", "gpt-4o-mini")
RETRY_CAP, WINDOW, MAX_USD = 3, 6, 0.50
PRICE_IN, PRICE_OUT = 0.15 / 1e6, 0.60 / 1e6
def get_weather(city: str) -> str:
if city.lower() == "atlantis":
raise RuntimeError("429: rate limited")
return f"Weather in {city}: 18C, light rain."
TOOLS = {"get_weather": get_weather}
SCHEMA = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}]
def run(task: str) -> str:
msgs = [{"role": "user", "content": task}]
retries: dict[str, int] = {}
seen: deque[str] = deque(maxlen=WINDOW)
spent = 0.0
while True:
r = client.chat.completions.create(
model=MODEL, messages=msgs, tools=SCHEMA
)
# guardrail 3: cost ceiling
u = r.usage
spent += u.prompt_tokens * PRICE_IN
spent += u.completion_tokens * PRICE_OUT
if spent > MAX_USD:
return f"ABORTED: cost ${spent:.4f} > ${MAX_USD}"
m = r.choices[0].message
msgs.append(m)
if not m.tool_calls:
return m.content
for call in m.tool_calls:
name = call.function.name
args = json.loads(call.function.arguments)
# guardrail 2: fingerprint loop detector
fp = f"{name}:{json.dumps(args, sort_keys=True)}"
if list(seen).count(fp) >= 2:
out = f"ERROR: loop on {fp}"
else:
seen.append(fp)
# guardrail 1: per-tool retry budget
retries[name] = retries.get(name, 0) + 1
if retries[name] > RETRY_CAP:
out = f"ERROR: {name} > {RETRY_CAP} retries"
else:
try:
out = TOOLS<a href="**args">name</a>
except Exception as e:
out = f"ERROR: {e}"
msgs.append({
"role": "tool",
"tool_call_id": call.id,
"content": str(out),
})
if __name__ == "__main__":
print(run("Get weather in Atlantis. Retry until it works."))
Eighty lines including the schema and the main block. Same agent, three guardrails, runnable.
Before and after, on the failure-triggering task
The 50-line version on the Atlantis task: open-ended loop, every retry is a real API call, you Ctrl-C after a minute and check the dashboard.
The 80-line version on the same task: the model calls get_weather("atlantis") and gets the 429 error. It retries, the counter trips at three, and the trace returns ERROR: get_weather > 3 retries. The model produces a final message: "I tried three times and the weather service is rate-limited. I cannot get Atlantis weather right now." Estimated cost: under a cent. Estimated wall-clock time: under five seconds (gpt-4o-mini, single trace).
Strip the retry cap and the loop detector but keep the cost ceiling, and the trace still terminates — at fifty cents instead of a few thousand. The cost ceiling alone caps the bill; the retry cap and loop detector cap the wasted calls before the bill matters.
What this does not solve
Three guardrails do not give you a production agent. They give you a tutorial that does not bankrupt you when the model goes sideways.
Real production needs more. Action-class separation (read-only tools versus destructive tools) is the missing piece in the Replit incident. Observability so the trace is readable when it fails is the missing piece in most of the postmortems linked above. The five-guardrail version is the natural next step in a follow-up.
If this was useful
The AI Agents Pocket Guide covers the patterns these guardrails come from: retry budgets, loop detection, cost-per-trace, and the action-class separation that would have stopped Replit. Short book, no fluff, written for engineers who already know they need this and want the shape to copy from.

Top comments (0)