Arthur

Posted on Jun 29 • Originally published at pickles.news

Five ways AI agents fail in production. None of them is the model.

#aiagents #production #reliability #circuitbreaker

The story I keep coming back to from Alex Vega's recent post on vegaforge.dev is the one about 47 messages. An agent's retry loop worked exactly as designed. An API call failed. The agent tried again. And again. By the time anyone noticed, it had posted 47 nearly-identical messages to a public channel. The circuit breaker that would have stopped it had not been written yet.

That's the structural shape of most production agent failures I've seen this year. The LLM didn't hallucinate. The reasoning didn't collapse. A piece of infrastructure that should have existed didn't.

I'm going to walk through five of these — Vega's framing, with my own annotations and the primary sources I dug up while writing this. The throughline is the one in Vega's title: not one of them is about the model.

The 47-messages problem: agents need circuit breakers

A retry loop without an upper bound is not a feature. It's a bug waiting on its incident.

The classical circuit breaker tracks failure rate per operation, trips after consecutive failures, and switches to a fallback. For an agent the fallback can be very simple: a notification to a human, a retry-in-1-hour rule, a do-nothing default. Anything is better than 47 of the same thing.

A guard solves the adjacent problem. Before the agent commits an action, the guard validates it. Is this in scope? Is the output well-formed? Have I seen this exact payload three times in the last minute? The guard catches what the agent itself, by construction, can't see — because the agent is the thing whose judgement is in question.

State management belongs in this same family. Many agent frameworks hold context in memory; when the process crashes, the context dies with it. The cheap fix is a JSON file written after each meaningful step. The agent restarts at the last checkpoint. No database needed. No queue. Just files that survive a reboot.

Here's the smallest reliable circuit breaker I write into agent loops. It's deliberately boring — counters, a window, a tripped state, no metrics infrastructure required:

import json, time
from pathlib import Path

class CircuitBreaker:
    def __init__(self, name, max_failures=5, window_s=60, cooldown_s=600):
        self.path = Path(f".cb-{name}.json")
        self.max_failures = max_failures
        self.window_s = window_s
        self.cooldown_s = cooldown_s

    def _load(self):
        if self.path.exists():
            return json.loads(self.path.read_text())
        return {"failures": [], "tripped_at": None}

    def _save(self, state):
        self.path.write_text(json.dumps(state))

    def allow(self):
        now, s = time.time(), self._load()
        if s["tripped_at"] and now - s["tripped_at"] < self.cooldown_s:
            return False
        s["failures"] = [t for t in s["failures"] if now - t < self.window_s]
        self._save(s)
        return True

    def record(self, ok):
        now, s = time.time(), self._load()
        if ok:
            s["failures"], s["tripped_at"] = [], None
        else:
            s["failures"].append(now)
            if len(s["failures"]) >= self.max_failures:
                s["tripped_at"] = now
        self._save(s)

It's about thirty lines. The state lives in a JSON file alongside the agent. If the process dies the file outlives it, and the next invocation honours the cooldown without any memory of the previous run.

Cron beats message queues for most agent work. Queues introduce brokers, connections, and their own failure modes; a cron job that wakes up, reads its state file, acts, and writes the result back is easier to reason about and easier to debug.

Garbage in, confident answer out: the real RAG problem

Semantic retrieval works. It finds relevant chunks. The issue is that relevant and correct are not the same property.

Google's AI Overviews famously told users to put glue on pizza and eat one rock a day when the feature launched in May 2024. The glue advice traced to a decade-old joke comment on Reddit. The rock advice traced to a satirical article from The Onion. The retriever did its job and surfaced text that matched the query. The model used it confidently. Semantic similarity told the system nothing about the source's quality.

This is the dumb-RAG failure mode: the model finds context and acts on it, and nothing in between asks whether that context should have been pulled at all.

Long context histories make this worse. As an agent accumulates dialogue and pulls from larger document stores, attention dilutes. The single critical detail that should drive a decision drowns in noise. Relevant chunks compete with each other until the signal is gone.

The fix isn't a better retriever. The fix is scoring quality alongside relevance. Every retrieved chunk should carry a provenance signal: government statistics outweigh a random blog post, primary sources outweigh second-hand summaries, current docs outweigh a legacy archive. Build confidence thresholds. Flag low-quality context for human review instead of acting on it autonomously.

"Garbage in, confident output" looks like an LLM problem. It's a system problem.

The thing that worked yesterday: connector reliability

In February 2026, n8n issue #25276 documented the Vector Store Question Answer tool generating invalid schemas in v2.6.3 — schemas that both OpenAI and Anthropic rejected outright. (Vega's post dates this to June 2025; the live GitHub issue was actually opened on Feb 4, 2026, with the v2.4.7 → v2.6.3 upgrade as the trigger.) OpenAI returned Invalid schema for function: schema must be a JSON Schema of 'type: "object"', got 'type: "None"'. Anthropic returned tools.0.custom.input_schema.type: Field required. Chains that had run for months started failing on every call.

Schema drift is the foreseeable risk. Connector libraries break on dependency upgrades, and the agent has no way of knowing the contract changed.

Credential rot is worse. On May 1, 2025, the LangSmith API went down for 28 minutes when its SSL certificate expired. The post-mortem is precise about the chain of failure. A migration between certificate-renewal automation tools at the end of January 2025 left a conflicting DNS record in dangling Terraform code. Automated renewals began failing on April 1. Nobody noticed until the certificate actually expired on May 1 — three months later. The team had assumed expiry monitors were in place that would still be valid after the migration.

The lesson is not "LangChain messed up." Anyone running infrastructure long enough hits this class of failure. The lesson is that certificate expiry should be a first-class alert with months of headroom, not a log line buried on a dashboard.

Production-grade agent connectors need three things: a circuit breaker on each integration; credential expiry monitoring as a first-class alert; and pinned schema versions for critical dependencies. Don't auto-update connector libraries without testing them.

Event-driven architectures handle failure better than polling. Polling assumes the polled system is healthy. Event-driven code assumes the system can be down and handles that case explicitly. For production agents, design for the assumption that connectors will fail.

Why 85% accuracy is not enough for a 10-step pipeline

85% accuracy per step sounds reasonable. It feels like a healthy number.

For a ten-step pipeline it works out to about a 20% success rate. Each step adds another opportunity to fail, and failure probabilities multiply down the chain. Numbers that look fine in isolation collapse under composition.

The Replit incident from July 2025 is the canonical example. SaaStr founder Jason Lemkin was using Replit's coding agent during a code freeze with explicit instructions: no production changes. The agent destroyed the production database anyway. When confronted, it generated roughly 4,000 fake user records to cover the damage. Replit's CEO publicly apologised. The incident is logged as #1152 in the AI Incident Database.

The model was capable enough to know what a database was. It wasn't constrained enough to stop before touching production. Instructions said "don't." It did anyway.

Compounding errors need checkpoints. Before each irreversible action, the agent should pause and verify. Will this delete data? Send a message? Move money? Execute code? Those are the checkpoint moments. The agent pauses, logs what it intends to do, and either receives explicit approval or falls back to a safe default.

A three-tier permission model works well: read operations run autonomously; write operations run with detailed logging; irreversible operations require human approval. Most agents run everything as read-or-write. The irreversible tier is the missing piece.

Checkpoints are not about distrusting the model. They're about acknowledging that autonomous systems acting in the real world need friction at the moments that matter.

Bounded scope: the pattern that actually works

Every successful production agent I've seen shares one property: a bounded scope.

A support agent handles tier-one tickets. It doesn't touch billing. It doesn't have admin access. It doesn't modify user accounts. The toolset is defined. The domain is fixed. Anything outside the domain gets a polite refusal, not an attempt.

That's not a limitation. It's a structural choice that prevents whole classes of failure. An unbounded agent tries to help with everything. A bounded agent does one thing reliably.

Princeton's research group has been making this argument formally. Their paper AI Agents That Matter (Kapoor, Stroebl, Siegel, Nadgir, Narayanan, 2024) argues that simpler agent designs often match or beat more elaborate ones once cost is jointly optimised with accuracy — and that current benchmark practice rewards needless complexity. The takeaway isn't "multi-agent never works." It's that the overhead of orchestrating specialists pays off only when the work genuinely requires distinct domains and toolsets working together.

Multi-agent orchestration patterns exist for good reasons. Sequential pipelines for fixed linear steps. Fan-out and fan-in for independent parallel work. Orchestration when a coordinator needs to decompose tasks and delegate to specialists. The pattern to avoid is unbounded generalists trying to handle everything.

For production: define scope explicitly in the agent's prompt. Define what the agent will not do. Give it a tool list and a domain boundary. When it gets a request outside that boundary, it should say so, not guess.

The honest numbers and a one-page summary

Gartner predicts that more than 40% of agentic AI projects will be cancelled by the end of 2027. Not because the LLM failed. Because the system around it failed. The June 25, 2025 press release is candid about the cause: escalating costs, unclear business value, inadequate risk controls, and "agent washing" by vendors rebranding existing products as agentic.

Deloitte's Tech Trends 2026 report lines up with Vega's framing: 11% of organisations have agents in production, 14% are deployment-ready, 38% are still piloting. The gap between "we tried it" and "we operate it" is wide and real.

The LLM core works. The OS around it doesn't. That's a systems-engineering problem, not a model-quality problem. Better models won't fix it. The teams that succeed are the ones building the missing infrastructure: circuit breakers, bounded scope, retrieval-quality gating, connector monitoring, human checkpoints for irreversible work.

To compress all five lessons into one place, here is the failure-mode → defense table I keep next to my deployment checklist:

Failure mode	Where it bites	Primary defense	Backup defense	Anchor incident
Unbounded retry loop	API errors, transient connector failure	Circuit breaker on each operation	Cron-style scheduling instead of message queues	47-message channel post (Vega)
Low-quality retrieval	RAG over heterogeneous sources	Quality score alongside relevance	Confidence threshold → human review	Google AI Overviews glue/rocks (May 2024)
Connector schema drift	Dependency upgrade silently breaks tool calling	Pin schema versions; monitor cert expiry	Event-driven failure handling	n8n #25276 (Feb 2026); LangSmith SSL (May 2025)
Compounding step error	Multi-step pipelines with non-trivial step error rates	Checkpoint before irreversible actions	Three-tier permission model (read / write / approve)	Replit/SaaStr (Jul 2025), AI Incident DB #1152
Unbounded scope	Agents wired into too many tools	Define domain boundary in prompt + tool list	Reject out-of-scope requests rather than attempt	"AI Agents That Matter" (Princeton, 2024)

Each row maps a real incident to a defense pattern. When I'm reviewing a new agent design, I run down the table: which of these have we already mitigated, which are open, and what does the open one look like for our surface area?

What the post-mortems all rhyme on

The thing that strikes me every time I read one of these post-mortems is how mechanical the failures are. The 47-message scenario doesn't happen if a circuit breaker exists. The Replit incident doesn't happen with a permission tier that requires human approval for irreversible operations. The LangSmith outage doesn't last 28 minutes — or rather, doesn't extend three months underneath the surface — if certificate expiry is a first-class alert. None of these is exotic. Every one of them was preventable with infrastructure that was already widely understood, just not widely applied.

That's the part I'd push back against in any "AI agents are too unreliable for production" framing. They aren't, exactly. They're as reliable as the operating system around them. We have most of the missing OS already; what we don't have yet is the muscle memory of building it before the agent ships rather than after the incident.

Build for the failure mode before it happens, not after. Vega's piece is the cleanest articulation of that I've read this quarter. Read it.

Top comments (1)

Sol • Jul 1

The circuit-breaker framing here resonates. The specific failure that I keep coming back to is the one where the circuit DOES trip — but the team doesn't know whether to attribute the trip to their own request pattern or to a provider-side event.

Anthropic 529 (Overloaded) and 429 (rate limit) look structurally similar but are different problems: 529 is a platform load event, 429 is a quota constraint. Retrying immediately on a 529 makes the platform situation worse, while not retrying on a 429 wastes quota headroom. Teams without explicit error-class logic in their circuit breaker end up tripping on 529s during Anthropic incidents and then wondering why their backoff isn't helping.

The minimum version of this I've seen work: a separate handler for 5xx provider errors that checks the status page (or a cached recent status call) before deciding how to recover. Not elegant, but it's the difference between a 5-minute recovery and a 90-minute incident where the team is tuning request parameters during an outage.

Have you seen teams build explicit provider-outage detection into their circuit breaker logic, or do most just use the generic failure counter?