Your AI agent handles 50 requests in development. Every one succeeds. You deploy to production and within 72 hours, the provider rate-limits you, a tool returns malformed JSON, and your agent enters an infinite retry loop that burns $200 before anyone notices.
This is not a hypothetical. We built a multi-agent system that ran 14 teams of autonomous agents. Three of those teams crashed in the first week — not because the logic was wrong, but because nothing in the system knew how to fail gracefully.
Here are 4 fault tolerance patterns we implemented to fix it. Each one uses production-tested code with LangGraph and LangChain.
Pattern 1: Retry Policies With Exponential Backoff
The simplest failure mode: a transient error. The API returns a 503, a database connection drops, a model provider hiccups. Most developers handle this with a bare try/except and a fixed retry. That creates thundering herds during outages.
LangGraph has built-in retry policies that handle this correctly — exponential backoff with jitter, configurable per node.
from langgraph.graph import StateGraph, START, END
from langgraph.types import RetryPolicy
from typing import TypedDict
class AgentState(TypedDict):
query: str
result: str
error_count: int
def call_external_api(state: AgentState) -> dict:
"""Node that calls an external service — might fail transiently."""
response = external_service.query(state["query"])
return {"result": response.text}
def process_result(state: AgentState) -> dict:
"""Process the API response."""
return {"result": f"Processed: {state['result']}"}
builder = StateGraph(AgentState)
# Retry transient failures: 3 attempts, exponential backoff starting at 1s
builder.add_node(
"call_api",
call_external_api,
retry_policy=RetryPolicy(max_attempts=3, initial_interval=1.0),
)
builder.add_node("process", process_result)
builder.add_edge(START, "call_api")
builder.add_edge("call_api", "process")
builder.add_edge("process", END)
graph = builder.compile()
The RetryPolicy parameters that matter:
| Parameter | Default | Purpose |
|---|---|---|
max_attempts |
3 | Total attempts including the first |
initial_interval |
0.5 | Seconds before first retry |
backoff_factor |
2.0 | Multiplier per subsequent retry |
max_interval |
128.0 | Cap on backoff interval in seconds |
jitter |
True | Adds randomness to prevent thundering herds |
retry_on |
Most exceptions | Which exceptions trigger a retry |
The default retry_on function is smart: it retries on most exceptions but skips ValueError, TypeError, and ImportError — errors that won't resolve on retry. For HTTP requests, it specifically retries on 5xx status codes.
You can target specific exceptions:
import sqlite3
builder.add_node(
"query_database",
query_database,
retry_policy=RetryPolicy(retry_on=sqlite3.OperationalError),
)
Why this matters: Fixed-interval retries during a provider outage create a spike of simultaneous requests when the service recovers. Exponential backoff with jitter spreads those retries across time, reducing the chance of cascading failure.
Pattern 2: Model Fallback Chains
Your primary model goes down. Without a fallback, your entire agent stops. With LangChain's middleware system, you can define a fallback chain that switches models automatically.
from langchain.agents import create_agent
from langchain.agents.middleware import ModelFallbackMiddleware
agent = create_agent(
model="gpt-4.1",
tools=[search_web, query_database],
middleware=[
ModelFallbackMiddleware(
"gpt-4.1-mini", # First fallback: cheaper, same provider
"claude-3-5-sonnet-20241022", # Second fallback: different provider
),
],
)
The middleware tries each model in order. If gpt-4.1 throws an error, it falls to gpt-4.1-mini. If that fails too, it tries Claude. The agent's tool calls, system prompt, and conversation history all carry over — only the model changes.
You can combine this with the retry middleware for defense in depth:
from langchain.agents.middleware import (
ModelFallbackMiddleware,
ModelRetryMiddleware,
)
agent = create_agent(
model="gpt-4.1",
tools=[search_web, query_database],
middleware=[
# First: retry the current model 3 times with backoff
ModelRetryMiddleware(
max_retries=3,
backoff_factor=2.0,
on_failure="continue", # Don't crash — pass error to agent
),
# Then: fall to alternative models
ModelFallbackMiddleware(
"gpt-4.1-mini",
"claude-3-5-sonnet-20241022",
),
],
)
The on_failure parameter on ModelRetryMiddleware controls what happens when all retries are exhausted:
-
"continue"(default) — Returns anAIMessagewith error details. The agent can see what failed and potentially handle it. -
"error"— Re-raises the exception, stopping the agent. - A callable — Custom function that takes the exception and returns a string for the
AIMessagecontent.
Design decision: Put retry middleware before fallback middleware. You want to retry the primary model a few times before falling to a cheaper or slower alternative. Falling back too eagerly wastes the primary model's capacity when it recovers.
Pattern 3: Classify Errors, Route Differently
Not every error is the same. A rate limit needs a retry. A tool that returns garbage needs the LLM to reformulate its query. A missing user input needs a human. Treating all errors the same way — retry and hope — is the root cause of most agent failures in production.
LangGraph's documentation identifies 4 error categories. Each requires a different handling strategy:
| Error Type | Who Fixes It | Strategy |
|---|---|---|
| Transient (network, rate limits) | System | Retry policy (Pattern 1) |
| LLM-recoverable (tool failures, parsing issues) | The LLM | Store error in state, loop back |
| User-fixable (missing info, unclear instructions) | Human | Pause with interrupt()
|
| Unexpected (bugs, logic errors) | Developer | Let them bubble up |
Here is how to implement the LLM-recoverable pattern — the one most developers miss:
from langgraph.graph import StateGraph, START, END
from langgraph.types import Command
from typing import TypedDict, Optional
class ToolState(TypedDict):
query: str
tool_result: Optional[str]
tool_error: Optional[str]
attempts: int
def execute_tool(state: ToolState) -> Command:
"""Try to execute a tool. On failure, route back to the agent."""
try:
result = run_database_query(state["query"])
return Command(
update={"tool_result": result, "tool_error": None},
goto="process_result",
)
except Exception as e:
attempts = state.get("attempts", 0) + 1
if attempts >= 3:
return Command(
update={"tool_error": f"Failed after 3 attempts: {e}"},
goto="handle_failure",
)
# Let the LLM see the error and reformulate
return Command(
update={
"tool_error": f"Tool error: {e}",
"attempts": attempts,
},
goto="agent", # Route back to agent — it can see the error and adjust
)
The key insight: when a tool fails, you do not retry the same call. You send the error back to the LLM so it can reformulate. If the database query had a syntax error, the LLM can fix it. If the search returned no results, the LLM can broaden the query. The LLM becomes the error handler.
Bound it. The attempts counter is critical. Without it, you get an infinite loop: tool fails → LLM reformulates → tool fails again → LLM reformulates → forever. We cap at 3 attempts. After that, route to a failure handler that either returns a partial result or escalates to a human.
Pattern 4: Checkpoint-Based Recovery
The most expensive failure is one that loses all progress. Your agent processes 47 out of 50 items, crashes on item 48, and restarts from zero. LangGraph's checkpointing prevents this by saving state at every node boundary.
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import StateGraph, START, END
from typing import TypedDict
class BatchState(TypedDict):
items: list[str]
processed: list[str]
current_index: int
errors: list[str]
def process_item(state: BatchState) -> dict:
"""Process one item. State is checkpointed after each node execution."""
idx = state["current_index"]
item = state["items"][idx]
try:
result = expensive_operation(item)
return {
"processed": [*state["processed"], result],
"current_index": idx + 1,
}
except Exception as e:
# Record the error, skip the item, continue
return {
"errors": [*state["errors"], f"Item {idx}: {e}"],
"current_index": idx + 1,
}
def should_continue(state: BatchState) -> str:
if state["current_index"] >= len(state["items"]):
return "done"
return "process"
builder = StateGraph(BatchState)
builder.add_node("process", process_item)
builder.add_conditional_edges("process", should_continue, {
"process": "process",
"done": END,
})
builder.add_edge(START, "process")
# Compile with checkpointer — state saved after every node
memory = MemorySaver()
graph = builder.compile(checkpointer=memory)
# Run with a thread_id — enables resume after crash
config = {"configurable": {"thread_id": "batch-job-001"}}
result = graph.invoke(
{
"items": ["item1", "item2", "item3"],
"processed": [],
"current_index": 0,
"errors": [],
},
config=config,
)
When this graph resumes after a crash, it picks up from the last completed checkpoint — not from the beginning. LangGraph creates checkpoints at node boundaries. Smaller nodes mean more frequent checkpoints, which means less repeated work.
Production tip: MemorySaver is for development. In production, use a persistent checkpointer like PostgreSQL (install with pip install langgraph-checkpoint-postgres psycopg[binary,pool]):
from langgraph.checkpoint.postgres import PostgresSaver
DB_URI = "postgresql://user:pass@localhost:5432/mydb?sslmode=disable"
with PostgresSaver.from_conn_string(DB_URI) as checkpointer:
checkpointer.setup() # Creates tables on first run
graph = builder.compile(checkpointer=checkpointer)
This survives process restarts, not just in-memory failures. The thread_id in the config maps to a row in PostgreSQL — resume any workflow from any process.
Putting It All Together: The Resilience Stack
These patterns are not alternatives. They are layers. A production agent needs all four:
Layer 4: Checkpoint recovery (survive crashes)
Layer 3: Error classification (route errors correctly)
Layer 2: Model fallback chains (survive provider outages)
Layer 1: Retry with backoff (survive transient errors)
Each layer catches what the layer below it misses:
- Retry handles the 503 that resolves in 2 seconds.
- Fallback handles the provider outage that lasts 10 minutes.
- Error classification handles the tool error that no amount of retrying will fix.
- Checkpointing handles the crash that kills the process entirely.
Without all four, you have gaps. A system with retries but no fallbacks dies during provider outages. A system with fallbacks but no error classification burns money retrying tool errors that need reformulation, not repetition.
What We Measured
After implementing these 4 patterns across our agent system:
- Unrecoverable failures dropped from 23% to under 2% of all runs.
- Cost per failure decreased by 85% — errors were caught and routed before they could cascade.
- Mean time to recovery went from "manual restart required" to automatic — checkpointing eliminated the restart-from-zero problem entirely.
The investment was 3 days of work across the agent codebase. The return was an agent system that runs 24/7 without human intervention for failure recovery.
Start Here
If you have an AI agent in production (or heading there), add these in order:
-
Retry policies first. They catch the most common failure (transient errors) with the least code. One
RetryPolicyper external call node. - Model fallbacks second. Define at least one alternative model per provider. Cross-provider fallbacks (OpenAI → Anthropic) protect against full provider outages.
- Error classification third. Separate transient errors from LLM-recoverable errors from human-required errors. Route each correctly.
-
Checkpointing last. Add
MemorySaverfor development,PostgresSaverfor production. Usethread_idto enable resume.
The goal is not zero failures. The goal is zero failures that require a human to notice and fix.
Follow @klement_gunndu for more AI architecture content. We're building in public.
Top comments (1)
The one great constant in the technology industry is it's history of change and the speed of that change so no education is going to give anyone the skills they need for an entire career. You've got to have curiosity, you've got to have a lifetime of curiosity and a dedication to a lifetime of learning because the tools and technology that we use in this industry are always going to be changing