DEV Community

Richard Dillon
Richard Dillon

Posted on

LangGraph Fault Tolerance: Building Resilient Agents with Retries, Timeouts, and Error Handlers

LangGraph Fault Tolerance: Building Resilient Agents with Retries, Timeouts, and Error Handlers

Your agent completed 90% of a complex research task, made fourteen successful API calls, and then hit a transient rate limit on the fifteenth. Now it's dead. Checkpoints won't save you here—they tell you where the agent stopped, not how to recover gracefully. This gap between state persistence and active recovery has been the single largest source of operational burden for teams running production agents, and LangGraph's new fault tolerance primitives finally close it.

The timing matters. As organizations move from proof-of-concept agents to production deployments handling thousands of daily invocations, the economics of manual intervention become untenable. A support agent that requires human restarts 15% of the time isn't a productivity gain—it's a liability. The new @retry decorator, TimeoutPolicy class, and ErrorHandler nodes represent LangGraph's first comprehensive answer to this challenge, building on the framework's existing resilient agent architecture while addressing the operational realities of 2026's agentic workloads.

The Problem: Why Checkpointing Alone Isn't Enough

LangGraph's checkpointing system—whether you're using PostgresSaver, MemorySaver, or the newer distributed options—excels at one job: capturing the complete state of an agent at defined points in execution. When an agent crashes, you can inspect exactly what happened and resume from that state. This is table stakes for any serious agentic system, and LangGraph has done it well.

But checkpointing is fundamentally passive. It answers "where did we stop?" without answering "should we try again?" or "how long should we wait?" or "what's our fallback if this keeps failing?"

Consider the failure modes that dominate production agent deployments. Rate limits from tool APIs are the most common—OpenAI, Anthropic, and every third-party data provider impose them, and they're designed to be transient. A 429 response at 2:15 PM will likely succeed at 2:16 PM. Transient 5xx errors from external services follow similar patterns. LLM provider timeouts spike during high-traffic periods; if your agent runs during peak hours, you'll see these regularly. Network partitions between your agent and external services happen more often than anyone wants to admit.

In multi-agent workflows and the newer Deep Agents architecture, you face an additional challenge: sub-agent hangs. A planning agent delegates to a research sub-agent, which gets stuck waiting for a response that will never come. Without timeouts, your entire workflow freezes.

The real cost isn't technical—it's operational. Every manual restart requires human attention, context switching, and decision-making. Teams running customer-facing agents report that before adopting fault tolerance patterns, they spent significant portions of their on-call rotations simply restarting agents that hit transient failures. The agent development lifecycle extends well beyond deployment, and monitoring becomes firefighting without proper recovery mechanisms.

The conceptual gap is clear: checkpointing defines where to resume, while fault tolerance defines whether and how to retry before giving up. You need both.

Core API: The @retry Decorator

The @retry decorator brings production-grade retry logic to node functions without the boilerplate that previously cluttered every external API call. The basic signature is straightforward:

@retry(max_attempts=3, backoff="exponential", retryable_exceptions=[RateLimitError, TimeoutError])
def call_external_api(state: AgentState) -> AgentState:
    ...
Enter fullscreen mode Exit fullscreen mode

The configuration options address the full spectrum of retry scenarios. max_attempts is an integer that includes the initial attempt—so max_attempts=3 means one initial try plus two retries. The backoff parameter accepts "constant", "linear", or "exponential" strategies, each with configurable base_delay (default 1.0 seconds) and max_delay (default 60 seconds) parameters. Exponential backoff with jitter is the recommended default for API rate limits.

The retryable_exceptions parameter is crucial for correct behavior. Only exceptions in this list trigger retries; all others propagate immediately. This prevents retrying on errors that won't resolve with time—a malformed request will fail identically on every attempt. For more complex scenarios, retry_condition accepts a callable (exception, attempt) -> bool that enables custom logic: "retry rate limits for the first 5 attempts, but only retry timeouts twice."

Integration with LangGraph's state management is seamless and, importantly, safe. Retries operate on the same state snapshot that the original attempt received. There's no risk of partial state corruption from a failed attempt leaking into a retry. The node either succeeds and its state updates are committed, or it exhausts retries and the original state remains unchanged.

Observability comes built-in. Each retry emits a RetryAttempt event visible in LangSmith traces, containing the attempt number, delay duration, exception type, and exception message. This means you can track retry rates per node, identify which external services cause the most retries, and tune your max_attempts settings based on real data rather than guesswork.

One implementation detail matters for teams using NVIDIA's parallel execution enhancements: when combining @retry with @independent (the decorator for parallelizable nodes), @retry must be the innermost decorator. This ensures the retry logic wraps the actual node execution rather than the parallelization wrapper.

Timeout Policies: Bounding Unbounded Operations

While retries handle failures that announce themselves with exceptions, timeouts protect against operations that simply never return. The TimeoutPolicy class provides granular control at three levels: individual nodes, subgraphs, and entire graph invocations.

The configuration hierarchy reflects how agents actually fail. node_timeout sets the maximum duration for any single node execution—useful when you know that a particular API call should never take more than 30 seconds. tool_timeout applies uniformly to all tool calls within a node, separate from the node's own computation time. graph_timeout sets a wall-clock limit for the entire invocation, preventing runaway agents that loop indefinitely or get stuck in recursive planning cycles.

The configuration pattern attaches to graph compilation:

from langgraph.timeout import TimeoutPolicy

policy = TimeoutPolicy(
    node_timeout=30,      # 30 seconds per node
    tool_timeout=15,      # 15 seconds per tool call
    graph_timeout=300     # 5 minutes total
)

compiled_graph = graph.compile(
    checkpointer=checkpointer,
    timeout_policy=policy
)
Enter fullscreen mode Exit fullscreen mode

Timeout behavior is configurable via the on_timeout parameter. The default "raise" behavior throws a TimeoutError that can be caught by an ErrorHandler (discussed next) or handled in downstream nodes. "interrupt" triggers LangGraph's human-in-the-loop interrupt mechanism, pausing execution for manual review and decision-making. "fallback" routes to a specified fallback node, enabling graceful degradation without human intervention.

The implementation uses asyncio.timeout() internally for async nodes. Synchronous nodes are wrapped automatically with equivalent behavior, but the async implementation is more efficient—another reason to prefer async node functions in production.

For teams using LangGraph's multi-agent capabilities, timeout policies integrate with the agent development stack at the orchestration level. Sub-agent timeouts can be configured independently, preventing a misbehaving sub-agent from consuming the entire parent agent's timeout budget.

LangSmith surfaces timeout metrics alongside other observability data: timeout_rate per node shows what percentage of invocations hit the timeout, while p99_duration displays your latency distribution with timeout thresholds overlaid. This makes it straightforward to tune timeouts based on actual production behavior rather than guesses.

Error Handler Nodes: Centralized Recovery Logic

Retries and timeouts handle specific failure types, but production agents need a unified place to make recovery decisions. ErrorHandler nodes provide this centralization, replacing scattered try-except blocks with a coherent error recovery architecture.

Registration uses scope-based configuration:

graph.add_error_handler(
    handler_node, 
    scope="global"  # or "subgraph" or ["node_a", "node_b"]
)
Enter fullscreen mode Exit fullscreen mode

Global handlers catch any unhandled exception from any node. Subgraph handlers scope to a specific subgraph, useful when different parts of your agent require different recovery strategies. Node-list scoping targets specific nodes, ideal for handling errors from a cluster of related API calls.

The handler node receives an ErrorContext object containing everything needed for intelligent recovery decisions:

class ErrorContext:
    exception: Exception          # The caught exception
    failed_node: str              # Name of node that raised
    state: AgentState             # Current state snapshot
    attempt_history: list         # Retry attempts if @retry was used
    trace_id: str                 # Correlation ID for LangSmith
Enter fullscreen mode Exit fullscreen mode

The attempt_history field is particularly valuable—it tells you not just that a node failed, but how many times it failed and what exceptions occurred on each attempt. A node that fails once with a timeout is different from a node that exhausted five retries with rate limit errors.

Handler return values control execution flow via the Command pattern:

def error_handler(context: ErrorContext) -> Command:
    if isinstance(context.exception, RateLimitError):
        # Route to degraded-mode node
        return Command(goto="degraded_synthesis")
    elif isinstance(context.exception, TimeoutError):
        # Interrupt for human review
        return Command(interrupt="Timeout on critical operation")
    else:
        # Abort with diagnostic payload
        return Command(
            abort=True, 
            result={"error": str(context.exception), "trace_id": context.trace_id}
        )
Enter fullscreen mode Exit fullscreen mode

The Command(resume=True) option is particularly powerful—it retries the failed node with a reset retry counter. This enables "escalate and retry" patterns where the handler might first try rate limit backoff, then switch API keys, then finally give up.

State modification before routing is supported via Command(update={...}). This enables patterns like marking a data source as unavailable in state before routing to a synthesis node that should work with partial data.

Two patterns emerge as particularly useful in production. The "circuit breaker" pattern tracks failure rates over time (using state or external storage) and switches to degraded mode after a threshold—useful for agents that should continue operating even when primary data sources are unavailable. The "escalation" pattern creates human-in-the-loop interrupts for specific error types while handling routine failures automatically, respecting the principle that agentic systems should augment human decision-making rather than eliminate it entirely.

Hands-On: Code Walkthrough

Let's build a research agent that demonstrates all three fault tolerance primitives. The agent queries three external APIs (arXiv, Wikipedia, and a news service), synthesizes results, and generates a report. This is a common pattern in production agents, and it exposes exactly the failure modes fault tolerance addresses.

from typing import TypedDict, List, Optional
from langgraph.graph import StateGraph, START, END
from langgraph.retry import retry
from langgraph.timeout import TimeoutPolicy
from langgraph.errors import ErrorContext, Command
from langsmith import traceable
import httpx
import asyncio

# State definition captures both data and operational metadata
class ResearchState(TypedDict):
    query: str
    arxiv_results: Optional[List[dict]]
    wikipedia_results: Optional[List[dict]]
    news_results: Optional[List[dict]]
    unavailable_sources: List[str]  # Track which sources failed
    synthesis: Optional[str]
    final_report: Optional[str]

# Custom exceptions for clear retry targeting
class RateLimitError(Exception):
    pass

class SourceUnavailableError(Exception):
    pass

# Node 1: arXiv API with retry for rate limits and transient errors
@retry(
    max_attempts=3, 
    backoff="exponential", 
    base_delay=2.0,
    max_delay=30.0,
    retryable_exceptions=[RateLimitError, httpx.TimeoutException, httpx.HTTPStatusError]
)
@traceable(name="query_arxiv")
async def query_arxiv(state: ResearchState) -> ResearchState:
    """Query arXiv API for academic papers matching the research query."""
    async with httpx.AsyncClient(timeout=10.0) as client:
        response = await client.get(
            "https://export.arxiv.org/api/query",
            params={"search_query": state["query"], "max_results": 5}
        )
        # Handle rate limits explicitly to trigger retry
        if response.status_code == 429:
            raise RateLimitError(f"arXiv rate limit hit: {response.headers.get('Retry-After', 'unknown')}")
        response.raise_for_status()

        # Parse response (simplified for clarity)
        results = parse_arxiv_response(response.text)
        return {**state, "arxiv_results": results}

# Node 2: Wikipedia API with similar retry pattern
@retry(
    max_attempts=3,
    backoff="exponential",
    retryable_exceptions=[RateLimitError, httpx.TimeoutException]
)
@traceable(name="query_wikipedia")
async def query_wikipedia(state: ResearchState) -> ResearchState:
    """Query Wikipedia API for relevant encyclopedia entries."""
    async with httpx.AsyncClient(timeout=10.0) as client:
        response = await client.get(
            "https://en.wikipedia.org/w/api.php",
            params={
                "action": "query",
                "list": "search",
                "srsearch": state["query"],
                "format": "json"
            }
        )
        if response.status_code == 429:
            raise RateLimitError("Wikipedia rate limit")
        response.raise_for_status()

        data = response.json()
        results = data.get("query", {}).get("search", [])
        return {**state, "wikipedia_results": results}

# Node 3: News API (third-party, less reliable)
@retry(
    max_attempts=2,  # Fewer retries for less critical source
    backoff="constant",
    base_delay=5.0,
    retryable_exceptions=[RateLimitError, httpx.TimeoutException]
)
@traceable(name="query_news")
async def query_news(state: ResearchState) -> ResearchState:
    """Query news API for recent coverage. Optional source—failure is acceptable."""
    async with httpx.AsyncClient(timeout=8.0) as client:
        response = await client.get(
            "https://newsapi.example.com/search",
            params={"q": state["query"]},
            headers={"Authorization": "Bearer NEWS_API_KEY"}
        )
        if response.status_code == 429:
            raise RateLimitError("News API rate limit")
        response.raise_for_status()

        results = response.json().get("articles", [])
        return {**state, "news_results": results}

# Synthesis node - no retry needed, operates on local data
@traceable(name="synthesize_results")
async def synthesize_results(state: ResearchState) -> ResearchState:
    """Combine results from available sources into unified synthesis."""
    available_results = []

    if state.get("arxiv_results"):
        available_results.append(f"Academic sources: {len(state['arxiv_results'])} papers found")
    if state.get("wikipedia_results"):
        available_results.append(f"Encyclopedia: {len(state['wikipedia_results'])} entries found")
    if state.get("news_results"):
        available_results.append(f"News: {len(state['news_results'])} articles found")

    # Note which sources were unavailable for transparency
    unavailable = state.get("unavailable_sources", [])

    synthesis = f"Research synthesis for: {state['query']}\n"
    synthesis += f"Available sources: {', '.join(available_results) or 'None'}\n"
    if unavailable:
        synthesis += f"Unavailable sources: {', '.join(unavailable)}\n"

    # In production, this would call an LLM to generate actual synthesis
    return {**state, "synthesis": synthesis}

# Error handler with scoped recovery logic
@traceable(name="research_error_handler")
def research_error_handler(context: ErrorContext) -> Command:
    """
    Central error handling for research API nodes.
    Strategy:
    - Rate limits after retry exhaustion: mark source unavailable, continue
    - Timeouts: mark source unavailable, continue (research can proceed with partial data)
    - Unexpected errors: abort with diagnostic info for debugging
    """
    failed_node = context.failed_node
    exception = context.exception
    state = context.state

    # Initialize unavailable_sources if not present
    unavailable = list(state.get("unavailable_sources", []))

    if isinstance(exception, (RateLimitError, httpx.TimeoutException)):
        # Transient failure after retries exhausted - degrade gracefully
        source_name = failed_node.replace("query_", "")
        unavailable.append(source_name)

        # Log for observability (LangSmith will capture this)
        print(f"Source {source_name} unavailable after {len(context.attempt_history)} attempts")

        # Update state and continue to synthesis
        return Command(
            update={"unavailable_sources": unavailable},
            goto="synthesize_results"
        )

    elif isinstance(exception, TimeoutError):
        # Graph-level or node-level timeout - more serious
        # For research agents, we still try to synthesize what we have
        return Command(
            update={
                "unavailable_sources": unavailable + [f"{failed_node}_timeout"],
            },
            goto="synthesize_results"
        )

    else:
        # Unexpected error - abort with full diagnostic payload
        return Command(
            abort=True,
            result={
                "error_type": type(exception).__name__,
                "error_message": str(exception),
                "failed_node": failed_node,
                "trace_id": context.trace_id,
                "state_snapshot": {k: v is not None for k, v in state.items()}
            }
        )

# Build the graph with fault tolerance
def build_research_agent():
    graph = StateGraph(ResearchState)

    # Add nodes
    graph.add_node("query_arxiv", query_arxiv)
    graph.add_node("query_wikipedia", query_wikipedia)
    graph.add_node("query_news", query_news)
    graph.add_node("synthesize_results", synthesize_results)

    # Parallel API queries, then synthesis
    graph.add_edge(START, "query_arxiv")
    graph.add_edge(START, "query_wikipedia")
    graph.add_edge(START, "query_news")
    graph.add_edge("query_arxiv", "synthesize_results")
    graph.add_edge("query_wikipedia", "synthesize_results")
    graph.add_edge("query_news", "synthesize_results")
    graph.add_edge("synthesize_results", END)

    # Register error handler scoped to API query nodes only
    graph.add_error_handler(
        research_error_handler,
        scope=["query_arxiv", "query_wikipedia", "query_news"]
    )

    # Configure timeout policy
    timeout_policy = TimeoutPolicy(
        node_timeout=60,    # 60 seconds per node (includes retries)
        graph_timeout=300   # 5 minutes total
    )

    # Compile with checkpointing and timeout policy
    compiled = graph.compile(
        timeout_policy=timeout_policy
    )

    return compiled

# Usage example
async def main():
    agent = build_research_agent()

    result = await agent.ainvoke({
        "query": "transformer architecture neural networks",
        "unavailable_sources": []
    })

    print(result["synthesis"])
    if result.get("unavailable_sources"):
        print(f"Note: Some sources were unavailable: {result['unavailable_sources']}")

if __name__ == "__main__":
    asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

When you run this agent and one API fails, you'll see the fault tolerance in action. The @retry decorator handles transient failures with exponential backoff. If retries are exhausted, the error handler catches the exception, marks the source as unavailable in state, and routes to synthesis. The agent completes with partial data rather than crashing.

In LangSmith traces, you'll see RetryAttempt events for each retry, the error handler invocation, and the modified routing decision—complete visibility into exactly how the agent recovered.

What This Means for Your Stack

Immediate adoption path: Start by adding @retry to any node that makes external calls. This is the lowest-friction change with the highest impact. Most teams see immediate reduction in failed runs simply by handling transient rate limits and timeouts gracefully.

Migrating from custom retry logic: If you've built manual try/except/sleep patterns around external calls, the @retry decorator replaces 20-50 lines of boilerplate per node. Beyond code reduction, the decorator handles backoff calculation, metric emission, and LangSmith integration automatically. Your custom logic probably doesn't.

Timeout strategy: Begin with generous timeouts—2-3x your observed p99 latency for each node type. Overly aggressive timeouts cause false failures; you can tighten them based on LangSmith metrics once you have production data. The p99_duration metric with timeout threshold overlay makes this tuning straightforward.

ErrorHandler placement: Start with a single global handler that logs errors and emits alerts. This gives you immediate observability into all failures. Add scoped handlers as specific recovery patterns emerge from production data—don't try to anticipate every failure mode upfront.

Multi-agent considerations: For teams using LangGraph's multi-agent workflows, fault tolerance automatically benefits sub-agents. Configure policies at the orchestration level, and sub-agents inherit appropriate timeouts. This prevents the common failure mode of a misbehaving sub-agent consuming resources indefinitely.

Cost awareness: Retries multiply LLM API costs. A node with max_attempts=5 calling Claude 3.5 Sonnet can cost 5x what you budgeted per invocation. Set max_attempts conservatively for expensive model calls—often 2 is sufficient for LLM calls, while API calls to external services can tolerate higher retry counts.

Testing fault tolerance: LangSmith Sandboxes support fault injection, enabling chaos testing without mocking your entire infrastructure. Inject rate limits, timeouts, and specific exceptions into production-like runs to validate that your error handlers behave correctly before real failures occur.

Observability checklist: Enable retry_rate, timeout_rate, and error_handler_invocations metrics in your LangSmith dashboard. These three metrics tell you whether fault tolerance is working as intended or masking underlying issues that need architectural fixes.

Anti-pattern to avoid: Don't wrap entire graphs in a single retry at the invocation level. This loses the granularity that makes fault tolerance valuable. A graph-level retry doesn't know which node failed, can't route to fallbacks, and may re-execute expensive operations unnecessarily. Use node-level retries with error handlers for precise control.

The broader shift here is from reactive debugging to proactive resilience. The agent development lifecycle no longer ends at deployment—it extends into production operations, and fault tolerance is the bridge between "my agent works" and "my agent works reliably at scale."

What to Build This Week

Project: Fault-Tolerant Data Pipeline Agent

Build an agent that extracts data from three different sources (a public API, a web scraper, and a local database), transforms the combined data, and loads it into a target system. This is a practical ETL pattern where fault tolerance directly impacts whether the pipeline runs unattended.

Implementation requirements:

  1. Each extraction node gets @retry with source-appropriate settings (aggressive retries for your own database, conservative for rate-limited public APIs)
  2. Configure TimeoutPolicy with different tolerances for each phase—extraction can be slow, transformation should be fast
  3. Build an error handler that implements "best effort" semantics: continue with available data if any source fails, but abort if all sources fail
  4. Add a "validation" node after transformation that checks data quality and routes to an error handler if thresholds aren't met
  5. Include LangSmith tracing with custom metadata tags for data quality metrics

Stretch goal: Add a "circuit breaker" pattern where repeated failures from one source cause the agent to skip that source entirely for subsequent runs (persisted via checkpointing), with automatic re-enablement after a cooldown period.

This project exercises all three fault tolerance primitives in a realistic scenario while producing something genuinely useful for data engineering workflows. The patterns transfer directly to any agent that coordinates multiple unreliable external systems—which is to say, most production agents.


Sources

- Agentic AI: 4 reasons why it's the next big thing in AI research - IBM

This is part of the **Agentic Engineering Weekly* series — a deep-dive every Monday into the frameworks,
patterns, and techniques shaping the next generation of AI systems.*

Follow the Agentic Engineering Weekly series on Dev.to to catch every edition.

Building something agentic? Drop a comment — I'd love to feature reader projects.

Top comments (0)