Three Error Recovery Patterns for LLM Agent Tool Failures

#hermeschallenge #ai #python #agents

Tool failures in LLM agents are not edge cases. They are the normal operating condition. APIs go down. Rate limits fire. Timeouts happen under load. The model calls a tool that returns an error. What happens next determines whether your agent is useful in production or just a demo.

Most agent code handles this poorly. The tool returns an error string. The model tries the same tool again with the same arguments. The error repeats. The model apologizes. The session ends. The user gets nothing.

There are three patterns worth knowing. They escalate in cost and complexity. You use the cheapest one that works for the failure you have.

Pattern 1: Retry

For transient errors. Rate limits. Network timeouts. Temporary 5xx from the upstream service.

The key question before retrying: is this error transient or permanent? Retrying a 401 is waste. Retrying a 429 is correct. tool-error-classify gives you a closed ErrorKind enum so you can branch without parsing error strings.

# pip install llm-retry-py tool-error-classify

from llm_retry import with_retry, RetryConfig
from tool_error_classify import classify_error, ErrorKind

# Configure retry: 3 attempts, exponential backoff with jitter, retry only transient errors
retry_config = RetryConfig(
    max_attempts=3,
    base_delay=1.0,
    max_delay=30.0,
    jitter=True,
    retry_on=[ErrorKind.RATE_LIMIT, ErrorKind.NETWORK_TIMEOUT, ErrorKind.SERVER_ERROR],
    fail_fast_on=[ErrorKind.AUTH_ERROR, ErrorKind.BAD_REQUEST, ErrorKind.NOT_FOUND],
)

def search_web(query: str) -> dict:
    response = search_api.get(query)
    response.raise_for_status()
    return response.json()

# Wrap with retry
search_with_retry = with_retry(search_web, config=retry_config)


# In your agent tool registry:
def call_tool_safe(tool_name: str, args: dict) -> dict:
    try:
        return search_with_retry(**args)
    except Exception as exc:
        kind = classify_error(exc)
        # Return structured failure, not a bare exception string
        return {
            "error": True,
            "kind": kind.value,
            "message": str(exc),
            "retried": True,
            "attempts": retry_config.max_attempts,
        }

Do not catch all exceptions and return a generic error string. The model cannot reason about a generic error. It needs to know whether the failure was auth (give up), rate limit (already retried), or not found (try different arguments). ErrorKind gives it that.

Pattern 2: Fallback

For when the primary provider is unavailable or too slow. Use a different provider, a cached result, or a degraded version of the same call.

llm-fallback-router maintains an ordered list of providers. It tries the first, catches failures by kind, and moves to the next. It does not require you to change your tool code.

# pip install llm-fallback-router llm-circuit-breaker-py

from llm_fallback_router import FallbackRouter
from llm_circuit_breaker import CircuitBreaker

# Set up the chain: primary -> secondary -> tertiary
router = FallbackRouter(providers=[
    "anthropic/claude-sonnet-4-6",
    "openai/gpt-5.4",
    "openrouter/meta-llama/llama-3.3-70b-instruct",
])

# Circuit breakers prevent hammering a known-down provider
breaker_a = CircuitBreaker(failure_threshold=5, recovery_timeout=60)
breaker_b = CircuitBreaker(failure_threshold=3, recovery_timeout=120)

async def call_model_with_fallback(messages: list, tools: list = None) -> str:
    providers = router.ordered_providers()

    for provider in providers:
        breaker = get_breaker(provider)  # one breaker per provider
        if breaker.is_open():
            continue  # skip this provider, it is in recovery

        try:
            response = await call_provider(provider, messages, tools)
            breaker.record_success()
            return response
        except Exception as exc:
            kind = classify_error(exc)
            breaker.record_failure()

            if kind == ErrorKind.AUTH_ERROR:
                # Auth errors are not transient. Skip immediately.
                continue
            if kind == ErrorKind.QUOTA_EXHAUSTED:
                # Provider is out of quota this billing period. Skip and do not retry.
                router.deprioritize(provider)
                continue
            # Other errors: try next provider

    raise RuntimeError("All providers failed or in circuit-break state")


# Tool-level fallback: different implementation, same interface
async def search_with_fallback(query: str) -> dict:
    try:
        return await primary_search_api.get(query)
    except Exception:
        # Fall back to secondary search with smaller result set
        return await secondary_search_api.get(query, max_results=3)

The circuit breaker prevents the fallback router from repeatedly trying a provider that is already down. Without it, every call checks the broken provider, gets an error, and wastes latency.

Pattern 3: Graceful Degrade

For when retry and fallback both fail, or when neither applies. You return a partial result, you tell the LLM what failed and why, and you let it decide what to do next.

This is the pattern most agent code skips. The instinct is to hide failures. Do not. The model cannot help the user recover from a failure it does not know about.

from tool_output_format import format_tool_result

async def fetch_user_data(user_id: str) -> dict:
    results = {}
    errors = []

    # Try to fetch profile
    try:
        results["profile"] = await profile_api.get(user_id)
    except Exception as exc:
        kind = classify_error(exc)
        errors.append({
            "component": "profile",
            "kind": kind.value,
            "message": "Profile service unavailable",
        })

    # Try to fetch order history (independent of profile)
    try:
        results["orders"] = await orders_api.list(user_id)
    except Exception as exc:
        kind = classify_error(exc)
        errors.append({
            "component": "orders",
            "kind": kind.value,
            "message": "Order history unavailable",
        })

    # Return whatever succeeded, plus a clear failure summary
    if errors and not results:
        return format_tool_result({
            "success": False,
            "errors": errors,
            "partial_data": None,
            "suggestion": "All data sources failed. Ask user to try again later.",
        })

    if errors:
        return format_tool_result({
            "success": "partial",
            "data": results,
            "errors": errors,
            "suggestion": f"Partial data available. Missing: {[e['component'] for e in errors]}",
        })

    return format_tool_result({"success": True, "data": results})

format_tool_result (from tool-output-format) renders the result as structured markdown with consistent headings. The LLM gets a predictable format it can parse. You do not hand-roll the formatting per tool.

When NOT to Recover

Some errors are not worth recovering from. Retrying or falling back just delays the inevitable.

Auth errors. A 401 or 403 means the credentials are wrong or the token is expired. No amount of retrying will fix it. Fail fast, surface the auth error to the user or operator.

Quota exhausted. Monthly API quota is gone. Fallback to a different provider if possible, but do not retry the same provider.

Missing required data. The tool needs a value that does not exist. A search for a user that was deleted. Retrying will always fail. Fail fast and tell the model what is missing.

Data validation errors. The model passed malformed arguments. This is usually a prompt or schema issue, not a transient failure.

FAIL_FAST_KINDS = {
    ErrorKind.AUTH_ERROR,
    ErrorKind.BAD_REQUEST,
    ErrorKind.NOT_FOUND,
    ErrorKind.QUOTA_EXHAUSTED,
}

def should_retry(exc: Exception) -> bool:
    kind = classify_error(exc)
    return kind not in FAIL_FAST_KINDS

Quick-Start Snippet

pip install llm-retry-py llm-circuit-breaker-py llm-fallback-router tool-error-classify tool-output-format

from llm_retry import with_retry, RetryConfig
from llm_circuit_breaker import CircuitBreaker
from tool_error_classify import classify_error, ErrorKind
from tool_output_format import format_tool_result

retry_config = RetryConfig(
    max_attempts=3,
    base_delay=1.0,
    jitter=True,
    retry_on=[ErrorKind.RATE_LIMIT, ErrorKind.NETWORK_TIMEOUT, ErrorKind.SERVER_ERROR],
    fail_fast_on=[ErrorKind.AUTH_ERROR, ErrorKind.BAD_REQUEST],
)

def my_tool(arg: str) -> dict:
    # your tool logic here
    ...

safe_tool = with_retry(my_tool, config=retry_config)

Related Libraries

Library	What It Does
llm-retry-py	Exponential backoff retry with jitter and error-kind filtering
llm-circuit-breaker-py	Circuit breaker to stop hammering failing providers
llm-fallback-router	Ordered multi-provider fallback chain
tool-error-classify	Closed ErrorKind enum from any exception type
llm-fallback-chain	Sync and async ordered failover across providers
tool-output-format	Render tool results as structured LLM-friendly markdown

What's Next

Error recovery is reactive. You handle failures after they happen. The next layer is proactive: knowing which tools are risky before they fail.

tool-side-effects-tag lets you declare intent (READ, WRITE, IDEMPOTENT, DESTRUCTIVE) at the function level. Your retry policy can then apply different rules automatically: never retry DESTRUCTIVE without an idempotency key, always retry READ with backoff, gate WRITE on budget availability.

Combined with agentvet (static checks before deploy), you catch misconfigured tools before they hit production rather than after they burn a customer's API budget.