Mukunda Rao Katta

Posted on May 25

llm-fallback-chain: Ordered Provider Failover for LLM Calls

#hermeschallenge #ai #python #agents

2am. Your primary provider is down.

Your demo starts in six hours. You wake up to alerts. The Anthropic API is returning 503s. You pull up the code and realize every LLM call is hardwired to claude-sonnet-4-6. There is no fallback. There is no retry across providers. There is just a broken app and a deadline.

This is not a rare scenario. Provider outages happen. Rate limits get hit. New models get added to one provider before another. If your code only knows one provider, any of these events becomes your problem at the worst time.

llm-fallback-chain is a small library that adds ordered provider failover to LLM calls. You define a sequence of providers. When one fails, the next one in the chain runs. If all fail, you get a structured trace of every attempt with the exception from each. The library supports both sync and async clients, and lets you write a skip predicate to skip a provider without even trying it.

The shape of the fix

Here is the problem in one line of real code:

response = anthropic_client.messages.create(model="claude-sonnet-4-6", ...)

If that call raises, your app raises. There is no plan B.

Here is the same call with llm-fallback-chain:

from llm_fallback_chain import FallbackChain, Provider

chain = FallbackChain([
    Provider(name="anthropic", fn=call_anthropic),
    Provider(name="openai",    fn=call_openai),
    Provider(name="gemini",    fn=call_gemini),
])

result = chain.run(prompt="Summarize this contract.")
print(result.value)
print(result.trace)

result.trace is the part that matters after an outage:

[
  Attempt(provider="anthropic", success=False, error="503 Service Unavailable", latency_ms=412),
  Attempt(provider="openai",    success=True,  error=None,                       latency_ms=890),
]

You can see exactly which provider succeeded, how long each call took, and what went wrong with the ones that did not. No digging through logs. No reconstructing the sequence.

The skip predicate lets you opt out of a provider without trying it. Useful when you already know a provider is quota-exhausted:

chain = FallbackChain(
    providers=[...],
    skip_if=lambda p: p.name == "gemini" and quota_exhausted("gemini"),
)

For async code, swap to AsyncFallbackChain and await chain.run(...). The interface is the same.

What it does NOT do

llm-fallback-chain is not a load balancer. It does not distribute calls across providers to reduce latency or spread cost. It tries providers in order and stops at the first success.

It does not normalize responses across providers. If Anthropic returns response.content[0].text and OpenAI returns response.choices[0].message.content, your fn callables are responsible for normalizing to a common shape before returning. The library does not know your response format.

It does not retry the same provider multiple times. For per-provider retries, pair it with llm-retry (another library in this series) inside each fn callable.

It does not manage API keys or clients. You bring your own initialized clients. This keeps the library small and keeps credentials out of the library's scope.

Inside the library: design choices

The core of llm-fallback-chain is a loop over a list of callables. Each callable is wrapped in a try/except. On success, iteration stops. On failure, the exception is recorded and iteration continues.

The Attempt dataclass records four fields: provider name, success flag, error string, and latency in milliseconds. Latency is measured with time.perf_counter() around each fn() call. The error field stores str(exc) from the caught exception.

The FallbackChain and AsyncFallbackChain classes share the same logic. The async version uses asyncio.get_event_loop().run_in_executor() only if you pass a sync fn to an async chain, but the expected pattern is to pass async callables to the async chain.

The skip predicate is called before each attempt. If it returns True, an Attempt is recorded with success=False and error="skipped". This keeps the trace complete even for providers that were never contacted.

All 24 tests run with pytest. There are no external dependencies beyond the standard library. The library does not import anthropic, openai, or any other provider SDK. It works with any callable that returns a value.

When this is useful, and when it is not

This is useful when:

You have real-time workloads where a provider outage should not mean a user-facing error.
You are evaluating multiple providers and want a graceful fallback during testing.
You have different providers for different environments (prod uses Anthropic, staging falls back to OpenAI).
You want structured failure logging without building custom exception handling in every call site.

This is not the right tool when:

You want to route calls by capability or cost before trying. Use a router, not a fallback chain.
You want to run calls in parallel and take the fastest result. This library is sequential.
Your providers have wildly different response schemas and you do not want to write normalization adapters.
You are already using an LLM gateway that handles failover at the infrastructure level.

Install

The library is pending PyPI. Install directly from GitHub:

pip install git+https://github.com/MukundaKatta/llm-fallback-chain

Basic usage:

from llm_fallback_chain import FallbackChain, Provider

def call_anthropic(prompt: str) -> str:
    # your anthropic client call here
    ...

def call_openai(prompt: str) -> str:
    # your openai client call here
    ...

chain = FallbackChain([
    Provider("anthropic", call_anthropic),
    Provider("openai",    call_openai),
])

result = chain.run(prompt="Hello")
print(result.value)   # the response text
print(result.trace)   # list of Attempt objects

Siblings in this series

These libraries pair naturally with llm-fallback-chain:

Library	What it does
`llm-retry`	Per-provider exponential backoff before giving up
`llm-circuit-breaker`	Skip providers that have failed too many times recently
`llm-rate-limit-bucket`	Token-bucket rate limiting per provider
`llm-cost-cap`	Pre-flight USD cost gate before a call is made
`llm-fallback-router`	Route by capability, then fall back if chosen provider fails

A production-grade agent stack might use all of these layered together. llm-fallback-chain is the outermost recovery layer.

What is next

The library needs a PyPI release once the upload queue clears. After that, the roadmap includes:

Weight-based ordering so you can express "prefer Anthropic 80% of the time but use OpenAI as equal weight fallback."
A FallbackChainResult method that re-raises the last exception if you want the chain to be transparent rather than returning a structured result.
An optional hook that fires on each attempt, so you can emit metrics to your observability layer without modifying the library.
A budget parameter that skips providers whose estimated cost exceeds your cap, using the llm-cost-cap library as a backend.

The core loop will stay simple. The goal is one small library that handles exactly one thing: try the next provider when the current one fails.

If you have shipped agents that hit provider outages, the pattern in this library is almost certainly already in your codebase in some form. llm-fallback-chain makes it reusable, testable, and observable.

Part of the Hermes Agent Challenge series. All libraries are on GitHub under MukundaKatta.

DEV Community