Mukunda Rao Katta

Posted on May 25

When Your LLM Provider Goes Down Mid-Demo: llm-fallback-chain

#hermeschallenge #ai #python #agents

Anthropic went down during a live demo

It was a product demo. The kind where you have a room full of stakeholders and a polished script. The app called Anthropic to generate a structured summary. It had worked in staging all week.

Fifteen minutes into the demo, the LLM call started hanging. Then it timed out. Then it errored. Anthropic had an outage. It lasted about 40 minutes.

The app had no fallback. No retry to OpenAI. No fallback to a smaller local model. No graceful degradation. The demo was dead and we were waiting on a status page.

That situation is common. Providers go down. Rate limits get hit. A cold region has elevated latency. And most apps handle it the same way: they fail.

The fix is not complicated. You need a list of providers. You try them in order. When one fails, you move to the next. When all fail, you report what happened.

That is what llm-fallback-chain does.

The shape of the fix

Install the library:

pip install llm-fallback-chain

Define your providers as plain callables and hand them to the chain:

from llm_fallback_chain import FallbackChain
import anthropic
import openai

anthropic_client = anthropic.Anthropic()
openai_client = openai.OpenAI()

def call_anthropic(prompt: str) -> str:
    msg = anthropic_client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )
    return msg.content[0].text

def call_openai(prompt: str) -> str:
    resp = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.choices[0].message.content

chain = FallbackChain(providers=[call_anthropic, call_openai])
result = chain.run("Summarize this incident in one sentence.")
print(result)

If Anthropic is down, the chain moves to OpenAI. If both fail, you get an AllProvidersFailedError with the full attempt trace.

The async path works the same way. Replace the sync callables with async ones and call await chain.run_async(...):

import asyncio
from llm_fallback_chain import FallbackChain

async def call_anthropic_async(prompt: str) -> str:
    # ... async client call
    pass

async def call_openai_async(prompt: str) -> str:
    # ... async client call
    pass

chain = FallbackChain(providers=[call_anthropic_async, call_openai_async])

async def main():
    result = await chain.run_async("What is the capital of France?")
    print(result)

asyncio.run(main())

What it does NOT do

Before going further, here is what the library does not handle:

It does not manage retries within a single provider. If you want exponential backoff for Anthropic before giving up on it, use llm-retry-py on top of your callable. The chain treats each callable as a black box. If it raises, the chain moves on.
It does not know about providers. There is no built-in Anthropic or OpenAI client. You pass any callable that takes a prompt and returns a string. That keeps the library dependency-free and model-agnostic.
It does not handle response validation. If a provider returns a malformed response or an empty string, the chain does not know that is an error. That is your callables job to detect and raise.
It does not rate-limit or throttle. If you want to avoid hammering every provider in sequence during a sustained outage, pair this with llm-circuit-breaker-py to open the circuit after repeated failures.

Inside the lib: the skip predicate

The most interesting design decision in this library is the skip predicate.

The naive fallback loop looks like this: provider fails, try the next one. But that breaks down quickly.

If you send a prompt that exceeds the model's context window, every provider will return a 400 error. The prompt is the problem, not the provider. Trying five fallback providers wastes time and burns tokens on every one of them. The right behavior is to stop immediately and surface the error.

If you hit a 429 rate limit or a 503 service unavailable, the provider is temporarily unable to serve you. The prompt is fine. The next provider is worth trying.

The skip predicate is a function you pass to the chain. It receives the exception from each failed attempt. It returns True if the chain should skip to the next provider, or False if it should stop and re-raise immediately.

from llm_fallback_chain import FallbackChain

def my_skip_predicate(exc: Exception) -> bool:
    # Skip to next provider only on rate limit or server errors
    if hasattr(exc, "status_code"):
        return exc.status_code in (429, 500, 502, 503, 504)
    # Skip on connection errors and timeouts
    if isinstance(exc, (ConnectionError, TimeoutError)):
        return True
    # For anything else (400, 401, 422), stop immediately
    return False

chain = FallbackChain(
    providers=[call_anthropic, call_openai],
    skip_predicate=my_skip_predicate,
)

The library ships a retryable_only predicate as the default. It skips on 429, 500, 502, 503, 504, connection errors, and timeouts. It stops on 400, 401, 403, 404, and 422. That covers the common cases without any config.

If you do not pass a skip predicate, the chain uses retryable_only automatically. If you want to always try every provider regardless of the error, pass lambda exc: True.

Reading the attempt trace

When all providers fail, AllProvidersFailedError carries a list of Attempt objects. Each attempt records the provider index, the exception, and how long the call took.

from llm_fallback_chain import FallbackChain, AllProvidersFailedError

chain = FallbackChain(providers=[call_anthropic, call_openai])

try:
    result = chain.run("Summarize this.")
except AllProvidersFailedError as e:
    for attempt in e.attempts:
        print(f"Provider {attempt.provider_index}: {attempt.exception} ({attempt.duration_seconds:.2f}s)")

That trace is useful for alerting. If provider 0 fails fast with a 401, your credentials are wrong. If provider 0 fails after 30 seconds with a timeout, it is a latency issue. If all providers fail in sequence within 5 seconds, the rate limit is systemic across your accounts.

When this is useful

Production apps where provider uptime is in your SLA but not in any single provider's control.
Demos or workshops where you need a working app regardless of which API is having a bad day.
Cost optimization: route to a cheap provider first, fall back to a capable one only when needed.
Geographic failover: primary provider in us-east-1, fallback in eu-west-1.
Model capability tiers: try a fast small model first, fall back to a larger one if the small model fails or returns a low-confidence result (with the right callable logic).

When NOT to use it

If your provider is down for more than a few minutes and you have a queue, use a job queue with retries instead of in-request fallover.
If the providers have different output formats, the consuming code needs to know which provider answered. The chain returns a plain string. If you need that metadata, read chain.last_attempt after a successful call.
If your app logic depends on provider-specific features (Anthropic tool use vs OpenAI tool use), the callables need to normalize that at the callable layer. The chain does not do format bridging.

Install

pip install llm-fallback-chain

GitHub: MukundaKatta/llm-fallback-chain
Python 3.9 and above
Zero dependencies
24 tests, sync and async paths covered

Siblings

Lib	Boundary	Repo
llm-retry-py	Retry ONE provider with backoff. Different from multi-provider failover.	MukundaKatta/llm-retry-py
llm-circuit-breaker-py	Open/half-open circuit breaker. Composable with failover to stop hammering dead providers.	MukundaKatta/llm-circuit-breaker-py
bedrock-kit	AWS Bedrock wrapper. Natural fallback target when the primary provider is down.	MukundaKatta/bedrock-kit
llm-fallback-router	Similar idea, different shape. Ordered Provider list with a structured interface vs chain of plain callables.	MukundaKatta/llm-fallback-router

The difference between this library and llm-fallback-router is the interface. This library takes plain callables with any signature. The router takes structured Provider objects with a defined interface. Both do ordered failover. Use whichever fits your codebase structure better.

What is next

A few things that would make this more useful in production:

Provider weights. Right now it is strict ordered priority. Weighted random selection across healthy providers would let you do load splitting, not just failover.
Warmup probes. A background check that pings each provider periodically so the chain knows which ones are healthy before a real request arrives. Combine with llm-circuit-breaker-py for the open/close logic.
Per-attempt timeout. Right now the timeout is whatever your callable enforces. A built-in per-attempt timeout with configurable budget would let the chain give up on a slow provider and move on without waiting for a TCP timeout.

For now, the library does one thing well: ordered fallover with a smart skip predicate and a full attempt trace when everything fails.

If your app has ever died because one API was having a bad day, this is the fix.

DEV Community