Mukunda Rao Katta

Posted on May 25

llm-fallback-router: Multi-Provider Failover for Python LLM Calls

#hermeschallenge #ai #python #agents

The Production Wake-Up Call

It was 2 AM on a Tuesday. An agent had been processing a batch of 800 documents overnight. Halfway through, Anthropic returned a 529 overloaded error. The agent crashed. No fallback. No retry with a different model. Just a hard stop at document 412.

The fix took 15 minutes to write. But the awareness that this could happen again, at any time, for any provider, for any reason, that took longer to settle. Rate limits, model outages, transient 500s, quota exhaustion on one account while another is fine. Provider failures are not rare. They are scheduled.

Most teams handle this by adding a try/except around the model call, logging the error, and falling back to a hardcoded secondary. That works for one provider pair. It does not scale to three providers. It does not give you a clean audit log. It does not let you plug in observability hooks without touching the core call logic.

llm-fallback-router is a small Python library that makes multi-provider failover explicit and auditable. You give it an ordered list of providers. Each provider is a name and an async callable. When one fails, it logs the attempt and tries the next. When all fail, it raises AllProvidersFailedError with the full attempt history attached.

Shape of the Fix

from llm_fallback_router import FallbackRouter, Provider

async def call_claude(request):
    # your anthropic SDK call here
    ...

async def call_openai(request):
    # your openai SDK call here
    ...

async def call_gemini(request):
    # your google genai call here
    ...

router = FallbackRouter([
    Provider("claude", call_claude),
    Provider("gpt4o", call_openai),
    Provider("gemini", call_gemini),
])

try:
    result = await router.call(request)
    print(result)
except AllProvidersFailedError as e:
    for attempt in e.attempts:
        print(f"{attempt.provider}: {attempt.error}")

Each Provider wraps any async callable. The library does not know what request is. It passes it through unchanged. You own the model-specific shaping.

The on_attempt hook fires after every attempt, whether it succeeded or failed. This is where you plug in your metrics or logs.

def log_attempt(attempt):
    print(f"provider={attempt.provider} ok={attempt.ok} latency_ms={attempt.latency_ms}")

router = FallbackRouter(providers, on_attempt=log_attempt)

The skip predicate lets you short-circuit based on the error type. If Claude returns a 400 bad request, that is your input, not Claude's problem. No point trying the same request on OpenAI.

def skip_if_bad_input(error):
    return isinstance(error, BadRequestError)

router = FallbackRouter(providers, skip=skip_if_bad_input)

What It Does NOT Do

No automatic model selection based on task type. No cost routing. No load balancing across providers of equal priority. No streaming support in v0.1.0. No provider health checks between calls.

The ordering is fixed. Provider 1 always gets the first shot. If you want round-robin or cost-weighted routing, this is not the right library. Those patterns are valid, but they require a routing layer with state. This library is stateless by design.

It also does not own the retry logic within a single provider. If you want exponential backoff before giving up on Claude, use llm-retry-py inside your call_claude function before handing off to the router. The two libraries compose cleanly.

Inside the Lib

The core loop is short. Under 80 lines for the router itself.

async def call(self, request):
    attempts = []
    for provider in self.providers:
        start = time.monotonic()
        try:
            result = await provider.fn(request)
            attempt = Attempt(provider=provider.name, ok=True,
                              latency_ms=elapsed(start))
            attempts.append(attempt)
            if self.on_attempt:
                self.on_attempt(attempt)
            return result
        except Exception as e:
            attempt = Attempt(provider=provider.name, ok=False,
                              error=e, latency_ms=elapsed(start))
            attempts.append(attempt)
            if self.on_attempt:
                self.on_attempt(attempt)
            if self.skip and self.skip(e):
                break
    raise AllProvidersFailedError(attempts)

The Attempt dataclass carries the provider name, a boolean, the raw exception, and the latency in milliseconds. AllProvidersFailedError stores the full list, so you can analyze what happened after the fact.

No global state. No background threads. No connection pool. The router is a plain object. You can create multiple routers with different provider sets for different call types.

Type annotations throughout. The request and result types are generic. The library does not import any LLM SDK. It stays decoupled so you can wire it to Anthropic, OpenAI, Cohere, local Ollama, anything.

When Useful / When Not

Useful when you have budget on multiple providers and uptime matters. Useful when you are running overnight batch jobs that cannot be manually restarted at 2 AM. Useful when you want a simple audit trail of which provider actually served each call.

Not useful for single-provider setups where you just want retry. Use llm-retry-py for that. Not useful when provider selection depends on the query content. Not useful when you need streaming responses, which v0.1.0 does not support.

The sweet spot is unattended batch processing with a fallback budget and a logging requirement.

A good mental model: this library sits between your agent loop and your LLM client. The agent loop calls router.call(request) instead of calling a specific provider directly. The router handles the rest. Your loop does not need to know which provider actually answered.

Install

pip install llm-fallback-router

PyPI publish is in the queue. During the 429 cooldown window, clone and install from GitHub:

git clone https://github.com/MukundaKatta/llm-fallback-router
cd llm-fallback-router
pip install -e .

The library has no runtime dependencies beyond the Python standard library. All LLM SDK imports stay in your provider callables, not in the router.

Siblings

Library	What it does	Language
`llm-fallback-router-rs`	Same pattern, Rust async	Rust
`llm-fallback-chain`	Sequential with skip predicate, no attempt log	Python
`llm-retry-py`	Exponential backoff, single provider	Python
`llm-circuit-breaker-py`	Stop trying a provider after N failures	Python
`agent-deadline`	Per-task cooperative deadline	Python
`llm-rate-limit-bucket`	Token bucket rate limiter	Python

Note: llm-fallback-chain and llm-fallback-router are related but distinct. Chain focuses on sequential skip logic. Router focuses on the attempt log and the on_attempt hook for observability. If you only need "try next provider on failure," chain is lighter. If you need to audit every attempt and hook into metrics, use router.

What Is Next

v0.2.0 targets:

Streaming support. The async generator case needs some design work to avoid buffering the full response before detecting a failure. The current approach waits for the full response before deciding if the attempt succeeded.
Provider weight config. An optional float per provider for weighted first-selection, while keeping the existing ordered fallback for failures. This helps spread load without losing the fallback guarantee.
Async on_attempt hook. The current hook is sync. Async hooks would allow writing attempt records to a database or queue without blocking the next provider attempt.
Provider timeout. An optional per-provider max wait time. If a provider takes longer than the timeout, it is treated as a failure and the next provider is tried.
Named provider groups. A way to tag providers by region or capability, then select a group at call time. Useful for agents that process different request types with different provider preferences.

Pull requests welcome at MukundaKatta/llm-fallback-router. The project is part of the Hermes Agent Challenge sprint, focused on small composable Python libraries for LLM agent infrastructure.

The 2 AM document batch now finishes every time.

DEV Community