What Actually Breaks When You Add LLM Failover?
A lot of teams say they want “LLM failover” as if it were a single feature.
In production, it is usually not one feature.
It is a bundle of decisions about retries, fallback targets, route health, timeout behavior, and what kind of degradation you are willing to accept before the whole application looks broken.
That is why adding a second model or second endpoint often creates a strange result:
you technically have more redundancy, but the system becomes harder to reason about under failure.
We ran into this while building XiDao API, an OpenAI-compatible gateway, and while putting together a small failover/routing demo. The most useful lesson was that failover usually breaks around the edges first — not in the happy-path request.
The first mistake: treating retry and fallback as the same thing
A retry says:
“try the same route again.”
A fallback says:
“try a different route.”
Those are not interchangeable.
If the primary backend is unhealthy, a retry loop can make things worse by stacking more traffic onto the same broken path.
That is why the first production question is not:
do we have a backup model?
It is:
what conditions should move this request off the primary route at all?
For example:
- timeout or connection failure may justify fast fallback
- rate-limit pressure may justify bounded retry before fallback
- malformed request errors should not fail over at all
- tool-calling incompatibility should route only to known-compatible models
This sounds obvious when written down, but a lot of “multi-model” demos collapse these cases into one catch-all exception block.
The second mistake: no failure classification
The easiest failover implementation is usually something like:
try:
return primary()
except Exception:
return fallback()
That is also how you end up hiding real bugs.
If the request is malformed, if the schema assumptions changed, or if the caller sent an unsupported parameter, falling back to another provider does not solve the root problem. It just makes the failure harder to diagnose.
A more useful split is:
- caller-side problems
- temporary upstream problems
- route-specific incompatibilities
- budget / policy-driven reroutes
Each one should map to a different routing decision.
The third mistake: routing without observability
Once you add fallback, the answer to “did the request work?” is no longer enough.
You need to know:
- which route actually served the response
- how often fallback happened
- which workloads trigger retries most often
- whether latency got better or worse after rerouting
- which routes create cost spikes under pressure
Without that visibility, teams often misread their own system.
A request may look healthy from the outside while the platform is quietly failing over far more often than expected. That can turn into a cost problem, a latency problem, or both.
The fourth mistake: no health-aware routing
Failover is better when it is not purely reactive.
A small health probe can tell you whether a route is still safe to send traffic to before you pile more requests onto it.
That does not need to be a giant benchmark run.
A cheap, short-budget probe is often enough to answer the operational question that matters most:
should this route keep receiving traffic right now?
That simple shift changes failover from a panic behavior into a routing policy.
The fifth mistake: treating all workloads as equal
One model strategy for every workload usually breaks down fast.
A better pattern is tiering:
- fast/cheap tier for summarization, tagging, extraction, background jobs
- stronger tier for higher-risk, user-facing reasoning flows
- fallback path for temporary degradation or route failure
This matters for reliability as much as for cost.
If your strongest tier is degraded, you can preserve a lot of useful application behavior by keeping lower-risk traffic alive instead of failing everything together.
What actually helped
The most practical patterns were not complicated:
- keep fallback targets explicit
- classify failures before rerouting
- probe route health cheaply
- log the final route used
- roll out routing changes in stages instead of flipping all traffic at once
That is also why “just switch the base_url” is only part of the story. OpenAI-compatible APIs reduce integration friction, but they do not remove the need to verify production behavior around timeouts, streaming, and route choice.
Why this matters more now
A lot of teams are moving toward multi-model access because they want lower cost, better resilience, or less provider lock-in.
But the moment you add route choice, you are no longer only choosing a model.
You are choosing failure semantics.
That is the part I think many gateway demos skip.
If you want the code-first version, I turned these ideas into a small repo:
And I also added a companion guide on routing patterns in the cookbook:
I’m curious what teams ran into first when they added failover:
- retry loops
- hidden schema differences
- timeout drift
- route-level observability gaps
- cost surprises under fallback
Top comments (0)