What Actually Breaks When You Add LLM Failover?

#ai #architecture #webdev #llm

What Actually Breaks When You Add LLM Failover?

A lot of teams say they want “LLM failover” as if it were a single feature.

In production, it is usually not one feature.
It is a bundle of decisions about retries, fallback targets, route health, timeout behavior, and what kind of degradation you are willing to accept before the whole application looks broken.

That is why adding a second model or second endpoint often creates a strange result:

you technically have more redundancy, but the system becomes harder to reason about under failure.

We ran into this while building XiDao API, an OpenAI-compatible gateway, and while putting together a small failover/routing demo. The most useful lesson was that failover usually breaks around the edges first — not in the happy-path request.

The first mistake: treating retry and fallback as the same thing

A retry says:
“try the same route again.”

A fallback says:
“try a different route.”

Those are not interchangeable.

If the primary backend is unhealthy, a retry loop can make things worse by stacking more traffic onto the same broken path.

That is why the first production question is not:

do we have a backup model?

It is:

what conditions should move this request off the primary route at all?

For example:

timeout or connection failure may justify fast fallback
rate-limit pressure may justify bounded retry before fallback
malformed request errors should not fail over at all
tool-calling incompatibility should route only to known-compatible models

This sounds obvious when written down, but a lot of “multi-model” demos collapse these cases into one catch-all exception block.

The second mistake: no failure classification

The easiest failover implementation is usually something like:

try:
    return primary()
except Exception:
    return fallback()

That is also how you end up hiding real bugs.

If the request is malformed, if the schema assumptions changed, or if the caller sent an unsupported parameter, falling back to another provider does not solve the root problem. It just makes the failure harder to diagnose.

A more useful split is:

caller-side problems
temporary upstream problems
route-specific incompatibilities
budget / policy-driven reroutes

Each one should map to a different routing decision.

The third mistake: routing without observability

Once you add fallback, the answer to “did the request work?” is no longer enough.

You need to know:

which route actually served the response
how often fallback happened
which workloads trigger retries most often
whether latency got better or worse after rerouting
which routes create cost spikes under pressure

Without that visibility, teams often misread their own system.

A request may look healthy from the outside while the platform is quietly failing over far more often than expected. That can turn into a cost problem, a latency problem, or both.

The fourth mistake: no health-aware routing

Failover is better when it is not purely reactive.

A small health probe can tell you whether a route is still safe to send traffic to before you pile more requests onto it.

That does not need to be a giant benchmark run.
A cheap, short-budget probe is often enough to answer the operational question that matters most:

should this route keep receiving traffic right now?

That simple shift changes failover from a panic behavior into a routing policy.

The fifth mistake: treating all workloads as equal

One model strategy for every workload usually breaks down fast.

A better pattern is tiering:

fast/cheap tier for summarization, tagging, extraction, background jobs
stronger tier for higher-risk, user-facing reasoning flows
fallback path for temporary degradation or route failure

This matters for reliability as much as for cost.
If your strongest tier is degraded, you can preserve a lot of useful application behavior by keeping lower-risk traffic alive instead of failing everything together.

What actually helped

The most practical patterns were not complicated:

keep fallback targets explicit
classify failures before rerouting
probe route health cheaply
log the final route used
roll out routing changes in stages instead of flipping all traffic at once

That is also why “just switch the base_url” is only part of the story. OpenAI-compatible APIs reduce integration friction, but they do not remove the need to verify production behavior around timeouts, streaming, and route choice.