Xidao

Posted on May 4

What Breaks When You Route LLM Traffic Across Multiple Providers (And How to Fix It)

#llm #ai #production #devops

What Breaks When You Route LLM Traffic Across Multiple Providers (And How to Fix It)

You've decided to multi-home your LLM traffic. Maybe you're migrating from one provider to another. Maybe you want a backup for when your primary goes down. Maybe you're cost-optimizing by routing cheap requests to cheaper models.

Whatever the reason, you change base_url, swap the API key, and ship it.

Then production breaks in ways you didn't expect.

This post walks through the failure modes we've seen in real multi-provider LLM routing setups, and the patterns that actually hold up under load. The code examples come from a failover router demo we built to make these patterns reproducible.

The Naive Approach and Why It Fails

Most teams start here:

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url=os.environ["OPENAI_BASE_URL"],
)

response = client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "user", "content": "Hello"}],
)

This works until it doesn't. The problems start when you add a second provider as a fallback:

try:    response = primary_client.chat.completions.create(...)
except Exception:
    response = fallback_client.chat.completions.create(...)

Here's what goes wrong.

Failure Mode 1: Catching Everything Means Hiding Everything

The except Exception block catches two very different kinds of failures:

Provider is down (503, timeout, connection refused) -- fallback is correct
Your prompt is bad (400, 422, content policy violation) -- fallback will hit the same error, wasting money and time

The fix is to classify errors before routing:

import openai

PROVIDER_ERRORS = (
    openai.APIConnectionError,
    openai.APITimeoutError,
    openai.RateLimitError,
    openai.InternalServerError,
)

try:
    response = primary_client.chat.completions.create(...)
except PROVIDER_ERRORS as e:
    # Provider issue -- safe to fail over
    log.warning(f"Primary failed ({type(e).__name__}), trying fallback")
    response = fallback_client.chat.completions.create(...)
except openai.APIStatusError as e:
    if e.status_code >= 500:
        # Server error -- fail over
        response = fallback_client.chat.completions.create(...)
    else:        # 4xx -- your fault, don't retry
        raise

This distinction seems obvious, but we've seen production systems where a malformed JSON schema retried against three providers before someone noticed the cost spike.

Failure Mode 2: Retry Storms Amplify Outages

When your primary provider has a partial outage (slow responses, intermittent 503s), a naive retry strategy makes things worse:

Request times out after 30 seconds
Retry fires immediately
Second request also times out
Retry fires again
Your connection pool is now saturated
All requests (including ones that would have succeeded) start failing

The pattern is familiar to anyone who's operated microservices, but LLM APIs have a twist: latency variance is enormous. A request that takes 200ms normally might take 45 seconds during a provider's degraded state. Your timeout has to account for this without letting requests hang forever.

A better approach uses explicit retry boundaries:

import time

MAX_RETRIES = 2
BASE_TIMEOUT = 30
BACKOFF_BASE = 2  # seconds

def call_with_retry(client, **kwargs):
    for attempt in range(MAX_RETRIES + 1):
        try:            return client.chat.completions.create(
                timeout=BASE_TIMEOUT,
                **kwargs,
            )
        except PROVIDER_ERRORS:
            if attempt == MAX_RETRIES:
                raise
            wait = BACKOFF_BASE * (2 ** attempt)
            log.warning(f"Attempt {attempt+1} failed, waiting {wait}s")
            time.sleep(wait)

Key points:

Retries are bounded (not infinite)
Backoff is exponential (not immediate)
Timeout is per-request, not per-attempt total
The caller decides when to escalate to a different provider

Failure Mode 3: Silent Model Name Mismatches

You test with gpt-4.1-mini on Provider A. You configure gpt-4.1-mini as your fallback on Provider B. But Provider B calls it gpt-4.1-mini-2024-07-18 or maps it to a different model entirely.

The response comes back. It looks fine. But the quality is different, the token counting is different, and your cost tracking is wrong.

This is especially dangerous when:

Model names overlap but versions differ
Your fallback provider silently substitutes a different model
Tokenization differs between providers (same text, different token count, different cost) The mitigation is a model mapping layer:

MODEL_MAP = {
    "primary": {
        "fast": "gpt-4.1-mini",
        "quality": "gpt-4.1",
    },
    "fallback": {
        "fast": "gpt-4o-mini",
        "quality": "gpt-4o",
    },
}

def resolve_model(provider: str, tier: str) -> str:
    return MODEL_MAP[provider][tier]

Failure Mode 4: No Visibility Into Which Provider Served the Request

This is the silent killer. Your app works, but you have no idea:

How often fallback is actually triggered
Which provider is serving which percentage of traffic
Whether latency improved or degraded after the switch
What the per-provider cost actually is

Without observability, you're flying blind. A minimal logging approach:

import time
import json

def call_with_routing(clients, model_tiers, **kwargs):
    for tier in model_tiers:
        for provider_name, client in clients.items():
            model = resolve_model(provider_name, tier)
            start = time.monotonic()
            try:
                response = client.chat.completions.create(
                    model=model, **kwargs
                )
                elapsed = time.monotonic() - start                log.info(json.dumps({
                    "provider": provider_name,
                    "model": model,
                    "tier": tier,
                    "latency_ms": round(elapsed * 1000),
                    "tokens": response.usage.total_tokens,
                    "status": "ok",
                }))
                return response
            except PROVIDER_ERRORS as e:
                elapsed = time.monotonic() - start
                log.warning(json.dumps({
                    "provider": provider_name,
                    "model": model,
                    "tier": tier,
                    "latency_ms": round(elapsed * 1000),
                    "error": type(e).__name__,
                    "status": "failed",
                }))
                continue
    raise RuntimeError("All providers exhausted")

Failure Mode 5: Streaming Makes Everything Harder

All of the above gets more complex with streaming responses. When a provider fails mid-stream, you can't just retry -- the user has already seen partial output.

Options:

Buffer before streaming -- defeats the purpose for long responses2. Accept partial delivery -- user sees truncated output, you log the failure
Stream-to-fallback -- try to continue from where you left off (very provider-dependent)

The honest answer: streaming failover is hard, and most teams should start with non-streaming reliability before attempting it.

The Health-Aware Router Pattern

The most robust approach we've found is health-aware routing. Instead of reacting to failures, you proactively probe providers and route around unhealthy ones:

import time

class HealthAwareRouter:
    def __init__(self, clients, probe_model, probe_interval=60):
        self.clients = clients
        self.probe_model = probe_model
        self.probe_interval = probe_interval
        self.health = {name: True for name in clients}
        self.last_probe = {name: 0 for name in clients}

    def probe(self, provider_name):
        """Cheap health check -- short prompt, short timeout."""
        client = self.clients[provider_name]
        try:
            client.chat.completions.create(
                model=self.probe_model,
                messages=[{"role": "user", "content": "ping"}],
                max_tokens=1,                timeout=5,
            )
            self.health[provider_name] = True
        except Exception:
            self.health[provider_name] = False
        self.last_probe[provider_name] = time.monotonic()

    def get_healthy_client(self):
        """Return first healthy client, probing if needed."""
        now = time.monotonic()
        for name, client in self.clients.items():
            if now - self.last_probe[name] > self.probe_interval:
                self.probe(name)
            if self.health[name]:
                return name, client
        # All unhealthy -- try anyway as last resort
        return list(self.clients.items())[0]

This pattern is the core of the llm-failover-router-demo repo. It includes:

Basic fallback -- primary to secondary with error classification
Health-aware routing -- probe before you route
Latency-tier routing -- cheap models for low-risk requests, escalate when needed

The Latency-Tier Pattern

Not all requests need the same model. A latency-tier router splits traffic by risk:

Tier 1 (fast/cheap): Simple classification, formatting, short completions
Tier 2 (quality): Complex reasoning, code generation, multi-step tasks

TIER_1_MODELS = ["gpt-4.1-mini", "gpt-4o-mini"]
TIER_2_MODELS = ["gpt-4.1", "gpt-4o", "claude-sonnet-4-20250514"]

def route_by_tier(prompt_complexity: str, **kwargs):
    if prompt_complexity == "simple":
        return call_with_routing(TIER_1_MODELS, **kwargs)
    else:
        return call_with_routing(TIER_2_MODELS, **kwargs)

This is where a gateway that supports multiple models under one API key becomes useful. Instead of managing separate API keys, base URLs, and model maps for each provider, you route through a single OpenAI-compatible endpoint that handles the upstream mapping.

What We Built

The llm-failover-router-demo is a minimal Python reference for these patterns. It's designed to be:

Copy-pasteable -- take the pattern you need, leave the rest
OpenAI SDK compatible -- works with any OpenAI-compatible endpoint
Observable by default -- logs which provider served each request- Provider-agnostic -- swap providers by changing environment variables

If you're looking at this from the perspective of reducing the blast radius of provider outages, or you're evaluating a migration to a new provider, the LLM Provider Migration Checklist covers the regression testing matrix and rollout sequencing that complements these routing patterns.

The Takeaway

Multi-provider LLM routing isn't hard because the code is complex. It's hard because the failure modes are subtle:

Error classification -- don't retry bad prompts
Retry boundaries -- don't amplify outages
Model mapping -- don't assume names are universal
Observability -- don't route blind
Streaming -- don't pretend failover is free

Start with non-streaming, add error classification, then layer on health checks and latency tiers. The boring approach is the one that works at 3 AM.

The code examples in this post come from the llm-failover-router-demo repo. If you're evaluating multi-provider setups or planning a migration, the LLM Provider Migration Checklist has a regression test matrix and rollout guide.

What failure modes have you hit in production with LLM routing? Drop a comment -- I'm collecting war stories for a follow-up post.

DEV Community

What Breaks When You Route LLM Traffic Across Multiple Providers (And How to Fix It)

What Breaks When You Route LLM Traffic Across Multiple Providers (And How to Fix It)

The Naive Approach and Why It Fails

Failure Mode 1: Catching Everything Means Hiding Everything

Failure Mode 2: Retry Storms Amplify Outages

Failure Mode 3: Silent Model Name Mismatches

Failure Mode 4: No Visibility Into Which Provider Served the Request

Failure Mode 5: Streaming Makes Everything Harder

The Health-Aware Router Pattern

The Latency-Tier Pattern

What We Built

The Takeaway

Top comments (0)