DEV Community

Xidao
Xidao

Posted on

What Breaks When You Route LLM Traffic Across Multiple Providers (And How to Fix It)

What Breaks When You Route LLM Traffic Across Multiple Providers (And How to Fix It)

You've decided to multi-home your LLM traffic. Maybe you're migrating from one provider to another. Maybe you want a backup for when your primary goes down. Maybe you're cost-optimizing by routing cheap requests to cheaper models.

Whatever the reason, you change base_url, swap the API key, and ship it.

Then production breaks in ways you didn't expect.

This post walks through the failure modes we've seen in real multi-provider LLM routing setups, and the patterns that actually hold up under load. The code examples come from a failover router demo we built to make these patterns reproducible.

The Naive Approach and Why It Fails

Most teams start here:

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url=os.environ["OPENAI_BASE_URL"],
)

response = client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "user", "content": "Hello"}],
)
Enter fullscreen mode Exit fullscreen mode

This works until it doesn't. The problems start when you add a second provider as a fallback:

try:    response = primary_client.chat.completions.create(...)
except Exception:
    response = fallback_client.chat.completions.create(...)
Enter fullscreen mode Exit fullscreen mode

Here's what goes wrong.

Failure Mode 1: Catching Everything Means Hiding Everything

The except Exception block catches two very different kinds of failures:

  • Provider is down (503, timeout, connection refused) -- fallback is correct
  • Your prompt is bad (400, 422, content policy violation) -- fallback will hit the same error, wasting money and time

The fix is to classify errors before routing:

import openai

PROVIDER_ERRORS = (
    openai.APIConnectionError,
    openai.APITimeoutError,
    openai.RateLimitError,
    openai.InternalServerError,
)

try:
    response = primary_client.chat.completions.create(...)
except PROVIDER_ERRORS as e:
    # Provider issue -- safe to fail over
    log.warning(f"Primary failed ({type(e).__name__}), trying fallback")
    response = fallback_client.chat.completions.create(...)
except openai.APIStatusError as e:
    if e.status_code >= 500:
        # Server error -- fail over
        response = fallback_client.chat.completions.create(...)
    else:        # 4xx -- your fault, don't retry
        raise
Enter fullscreen mode Exit fullscreen mode

This distinction seems obvious, but we've seen production systems where a malformed JSON schema retried against three providers before someone noticed the cost spike.

Failure Mode 2: Retry Storms Amplify Outages

When your primary provider has a partial outage (slow responses, intermittent 503s), a naive retry strategy makes things worse:

  1. Request times out after 30 seconds
  2. Retry fires immediately
  3. Second request also times out
  4. Retry fires again
  5. Your connection pool is now saturated
  6. All requests (including ones that would have succeeded) start failing

The pattern is familiar to anyone who's operated microservices, but LLM APIs have a twist: latency variance is enormous. A request that takes 200ms normally might take 45 seconds during a provider's degraded state. Your timeout has to account for this without letting requests hang forever.

A better approach uses explicit retry boundaries:

import time

MAX_RETRIES = 2
BASE_TIMEOUT = 30
BACKOFF_BASE = 2  # seconds

def call_with_retry(client, **kwargs):
    for attempt in range(MAX_RETRIES + 1):
        try:            return client.chat.completions.create(
                timeout=BASE_TIMEOUT,
                **kwargs,
            )
        except PROVIDER_ERRORS:
            if attempt == MAX_RETRIES:
                raise
            wait = BACKOFF_BASE * (2 ** attempt)
            log.warning(f"Attempt {attempt+1} failed, waiting {wait}s")
            time.sleep(wait)
Enter fullscreen mode Exit fullscreen mode

Key points:

  • Retries are bounded (not infinite)
  • Backoff is exponential (not immediate)
  • Timeout is per-request, not per-attempt total
  • The caller decides when to escalate to a different provider

Failure Mode 3: Silent Model Name Mismatches

You test with gpt-4.1-mini on Provider A. You configure gpt-4.1-mini as your fallback on Provider B. But Provider B calls it gpt-4.1-mini-2024-07-18 or maps it to a different model entirely.

The response comes back. It looks fine. But the quality is different, the token counting is different, and your cost tracking is wrong.

This is especially dangerous when:

  • Model names overlap but versions differ
  • Your fallback provider silently substitutes a different model
  • Tokenization differs between providers (same text, different token count, different cost) The mitigation is a model mapping layer:
MODEL_MAP = {
    "primary": {
        "fast": "gpt-4.1-mini",
        "quality": "gpt-4.1",
    },
    "fallback": {
        "fast": "gpt-4o-mini",
        "quality": "gpt-4o",
    },
}

def resolve_model(provider: str, tier: str) -> str:
    return MODEL_MAP[provider][tier]
Enter fullscreen mode Exit fullscreen mode

Failure Mode 4: No Visibility Into Which Provider Served the Request

This is the silent killer. Your app works, but you have no idea:

  • How often fallback is actually triggered
  • Which provider is serving which percentage of traffic
  • Whether latency improved or degraded after the switch
  • What the per-provider cost actually is

Without observability, you're flying blind. A minimal logging approach:

import time
import json

def call_with_routing(clients, model_tiers, **kwargs):
    for tier in model_tiers:
        for provider_name, client in clients.items():
            model = resolve_model(provider_name, tier)
            start = time.monotonic()
            try:
                response = client.chat.completions.create(
                    model=model, **kwargs
                )
                elapsed = time.monotonic() - start                log.info(json.dumps({
                    "provider": provider_name,
                    "model": model,
                    "tier": tier,
                    "latency_ms": round(elapsed * 1000),
                    "tokens": response.usage.total_tokens,
                    "status": "ok",
                }))
                return response
            except PROVIDER_ERRORS as e:
                elapsed = time.monotonic() - start
                log.warning(json.dumps({
                    "provider": provider_name,
                    "model": model,
                    "tier": tier,
                    "latency_ms": round(elapsed * 1000),
                    "error": type(e).__name__,
                    "status": "failed",
                }))
                continue
    raise RuntimeError("All providers exhausted")
Enter fullscreen mode Exit fullscreen mode

Failure Mode 5: Streaming Makes Everything Harder

All of the above gets more complex with streaming responses. When a provider fails mid-stream, you can't just retry -- the user has already seen partial output.

Options:

  1. Buffer before streaming -- defeats the purpose for long responses2. Accept partial delivery -- user sees truncated output, you log the failure
  2. Stream-to-fallback -- try to continue from where you left off (very provider-dependent)

The honest answer: streaming failover is hard, and most teams should start with non-streaming reliability before attempting it.

The Health-Aware Router Pattern

The most robust approach we've found is health-aware routing. Instead of reacting to failures, you proactively probe providers and route around unhealthy ones:

import time

class HealthAwareRouter:
    def __init__(self, clients, probe_model, probe_interval=60):
        self.clients = clients
        self.probe_model = probe_model
        self.probe_interval = probe_interval
        self.health = {name: True for name in clients}
        self.last_probe = {name: 0 for name in clients}

    def probe(self, provider_name):
        """Cheap health check -- short prompt, short timeout."""
        client = self.clients[provider_name]
        try:
            client.chat.completions.create(
                model=self.probe_model,
                messages=[{"role": "user", "content": "ping"}],
                max_tokens=1,                timeout=5,
            )
            self.health[provider_name] = True
        except Exception:
            self.health[provider_name] = False
        self.last_probe[provider_name] = time.monotonic()

    def get_healthy_client(self):
        """Return first healthy client, probing if needed."""
        now = time.monotonic()
        for name, client in self.clients.items():
            if now - self.last_probe[name] > self.probe_interval:
                self.probe(name)
            if self.health[name]:
                return name, client
        # All unhealthy -- try anyway as last resort
        return list(self.clients.items())[0]
Enter fullscreen mode Exit fullscreen mode

This pattern is the core of the llm-failover-router-demo repo. It includes:

  • Basic fallback -- primary to secondary with error classification
  • Health-aware routing -- probe before you route
  • Latency-tier routing -- cheap models for low-risk requests, escalate when needed

The Latency-Tier Pattern

Not all requests need the same model. A latency-tier router splits traffic by risk:

  • Tier 1 (fast/cheap): Simple classification, formatting, short completions
  • Tier 2 (quality): Complex reasoning, code generation, multi-step tasks
TIER_1_MODELS = ["gpt-4.1-mini", "gpt-4o-mini"]
TIER_2_MODELS = ["gpt-4.1", "gpt-4o", "claude-sonnet-4-20250514"]

def route_by_tier(prompt_complexity: str, **kwargs):
    if prompt_complexity == "simple":
        return call_with_routing(TIER_1_MODELS, **kwargs)
    else:
        return call_with_routing(TIER_2_MODELS, **kwargs)
Enter fullscreen mode Exit fullscreen mode

This is where a gateway that supports multiple models under one API key becomes useful. Instead of managing separate API keys, base URLs, and model maps for each provider, you route through a single OpenAI-compatible endpoint that handles the upstream mapping.

What We Built

The llm-failover-router-demo is a minimal Python reference for these patterns. It's designed to be:

  • Copy-pasteable -- take the pattern you need, leave the rest
  • OpenAI SDK compatible -- works with any OpenAI-compatible endpoint
  • Observable by default -- logs which provider served each request- Provider-agnostic -- swap providers by changing environment variables

If you're looking at this from the perspective of reducing the blast radius of provider outages, or you're evaluating a migration to a new provider, the LLM Provider Migration Checklist covers the regression testing matrix and rollout sequencing that complements these routing patterns.

The Takeaway

Multi-provider LLM routing isn't hard because the code is complex. It's hard because the failure modes are subtle:

  1. Error classification -- don't retry bad prompts
  2. Retry boundaries -- don't amplify outages
  3. Model mapping -- don't assume names are universal
  4. Observability -- don't route blind
  5. Streaming -- don't pretend failover is free

Start with non-streaming, add error classification, then layer on health checks and latency tiers. The boring approach is the one that works at 3 AM.


The code examples in this post come from the llm-failover-router-demo repo. If you're evaluating multi-provider setups or planning a migration, the LLM Provider Migration Checklist has a regression test matrix and rollout guide.

What failure modes have you hit in production with LLM routing? Drop a comment -- I'm collecting war stories for a follow-up post.

Top comments (0)