DEV Community

Cover image for Stop Calling One LLM: Route Between Models With 30 Lines of Python
klement Gunndu
klement Gunndu

Posted on

Stop Calling One LLM: Route Between Models With 30 Lines of Python

Your production app calls openai.chat.completions.create(). OpenAI goes down for 47 minutes. Your users see errors until it comes back.

This happened three times in the last 90 days. The fix takes 30 lines of Python and one library: LiteLLM.

Why Single-Provider LLM Calls Break in Production

Every LLM provider has outages. OpenAI's status page shows multiple incidents per month. Anthropic has rate limits that hit hard during peak hours. Google Vertex has regional availability gaps.

If your application calls one provider, you inherit that provider's uptime as your ceiling. For a side project, that is fine. For a product with paying users, it is not.

The pattern you need is model routing: one interface, multiple providers, automatic failover. The same pattern every serious web application uses for databases and CDNs.

LiteLLM: One Interface for 100+ LLM Providers

LiteLLM (v1.82.1 as of March 2026) wraps every major LLM provider behind the OpenAI completion interface. You change the model string. Everything else stays the same.

Install it:

pip install litellm
Enter fullscreen mode Exit fullscreen mode

Call any provider with the same function:

from litellm import completion
import os

os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."

# Call OpenAI
response = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain routing in one sentence."}]
)

# Call Anthropic — same function, different model string
response = completion(
    model="anthropic/claude-sonnet-4-5-20250929",
    messages=[{"role": "user", "content": "Explain routing in one sentence."}]
)
Enter fullscreen mode Exit fullscreen mode

The completion() function returns an OpenAI-compatible response object regardless of the provider. Your downstream code does not change.

Provider model strings follow a pattern: provider/model-name for non-OpenAI providers (e.g., anthropic/claude-sonnet-4-5-20250929), and the model name directly for OpenAI (e.g., gpt-4o).

Pattern 1: Automatic Fallback Between Providers

This is the pattern that eliminates single-provider downtime. When OpenAI fails, requests fall through to Claude. When Claude fails, they fall through to Gemini.

from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "main-llm",
            "litellm_params": {
                "model": "gpt-4o",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        {
            "model_name": "main-llm",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-5-20250929",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
            },
        },
    ],
    num_retries=2,
    allowed_fails=1,
    cooldown_time=60,
    routing_strategy="simple-shuffle",
)

# This call automatically fails over if the first provider is down
response = router.completion(
    model="main-llm",
    messages=[{"role": "user", "content": "What is model routing?"}],
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

The key concept: multiple deployments share the same model_name. When you call router.completion(model="main-llm"), LiteLLM picks one deployment, tries it, and fails over to the next if it gets an error.

cooldown_time=60 means a failed deployment is taken out of rotation for 60 seconds. allowed_fails=1 means one failure triggers the cooldown. num_retries=2 means each deployment gets 2 attempts before LiteLLM moves to the next one.

Pattern 2: Cost-Based Routing

Not every prompt needs GPT-4o. A simple classification task runs fine on a cheaper model. Cost-based routing sends each request to the cheapest available deployment.

router = Router(
    model_list=[
        {
            "model_name": "general-llm",
            "litellm_params": {
                "model": "gpt-4o-mini",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        {
            "model_name": "general-llm",
            "litellm_params": {
                "model": "gpt-4o",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        {
            "model_name": "general-llm",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-5-20250929",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
            },
        },
    ],
    routing_strategy="cost-based-routing",
    num_retries=2,
)
Enter fullscreen mode Exit fullscreen mode

LiteLLM maintains an internal cost map for every model it supports. With cost-based-routing, it routes to gpt-4o-mini first (cheapest), then gpt-4o, then Claude — based on per-token pricing.

You get the cheapest model that is currently available and within rate limits. No code changes when pricing updates.

Pattern 3: Latency-Based Routing

For user-facing applications where response time matters more than cost, route to the fastest available deployment:

router = Router(
    model_list=[
        {
            "model_name": "fast-llm",
            "litellm_params": {
                "model": "gpt-4o-mini",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        {
            "model_name": "fast-llm",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-5-20250929",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
            },
        },
    ],
    routing_strategy="latency-based-routing",
    num_retries=2,
)
Enter fullscreen mode Exit fullscreen mode

LiteLLM tracks the response latency of each deployment over time. It routes new requests to whichever deployment has been responding fastest. If latency spikes on one provider (common during peak hours), traffic shifts automatically.

Pattern 4: Complexity-Based Routing (Smart Tiering)

This is the pattern that saves the most money in practice. Simple prompts go to cheap models. Complex reasoning goes to expensive ones. No manual routing logic in your application code.

router = Router(
    model_list=[
        {
            "model_name": "smart-router",
            "litellm_params": {
                "model": "auto_router/complexity_router",
                "complexity_router_config": {
                    "tiers": {
                        "SIMPLE": "gpt-4o-mini",
                        "MEDIUM": "gpt-4o",
                        "COMPLEX": "anthropic/claude-sonnet-4-5-20250929",
                        "REASONING": "o3-mini",
                    },
                },
                "complexity_router_default_model": "gpt-4o",
            },
        },
    ],
)

# Simple question → routes to gpt-4o-mini
response = await router.acompletion(
    model="smart-router",
    messages=[{"role": "user", "content": "What is Python?"}],
)

# Complex reasoning → routes to claude or o3-mini
response = await router.acompletion(
    model="smart-router",
    messages=[{"role": "user", "content": "Design a distributed consensus algorithm for 5 nodes with Byzantine fault tolerance."}],
)
Enter fullscreen mode Exit fullscreen mode

The complexity router uses rule-based scoring — no external API calls, under 1ms latency overhead. It classifies the prompt into a tier and routes accordingly.

Pattern 5: Explicit Fallback Chains

When you need predictable fallback order (not load-balanced), define explicit chains:

router = Router(
    model_list=[
        {
            "model_name": "primary",
            "litellm_params": {
                "model": "gpt-4o",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        {
            "model_name": "secondary",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-5-20250929",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
            },
        },
        {
            "model_name": "tertiary",
            "litellm_params": {
                "model": "gpt-4o-mini",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
    ],
    fallbacks=[
        {"primary": ["secondary", "tertiary"]},
        {"secondary": ["tertiary"]},
    ],
    max_fallbacks=3,
)

# Tries primary → secondary → tertiary
response = router.completion(
    model="primary",
    messages=[{"role": "user", "content": "Summarize this document."}],
)
Enter fullscreen mode Exit fullscreen mode

Fallbacks trigger on: all deployments in a model group failing, context window errors, content policy violations, and rate limit exhaustion. The order is deterministic — first fallback tried first.

Exception Handling That Actually Works

LiteLLM maps every provider's error format to standard exception types:

import litellm
from litellm import completion

try:
    response = completion(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello"}],
    )
except litellm.RateLimitError:
    # OpenAI 429, Anthropic 429, Gemini quota errors — all caught here
    print("Rate limited. Retrying with fallback...")
except litellm.AuthenticationError:
    # Bad API key on any provider
    print("Check your API key.")
except litellm.Timeout:
    # Request took too long
    print("Provider timed out.")
except litellm.APIConnectionError:
    # Network-level failure
    print("Cannot reach provider.")
Enter fullscreen mode Exit fullscreen mode

One try/except block handles errors from every provider. No provider-specific error handling code scattered through your application.

Streaming With Router

Streaming works identically to the OpenAI SDK:

for chunk in router.completion(
    model="main-llm",
    messages=[{"role": "user", "content": "Write a haiku."}],
    stream=True,
):
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)
Enter fullscreen mode Exit fullscreen mode

If the primary provider fails mid-stream, LiteLLM retries from the beginning on the fallback provider. The stream restarts — it does not resume from the failure point.

When to Use Each Strategy

Strategy Best for Trade-off
simple-shuffle Even distribution across providers No intelligence, random
cost-based-routing Budget-conscious apps May route to slower models
latency-based-routing User-facing real-time apps May route to expensive models
usage-based-routing-v2 Staying under rate limits Needs TPM/RPM configured
Complexity router Mixed workloads (simple + complex) Requires tier definitions
Explicit fallbacks Predictable failover order No load balancing

For most production applications, start with simple-shuffle plus explicit fallbacks. Add cost or latency routing when you have usage data showing where to optimize.

The 30-Line Production Setup

Here is the complete setup for a production application with multi-provider routing, fallbacks, retries, and error handling:

import os
from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "app-llm",
            "litellm_params": {
                "model": "gpt-4o",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        {
            "model_name": "app-llm",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-5-20250929",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
            },
        },
        {
            "model_name": "app-llm-cheap",
            "litellm_params": {
                "model": "gpt-4o-mini",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
    ],
    fallbacks=[{"app-llm": ["app-llm-cheap"]}],
    routing_strategy="latency-based-routing",
    num_retries=2,
    cooldown_time=30,
)

def ask(prompt: str) -> str:
    response = router.completion(
        model="app-llm",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

Call ask("your prompt") and LiteLLM handles provider selection, retries, fallback, and error mapping. Your application code stays clean.

What This Does Not Solve

Model routing handles availability and cost. It does not handle:

  • Prompt compatibility: Claude and GPT-4o have different system prompt behaviors. Test your prompts on every provider you route to.
  • Output format differences: JSON mode, tool calling, and structured outputs vary between providers. LiteLLM normalizes the interface, but edge cases exist.
  • Context window mismatches: A prompt that fits GPT-4o's 128K window may not fit a fallback model with 32K. Set max_tokens explicitly.

Test every provider in your fallback chain with your actual prompts before deploying.


LiteLLM has 100+ supported providers as of v1.82.1 (March 2026). The full provider list includes OpenAI, Anthropic, Google Vertex, Azure OpenAI, AWS Bedrock, Ollama, and more.

Follow @klement_gunndu for more AI engineering content. We're building in public.

Top comments (4)

Collapse
 
harsh2644 profile image
Harsh

This is exactly what production apps need. We implemented something similar after OpenAI had that 47-minute outage last month. Lost about $2k in that window.

One thing I'd add: we use retry-after headers with exponential backoff alongside the cooldown. Sometimes providers recover faster than 60 seconds. Also, we log which model served each request - helps debug weird output differences between providers. Great write-up!

Collapse
 
klement_gunndu profile image
klement Gunndu

That $2k outage loss is exactly the scenario that motivated this pattern. Retry-after headers with exponential backoff is a great addition — in practice we found that combining provider health checks with adaptive cooldowns catches recovery faster than a fixed window. Logging which model served each request is smart too; we tag every response with the provider name and latency so we can spot quality drift between models over time. Appreciate you sharing the production context.

Collapse
 
dubeykartikay profile image
kartikay dubey

With ai companies like claude struggling to hit even 3 9s, this article is very relevent

Collapse
 
klement_gunndu profile image
klement Gunndu

Exactly — the reliability gap between providers is precisely why routing matters. When one model degrades, your system should automatically shift load rather than passing failures to users.