klement Gunndu

Posted on Mar 12

Stop Calling One LLM: Route Between Models With 30 Lines of Python

#python #ai #productivity #tutorial

Your production app calls openai.chat.completions.create(). OpenAI goes down for 47 minutes. Your users see errors until it comes back.

This happened three times in the last 90 days. The fix takes 30 lines of Python and one library: LiteLLM.

Why Single-Provider LLM Calls Break in Production

Every LLM provider has outages. OpenAI's status page shows multiple incidents per month. Anthropic has rate limits that hit hard during peak hours. Google Vertex has regional availability gaps.

If your application calls one provider, you inherit that provider's uptime as your ceiling. For a side project, that is fine. For a product with paying users, it is not.

The pattern you need is model routing: one interface, multiple providers, automatic failover. The same pattern every serious web application uses for databases and CDNs.

LiteLLM: One Interface for 100+ LLM Providers

LiteLLM (v1.82.1 as of March 2026) wraps every major LLM provider behind the OpenAI completion interface. You change the model string. Everything else stays the same.

Install it:

pip install litellm

Call any provider with the same function:

from litellm import completion
import os

os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."

# Call OpenAI
response = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain routing in one sentence."}]
)

# Call Anthropic — same function, different model string
response = completion(
    model="anthropic/claude-sonnet-4-5-20250929",
    messages=[{"role": "user", "content": "Explain routing in one sentence."}]
)

The completion() function returns an OpenAI-compatible response object regardless of the provider. Your downstream code does not change.

Provider model strings follow a pattern: provider/model-name for non-OpenAI providers (e.g., anthropic/claude-sonnet-4-5-20250929), and the model name directly for OpenAI (e.g., gpt-4o).

Pattern 1: Automatic Fallback Between Providers

This is the pattern that eliminates single-provider downtime. When OpenAI fails, requests fall through to Claude. When Claude fails, they fall through to Gemini.

from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "main-llm",
            "litellm_params": {
                "model": "gpt-4o",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        {
            "model_name": "main-llm",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-5-20250929",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
            },
        },
    ],
    num_retries=2,
    allowed_fails=1,
    cooldown_time=60,
    routing_strategy="simple-shuffle",
)

# This call automatically fails over if the first provider is down
response = router.completion(
    model="main-llm",
    messages=[{"role": "user", "content": "What is model routing?"}],
)

print(response.choices[0].message.content)

The key concept: multiple deployments share the same model_name. When you call router.completion(model="main-llm"), LiteLLM picks one deployment, tries it, and fails over to the next if it gets an error.

cooldown_time=60 means a failed deployment is taken out of rotation for 60 seconds. allowed_fails=1 means one failure triggers the cooldown. num_retries=2 means each deployment gets 2 attempts before LiteLLM moves to the next one.

Pattern 2: Cost-Based Routing

Not every prompt needs GPT-4o. A simple classification task runs fine on a cheaper model. Cost-based routing sends each request to the cheapest available deployment.

router = Router(
    model_list=[
        {
            "model_name": "general-llm",
            "litellm_params": {
                "model": "gpt-4o-mini",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        {
            "model_name": "general-llm",
            "litellm_params": {
                "model": "gpt-4o",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        {
            "model_name": "general-llm",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-5-20250929",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
            },
        },
    ],
    routing_strategy="cost-based-routing",
    num_retries=2,
)

LiteLLM maintains an internal cost map for every model it supports. With cost-based-routing, it routes to gpt-4o-mini first (cheapest), then gpt-4o, then Claude — based on per-token pricing.

You get the cheapest model that is currently available and within rate limits. No code changes when pricing updates.

Pattern 3: Latency-Based Routing

For user-facing applications where response time matters more than cost, route to the fastest available deployment:

router = Router(
    model_list=[
        {
            "model_name": "fast-llm",
            "litellm_params": {
                "model": "gpt-4o-mini",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        {
            "model_name": "fast-llm",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-5-20250929",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
            },
        },
    ],
    routing_strategy="latency-based-routing",
    num_retries=2,
)

LiteLLM tracks the response latency of each deployment over time. It routes new requests to whichever deployment has been responding fastest. If latency spikes on one provider (common during peak hours), traffic shifts automatically.

Pattern 4: Complexity-Based Routing (Smart Tiering)

This is the pattern that saves the most money in practice. Simple prompts go to cheap models. Complex reasoning goes to expensive ones. No manual routing logic in your application code.

router = Router(
    model_list=[
        {
            "model_name": "smart-router",
            "litellm_params": {
                "model": "auto_router/complexity_router",
                "complexity_router_config": {
                    "tiers": {
                        "SIMPLE": "gpt-4o-mini",
                        "MEDIUM": "gpt-4o",
                        "COMPLEX": "anthropic/claude-sonnet-4-5-20250929",
                        "REASONING": "o3-mini",
                    },
                },
                "complexity_router_default_model": "gpt-4o",
            },
        },
    ],
)

# Simple question → routes to gpt-4o-mini
response = await router.acompletion(
    model="smart-router",
    messages=[{"role": "user", "content": "What is Python?"}],
)

# Complex reasoning → routes to claude or o3-mini
response = await router.acompletion(
    model="smart-router",
    messages=[{"role": "user", "content": "Design a distributed consensus algorithm for 5 nodes with Byzantine fault tolerance."}],
)

The complexity router uses rule-based scoring — no external API calls, under 1ms latency overhead. It classifies the prompt into a tier and routes accordingly.

Pattern 5: Explicit Fallback Chains

When you need predictable fallback order (not load-balanced), define explicit chains:

router = Router(
    model_list=[
        {
            "model_name": "primary",
            "litellm_params": {
                "model": "gpt-4o",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        {
            "model_name": "secondary",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-5-20250929",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
            },
        },
        {
            "model_name": "tertiary",
            "litellm_params": {
                "model": "gpt-4o-mini",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
    ],
    fallbacks=[
        {"primary": ["secondary", "tertiary"]},
        {"secondary": ["tertiary"]},
    ],
    max_fallbacks=3,
)

# Tries primary → secondary → tertiary
response = router.completion(
    model="primary",
    messages=[{"role": "user", "content": "Summarize this document."}],
)

Fallbacks trigger on: all deployments in a model group failing, context window errors, content policy violations, and rate limit exhaustion. The order is deterministic — first fallback tried first.

Exception Handling That Actually Works

LiteLLM maps every provider's error format to standard exception types:

import litellm
from litellm import completion

try:
    response = completion(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello"}],
    )
except litellm.RateLimitError:
    # OpenAI 429, Anthropic 429, Gemini quota errors — all caught here
    print("Rate limited. Retrying with fallback...")
except litellm.AuthenticationError:
    # Bad API key on any provider
    print("Check your API key.")
except litellm.Timeout:
    # Request took too long
    print("Provider timed out.")
except litellm.APIConnectionError:
    # Network-level failure
    print("Cannot reach provider.")

One try/except block handles errors from every provider. No provider-specific error handling code scattered through your application.

Streaming With Router

Streaming works identically to the OpenAI SDK:

for chunk in router.completion(
    model="main-llm",
    messages=[{"role": "user", "content": "Write a haiku."}],
    stream=True,
):
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

If the primary provider fails mid-stream, LiteLLM retries from the beginning on the fallback provider. The stream restarts — it does not resume from the failure point.

When to Use Each Strategy

Strategy	Best for	Trade-off
`simple-shuffle`	Even distribution across providers	No intelligence, random
`cost-based-routing`	Budget-conscious apps	May route to slower models
`latency-based-routing`	User-facing real-time apps	May route to expensive models
`usage-based-routing-v2`	Staying under rate limits	Needs TPM/RPM configured
Complexity router	Mixed workloads (simple + complex)	Requires tier definitions
Explicit fallbacks	Predictable failover order	No load balancing

For most production applications, start with simple-shuffle plus explicit fallbacks. Add cost or latency routing when you have usage data showing where to optimize.

The 30-Line Production Setup

Here is the complete setup for a production application with multi-provider routing, fallbacks, retries, and error handling:

import os
from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "app-llm",
            "litellm_params": {
                "model": "gpt-4o",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        {
            "model_name": "app-llm",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-5-20250929",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
            },
        },
        {
            "model_name": "app-llm-cheap",
            "litellm_params": {
                "model": "gpt-4o-mini",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
    ],
    fallbacks=[{"app-llm": ["app-llm-cheap"]}],
    routing_strategy="latency-based-routing",
    num_retries=2,
    cooldown_time=30,
)

def ask(prompt: str) -> str:
    response = router.completion(
        model="app-llm",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

Call ask("your prompt") and LiteLLM handles provider selection, retries, fallback, and error mapping. Your application code stays clean.

What This Does Not Solve

Model routing handles availability and cost. It does not handle:

Prompt compatibility: Claude and GPT-4o have different system prompt behaviors. Test your prompts on every provider you route to.
Output format differences: JSON mode, tool calling, and structured outputs vary between providers. LiteLLM normalizes the interface, but edge cases exist.
Context window mismatches: A prompt that fits GPT-4o's 128K window may not fit a fallback model with 32K. Set max_tokens explicitly.

Test every provider in your fallback chain with your actual prompts before deploying.

LiteLLM has 100+ supported providers as of v1.82.1 (March 2026). The full provider list includes OpenAI, Anthropic, Google Vertex, Azure OpenAI, AWS Bedrock, Ollama, and more.

Follow @klement_gunndu for more AI engineering content. We're building in public.

Top comments (8)

Kai Alder • Mar 13

The complexity router is the one that interests me most. We've been doing something similar manually — checking prompt length and whether it contains code blocks, then routing to different models. It works but it's brittle and we keep adding edge cases.

One gotcha worth mentioning: prompt compatibility between providers is a bigger deal than most people realize. We had a system prompt that worked perfectly on GPT-4o but Claude kept ignoring part of it because it handled multi-turn context differently. Ended up needing provider-specific system prompt templates, which kinda defeats the "one interface" simplicity. Not a dealbreaker, just something to test early.

Also curious — have you run into issues with the complexity router misclassifying prompts? Like a short prompt that's actually complex ("prove P != NP") vs a long prompt that's actually simple?

klement Gunndu • Mar 28

Exactly the brittleness problem — hardcoded rules rot fast as models change. The win is making the router a single swappable function so you can evolve from regex heuristics to a lightweight classifier without touching the rest of your pipeline.

klement Gunndu • Mar 13

You hit on the real production problem. Prompt compatibility across providers is something most routing tutorials skip entirely. We ran into the same thing — Claude handles system prompts as a separate parameter while OpenAI threads them into the message array, and the behavioral differences go deeper than format. Provider-specific prompt templates are basically unavoidable once you're routing across model families.

On the misclassification question: yes, absolutely. Length-based complexity scoring breaks on exactly the cases you described. A 10-word prompt asking to prove a theorem is harder than a 500-word prompt asking to summarize a meeting. The fix we found most reliable is a two-pass approach: first classify the task type (reasoning, generation, extraction, translation), then route based on task type rather than surface-level metrics. Still not perfect, but it catches the short-but-hard cases that pure heuristics miss.

Harsh • Mar 12

This is exactly what production apps need. We implemented something similar after OpenAI had that 47-minute outage last month. Lost about $2k in that window.

One thing I'd add: we use retry-after headers with exponential backoff alongside the cooldown. Sometimes providers recover faster than 60 seconds. Also, we log which model served each request - helps debug weird output differences between providers. Great write-up!

klement Gunndu • Mar 12

That $2k outage loss is exactly the scenario that motivated this pattern. Retry-after headers with exponential backoff is a great addition — in practice we found that combining provider health checks with adaptive cooldowns catches recovery faster than a fixed window. Logging which model served each request is smart too; we tag every response with the provider name and latency so we can spot quality drift between models over time. Appreciate you sharing the production context.

kartikay dubey • Mar 12

With ai companies like claude struggling to hit even 3 9s, this article is very relevent

klement Gunndu • Mar 28

Spot on — when your provider's uptime is already shaky, having a fallback route to a second model isn't just optimization, it's basic resilience engineering.

klement Gunndu • Mar 12

Exactly — the reliability gap between providers is precisely why routing matters. When one model degrades, your system should automatically shift load rather than passing failures to users.