Your production app calls openai.chat.completions.create(). OpenAI goes down for 47 minutes. Your users see errors until it comes back.
This happened three times in the last 90 days. The fix takes 30 lines of Python and one library: LiteLLM.
Why Single-Provider LLM Calls Break in Production
Every LLM provider has outages. OpenAI's status page shows multiple incidents per month. Anthropic has rate limits that hit hard during peak hours. Google Vertex has regional availability gaps.
If your application calls one provider, you inherit that provider's uptime as your ceiling. For a side project, that is fine. For a product with paying users, it is not.
The pattern you need is model routing: one interface, multiple providers, automatic failover. The same pattern every serious web application uses for databases and CDNs.
LiteLLM: One Interface for 100+ LLM Providers
LiteLLM (v1.82.1 as of March 2026) wraps every major LLM provider behind the OpenAI completion interface. You change the model string. Everything else stays the same.
Install it:
pip install litellm
Call any provider with the same function:
from litellm import completion
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."
# Call OpenAI
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain routing in one sentence."}]
)
# Call Anthropic — same function, different model string
response = completion(
model="anthropic/claude-sonnet-4-5-20250929",
messages=[{"role": "user", "content": "Explain routing in one sentence."}]
)
The completion() function returns an OpenAI-compatible response object regardless of the provider. Your downstream code does not change.
Provider model strings follow a pattern: provider/model-name for non-OpenAI providers (e.g., anthropic/claude-sonnet-4-5-20250929), and the model name directly for OpenAI (e.g., gpt-4o).
Pattern 1: Automatic Fallback Between Providers
This is the pattern that eliminates single-provider downtime. When OpenAI fails, requests fall through to Claude. When Claude fails, they fall through to Gemini.
from litellm import Router
router = Router(
model_list=[
{
"model_name": "main-llm",
"litellm_params": {
"model": "gpt-4o",
"api_key": os.environ["OPENAI_API_KEY"],
},
},
{
"model_name": "main-llm",
"litellm_params": {
"model": "anthropic/claude-sonnet-4-5-20250929",
"api_key": os.environ["ANTHROPIC_API_KEY"],
},
},
],
num_retries=2,
allowed_fails=1,
cooldown_time=60,
routing_strategy="simple-shuffle",
)
# This call automatically fails over if the first provider is down
response = router.completion(
model="main-llm",
messages=[{"role": "user", "content": "What is model routing?"}],
)
print(response.choices[0].message.content)
The key concept: multiple deployments share the same model_name. When you call router.completion(model="main-llm"), LiteLLM picks one deployment, tries it, and fails over to the next if it gets an error.
cooldown_time=60 means a failed deployment is taken out of rotation for 60 seconds. allowed_fails=1 means one failure triggers the cooldown. num_retries=2 means each deployment gets 2 attempts before LiteLLM moves to the next one.
Pattern 2: Cost-Based Routing
Not every prompt needs GPT-4o. A simple classification task runs fine on a cheaper model. Cost-based routing sends each request to the cheapest available deployment.
router = Router(
model_list=[
{
"model_name": "general-llm",
"litellm_params": {
"model": "gpt-4o-mini",
"api_key": os.environ["OPENAI_API_KEY"],
},
},
{
"model_name": "general-llm",
"litellm_params": {
"model": "gpt-4o",
"api_key": os.environ["OPENAI_API_KEY"],
},
},
{
"model_name": "general-llm",
"litellm_params": {
"model": "anthropic/claude-sonnet-4-5-20250929",
"api_key": os.environ["ANTHROPIC_API_KEY"],
},
},
],
routing_strategy="cost-based-routing",
num_retries=2,
)
LiteLLM maintains an internal cost map for every model it supports. With cost-based-routing, it routes to gpt-4o-mini first (cheapest), then gpt-4o, then Claude — based on per-token pricing.
You get the cheapest model that is currently available and within rate limits. No code changes when pricing updates.
Pattern 3: Latency-Based Routing
For user-facing applications where response time matters more than cost, route to the fastest available deployment:
router = Router(
model_list=[
{
"model_name": "fast-llm",
"litellm_params": {
"model": "gpt-4o-mini",
"api_key": os.environ["OPENAI_API_KEY"],
},
},
{
"model_name": "fast-llm",
"litellm_params": {
"model": "anthropic/claude-sonnet-4-5-20250929",
"api_key": os.environ["ANTHROPIC_API_KEY"],
},
},
],
routing_strategy="latency-based-routing",
num_retries=2,
)
LiteLLM tracks the response latency of each deployment over time. It routes new requests to whichever deployment has been responding fastest. If latency spikes on one provider (common during peak hours), traffic shifts automatically.
Pattern 4: Complexity-Based Routing (Smart Tiering)
This is the pattern that saves the most money in practice. Simple prompts go to cheap models. Complex reasoning goes to expensive ones. No manual routing logic in your application code.
router = Router(
model_list=[
{
"model_name": "smart-router",
"litellm_params": {
"model": "auto_router/complexity_router",
"complexity_router_config": {
"tiers": {
"SIMPLE": "gpt-4o-mini",
"MEDIUM": "gpt-4o",
"COMPLEX": "anthropic/claude-sonnet-4-5-20250929",
"REASONING": "o3-mini",
},
},
"complexity_router_default_model": "gpt-4o",
},
},
],
)
# Simple question → routes to gpt-4o-mini
response = await router.acompletion(
model="smart-router",
messages=[{"role": "user", "content": "What is Python?"}],
)
# Complex reasoning → routes to claude or o3-mini
response = await router.acompletion(
model="smart-router",
messages=[{"role": "user", "content": "Design a distributed consensus algorithm for 5 nodes with Byzantine fault tolerance."}],
)
The complexity router uses rule-based scoring — no external API calls, under 1ms latency overhead. It classifies the prompt into a tier and routes accordingly.
Pattern 5: Explicit Fallback Chains
When you need predictable fallback order (not load-balanced), define explicit chains:
router = Router(
model_list=[
{
"model_name": "primary",
"litellm_params": {
"model": "gpt-4o",
"api_key": os.environ["OPENAI_API_KEY"],
},
},
{
"model_name": "secondary",
"litellm_params": {
"model": "anthropic/claude-sonnet-4-5-20250929",
"api_key": os.environ["ANTHROPIC_API_KEY"],
},
},
{
"model_name": "tertiary",
"litellm_params": {
"model": "gpt-4o-mini",
"api_key": os.environ["OPENAI_API_KEY"],
},
},
],
fallbacks=[
{"primary": ["secondary", "tertiary"]},
{"secondary": ["tertiary"]},
],
max_fallbacks=3,
)
# Tries primary → secondary → tertiary
response = router.completion(
model="primary",
messages=[{"role": "user", "content": "Summarize this document."}],
)
Fallbacks trigger on: all deployments in a model group failing, context window errors, content policy violations, and rate limit exhaustion. The order is deterministic — first fallback tried first.
Exception Handling That Actually Works
LiteLLM maps every provider's error format to standard exception types:
import litellm
from litellm import completion
try:
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
except litellm.RateLimitError:
# OpenAI 429, Anthropic 429, Gemini quota errors — all caught here
print("Rate limited. Retrying with fallback...")
except litellm.AuthenticationError:
# Bad API key on any provider
print("Check your API key.")
except litellm.Timeout:
# Request took too long
print("Provider timed out.")
except litellm.APIConnectionError:
# Network-level failure
print("Cannot reach provider.")
One try/except block handles errors from every provider. No provider-specific error handling code scattered through your application.
Streaming With Router
Streaming works identically to the OpenAI SDK:
for chunk in router.completion(
model="main-llm",
messages=[{"role": "user", "content": "Write a haiku."}],
stream=True,
):
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
If the primary provider fails mid-stream, LiteLLM retries from the beginning on the fallback provider. The stream restarts — it does not resume from the failure point.
When to Use Each Strategy
| Strategy | Best for | Trade-off |
|---|---|---|
simple-shuffle |
Even distribution across providers | No intelligence, random |
cost-based-routing |
Budget-conscious apps | May route to slower models |
latency-based-routing |
User-facing real-time apps | May route to expensive models |
usage-based-routing-v2 |
Staying under rate limits | Needs TPM/RPM configured |
| Complexity router | Mixed workloads (simple + complex) | Requires tier definitions |
| Explicit fallbacks | Predictable failover order | No load balancing |
For most production applications, start with simple-shuffle plus explicit fallbacks. Add cost or latency routing when you have usage data showing where to optimize.
The 30-Line Production Setup
Here is the complete setup for a production application with multi-provider routing, fallbacks, retries, and error handling:
import os
from litellm import Router
router = Router(
model_list=[
{
"model_name": "app-llm",
"litellm_params": {
"model": "gpt-4o",
"api_key": os.environ["OPENAI_API_KEY"],
},
},
{
"model_name": "app-llm",
"litellm_params": {
"model": "anthropic/claude-sonnet-4-5-20250929",
"api_key": os.environ["ANTHROPIC_API_KEY"],
},
},
{
"model_name": "app-llm-cheap",
"litellm_params": {
"model": "gpt-4o-mini",
"api_key": os.environ["OPENAI_API_KEY"],
},
},
],
fallbacks=[{"app-llm": ["app-llm-cheap"]}],
routing_strategy="latency-based-routing",
num_retries=2,
cooldown_time=30,
)
def ask(prompt: str) -> str:
response = router.completion(
model="app-llm",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
Call ask("your prompt") and LiteLLM handles provider selection, retries, fallback, and error mapping. Your application code stays clean.
What This Does Not Solve
Model routing handles availability and cost. It does not handle:
- Prompt compatibility: Claude and GPT-4o have different system prompt behaviors. Test your prompts on every provider you route to.
- Output format differences: JSON mode, tool calling, and structured outputs vary between providers. LiteLLM normalizes the interface, but edge cases exist.
-
Context window mismatches: A prompt that fits GPT-4o's 128K window may not fit a fallback model with 32K. Set
max_tokensexplicitly.
Test every provider in your fallback chain with your actual prompts before deploying.
LiteLLM has 100+ supported providers as of v1.82.1 (March 2026). The full provider list includes OpenAI, Anthropic, Google Vertex, Azure OpenAI, AWS Bedrock, Ollama, and more.
Follow @klement_gunndu for more AI engineering content. We're building in public.
Top comments (4)
This is exactly what production apps need. We implemented something similar after OpenAI had that 47-minute outage last month. Lost about $2k in that window.
One thing I'd add: we use retry-after headers with exponential backoff alongside the cooldown. Sometimes providers recover faster than 60 seconds. Also, we log which model served each request - helps debug weird output differences between providers. Great write-up!
That $2k outage loss is exactly the scenario that motivated this pattern. Retry-after headers with exponential backoff is a great addition — in practice we found that combining provider health checks with adaptive cooldowns catches recovery faster than a fixed window. Logging which model served each request is smart too; we tag every response with the provider name and latency so we can spot quality drift between models over time. Appreciate you sharing the production context.
With ai companies like claude struggling to hit even 3 9s, this article is very relevent
Exactly — the reliability gap between providers is precisely why routing matters. When one model degrades, your system should automatically shift load rather than passing failures to users.