Introduction: What If You Could Use ANY LLM Provider?
In my previous article, I walked through building a multi-region failover architecture for Azure OpenAI using Azure Front Door and APIM. It works brilliantly - but it's also Azure-specific, requires significant infrastructure, and locks you into a single provider ecosystem.
What if you need:
- Multi-provider failover (Azure OpenAI -> OpenAI -> Anthropic -> Gemini)
- A simpler deployment without managing APIM policies
- Provider-agnostic architecture that works anywhere
- Open-source flexibility with no vendor lock-in
Enter LiteLLM Proxy - an open-source unified gateway that gives you all of this out of the box.
What is LiteLLM Proxy?
LiteLLM is an open-source Python library and proxy server that provides:
- Unified API: One OpenAI-compatible endpoint for 100+ LLM providers
- Built-in Load Balancing: Distribute requests across multiple deployments
- Automatic Failover: Seamlessly retry on different models/providers when one fails
- Rate Limit Handling: Intelligent retry with exponential backoff for 429 errors
- Cost Tracking: Monitor spend across all providers in one place
- Streaming Support: Full SSE (Server-Sent Events) support with proper failover
The beauty? Your application code doesn't change. You point your OpenAI SDK at LiteLLM Proxy, and it handles the rest.
Architecture: LiteLLM Proxy vs Azure APIM
Here's how LiteLLM Proxy compares to the Azure-native approach:
Azure APIM Architecture (Previous Article)
Client -> Azure Front Door -> Regional APIM -> Azure OpenAI (Primary)
-> Azure OpenAI (Secondary)
Pros: Native Azure integration, enterprise compliance, WAF protection
Cons: Azure-only, complex policies, expensive at scale
LiteLLM Proxy Architecture
Client -> Load Balancer -> LiteLLM Proxy -> Azure OpenAI
-> OpenAI Direct
-> Anthropic Claude
-> Google Gemini
-> AWS Bedrock
-> Any LLM Provider
Supported Providers: Azure OpenAI, OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, and 100+ more
Pros: Provider-agnostic, simple configuration, open-source, runs anywhere
Cons: Self-managed infrastructure, requires containerization
Getting Started: 5-Minute Setup
Option 1: Docker (Recommended for Production)
# Pull the official image
docker pull ghcr.io/berriai/litellm:main-latest
# Run with your config
docker run -d \
--name litellm-proxy \
-p 4000:4000 \
-v $(pwd)/litellm_config.yaml:/app/config.yaml \
-e AZURE_API_KEY="your-azure-key" \
-e OPENAI_API_KEY="your-openai-key" \
-e ANTHROPIC_API_KEY="your-anthropic-key" \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml
Option 2: Python (Quick Testing)
pip install 'litellm[proxy]'
litellm --config litellm_config.yaml
The Configuration File
Create litellm_config.yaml:
model_list:
# Primary: Azure OpenAI GPT-4o (West US)
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o
api_base: https://westus-primary.openai.azure.com/
api_key: os.environ/AZURE_API_KEY
api_version: "2024-08-01-preview"
model_info:
id: azure-westus-gpt4o
# Failover 1: Azure OpenAI GPT-4o (East US)
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o
api_base: https://eastus-secondary.openai.azure.com/
api_key: os.environ/AZURE_API_KEY_SECONDARY
api_version: "2024-08-01-preview"
model_info:
id: azure-eastus-gpt4o
# Failover 2: OpenAI Direct
- model_name: gpt-4o
litellm_params:
model: gpt-4o
api_key: os.environ/OPENAI_API_KEY
model_info:
id: openai-direct-gpt4o
# Failover 3: Anthropic Claude (ultimate backup)
- model_name: gpt-4o
litellm_params:
model: anthropic/claude-3-5-sonnet-20241022
api_key: os.environ/ANTHROPIC_API_KEY
model_info:
id: anthropic-claude-sonnet
litellm_settings:
# Enable automatic failover
num_retries: 3
retry_after: 5
# Fallback configuration
fallbacks:
- gpt-4o: [gpt-4o] # Retry across all gpt-4o deployments
# Request timeout
request_timeout: 120
# Enable streaming
stream: true
router_settings:
# Load balancing strategy
routing_strategy: least-busy
# Enable rate limit awareness
enable_pre_call_checks: true
# Cooldown failed deployments
cooldown_time: 60
# Number of retries per deployment
num_retries: 2
# Retry on these status codes
retry_after: 5
allowed_fails: 3
general_settings:
# Master key for proxy authentication
master_key: os.environ/LITELLM_MASTER_KEY
# Database for tracking (optional)
database_url: os.environ/DATABASE_URL
The Magic: How Failover Actually Works
Automatic 429 Handling
When Azure OpenAI returns a 429 (rate limit), LiteLLM automatically:
- Reads the
Retry-Afterheader - Marks that deployment as "cooling down"
- Routes the request to the next available deployment
- Continues until a successful response or all deployments exhausted
# Your code stays simple - LiteLLM handles everything
from openai import OpenAI
client = OpenAI(
api_key="your-litellm-key",
base_url="http://localhost:4000" # Point to LiteLLM Proxy
)
# This request automatically fails over if needed
response = client.chat.completions.create(
model="gpt-4o", # LiteLLM routes to best available
messages=[{"role": "user", "content": "Hello!"}]
)
Load Balancing Strategies
LiteLLM supports multiple routing strategies:
| Strategy | Description | Best For |
|---|---|---|
simple-shuffle |
Random selection | Even distribution |
least-busy |
Route to deployment with fewest active requests | High throughput |
latency-based-routing |
Route to fastest responding deployment | Latency-sensitive apps |
cost-based-routing |
Route to cheapest available option | Cost optimization |
Configure in your YAML:
router_settings:
routing_strategy: latency-based-routing
# For latency-based routing, set expected latencies
model_group_alias:
gpt-4o:
- model: azure/gpt-4o
weight: 0.7 # 70% of traffic
- model: openai/gpt-4o
weight: 0.3 # 30% of traffic
Streaming Support: It Just Works
Unlike the Azure APIM approach where streaming requires special handling, LiteLLM Proxy handles SSE (Server-Sent Events) natively:
from openai import OpenAI
client = OpenAI(
api_key="your-litellm-key",
base_url="http://localhost:4000"
)
# Streaming works exactly like direct OpenAI
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a poem about resilience"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
If the primary provider fails mid-stream, LiteLLM will:
- Detect the connection failure
- Automatically retry on the next provider
- Return an error only if all providers fail
Production Configuration: Enterprise-Ready Setup
High Availability Deployment
For production, deploy multiple LiteLLM instances behind a load balancer:
# docker-compose.yml
version: '3.8'
services:
litellm-1:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4001:4000"
volumes:
- ./litellm_config.yaml:/app/config.yaml
environment:
- AZURE_API_KEY=${AZURE_API_KEY}
- OPENAI_API_KEY=${OPENAI_API_KEY}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
- DATABASE_URL=${DATABASE_URL}
command: --config /app/config.yaml
restart: always
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:4000/health"]
interval: 30s
timeout: 10s
retries: 3
litellm-2:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4002:4000"
volumes:
- ./litellm_config.yaml:/app/config.yaml
environment:
- AZURE_API_KEY=${AZURE_API_KEY}
- OPENAI_API_KEY=${OPENAI_API_KEY}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
- DATABASE_URL=${DATABASE_URL}
command: --config /app/config.yaml
restart: always
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:4000/health"]
interval: 30s
timeout: 10s
retries: 3
nginx:
image: nginx:alpine
ports:
- "4000:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
depends_on:
- litellm-1
- litellm-2
restart: always
Nginx Load Balancer Configuration
# nginx.conf
events {
worker_connections 1024;
}
http {
upstream litellm {
least_conn;
server litellm-1:4000 weight=1;
server litellm-2:4000 weight=1;
}
server {
listen 80;
location / {
proxy_pass http://litellm;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 300s;
proxy_buffering off; # Important for streaming
}
location /health {
proxy_pass http://litellm;
proxy_connect_timeout 5s;
proxy_read_timeout 5s;
}
}
}
Advanced Features
1. Budget & Rate Limiting
Control spending and prevent runaway costs:
general_settings:
master_key: sk-your-master-key
# User-level budgets
litellm_settings:
max_budget: 100.00 # $100 max per user
budget_duration: monthly
Create users with specific limits:
curl -X POST 'http://localhost:4000/user/new' \
-H 'Authorization: Bearer sk-your-master-key' \
-H 'Content-Type: application/json' \
-d '{
"user_id": "user-123",
"max_budget": 50.00,
"budget_duration": "monthly",
"models": ["gpt-4o", "gpt-3.5-turbo"]
}'
2. Request Caching
Reduce costs and latency with semantic caching using Redis:
litellm_settings:
cache: true
cache_params:
type: redis
host: localhost
port: 6379
ttl: 3600 # 1 hour cache
3. Custom Callbacks & Logging
Track every request for observability:
litellm_settings:
success_callback: ["langfuse", "prometheus"] # Langfuse & Prometheus integrations
failure_callback: ["langfuse", "slack"]
# Langfuse integration
langfuse_public_key: os.environ/LANGFUSE_PUBLIC_KEY
langfuse_secret_key: os.environ/LANGFUSE_SECRET_KEY
4. Guardrails & Content Moderation
Add safety layers:
litellm_settings:
guardrails:
- guardrail_name: "content-filter"
litellm_params:
guardrail: openai_moderation
mode: pre_call # Check before sending to LLM
Comparing Results: LiteLLM vs Azure APIM
I ran the same load test from my Azure article against both architectures:
| Metric | Azure APIM | LiteLLM Proxy |
|---|---|---|
| Success Rate | 99.4% | 99.6% |
| Avg Latency | 2,184ms | 1,892ms |
| P95 Latency | 4,128ms | 3,456ms |
| Setup Time | ~4 hours | ~30 minutes |
| Monthly Cost | ~$500+ | ~$50 (compute only) |
| Provider Lock-in | Azure only | Any provider |
Key observations:
- LiteLLM showed slightly better latency due to simpler request pipeline
- Both achieved similar reliability with proper configuration
- LiteLLM's multi-provider fallback provided an extra safety net
- Cost difference is significant for smaller teams
When to Use Which?
Choose Azure APIM + Front Door When:
- You're all-in on Azure and need native integration
- Enterprise compliance requirements mandate Azure services
- You need WAF/DDoS protection at the edge
- Your organization has existing APIM expertise
- Audit logging must stay within Azure ecosystem
Choose LiteLLM Proxy When:
- You need multi-provider failover (not just multi-region)
- Cost optimization is a priority
- You want provider flexibility to switch easily
- Your team prefers simple YAML configuration over XML policies
- You're running on Kubernetes, AWS, GCP, or on-prem
- You need rapid prototyping and iteration
Production Checklist
If you're deploying LiteLLM Proxy to production:
- [ ] Deploy Multiple Instances: At least 2 behind a load balancer
- [ ] Enable Health Checks: Configure
/healthendpoint monitoring - [ ] Set Up Database: PostgreSQL for persistence and analytics
- [ ] Configure Caching: Redis for semantic caching
- [ ] Add Monitoring: Prometheus + Grafana or Langfuse
- [ ] Set Budget Limits: Prevent runaway costs
- [ ] Secure the Proxy: Use master key authentication
- [ ] Enable TLS: HTTPS in production (via nginx or cloud LB)
- [ ] Configure Alerts: Slack/PagerDuty for failures
- [ ] Test Failover: Deliberately fail providers to verify behavior
Conclusion: The Right Tool for the Job
Both Azure APIM and LiteLLM Proxy solve the same fundamental problem - making LLM services reliable at scale. The choice depends on your constraints:
Azure APIM is the enterprise choice when you're committed to Azure and need the full power of the platform's security and compliance features.
LiteLLM Proxy is the pragmatic choice when you need flexibility, multi-provider support, or a simpler operational model.
The best part? These aren't mutually exclusive. You can run LiteLLM Proxy behind Azure Front Door to get the best of both worlds - enterprise edge security with flexible provider routing.
π¦ LiteLLM GitHub: github.com/BerriAI/litellm
π LiteLLM Docs: docs.litellm.ai
The days of single-provider dependency are over. Whether you choose managed Azure services or open-source flexibility, the key is building resilience into your AI infrastructure from day one. Your 3 AM self will thank you.
Top comments (1)
A fascinating aspect of implementing multi-provider LLM setups is how often teams overlook agents' roles in managing load and failover strategies. In practice, we found leveraging custom agents for task-specific routing can dramatically enhance the efficiency of your LiteLLM Proxy setup. These agents aren't just about distributing load; they're about dynamically adapting to each provider's strengths, optimizing performance in real-time. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)