Deneesh Narayanasamy

Posted on Apr 7

LiteLLM Proxy: The Open-Source Alternative for Multi-Provider LLM Failover and Load Balancing

#ai #python #openai #devops

Introduction: What If You Could Use ANY LLM Provider?

In my previous article, I walked through building a multi-region failover architecture for Azure OpenAI using Azure Front Door and APIM. It works brilliantly - but it's also Azure-specific, requires significant infrastructure, and locks you into a single provider ecosystem.

What if you need:

Multi-provider failover (Azure OpenAI -> OpenAI -> Anthropic -> Gemini)
A simpler deployment without managing APIM policies
Provider-agnostic architecture that works anywhere
Open-source flexibility with no vendor lock-in

Enter LiteLLM Proxy - an open-source unified gateway that gives you all of this out of the box.

What is LiteLLM Proxy?

LiteLLM is an open-source Python library and proxy server that provides:

Unified API: One OpenAI-compatible endpoint for 100+ LLM providers
Built-in Load Balancing: Distribute requests across multiple deployments
Automatic Failover: Seamlessly retry on different models/providers when one fails
Rate Limit Handling: Intelligent retry with exponential backoff for 429 errors
Cost Tracking: Monitor spend across all providers in one place
Streaming Support: Full SSE (Server-Sent Events) support with proper failover

The beauty? Your application code doesn't change. You point your OpenAI SDK at LiteLLM Proxy, and it handles the rest.

Architecture: LiteLLM Proxy vs Azure APIM

Here's how LiteLLM Proxy compares to the Azure-native approach:

Azure APIM Architecture (Previous Article)

Client -> Azure Front Door -> Regional APIM -> Azure OpenAI (Primary)
                                         -> Azure OpenAI (Secondary)

Pros: Native Azure integration, enterprise compliance, WAF protection
Cons: Azure-only, complex policies, expensive at scale

LiteLLM Proxy Architecture

Client -> Load Balancer -> LiteLLM Proxy -> Azure OpenAI
                                       -> OpenAI Direct
                                       -> Anthropic Claude
                                       -> Google Gemini
                                       -> AWS Bedrock
                                       -> Any LLM Provider

Supported Providers: Azure OpenAI, OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, and 100+ more

Pros: Provider-agnostic, simple configuration, open-source, runs anywhere
Cons: Self-managed infrastructure, requires containerization

Getting Started: 5-Minute Setup

Option 1: Docker (Recommended for Production)

# Pull the official image
docker pull ghcr.io/berriai/litellm:main-latest

# Run with your config
docker run -d \
  --name litellm-proxy \
  -p 4000:4000 \
  -v $(pwd)/litellm_config.yaml:/app/config.yaml \
  -e AZURE_API_KEY="your-azure-key" \
  -e OPENAI_API_KEY="your-openai-key" \
  -e ANTHROPIC_API_KEY="your-anthropic-key" \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml

Option 2: Python (Quick Testing)

pip install 'litellm[proxy]'
litellm --config litellm_config.yaml

The Configuration File

Create litellm_config.yaml:

model_list:
  # Primary: Azure OpenAI GPT-4o (West US)
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o
      api_base: https://westus-primary.openai.azure.com/
      api_key: os.environ/AZURE_API_KEY
      api_version: "2024-08-01-preview"
    model_info:
      id: azure-westus-gpt4o

  # Failover 1: Azure OpenAI GPT-4o (East US)
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o
      api_base: https://eastus-secondary.openai.azure.com/
      api_key: os.environ/AZURE_API_KEY_SECONDARY
      api_version: "2024-08-01-preview"
    model_info:
      id: azure-eastus-gpt4o

  # Failover 2: OpenAI Direct
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
    model_info:
      id: openai-direct-gpt4o

  # Failover 3: Anthropic Claude (ultimate backup)
  - model_name: gpt-4o
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY
    model_info:
      id: anthropic-claude-sonnet

litellm_settings:
  # Enable automatic failover
  num_retries: 3
  retry_after: 5

  # Fallback configuration
  fallbacks:
    - gpt-4o: [gpt-4o]  # Retry across all gpt-4o deployments

  # Request timeout
  request_timeout: 120

  # Enable streaming
  stream: true

router_settings:
  # Load balancing strategy
  routing_strategy: least-busy

  # Enable rate limit awareness
  enable_pre_call_checks: true

  # Cooldown failed deployments
  cooldown_time: 60

  # Number of retries per deployment
  num_retries: 2

  # Retry on these status codes
  retry_after: 5
  allowed_fails: 3

general_settings:
  # Master key for proxy authentication
  master_key: os.environ/LITELLM_MASTER_KEY

  # Database for tracking (optional)
  database_url: os.environ/DATABASE_URL

The Magic: How Failover Actually Works

Automatic 429 Handling

When Azure OpenAI returns a 429 (rate limit), LiteLLM automatically:

Reads the Retry-After header
Marks that deployment as "cooling down"
Routes the request to the next available deployment
Continues until a successful response or all deployments exhausted

# Your code stays simple - LiteLLM handles everything
from openai import OpenAI

client = OpenAI(
    api_key="your-litellm-key",
    base_url="http://localhost:4000"  # Point to LiteLLM Proxy
)

# This request automatically fails over if needed
response = client.chat.completions.create(
    model="gpt-4o",  # LiteLLM routes to best available
    messages=[{"role": "user", "content": "Hello!"}]
)

Load Balancing Strategies

LiteLLM supports multiple routing strategies:

Strategy	Description	Best For
`simple-shuffle`	Random selection	Even distribution
`least-busy`	Route to deployment with fewest active requests	High throughput
`latency-based-routing`	Route to fastest responding deployment	Latency-sensitive apps
`cost-based-routing`	Route to cheapest available option	Cost optimization

Configure in your YAML:

router_settings:
  routing_strategy: latency-based-routing

  # For latency-based routing, set expected latencies
  model_group_alias:
    gpt-4o:
      - model: azure/gpt-4o
        weight: 0.7  # 70% of traffic
      - model: openai/gpt-4o
        weight: 0.3  # 30% of traffic

Streaming Support: It Just Works

Unlike the Azure APIM approach where streaming requires special handling, LiteLLM Proxy handles SSE (Server-Sent Events) natively:

from openai import OpenAI

client = OpenAI(
    api_key="your-litellm-key",
    base_url="http://localhost:4000"
)

# Streaming works exactly like direct OpenAI
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a poem about resilience"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

If the primary provider fails mid-stream, LiteLLM will:

Detect the connection failure
Automatically retry on the next provider
Return an error only if all providers fail

Production Configuration: Enterprise-Ready Setup

High Availability Deployment

For production, deploy multiple LiteLLM instances behind a load balancer:

# docker-compose.yml
version: '3.8'

services:
  litellm-1:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4001:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - AZURE_API_KEY=${AZURE_API_KEY}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
      - DATABASE_URL=${DATABASE_URL}
    command: --config /app/config.yaml
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  litellm-2:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4002:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - AZURE_API_KEY=${AZURE_API_KEY}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
      - DATABASE_URL=${DATABASE_URL}
    command: --config /app/config.yaml
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  nginx:
    image: nginx:alpine
    ports:
      - "4000:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - litellm-1
      - litellm-2
    restart: always

Nginx Load Balancer Configuration

# nginx.conf
events {
    worker_connections 1024;
}

http {
    upstream litellm {
        least_conn;
        server litellm-1:4000 weight=1;
        server litellm-2:4000 weight=1;
    }

    server {
        listen 80;

        location / {
            proxy_pass http://litellm;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_read_timeout 300s;
            proxy_buffering off;  # Important for streaming
        }

        location /health {
            proxy_pass http://litellm;
            proxy_connect_timeout 5s;
            proxy_read_timeout 5s;
        }
    }
}

Advanced Features

1. Budget & Rate Limiting

Control spending and prevent runaway costs:

general_settings:
  master_key: sk-your-master-key

# User-level budgets
litellm_settings:
  max_budget: 100.00  # $100 max per user
  budget_duration: monthly

Create users with specific limits:

curl -X POST 'http://localhost:4000/user/new' \
  -H 'Authorization: Bearer sk-your-master-key' \
  -H 'Content-Type: application/json' \
  -d '{
    "user_id": "user-123",
    "max_budget": 50.00,
    "budget_duration": "monthly",
    "models": ["gpt-4o", "gpt-3.5-turbo"]
  }'

2. Request Caching

Reduce costs and latency with semantic caching using Redis:

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: localhost
    port: 6379
    ttl: 3600  # 1 hour cache

3. Custom Callbacks & Logging

Track every request for observability:

litellm_settings:
  success_callback: ["langfuse", "prometheus"]  # Langfuse & Prometheus integrations
  failure_callback: ["langfuse", "slack"]

  # Langfuse integration
  langfuse_public_key: os.environ/LANGFUSE_PUBLIC_KEY
  langfuse_secret_key: os.environ/LANGFUSE_SECRET_KEY

4. Guardrails & Content Moderation

Add safety layers:

litellm_settings:
  guardrails:
    - guardrail_name: "content-filter"
      litellm_params:
        guardrail: openai_moderation
        mode: pre_call  # Check before sending to LLM

Comparing Results: LiteLLM vs Azure APIM

I ran the same load test from my Azure article against both architectures:

Metric	Azure APIM	LiteLLM Proxy
Success Rate	99.4%	99.6%
Avg Latency	2,184ms	1,892ms
P95 Latency	4,128ms	3,456ms
Setup Time	~4 hours	~30 minutes
Monthly Cost	~$500+	~$50 (compute only)
Provider Lock-in	Azure only	Any provider

Key observations:

LiteLLM showed slightly better latency due to simpler request pipeline
Both achieved similar reliability with proper configuration
LiteLLM's multi-provider fallback provided an extra safety net
Cost difference is significant for smaller teams

When to Use Which?

Choose Azure APIM + Front Door When:

You're all-in on Azure and need native integration
Enterprise compliance requirements mandate Azure services
You need WAF/DDoS protection at the edge
Your organization has existing APIM expertise
Audit logging must stay within Azure ecosystem

Choose LiteLLM Proxy When:

You need multi-provider failover (not just multi-region)
Cost optimization is a priority
You want provider flexibility to switch easily
Your team prefers simple YAML configuration over XML policies
You're running on Kubernetes, AWS, GCP, or on-prem
You need rapid prototyping and iteration

Production Checklist

If you're deploying LiteLLM Proxy to production:

[ ] Deploy Multiple Instances: At least 2 behind a load balancer
[ ] Enable Health Checks: Configure /health endpoint monitoring
[ ] Set Up Database: PostgreSQL for persistence and analytics
[ ] Configure Caching: Redis for semantic caching
[ ] Add Monitoring: Prometheus + Grafana or Langfuse
[ ] Set Budget Limits: Prevent runaway costs
[ ] Secure the Proxy: Use master key authentication
[ ] Enable TLS: HTTPS in production (via nginx or cloud LB)
[ ] Configure Alerts: Slack/PagerDuty for failures
[ ] Test Failover: Deliberately fail providers to verify behavior

Conclusion: The Right Tool for the Job

Both Azure APIM and LiteLLM Proxy solve the same fundamental problem - making LLM services reliable at scale. The choice depends on your constraints:

Azure APIM is the enterprise choice when you're committed to Azure and need the full power of the platform's security and compliance features.

LiteLLM Proxy is the pragmatic choice when you need flexibility, multi-provider support, or a simpler operational model.

The best part? These aren't mutually exclusive. You can run LiteLLM Proxy behind Azure Front Door to get the best of both worlds - enterprise edge security with flexible provider routing.

📦 LiteLLM GitHub: github.com/BerriAI/litellm

📄 LiteLLM Docs: docs.litellm.ai

The days of single-provider dependency are over. Whether you choose managed Azure services or open-source flexibility, the key is building resilience into your AI infrastructure from day one. Your 3 AM self will thank you.

Top comments (1)

Stefy “nextime” Lanza • May 21

I am building AISBF, an AI Service Broker Framework, in this same problem space.

The angle I am taking is a bit broader than a litellm. The proxy part matters, but once this becomes shared infrastructure the control layer matters just as much: provider config, routing, access control, request visibility, quotas, and a dashboard/API layer that keeps apps from being hard-wired to one vendor.

My current mental split is:

gateway/proxy: normalize provider APIs, route requests, handle fallback and retries
broker/control layer: manage providers, accounts, API keys, quotas, request history, and operational policy

AISBF is focused on that broker/control-layer side while still sitting between apps and providers.

Curious what people here consider the minimum production checklist for this kind of tool. For me it is looking like: failover, per-account limits, audit/request logs, provider health, cost visibility, and easy provider switching.

Project: aisbf.cloud

Some comments may only be visible to logged-in visitors. Sign in to view all comments.