DEV Community

Deneesh Narayanasamy
Deneesh Narayanasamy

Posted on

LiteLLM Proxy: The Open-Source Alternative for Multi-Provider LLM Failover and Load Balancing

Introduction: What If You Could Use ANY LLM Provider?

In my previous article, I walked through building a multi-region failover architecture for Azure OpenAI using Azure Front Door and APIM. It works brilliantly - but it's also Azure-specific, requires significant infrastructure, and locks you into a single provider ecosystem.

What if you need:

  • Multi-provider failover (Azure OpenAI -> OpenAI -> Anthropic -> Gemini)
  • A simpler deployment without managing APIM policies
  • Provider-agnostic architecture that works anywhere
  • Open-source flexibility with no vendor lock-in

Enter LiteLLM Proxy - an open-source unified gateway that gives you all of this out of the box.


What is LiteLLM Proxy?

LiteLLM is an open-source Python library and proxy server that provides:

  • Unified API: One OpenAI-compatible endpoint for 100+ LLM providers
  • Built-in Load Balancing: Distribute requests across multiple deployments
  • Automatic Failover: Seamlessly retry on different models/providers when one fails
  • Rate Limit Handling: Intelligent retry with exponential backoff for 429 errors
  • Cost Tracking: Monitor spend across all providers in one place
  • Streaming Support: Full SSE (Server-Sent Events) support with proper failover

The beauty? Your application code doesn't change. You point your OpenAI SDK at LiteLLM Proxy, and it handles the rest.


Architecture: LiteLLM Proxy vs Azure APIM

Here's how LiteLLM Proxy compares to the Azure-native approach:

Azure APIM Architecture (Previous Article)

Client -> Azure Front Door -> Regional APIM -> Azure OpenAI (Primary)
                                         -> Azure OpenAI (Secondary)
Enter fullscreen mode Exit fullscreen mode

Pros: Native Azure integration, enterprise compliance, WAF protection
Cons: Azure-only, complex policies, expensive at scale

LiteLLM Proxy Architecture

Client -> Load Balancer -> LiteLLM Proxy -> Azure OpenAI
                                       -> OpenAI Direct
                                       -> Anthropic Claude
                                       -> Google Gemini
                                       -> AWS Bedrock
                                       -> Any LLM Provider
Enter fullscreen mode Exit fullscreen mode

Supported Providers: Azure OpenAI, OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, and 100+ more

Pros: Provider-agnostic, simple configuration, open-source, runs anywhere
Cons: Self-managed infrastructure, requires containerization


Getting Started: 5-Minute Setup

Option 1: Docker (Recommended for Production)

# Pull the official image
docker pull ghcr.io/berriai/litellm:main-latest

# Run with your config
docker run -d \
  --name litellm-proxy \
  -p 4000:4000 \
  -v $(pwd)/litellm_config.yaml:/app/config.yaml \
  -e AZURE_API_KEY="your-azure-key" \
  -e OPENAI_API_KEY="your-openai-key" \
  -e ANTHROPIC_API_KEY="your-anthropic-key" \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml
Enter fullscreen mode Exit fullscreen mode

Option 2: Python (Quick Testing)

pip install 'litellm[proxy]'
litellm --config litellm_config.yaml
Enter fullscreen mode Exit fullscreen mode

The Configuration File

Create litellm_config.yaml:

model_list:
  # Primary: Azure OpenAI GPT-4o (West US)
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o
      api_base: https://westus-primary.openai.azure.com/
      api_key: os.environ/AZURE_API_KEY
      api_version: "2024-08-01-preview"
    model_info:
      id: azure-westus-gpt4o

  # Failover 1: Azure OpenAI GPT-4o (East US)
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o
      api_base: https://eastus-secondary.openai.azure.com/
      api_key: os.environ/AZURE_API_KEY_SECONDARY
      api_version: "2024-08-01-preview"
    model_info:
      id: azure-eastus-gpt4o

  # Failover 2: OpenAI Direct
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
    model_info:
      id: openai-direct-gpt4o

  # Failover 3: Anthropic Claude (ultimate backup)
  - model_name: gpt-4o
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY
    model_info:
      id: anthropic-claude-sonnet

litellm_settings:
  # Enable automatic failover
  num_retries: 3
  retry_after: 5

  # Fallback configuration
  fallbacks:
    - gpt-4o: [gpt-4o]  # Retry across all gpt-4o deployments

  # Request timeout
  request_timeout: 120

  # Enable streaming
  stream: true

router_settings:
  # Load balancing strategy
  routing_strategy: least-busy

  # Enable rate limit awareness
  enable_pre_call_checks: true

  # Cooldown failed deployments
  cooldown_time: 60

  # Number of retries per deployment
  num_retries: 2

  # Retry on these status codes
  retry_after: 5
  allowed_fails: 3

general_settings:
  # Master key for proxy authentication
  master_key: os.environ/LITELLM_MASTER_KEY

  # Database for tracking (optional)
  database_url: os.environ/DATABASE_URL
Enter fullscreen mode Exit fullscreen mode

The Magic: How Failover Actually Works

Automatic 429 Handling

When Azure OpenAI returns a 429 (rate limit), LiteLLM automatically:

  1. Reads the Retry-After header
  2. Marks that deployment as "cooling down"
  3. Routes the request to the next available deployment
  4. Continues until a successful response or all deployments exhausted
# Your code stays simple - LiteLLM handles everything
from openai import OpenAI

client = OpenAI(
    api_key="your-litellm-key",
    base_url="http://localhost:4000"  # Point to LiteLLM Proxy
)

# This request automatically fails over if needed
response = client.chat.completions.create(
    model="gpt-4o",  # LiteLLM routes to best available
    messages=[{"role": "user", "content": "Hello!"}]
)
Enter fullscreen mode Exit fullscreen mode

Load Balancing Strategies

LiteLLM supports multiple routing strategies:

Strategy Description Best For
simple-shuffle Random selection Even distribution
least-busy Route to deployment with fewest active requests High throughput
latency-based-routing Route to fastest responding deployment Latency-sensitive apps
cost-based-routing Route to cheapest available option Cost optimization

Configure in your YAML:

router_settings:
  routing_strategy: latency-based-routing

  # For latency-based routing, set expected latencies
  model_group_alias:
    gpt-4o:
      - model: azure/gpt-4o
        weight: 0.7  # 70% of traffic
      - model: openai/gpt-4o
        weight: 0.3  # 30% of traffic
Enter fullscreen mode Exit fullscreen mode

Streaming Support: It Just Works

Unlike the Azure APIM approach where streaming requires special handling, LiteLLM Proxy handles SSE (Server-Sent Events) natively:

from openai import OpenAI

client = OpenAI(
    api_key="your-litellm-key",
    base_url="http://localhost:4000"
)

# Streaming works exactly like direct OpenAI
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a poem about resilience"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
Enter fullscreen mode Exit fullscreen mode

If the primary provider fails mid-stream, LiteLLM will:

  1. Detect the connection failure
  2. Automatically retry on the next provider
  3. Return an error only if all providers fail

Production Configuration: Enterprise-Ready Setup

High Availability Deployment

For production, deploy multiple LiteLLM instances behind a load balancer:

# docker-compose.yml
version: '3.8'

services:
  litellm-1:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4001:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - AZURE_API_KEY=${AZURE_API_KEY}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
      - DATABASE_URL=${DATABASE_URL}
    command: --config /app/config.yaml
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  litellm-2:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4002:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - AZURE_API_KEY=${AZURE_API_KEY}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
      - DATABASE_URL=${DATABASE_URL}
    command: --config /app/config.yaml
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  nginx:
    image: nginx:alpine
    ports:
      - "4000:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - litellm-1
      - litellm-2
    restart: always
Enter fullscreen mode Exit fullscreen mode

Nginx Load Balancer Configuration

# nginx.conf
events {
    worker_connections 1024;
}

http {
    upstream litellm {
        least_conn;
        server litellm-1:4000 weight=1;
        server litellm-2:4000 weight=1;
    }

    server {
        listen 80;

        location / {
            proxy_pass http://litellm;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_read_timeout 300s;
            proxy_buffering off;  # Important for streaming
        }

        location /health {
            proxy_pass http://litellm;
            proxy_connect_timeout 5s;
            proxy_read_timeout 5s;
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Advanced Features

1. Budget & Rate Limiting

Control spending and prevent runaway costs:

general_settings:
  master_key: sk-your-master-key

# User-level budgets
litellm_settings:
  max_budget: 100.00  # $100 max per user
  budget_duration: monthly
Enter fullscreen mode Exit fullscreen mode

Create users with specific limits:

curl -X POST 'http://localhost:4000/user/new' \
  -H 'Authorization: Bearer sk-your-master-key' \
  -H 'Content-Type: application/json' \
  -d '{
    "user_id": "user-123",
    "max_budget": 50.00,
    "budget_duration": "monthly",
    "models": ["gpt-4o", "gpt-3.5-turbo"]
  }'
Enter fullscreen mode Exit fullscreen mode

2. Request Caching

Reduce costs and latency with semantic caching using Redis:

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: localhost
    port: 6379
    ttl: 3600  # 1 hour cache
Enter fullscreen mode Exit fullscreen mode

3. Custom Callbacks & Logging

Track every request for observability:

litellm_settings:
  success_callback: ["langfuse", "prometheus"]  # Langfuse & Prometheus integrations
  failure_callback: ["langfuse", "slack"]

  # Langfuse integration
  langfuse_public_key: os.environ/LANGFUSE_PUBLIC_KEY
  langfuse_secret_key: os.environ/LANGFUSE_SECRET_KEY
Enter fullscreen mode Exit fullscreen mode

4. Guardrails & Content Moderation

Add safety layers:

litellm_settings:
  guardrails:
    - guardrail_name: "content-filter"
      litellm_params:
        guardrail: openai_moderation
        mode: pre_call  # Check before sending to LLM
Enter fullscreen mode Exit fullscreen mode

Comparing Results: LiteLLM vs Azure APIM

I ran the same load test from my Azure article against both architectures:

Metric Azure APIM LiteLLM Proxy
Success Rate 99.4% 99.6%
Avg Latency 2,184ms 1,892ms
P95 Latency 4,128ms 3,456ms
Setup Time ~4 hours ~30 minutes
Monthly Cost ~$500+ ~$50 (compute only)
Provider Lock-in Azure only Any provider

Key observations:

  • LiteLLM showed slightly better latency due to simpler request pipeline
  • Both achieved similar reliability with proper configuration
  • LiteLLM's multi-provider fallback provided an extra safety net
  • Cost difference is significant for smaller teams

When to Use Which?

Choose Azure APIM + Front Door When:

  • You're all-in on Azure and need native integration
  • Enterprise compliance requirements mandate Azure services
  • You need WAF/DDoS protection at the edge
  • Your organization has existing APIM expertise
  • Audit logging must stay within Azure ecosystem

Choose LiteLLM Proxy When:

  • You need multi-provider failover (not just multi-region)
  • Cost optimization is a priority
  • You want provider flexibility to switch easily
  • Your team prefers simple YAML configuration over XML policies
  • You're running on Kubernetes, AWS, GCP, or on-prem
  • You need rapid prototyping and iteration

Production Checklist

If you're deploying LiteLLM Proxy to production:

  • [ ] Deploy Multiple Instances: At least 2 behind a load balancer
  • [ ] Enable Health Checks: Configure /health endpoint monitoring
  • [ ] Set Up Database: PostgreSQL for persistence and analytics
  • [ ] Configure Caching: Redis for semantic caching
  • [ ] Add Monitoring: Prometheus + Grafana or Langfuse
  • [ ] Set Budget Limits: Prevent runaway costs
  • [ ] Secure the Proxy: Use master key authentication
  • [ ] Enable TLS: HTTPS in production (via nginx or cloud LB)
  • [ ] Configure Alerts: Slack/PagerDuty for failures
  • [ ] Test Failover: Deliberately fail providers to verify behavior

Conclusion: The Right Tool for the Job

Both Azure APIM and LiteLLM Proxy solve the same fundamental problem - making LLM services reliable at scale. The choice depends on your constraints:

Azure APIM is the enterprise choice when you're committed to Azure and need the full power of the platform's security and compliance features.

LiteLLM Proxy is the pragmatic choice when you need flexibility, multi-provider support, or a simpler operational model.

The best part? These aren't mutually exclusive. You can run LiteLLM Proxy behind Azure Front Door to get the best of both worlds - enterprise edge security with flexible provider routing.

πŸ“¦ LiteLLM GitHub: github.com/BerriAI/litellm

πŸ“„ LiteLLM Docs: docs.litellm.ai

The days of single-provider dependency are over. Whether you choose managed Azure services or open-source flexibility, the key is building resilience into your AI infrastructure from day one. Your 3 AM self will thank you.

Top comments (1)

Collapse
 
ali_muwwakkil_a776a21aa9c profile image
Ali Muwwakkil

A fascinating aspect of implementing multi-provider LLM setups is how often teams overlook agents' roles in managing load and failover strategies. In practice, we found leveraging custom agents for task-specific routing can dramatically enhance the efficiency of your LiteLLM Proxy setup. These agents aren't just about distributing load; they're about dynamically adapting to each provider's strengths, optimizing performance in real-time. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)