Debby McKinney

Posted on Feb 9

Your LiteLLM Failover Might Be Adding 30+ Seconds of Latency (Here's Why)

#openai #ai #programming

If you're using LiteLLM for failover, you probably expect instant provider switching when OpenAI goes down. Configure fallback to Anthropic, requests route automatically, users never notice.

That's the theory. Here's the reality: your requests might be hanging for 30+ seconds before fallback even triggers.

This post breaks down how LiteLLM's retry logic actually works, why it causes timeout accumulation, and how gateways like Bifrost solve this with bounded failover latency (4 seconds vs 60 seconds for two-provider chains).

TL;DR: LiteLLM accumulates timeouts across retries. 3 retries × 10-second timeout = 30 seconds per provider before fallback. A gateway like Bifrost handles failover with 2-second max per provider, no accumulation.

The Config You're Probably Using

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
  - model_name: gpt-4o
    litellm_params:
      model: anthropic/claude-3-5-sonnet

litellm_settings:
  num_retries: 3
  request_timeout: 10
  fallbacks: [{"gpt-4o": ["claude-3-5-sonnet"]}]

You think: OpenAI fails → instant fallback to Anthropic.

What actually happens: OpenAI fails → retry OpenAI → retry OpenAI → retry OpenAI → then try Anthropic.

Each retry waits for the full timeout. 3 retries × 10 seconds = 30 seconds before fallback triggers.

The Timeout Accumulation Problem

Your request_timeout: 10 setting doesn't mean "maximum 10 seconds total." It means "10 seconds per attempt."

With 3 retries configured:

Attempt 1: 10 seconds (times out)
Attempt 2: 10 seconds (times out)
Attempt 3: 10 seconds (times out)
Then fallback to Anthropic

Your users wait 30 seconds before Anthropic even gets tried.

If Anthropic also times out with 3 retries, total request time hits 60 seconds.

Quick fix if you want to skip the technical details: Bifrost solves this with bounded timeout per provider (2s max), not timeout per retry. Two-provider failover completes in 4 seconds total, not 60. Zero-config setup in under 60 seconds.

The Retry Delay Bug

LiteLLM is supposed to wait between retries. Exponential backoff, respect rate limit headers, intelligent delays.

Reality (from GitHub issue #6011):

"Despite configuring a 30-second delay between retries, the system only seems to not pausing at all before retrying."

The retry_after configuration gets ignored in certain versions. Retries happen immediately, exhausting your retry budget in seconds instead of waiting for provider recovery.

This causes:

Retry storms hitting the same failing endpoint repeatedly
Rate limit violations from too many immediate retries
No time for transient issues to resolve

Streaming Requests Get No Retry Protection

Non-streaming requests get retry logic. Streaming requests? Nope.

From GitHub issue #8648:

"When a streaming request encounters a 429 error before any data has been sent to the client, the retry mechanism does not seem to be triggered."

If you're using streaming for real-time chat or code generation, provider failures surface directly to your users. No retry. No fallback. Just errors.

Rate Limit Handling Is Broken

When OpenAI returns HTTP 429 with a Retry-After header saying "wait 60 seconds," LiteLLM should wait 60 seconds.

What actually happens (GitHub issue #7669):

"All retries happen immediately and fail because it's a RateLimitError."

It ignores the rate limit signal and retries immediately. Three times. Wasting your retry budget on requests that are guaranteed to fail.

What This Looks Like in Production

Scenario 1: User-facing chat

User sends message
OpenAI times out after 10 seconds
Retry 1 times out after 10 seconds
Retry 2 times out after 10 seconds
Retry 3 times out after 10 seconds
Finally tries Anthropic
User waited 40+ seconds for a response

Scenario 2: Streaming code generation

User requests code completion
Provider returns 429 rate limit
No retry triggered (streaming request)
User sees error immediately
No failover happened at all

Scenario 3: Background processing

Job queue processing LLM requests
Provider failure triggers retry storm
3 retries × 100 queued requests = 300 API calls in seconds
Rate limits trigger from retry volume
Cascade failure across all requests

How Bifrost Solves This

maximhq / bifrost

Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost AI Gateway

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration…

View on GitHub

Bifrost implements failover without timeout accumulation. Here's the difference:

LiteLLM approach:

3 retries × 10-second timeout = 30 seconds per provider
Retry delays often ignored (GitHub issues document this)
Streaming requests get no retry protection
Total failover time unpredictable

Bifrost approach:

Timeout per attempt, not per provider
First timeout triggers immediate fallback
Streaming and non-streaming get same retry protection
Bounded maximum latency

Configuration:

# Bifrost - zero configuration needed
npx -y @maximhq/bifrost

# Then just add fallbacks to your request
curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "openai/gpt-4o",
    "fallbacks": ["anthropic/claude-3-5-sonnet"],
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Execution flow:

Try OpenAI (max 2 seconds)
If fails, try Anthropic immediately (max 2 seconds)
Return response or error

Total maximum latency: 4 seconds for a two-provider chain, not 60 seconds.

Real example from production:

OpenAI experiencing outage
LiteLLM users: 30-60 second delays before fallback completes
Bifrost users: 2-4 second failover, seamless for end users

The difference matters when you're serving real users who expect sub-5-second responses.

Setup Bifrost in under 60 seconds:

# NPX (instant)
npx -y @maximhq/bifrost

# Docker
docker run -p 8080:8080 maximhq/bifrost

Add your API keys through the web UI at http://localhost:8080. No YAML configuration. No retry tuning. No debugging why retries aren't working.

Point your existing OpenAI SDK at Bifrost:

from openai import OpenAI

client = OpenAI(
    api_key="your-bifrost-key",
    base_url="http://localhost:8080/v1"  # Only line that changes
)

# Your existing code works unchanged
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

Bifrost handles failover, load balancing, semantic caching, and observability at the infrastructure level. Your application code stays clean.

Why this matters for production:

11µs overhead at 5,000 RPS (50x faster than Python-based gateways)
Built-in dashboard with real-time logs and cost analytics
Prometheus metrics at /metrics for existing monitoring stacks
Zero-config deployment — running in under a minute
Automatic failover with bounded latency guarantees

If you're using LiteLLM and hitting these retry issues, Bifrost is a drop-in replacement that solves the timeout accumulation problem.

Other Options If You're Staying with LiteLLM

Option 1: Aggressive timeouts

Set very low timeout values to limit accumulation:

litellm_settings:
  num_retries: 2
  request_timeout: 2  # 2 seconds instead of 10

Max delay per provider: 4 seconds (2 retries × 2 seconds). Still adds up with multiple providers, but better than 30+ seconds.

Option 2: Disable retries, use only fallbacks

litellm_settings:
  num_retries: 0  # No retries
  fallbacks: [{"gpt-4o": ["claude-3-5-sonnet"]}]

Provider fails once → immediate fallback. No retry accumulation. But you lose retry protection for transient failures.

Option 3: Manual retry logic

Skip LiteLLM's built-in retry, handle it yourself with proper backoff:

import time
from litellm import completion

def request_with_intelligent_retry(model, messages, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            return completion(
                model=model,
                messages=messages,
                num_retries=0,  # Disable LiteLLM retry
                request_timeout=5
            )
        except Exception as e:
            if attempt < max_attempts - 1:
                wait = 2 ** attempt  # Exponential backoff: 1s, 2s, 4s
                time.sleep(wait)
            else:
                raise

You control timing. You know exactly how long requests can take.

Option 4: Manual retry logic

Skip LiteLLM's built-in retry, handle it yourself with proper backoff:

import time
from litellm import completion

def request_with_intelligent_retry(model, messages, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            return completion(
                model=model,
                messages=messages,
                num_retries=0,  # Disable LiteLLM retry
                request_timeout=5
            )
        except Exception as e:
            if attempt < max_attempts - 1:
                wait = 2 ** attempt  # Exponential backoff: 1s, 2s, 4s
                time.sleep(wait)
            else:
                raise

You control timing. You know exactly how long requests can take. But you're building infrastructure instead of shipping features.

The Reality Check

If you're spending time debugging retry delays, configuring backoff strategies, and testing timeout accumulation — you're solving infrastructure problems that modern gateways already solved.

LiteLLM works for:

Low-stakes async workloads where 30-60 second delays are acceptable
Teams with time to debug retry behavior across versions
Applications without strict latency requirements

It doesn't work for:

User-facing chat interfaces (<5s latency requirements)
Real-time streaming applications
Production systems serving thousands of concurrent users
Teams that need infrastructure to just work

Bifrost was built specifically to solve these problems.

Zero configuration. Bounded failover latency. Production-grade performance. Built-in observability.

If you're hitting LiteLLM's limitations, try Bifrost. It takes 60 seconds to set up, and it just works.

Get started with Bifrost:

Docs: https://docs.getbifrost.ai

GitHub: https://github.com/maximhq/bifrost

Setup guide: https://docs.getbifrost.ai/quickstart/gateway/setting-up

LiteLLM resources (if staying):

Retry docs: https://docs.litellm.ai/docs/completion/reliable_completions

GitHub issues to watch: #6011, #6623, #7669, #8648

DEV Community