DEV Community

Debby McKinney
Debby McKinney

Posted on

Your LiteLLM Failover Might Be Adding 30+ Seconds of Latency (Here's Why)

If you're using LiteLLM for failover, you probably expect instant provider switching when OpenAI goes down. Configure fallback to Anthropic, requests route automatically, users never notice.

That's the theory. Here's the reality: your requests might be hanging for 30+ seconds before fallback even triggers.

This post breaks down how LiteLLM's retry logic actually works, why it causes timeout accumulation, and how gateways like Bifrost solve this with bounded failover latency (4 seconds vs 60 seconds for two-provider chains).

TL;DR: LiteLLM accumulates timeouts across retries. 3 retries × 10-second timeout = 30 seconds per provider before fallback. A gateway like Bifrost handles failover with 2-second max per provider, no accumulation.

The Config You're Probably Using

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
  - model_name: gpt-4o
    litellm_params:
      model: anthropic/claude-3-5-sonnet

litellm_settings:
  num_retries: 3
  request_timeout: 10
  fallbacks: [{"gpt-4o": ["claude-3-5-sonnet"]}]
Enter fullscreen mode Exit fullscreen mode

You think: OpenAI fails → instant fallback to Anthropic.

What actually happens: OpenAI fails → retry OpenAI → retry OpenAI → retry OpenAI → then try Anthropic.

Each retry waits for the full timeout. 3 retries × 10 seconds = 30 seconds before fallback triggers.

The Timeout Accumulation Problem

Your request_timeout: 10 setting doesn't mean "maximum 10 seconds total." It means "10 seconds per attempt."

With 3 retries configured:

  • Attempt 1: 10 seconds (times out)
  • Attempt 2: 10 seconds (times out)
  • Attempt 3: 10 seconds (times out)
  • Then fallback to Anthropic

Your users wait 30 seconds before Anthropic even gets tried.

If Anthropic also times out with 3 retries, total request time hits 60 seconds.


Quick fix if you want to skip the technical details: Bifrost solves this with bounded timeout per provider (2s max), not timeout per retry. Two-provider failover completes in 4 seconds total, not 60. Zero-config setup in under 60 seconds.


The Retry Delay Bug

LiteLLM is supposed to wait between retries. Exponential backoff, respect rate limit headers, intelligent delays.

Reality (from GitHub issue #6011):

"Despite configuring a 30-second delay between retries, the system only seems to not pausing at all before retrying."

The retry_after configuration gets ignored in certain versions. Retries happen immediately, exhausting your retry budget in seconds instead of waiting for provider recovery.

This causes:

  • Retry storms hitting the same failing endpoint repeatedly
  • Rate limit violations from too many immediate retries
  • No time for transient issues to resolve

Streaming Requests Get No Retry Protection

Non-streaming requests get retry logic. Streaming requests? Nope.

From GitHub issue #8648:

"When a streaming request encounters a 429 error before any data has been sent to the client, the retry mechanism does not seem to be triggered."

If you're using streaming for real-time chat or code generation, provider failures surface directly to your users. No retry. No fallback. Just errors.

Rate Limit Handling Is Broken

When OpenAI returns HTTP 429 with a Retry-After header saying "wait 60 seconds," LiteLLM should wait 60 seconds.

What actually happens (GitHub issue #7669):

"All retries happen immediately and fail because it's a RateLimitError."

It ignores the rate limit signal and retries immediately. Three times. Wasting your retry budget on requests that are guaranteed to fail.

What This Looks Like in Production

Scenario 1: User-facing chat

  • User sends message
  • OpenAI times out after 10 seconds
  • Retry 1 times out after 10 seconds
  • Retry 2 times out after 10 seconds
  • Retry 3 times out after 10 seconds
  • Finally tries Anthropic
  • User waited 40+ seconds for a response

Scenario 2: Streaming code generation

  • User requests code completion
  • Provider returns 429 rate limit
  • No retry triggered (streaming request)
  • User sees error immediately
  • No failover happened at all

Scenario 3: Background processing

  • Job queue processing LLM requests
  • Provider failure triggers retry storm
  • 3 retries × 100 queued requests = 300 API calls in seconds
  • Rate limits trigger from retry volume
  • Cascade failure across all requests

How Bifrost Solves This

GitHub logo maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

Go Report Card Discord badge Known Vulnerabilities codecov Docker Pulls Run In Postman Artifact Hub License

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Get started

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080
Enter fullscreen mode Exit fullscreen mode

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

Bifrost implements failover without timeout accumulation. Here's the difference:

LiteLLM approach:

  • 3 retries × 10-second timeout = 30 seconds per provider
  • Retry delays often ignored (GitHub issues document this)
  • Streaming requests get no retry protection
  • Total failover time unpredictable

Bifrost approach:

  • Timeout per attempt, not per provider
  • First timeout triggers immediate fallback
  • Streaming and non-streaming get same retry protection
  • Bounded maximum latency

Configuration:

# Bifrost - zero configuration needed
npx -y @maximhq/bifrost

# Then just add fallbacks to your request
curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "openai/gpt-4o",
    "fallbacks": ["anthropic/claude-3-5-sonnet"],
    "messages": [{"role": "user", "content": "Hello"}]
  }'
Enter fullscreen mode Exit fullscreen mode

Execution flow:

  1. Try OpenAI (max 2 seconds)
  2. If fails, try Anthropic immediately (max 2 seconds)
  3. Return response or error

Total maximum latency: 4 seconds for a two-provider chain, not 60 seconds.

Real example from production:

  • OpenAI experiencing outage
  • LiteLLM users: 30-60 second delays before fallback completes
  • Bifrost users: 2-4 second failover, seamless for end users

The difference matters when you're serving real users who expect sub-5-second responses.

Setup Bifrost in under 60 seconds:

# NPX (instant)
npx -y @maximhq/bifrost

# Docker
docker run -p 8080:8080 maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Add your API keys through the web UI at http://localhost:8080. No YAML configuration. No retry tuning. No debugging why retries aren't working.

Point your existing OpenAI SDK at Bifrost:

from openai import OpenAI

client = OpenAI(
    api_key="your-bifrost-key",
    base_url="http://localhost:8080/v1"  # Only line that changes
)

# Your existing code works unchanged
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)
Enter fullscreen mode Exit fullscreen mode

Bifrost handles failover, load balancing, semantic caching, and observability at the infrastructure level. Your application code stays clean.

Why this matters for production:

  • 11µs overhead at 5,000 RPS (50x faster than Python-based gateways)
  • Built-in dashboard with real-time logs and cost analytics
  • Prometheus metrics at /metrics for existing monitoring stacks
  • Zero-config deployment — running in under a minute
  • Automatic failover with bounded latency guarantees

If you're using LiteLLM and hitting these retry issues, Bifrost is a drop-in replacement that solves the timeout accumulation problem.


Other Options If You're Staying with LiteLLM

Option 1: Aggressive timeouts

Set very low timeout values to limit accumulation:

litellm_settings:
  num_retries: 2
  request_timeout: 2  # 2 seconds instead of 10
Enter fullscreen mode Exit fullscreen mode

Max delay per provider: 4 seconds (2 retries × 2 seconds). Still adds up with multiple providers, but better than 30+ seconds.

Option 2: Disable retries, use only fallbacks

litellm_settings:
  num_retries: 0  # No retries
  fallbacks: [{"gpt-4o": ["claude-3-5-sonnet"]}]
Enter fullscreen mode Exit fullscreen mode

Provider fails once → immediate fallback. No retry accumulation. But you lose retry protection for transient failures.

Option 3: Manual retry logic

Skip LiteLLM's built-in retry, handle it yourself with proper backoff:

import time
from litellm import completion

def request_with_intelligent_retry(model, messages, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            return completion(
                model=model,
                messages=messages,
                num_retries=0,  # Disable LiteLLM retry
                request_timeout=5
            )
        except Exception as e:
            if attempt < max_attempts - 1:
                wait = 2 ** attempt  # Exponential backoff: 1s, 2s, 4s
                time.sleep(wait)
            else:
                raise
Enter fullscreen mode Exit fullscreen mode

You control timing. You know exactly how long requests can take.

Option 4: Manual retry logic

Skip LiteLLM's built-in retry, handle it yourself with proper backoff:

import time
from litellm import completion

def request_with_intelligent_retry(model, messages, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            return completion(
                model=model,
                messages=messages,
                num_retries=0,  # Disable LiteLLM retry
                request_timeout=5
            )
        except Exception as e:
            if attempt < max_attempts - 1:
                wait = 2 ** attempt  # Exponential backoff: 1s, 2s, 4s
                time.sleep(wait)
            else:
                raise
Enter fullscreen mode Exit fullscreen mode

You control timing. You know exactly how long requests can take. But you're building infrastructure instead of shipping features.


The Reality Check

If you're spending time debugging retry delays, configuring backoff strategies, and testing timeout accumulation — you're solving infrastructure problems that modern gateways already solved.

LiteLLM works for:

  • Low-stakes async workloads where 30-60 second delays are acceptable
  • Teams with time to debug retry behavior across versions
  • Applications without strict latency requirements

It doesn't work for:

  • User-facing chat interfaces (<5s latency requirements)
  • Real-time streaming applications
  • Production systems serving thousands of concurrent users
  • Teams that need infrastructure to just work

Bifrost was built specifically to solve these problems.

Zero configuration. Bounded failover latency. Production-grade performance. Built-in observability.

If you're hitting LiteLLM's limitations, try Bifrost. It takes 60 seconds to set up, and it just works.


Get started with Bifrost:

Docs: https://docs.getbifrost.ai

GitHub: https://github.com/maximhq/bifrost

Setup guide: https://docs.getbifrost.ai/quickstart/gateway/setting-up

LiteLLM resources (if staying):

Retry docs: https://docs.litellm.ai/docs/completion/reliable_completions

GitHub issues to watch: #6011, #6623, #7669, #8648

Top comments (0)