Debby McKinney

Posted on Feb 4

Your LLM Provider Just Went Down. Here's How to Stay Online.

#programming #ai #devops #chatgpt

If you're running LLM applications in production, provider failures will happen. Network timeouts, rate limits, model maintenance; all of these break your application unless you have failover in place.

Most teams handle failures by showing error messages to users or retrying manually. Both approaches mean downtime.

Here's how to set up automatic failover in Bifrost so your application stays online when providers go down.

maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

View on GitHub

Common Failure Scenarios

LLM API calls fail for several reasons:

Rate limiting (HTTP 429): You've hit your quota limit. Common during traffic spikes or when you're approaching your plan's ceiling.

Server errors (HTTP 500/502/503/504): The provider's backend is having issues. Could be an outage, maintenance, or infrastructure problems.

Network failures: Connection timeouts, DNS resolution failures, connection refused. Something between your gateway and the provider broke.

Model unavailability: The specific model you requested is offline for maintenance or has been deprecated.

Authentication failures: Invalid API key or expired token. Usually a configuration issue.

Without failover, all of these mean your users see errors. With failover, requests automatically route to backup providers.

Setting Up Basic Failover

Fallbacks - Bifrost

Automatic failover between AI providers and models. When your primary provider fails, Bifrost seamlessly switches to backup providers without interrupting your application.

docs.getbifrost.ai

Add a fallbacks array to your request:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello"}],
    "fallbacks": [
      "anthropic/claude-3-5-sonnet-20241022",
      "bedrock/anthropic.claude-3-sonnet-20240229-v1:0"
    ]
  }'

Bifrost tries providers in this order:

OpenAI GPT-4o-mini (your primary choice)
Anthropic Claude 3.5 Sonnet (first backup)
AWS Bedrock Claude 3 Sonnet (second backup)

The first provider that succeeds returns the response. If all providers fail, you get the original error from the primary provider.

How Failover Actually Works

When a request fails, here's the sequence:

Primary attempt: Try your main provider
Failure detection: Gateway catches the error
Fallback trigger: Move to next provider in the list
Complete re-execution: Treat the fallback as a brand new request
Response or error: Return success or original error if everything failed

The key detail: each fallback is a completely new request. This means:

Semantic cache checks run again
Governance rules (budgets, rate limits) apply again
Logging happens again
All your configured plugins execute again

This ensures consistent behavior no matter which provider handles your request.

What You Get Back

Responses stay in OpenAI format regardless of which provider succeeded:

{
  "id": "chatcmpl-123",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Response from whichever provider worked"
    }
  }],
  "extra_fields": {
    "provider": "anthropic"
  }
}

The extra_fields.provider tells you which provider actually handled the request. This is crucial for monitoring. You can track whether your primary provider is healthy or if you're consistently falling back to backups.

Retries vs Fallbacks — Understanding the Difference

Bifrost uses two separate mechanisms:

Retries: Same provider, multiple attempts. When a request fails with a retryable status code (500, 502, 503, 504, 429), Bifrost retries the same provider a few times before giving up.

Fallbacks: Different providers, sequential attempts. After all retries on one provider are exhausted, Bifrost moves to the next provider in your fallback list.

Example flow:

Try OpenAI (3 retry attempts)
OpenAI failed → Try Anthropic (3 retry attempts)
Anthropic failed → Try Bedrock (3 retry attempts)
All failed → Return error

This means a single user request might trigger 9 total API calls across 3 providers. Monitoring retry counts helps you spot systemic issues early.

Automatic Fallbacks with Virtual Keys

If you're using virtual keys with multiple providers configured, you don't even need to specify fallbacks in every request. Bifrost creates them automatically.

Set up your virtual key:

{
  "provider_configs": [
    {
      "provider": "openai",
      "weight": 0.8
    },
    {
      "provider": "anthropic",
      "weight": 0.2
    }
  ]
}

Now requests using this virtual key automatically get fallbacks based on weight order:

OpenAI (weight 0.8, primary)
Anthropic (weight 0.2, backup)

You don't need to specify fallbacks in the request body. The gateway handles it.

If you do specify explicit fallbacks in the request, those take precedence over the automatic ones.

Cost-Optimized Failover

You can use failover to control costs. Route to cheaper models first, more expensive ones as backup:

curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "openai/gpt-4o-mini",
    "fallbacks": [
      "openai/gpt-3.5-turbo",
      "anthropic/claude-3-5-sonnet-20241022"
    ]
  }'

Execution order:

GPT-4o-mini (moderate cost)
GPT-3.5-turbo (cheap fallback)
Claude 3.5 Sonnet (expensive fallback)

This keeps costs low under normal conditions. When cheaper options fail, you fall back to more expensive providers rather than showing errors to users.

The tradeoff: you're paying more during failures. Monitor your extra_fields.provider to see if you're consistently hitting expensive fallbacks.

Multi-Region Failover

If you're using Azure or other providers with regional deployments, route across regions for resilience:

curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "azure-us/gpt-4o",
    "fallbacks": [
      "azure-eu/gpt-4o"
    ]
  }'

Regional outages won't take down your application. Latency might vary by region, but you stay online.

Monitoring Fallback Usage

Track these metrics to understand your failover patterns:

Fallback trigger rate: What percentage of requests need fallbacks? Should be low (<5%) under normal conditions. Spikes indicate provider issues.

Success rate by position: Are requests succeeding on primary, first fallback, or second fallback? If you're consistently using fallbacks, your primary provider has problems.

Latency per provider: Which providers are fast? Which are slow? This helps you order your fallback chain.

Cost per provider: How much are you spending on each provider? Fallbacks to expensive providers increase costs.

The gateway provides all this data through the extra_fields.provider field. You don't need custom instrumentation.

Common Mistakes to Avoid

Mistake 1: Not monitoring which provider is being used. You think you're using OpenAI but you've been on Anthropic for three days because OpenAI is rate limiting you. Check extra_fields.provider.

Mistake 2: Too many fallbacks. Each fallback adds latency. Three fallbacks is plenty for most use cases. More than that and you're waiting 10+ seconds for all the retries to exhaust.

Mistake 3: Identical models in fallback chain. If you're using openai/gpt-4o as primary and azure/gpt-4o as fallback, you're just duplicating the same model. Use different models or providers for actual diversity.

Mistake 4: Not testing fallbacks. Set up your fallback chain, then manually trigger a failure to verify it works. Don't wait for a production outage to find out your fallbacks are misconfigured.

Plugin Behavior During Fallbacks

If you're using plugins (semantic caching, governance, logging), they re-execute for every fallback attempt.

Semantic caching: A cache miss on OpenAI might hit on Anthropic if that provider has cached similar requests.

Governance rules: Budget and rate limit checks run again. A request blocked by OpenAI's budget might succeed through Anthropic if that budget has headroom.

Logging: Each fallback attempt generates a separate log entry. One user request that tries three providers creates three log records.

This ensures consistent behavior but increases operational overhead. Each fallback has the full cost of a complete request.

When NOT to Use Fallbacks

Streaming responses: Fallbacks add latency. If you're streaming tokens back to users, failures are immediately visible and fallbacks might timeout before completing.

Latency-critical applications: If your SLA requires sub-500ms responses, multiple fallback attempts will violate that. Use health checks and circuit breakers instead.

Debugging specific provider issues: If you're trying to reproduce a bug with OpenAI specifically, fallbacks will hide the problem by succeeding through Anthropic.

Advanced: Plugin Fallback Control

Plugins can prevent fallback execution using the AllowFallbacks field. This is useful when a failure applies to all providers.

Example: An authentication plugin detects an invalid virtual key. Setting AllowFallbacks=False returns the error immediately instead of wasting time trying fallbacks that will all fail for the same reason.

Documentation: https://docs.getbifrost.ai/features/fallbacks

What This Means for Your Application

LLM providers will go down. Rate limits will trigger. Models will go offline for maintenance.

With automatic failover, your application stays online. Users see responses instead of errors. You handle degraded provider performance gracefully instead of catastrophically.

And it all happens at the infrastructure level. Your application code doesn't change. No retry logic. No provider switching. The gateway manages it.

DEV Community