DEV Community

Pranay Batta
Pranay Batta

Posted on

Automatic Failover When Your Primary LLM Provider Goes Down

Provider outages happen. Network timeouts. Rate limits. Model unavailability. All of these cause request failures that break production applications.

The standard response is showing error messages to users or failing silently. Both degrade reliability.

Bifrost handles this with automatic fallback chains. When a primary provider fails, requests route to backup providers sequentially until one succeeds. This post explains how the system works.

GitHub logo maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

Go Report Card Discord badge Known Vulnerabilities codecov Docker Pulls Run In Postman Artifact Hub License

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Get started

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080
Enter fullscreen mode Exit fullscreen mode

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

What Triggers Fallbacks

The system attempts fallbacks for these failure types:

HTTP 429 (Rate Limiting): Quota limits exceeded at the provider level. Common during traffic spikes or when usage approaches tier limits.

HTTP 500/502/503/504: Server errors from the provider. Indicates backend issues, maintenance, or outages.

Network failures: Connection timeouts, DNS resolution failures, connection refused. Infrastructure-level problems between gateway and provider.

Model unavailability: Specific model offline for maintenance or deprecated. Provider returns model-specific errors.

Authentication failures: Invalid API keys or expired tokens. Configuration issues or credential rotation problems.

Each failure type gets handled the same way: try the next provider in the configured fallback list.

The Execution Flow

Request processing follows this sequence:

  1. Primary attempt: Execute against the primary provider
  2. Failure detection: Catch errors, check HTTP status codes
  3. Fallback trigger: Move to next provider in fallback list
  4. Plugin re-execution: Run all configured plugins for the new provider
  5. Response or error: Return successful response or original error if all providers exhausted

The key detail: each fallback is treated as a completely new request. Semantic caching checks run again. Governance rules apply again. Logging happens again. This ensures consistent behavior regardless of which provider handles the request.

Configuration Format

Fallbacks are specified in the request payload:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello"}],
    "fallbacks": [
      "anthropic/claude-3-5-sonnet-20241022",
      "bedrock/anthropic.claude-3-sonnet-20240229-v1:0"
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Execution order:

  1. OpenAI GPT-4o-mini (primary)
  2. Anthropic Claude 3.5 Sonnet (first fallback)
  3. AWS Bedrock Claude 3 Sonnet (second fallback)

The response format remains OpenAI-compatible regardless of which provider actually handled the request:

{
  "id": "chatcmpl-123",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Response text"
    }
  }],
  "extra_fields": {
    "provider": "anthropic"
  }
}
Enter fullscreen mode Exit fullscreen mode

The extra_fields.provider indicates which provider processed the request. Essential for monitoring which providers are actually being used versus configured.

Retries vs Fallbacks

The system uses two separate mechanisms:

Retries (same provider): Configured at the provider level. When a request fails with a retryable status code (500, 502, 503, 504, 429), the system retries the same provider multiple times before attempting fallbacks. https://docs.getbifrost.ai/quickstart/go-sdk/provider-configuration#managing-retries

Fallbacks (different providers): Activated after all retries exhausted. Requests move to the next provider in the fallback chain.
https://docs.getbifrost.ai/features/fallbacks

The distinction matters for understanding request flow. A single user request might trigger:

  • 3 retry attempts on OpenAI
  • 3 retry attempts on Anthropic
  • 3 retry attempts on Bedrock
  • Final failure if all providers exhausted

This can add significant latency when multiple providers are down. Monitoring retry counts per provider helps identify systemic issues.

Plugin Execution on Fallbacks

Each fallback attempt re-executes all plugins. This means:

Semantic caching: Cache lookups run against each provider's cache separately. A cache miss on OpenAI might hit on Anthropic if that provider has seen similar requests.

Governance rules: Rate limits and budget controls apply per provider. A request blocked by OpenAI's budget might succeed through Anthropic if that provider's budget has headroom.

Logging: Each attempt generates separate log entries. A single user request that tries three providers creates three log records.

Monitoring: Metrics track attempts per provider. Fallback rate becomes a key reliability metric.

This design ensures consistent behavior but increases operational overhead. Each fallback attempt has the full cost of a complete request path.

Plugin Fallback Control

Plugins can prevent fallback execution using the AllowFallbacks field on errors. This provides fine-grained control over when fallbacks should be attempted.

Example use case: An authentication plugin detects a fundamental auth issue that would affect all providers (like an invalid virtual key). Setting AllowFallbacks=False returns the error immediately instead of wasting time attempting fallbacks that will all fail for the same reason.

Documentation: https://docs.getbifrost.ai/features/fallbacks

Automatic Fallback Creation with Virtual Keys

When multiple providers are configured on a virtual key, Bifrost automatically creates fallback chains without requiring explicit fallback arrays in requests.

The system sorts providers by weight (highest first) and builds the fallback list automatically:

{
  "provider_configs": [
    {
      "provider": "openai",
      "weight": 0.8
    },
    {
      "provider": "anthropic",
      "weight": 0.2
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Requests sent without explicit fallbacks get automatic ordering:

  1. OpenAI (weight 0.8, highest)
  2. Anthropic (weight 0.2, fallback)

If you specify explicit fallbacks in the request, automatic fallback creation is skipped. Your specified order takes precedence.

This matters for understanding request routing. A virtual key configuration changes the default behavior for all requests using that VK, while request-level fallbacks override that default.

Monitoring Fallback Behavior

Key metrics to track:

Fallback trigger rate: Percentage of requests requiring fallbacks. Should be low (<5%) under normal conditions. Spikes indicate provider issues.

Success rate by provider position: Track whether primary, first fallback, or second fallback is handling requests. Persistent fallback usage indicates primary provider problems.

Latency by provider: Measure response time for each provider. Helps identify which providers are fast versus slow.

Cost per provider: Track spending across providers. Fallbacks to more expensive providers increase costs.

The extra_fields.provider in responses enables this monitoring. Applications don't need custom instrumentation — the gateway provides the data.

Implementation Patterns

Pattern 1: Cost-Optimized Fallback

Route to cheap providers first, expensive providers as backup:

curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "openai/gpt-4o-mini",
    "fallbacks": [
      "openai/gpt-3.5-turbo",
      "anthropic/claude-3-5-sonnet-20241022"
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Execution: Try GPT-4o-mini → Try GPT-3.5-turbo → Try Claude 3.5 Sonnet

Cost increases with each fallback tier. Monitoring shows whether you're consistently hitting expensive fallbacks.

Pattern 2: Multi-Region Fallback

Route across geographic regions for resilience:

curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "openai/gpt-4o-mini",
    "fallbacks": [
      "azure-us/gpt-4o",
      "azure-eu/gpt-4o"
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Execution: Try OpenAI US → Try Azure US → Try Azure EU

Regional outages don't take down the entire application. Latency varies by region.

Pattern 3: Provider Diversity

Route across different providers for maximum resilience:

curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{
    "model": "openai/gpt-4o-mini",
    "fallbacks": [
      "anthropic/claude-3-5-sonnet-20241022",
      "bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
      "vertex/gemini-1.5-pro"
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Execution: OpenAI → Anthropic → AWS Bedrock → Google Vertex

Provider-specific outages don't block requests. Response quality may vary by provider.

What We Got Wrong Initially

Mistake 1: No retry/fallback distinction. Original implementation treated retries and fallbacks the same. This caused excessive retries across all providers even for non-retryable errors. We separated the mechanisms.

Mistake 2: Plugin execution happened once. Early version ran plugins only for the primary request, not fallbacks. This broke governance rules and caching for fallback providers. We changed to re-execute plugins on every fallback.

Mistake 3: Insufficient error context. Initial errors didn't indicate which providers were tried. Debugging failures required checking logs. We added provider tracking to error responses.

Mistake 4: No plugin fallback control. Plugins couldn't prevent fallbacks even when it made no sense to try them. We added AllowFallbacks to give plugins control.

Limitations

Increased latency: Fallback attempts add latency. A request trying three providers with 2-second timeouts each can take 6+ seconds total.

Cost increases: Fallback providers might be more expensive. Budget-optimized routing gets bypassed during failures.

Response variation: Different providers return different response formats and quality. Applications expecting consistent behavior need additional logic.

Debugging complexity: Tracing requests through multiple providers requires correlating logs across fallback attempts.

These tradeoffs are worth it for reliability, but they're real constraints to design around.

Why This Matters

LLM provider reliability is not optional for production applications. Outages happen. Rate limits trigger. Models go offline.

Manual failover means downtime. Automatic fallback means degraded performance instead of complete failure.

The system handles this at the infrastructure level. Application code stays clean. No retry logic. No provider switching. The gateway manages it.


Bifrost fallback documentation: https://docs.getbifrost.ai/features/fallbacks

Top comments (0)