You're managing multiple LLM providers; OpenAI for production, Anthropic for experimentation, AWS Bedrock for compliance. Each provider has different API formats, rate limits, and pricing. Your application needs automatic failover when providers go down, intelligent routing to optimize costs, and load balancing across API keys to prevent throttling.
This is LLM orchestration: coordinating requests across multiple providers, models, and API keys with routing logic, failover strategies, and load balancing—all without cluttering your application code.
Bifrost provides comprehensive LLM orchestration through a single gateway layer, delivering sub-3ms latency while handling routing,
automatic failover, adaptive load balancing, and semantic caching.
maximhq
/
bifrost
Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
Bifrost AI Gateway
The fastest way to build AI applications that never go down
Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.
Quick Start
Go from zero to production-ready AI gateway in under a minute.
Step 1: Start Bifrost Gateway
# Install and run locally
npx -y @maximhq/bifrost
# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Step 2: Configure via Web UI
# Open the built-in web interface
open http://localhost:8080
Step 3: Make your first API call
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello, Bifrost!"}]
}'
That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring, and analytics.
…
What is LLM Orchestration?
LLM orchestration manages the complexity of multi-provider AI infrastructure:
Routing: Direct requests to specific providers, models, or API keys based on rules
Load balancing: Distribute traffic across multiple endpoints to prevent rate limiting
Failover: Automatically retry failed requests with alternative providers
Caching: Reduce redundant API calls through intelligent response caching
Governance: Enforce budgets, rate limits, and access controls per team/customer
Without orchestration, each application manages provider connections independently—leading to duplicated logic, inconsistent policies, and operational complexity.
Bifrost's Orchestration Architecture
Performance: 11µs overhead at 5,000 RPS (50x faster than Python alternatives)
Unified Interface: OpenAI-compatible API for 12+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, Azure, Cohere, Mistral, Ollama, Groq, Cerebras)
Zero Configuration: Start in seconds with dynamic provider configuration
Drop-in Replacement: Change one line of code to route through Bifrost
Get Started
Installation:
npx -y @maximhq/bifrost
Documentation: https://getmax.im/bifrostdocs
GitHub: https://git.new/bifrost
Key Resources:
- Routing documentation: https://getmax.im/bifrostdocs (search "routing")
- Virtual keys guide: https://getmax.im/bifrostdocs (search "virtual keys")
- Governance features: https://getmax.im/bifrostdocs (search "governance")
Weighted Load Balancing
Distribute traffic across providers based on configurable weights.
Use Case: Route 80% of traffic to Azure OpenAI (cheaper with enterprise agreement), 20% to OpenAI directly (for availability).
Configuration:
{
"virtual_key": "vk-prod-main",
"provider_configs": [
{
"provider": "azure",
"allowed_models": ["gpt-4o"],
"weight": 0.8
},
{
"provider": "openai",
"allowed_models": ["gpt-4o", "gpt-4o-mini"],
"weight": 0.2
}
]
}
Behavior:
- For
gpt-4o: 80% Azure, 20% OpenAI (both providers support it) - For
gpt-4o-mini: 100% OpenAI (only provider that supports it) - Weights automatically normalized based on available providers for each model
Request (triggers load balancing):
curl -X POST http://localhost:8080/v1/chat/completions \
-H "x-bf-vk: vk-prod-main" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Bypass load balancing (target specific provider):
curl -X POST http://localhost:8080/v1/chat/completions \
-H "x-bf-vk: vk-prod-main" \
-d '{
"model": "openai/gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Automatic Failover
When multiple providers are configured, Bifrost automatically creates fallback chains for resilience.
How It Works:
- Activated when your request has no existing
fallbacksarray - Providers sorted by weight (highest first) and added as fallbacks
- Respects manually specified fallbacks
Example Request Flow:
- Primary request goes to weighted-selected provider (Azure with 80% weight)
- If Azure fails, automatically retry with OpenAI
- Continue until success or all providers exhausted
Automatic fallbacks (no fallbacks in request):
curl -X POST http://localhost:8080/v1/chat/completions \
-H "x-bf-vk: vk-prod-main" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Manual fallbacks (preserves your specification):
curl -X POST http://localhost:8080/v1/chat/completions \
-H "x-bf-vk: vk-prod-main" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}],
"fallbacks": ["anthropic/claude-3-5-sonnet-20241022"]
}'
Result: Transparent failover without application code changes. If Azure experiences outages, traffic automatically shifts to OpenAI.
Adaptive Load Balancing
Beyond simple weighted routing, Bifrost implements adaptive load balancing based on real-time metrics:
Metrics Tracked:
- Latency measurements per provider
- Error rates and success patterns
- Throughput limits and current load
- Provider health status
Adaptive Behavior:
- Detect provider throttling or failures
- Route requests to healthy alternatives automatically
- Monitor key health and respect rate limits
- Balance load intelligently to prevent quota exhaustion
Intelligent Key Distribution:
Distribute requests across multiple API keys from the same provider to maximize throughput:
{
"provider": "openai",
"api_keys": [
{"key": "sk-key1", "weight": 0.5},
{"key": "sk-key2", "weight": 0.5}
]
}
Bifrost monitors key usage, rotates requests to balance load, and adapts routing automatically—all without manual intervention.
Task-Based Routing
Route different request types to appropriate models based on complexity.
Strategy: Short queries use economy models (GPT-4o-mini), complex multi-part requests use premium models (GPT-4o).
Implementation via Virtual Keys:
Economy Virtual Key (for free-tier users):
{
"virtual_key": "vk-free-tier",
"provider_configs": [
{
"provider": "openai",
"allowed_models": ["gpt-4o-mini"],
"budget": {"max_limit": 10, "reset_duration": "1d"}
}
]
}
Premium Virtual Key (for paid users):
{
"virtual_key": "vk-premium",
"provider_configs": [
{
"provider": "openai",
"allowed_models": ["gpt-4o", "gpt-4o-mini"]
},
{
"provider": "anthropic",
"allowed_models": ["claude-3-5-sonnet-20241022"],
"weight": 0.3
}
]
}
Application Code:
# Free-tier user
client_free = OpenAI(
base_url="http://localhost:8080/v1",
api_key="vk-free-tier"
)
# Premium user
client_premium = OpenAI(
base_url="http://localhost:8080/v1",
api_key="vk-premium"
)
Different user tiers automatically route to appropriate models without application logic.
Cost-Optimized Failover Strategy
Use cheaper providers by default, automatically fail over to premium when budgets exhausted.
Configuration:
{
"virtual_key": "vk-cost-optimized",
"provider_configs": [
{
"provider": "openai-cheap",
"weight": 1.0,
"budget": {"max_limit": 10, "reset_duration": "1d"}
},
{
"provider": "openai-premium",
"weight": 0.0,
"budget": {"max_limit": 50, "reset_duration": "1d"},
"rate_limit": {
"request_max_limit": 100,
"request_reset_duration": "1h"
}
}
]
}
Behavior:
- Primary: Use cheap provider until $10 daily budget exhausted
- Fallback: Automatically switch to premium provider when cheap unavailable
- Cost containment: Prevent unexpected overspend, limit premium requests
Important: Don't send provider name in request body for automatic failover to work.
Environment Separation
Separate virtual keys for development, testing, and production environments with different provider access.
Development Virtual Key:
{
"virtual_key": "vk-dev",
"provider_configs": [
{
"provider": "openai-dev-keys",
"allowed_models": ["gpt-4o-mini"],
"rate_limit": {"request_max_limit": 100, "request_reset_duration": "1h"}
}
]
}
Production Virtual Key:
{
"virtual_key": "vk-prod",
"provider_configs": [
{
"provider": "openai-prod-keys",
"allowed_models": ["gpt-4o"],
"weight": 0.7
},
{
"provider": "azure-prod-keys",
"allowed_models": ["gpt-4o"],
"weight": 0.3
}
]
}
Different API keys, models, and providers per environment—enforced at infrastructure level.
Provider-Level Governance
Set specific spending limits and rate limits per AI provider.
Example:
{
"virtual_key": "vk-multi-provider",
"budget": {"max_limit": 100, "reset_duration": "1mo"},
"provider_configs": [
{
"provider": "openai",
"budget": {"max_limit": 50, "reset_duration": "1mo"},
"rate_limit": {
"request_max_limit": 1000,
"request_reset_duration": "1h",
"token_max_limit": 1000000,
"token_reset_duration": "1h"
}
},
{
"provider": "anthropic",
"budget": {"max_limit": 30, "reset_duration": "1mo"},
"rate_limit": {
"request_max_limit": 500,
"request_reset_duration": "1h"
}
}
]
}
Behavior:
- Virtual key limited to $100/month total
- OpenAI: $50/month + 1000 req/hour + 1M tokens/hour
- Anthropic: $30/month + 500 req/hour
- If any provider's budget/rate limits exhausted, requests to that provider blocked
Benefits:
- Granular control per provider
- Automatic fallback when budgets exceeded
- Cost tracking by provider
- A/B testing with controlled budgets
Semantic Caching Integration
Bifrost's orchestration layer integrates semantic caching to reduce redundant API calls.
How It Works:
- Exact hash matching for identical requests
- Semantic similarity search for variations ("What are your hours?" = "When are you open?")
- Configurable threshold (0.8-0.95)
- TTL-based expiration
Configuration:
{
"semantic_caching": {
"enabled": true,
"threshold": 0.85,
"ttl": "5m",
"conversation_history_threshold": 3
}
}
Cost Impact: 40-60% reduction typical through intelligent caching.
Integration with Routing: Cached responses bypass provider routing entirely, delivering sub-millisecond response times.
Unified Observability
Track orchestration performance across all providers:
Built-in Dashboard: Real-time logs showing routing decisions, failover events, provider health
Prometheus Metrics: Native metrics at /metrics for:
- Requests per provider
- Latency per provider
- Error rates and failover frequency
- Budget consumption per virtual key
OpenTelemetry Tracing: Distributed tracing shows complete request path:
- Initial provider selection (weighted routing)
- Failover attempts
- Cache hits/misses
- Final successful provider
Example Query:
# Request distribution by provider
rate(bifrost_requests_total[5m]) by (provider)
# Failover rate
rate(bifrost_failover_total[5m]) by (source_provider, target_provider)
# Average latency by provider
avg(bifrost_request_duration_seconds) by (provider)
Setup: Zero-Config to Production
Install:
npx -y @maximhq/bifrost
# or
docker run -p 8080:8080 maximhq/bifrost
Configure Providers (Web UI at http://localhost:8080):
- Add provider API keys (OpenAI, Anthropic, Azure, etc.)
- Create virtual keys with routing rules
- Set weights, budgets, rate limits
Application Integration:
# Before (direct OpenAI)
from openai import OpenAI
client = OpenAI(api_key="sk-...")
# After (through Bifrost orchestration)
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="vk-prod-main" # Virtual key with routing configured
)
# Same code, now with:
# - Automatic failover
# - Weighted load balancing
# - Semantic caching
# - Budget enforcement
# - Complete observability
Real-World Orchestration Patterns
Pattern 1: Cost Optimization
- 80% traffic to cheap provider
- 20% to premium for availability
- Semantic caching reduces overall volume 40-60%
- Result: Significant cost reduction without reliability loss
Pattern 2: High Availability
- Primary: Azure OpenAI (enterprise SLA)
- Fallback 1: OpenAI direct
- Fallback 2: Anthropic Claude
- Result: 99.99% uptime through multi-provider redundancy
Pattern 3: Multi-Tenant SaaS
- Free tier: GPT-4o-mini, $10/day budget
- Pro tier: GPT-4o, $50/day budget
- Enterprise: Claude + GPT-4o, custom budgets
- Result: Per-customer cost control and model access
Pattern 4: Development to Production
- Dev: GPT-4o-mini, rate limited, separate keys
- Staging: GPT-4o, moderate limits
- Prod: Multi-provider with failover, high limits
- Result: Environment isolation enforced at infrastructure
Performance Impact
Orchestration Overhead: 11µs at 5,000 RPS
Comparison:
- Direct provider call: Provider latency only
- Bifrost orchestration: Provider latency + 11µs
- LiteLLM: Provider latency + ~8ms (727x slower than Bifrost)
At scale (50 requests per interaction):
- Bifrost overhead: 50 × 11µs = 0.55ms
- LiteLLM overhead: 50 × 8ms = 400ms
Bifrost's orchestration is effectively free from latency perspective.
Key Takeaway: LLM orchestration consolidates routing, failover, load balancing, caching, and governance into a single infrastructure layer. Bifrost delivers comprehensive orchestration with 11µs overhead—enabling sophisticated multi-provider strategies without application complexity or performance degradation.

Top comments (0)