Set up multi-provider routing and custom providers in Bifrost using virtual keys, weighted distributions, CEL-based rules, and request-type scoping.
Production AI workloads run into provider-side disruptions on a regular basis: regional outages, rate-limit rejections during traffic peaks, model deprecations, and tail latency that drifts in hard-to-predict ways. The remedy at the infrastructure layer is multi-provider routing, which sends a gpt-4o request to OpenAI, Azure OpenAI, or any other configured backend based on live conditions. Bifrost is an open-source AI gateway built around this pattern, with multi-provider routing layered across three coordinated stages and native support for custom provider definitions covering self-hosted models, OpenAI-compatible endpoints, and environment-scoped configurations. The source is on GitHub, and the Bifrost documentation walks through a complete setup in under five minutes.
Why Production AI Needs Multi-Provider Routing
Architectures that depend on a single LLM provider fail in well-understood ways. One regional outage knocks the application offline. A traffic spike burns through the API key's rate-limit budget. A competitor releases a stronger model, and migrating to it means SDK rewrites scattered across every service. Industry analyses of multi-provider LLM strategies describe the same recurring pressures: cost variability, reliability gaps, and the engineering tax of model-specific code paths.
The cleaner answer is to push the problem down to the infrastructure layer rather than solving it inside every service. With a gateway in the middle, retry logic, provider SDKs, and fallback rules live in one place. A single configuration change updates how the whole application speaks to its model providers, and engineering teams stop rebuilding failover code in every new microservice.
Bifrost delivers this through three routing mechanisms that can stand alone or combine into a stack: governance routing built on virtual keys, dynamic rules driven by request context, and adaptive load balancing across the configured providers. Every approach builds on a shared model catalog indexing each model available across 20+ supported providers.
Inside Bifrost's Multi-Provider Routing Stack
Inside Bifrost, routing decisions follow a fixed evaluation order across three sequenced layers. Routing rules fire first and can override anything downstream. Governance rules come next, applying the weighted distributions configured on a virtual key. Load balancing then refines the final choice of provider and key based on live performance signals.
Two entities anchor the system:
- Virtual keys: governance objects that authenticate callers and bound which providers, models, and budgets they are allowed to reach.
- Provider configs: per-virtual-key entries that bind each caller to one or more upstream providers, with weights, model allowlists, and optional budget or rate-limit caps.
Multi-provider routing is expressed concretely inside the provider configs. Every entry declares which provider is in play, which models that provider is permitted to serve for this caller, and how much weight it carries during selection. The outcome is fine-grained control over traffic distribution across providers, achieved without touching application code.
Building Governance-Based Routing on Top of Virtual Keys
Of the three methods, governance routing is the most explicit. It operates through provider configs attached to a virtual key and is the right tool when an organization needs deterministic, configuration-defined control over traffic distribution.
A minimal configuration that splits traffic across two providers looks like this:
{
"provider_configs": [
{
"provider": "openai",
"allowed_models": ["gpt-4o", "gpt-4o-mini"],
"weight": 0.3,
"budget": {
"max_limit": 100.0,
"current_usage": 45.0
}
},
{
"provider": "azure",
"allowed_models": ["gpt-4o"],
"weight": 0.7,
"rate_limit": {
"token_max_limit": 100000,
"token_reset_duration": "1m"
}
}
]
}
For each incoming request carrying this virtual key, Bifrost steps through the following sequence:
- Check that at least one configured provider is allowed to serve the requested model.
- Discard any provider that has already exceeded its budget or rate limit.
- Run a weighted random pick across the surviving providers; weights of 0.3 and 0.7 distribute eligible traffic as 70% to Azure and 30% to OpenAI.
- Rewrite the model identifier into the
provider/modelform (for example,azure/gpt-4o). - Construct a fallback chain by sorting the remaining providers from highest to lowest weight.
What each provider is allowed to serve is governed by allowed_models. The ["*"] value grants every model the provider supports, validated against the model catalog. An explicit list constrains the provider to exactly those entries, while an empty list ([]) blocks every model for that provider.
That last behavior is deliberate and reflects a deny-by-default stance, which matters for enterprise governance settings where an unconfigured provider must never serve traffic by accident.
Layering Routing Rules for Dynamic Conditions
Static weights handle the bulk of production traffic, but real workloads also need context-aware decisions: the caller's tier, the requesting team, current spend against budget, or a specific HTTP header. Routing rules cover this surface through Common Expression Language (CEL) expressions, evaluated before governance rules engage.
Common rule expressions look like:
-
headers["x-tier"] == "premium"sends premium callers to a specific provider and model. -
budget_used > 85redirects traffic to a cheaper backend once monthly spend approaches the cap. -
team_name == "ml-research"routes research workloads to a different model than the production fleet. -
headers["x-environment"] == "production" && tokens_used < 75combines multiple conditions for more sophisticated routing.
Rules evaluate by scope precedence: virtual key first, then team, then customer, then global. Inside a scope, lower priority numbers go first. Whichever rule matches first determines the routing decision; the remaining rules are skipped. When no rule fires, control falls through to governance routing.
This separation is intentional. Routing rules are not a substitute for governance; they sit above it as an override layer. A single virtual key can carry static provider weights alongside a handful of CEL overrides for edge cases, and the two coexist without conflict.
Building Custom Providers for Bifrost Deployments
Custom providers push Bifrost beyond its 20+ built-in integrations. The reasons to define one fall into three buckets: connecting to an OpenAI-compatible endpoint that ships outside the built-in list, spinning up multiple scoped instances of the same base provider, or routing distinct request types to distinct underlying endpoints.
Configuration happens through the custom_provider_config field. The block declares a base provider type (the API contract Bifrost will speak), an optional list of allowed request types, and optional path overrides for individual endpoints.
Here is an example for an OpenAI-compatible internal endpoint locked down to chat completions:
{
"providers": {
"internal-llm": {
"keys": [
{
"name": "internal-llm-key-1",
"value": "env.INTERNAL_API_KEY",
"models": ["*"],
"weight": 1.0
}
],
"network_config": {
"base_url": "https://internal-llm.example.com"
},
"custom_provider_config": {
"base_provider_type": "openai",
"allowed_requests": {
"chat_completion": true,
"chat_completion_stream": true
},
"request_path_overrides": {
"chat_completion": "/api/v2/chat",
"chat_completion_stream": "/api/v2/chat"
}
}
}
}
}
The allowed_requests block is not documentation; it is enforcement. Only request types flagged true are accepted, and any other operation gets back an access-control error. That makes it practical to expose an embeddings-only OpenAI instance to an analytics team while routing a customer-facing app through a separate chat-only instance, both pointing at the same upstream account but bounded by what their scope permits.
Bifrost supports custom providers built on these base types:
-
openai(and any endpoint that speaks the OpenAI API) anthropic-
bedrock(AWS Bedrock) coheregeminireplicate
Air-gapped and self-hosted deployments also need TLS controls. Custom providers accept either a ca_cert_pem to trust a private CA, or insecure_skip_verify for trusted internal networks. With these options in place, the gateway can communicate with internal endpoints that do not present publicly trusted certificates, a common requirement for in-VPC and on-prem deployments in regulated industries.
Configuration Patterns That Recur in Production
A handful of patterns show up repeatedly across Bifrost deployments:
-
Environment separation: register
openai-dev,openai-staging, andopenai-prodas distinct custom providers, each scoped throughallowed_requests. Hand out dev-only virtual keys that can only reachopenai-dev, and the access boundary becomes a structural property of the configuration rather than a convention enforced by review. -
Cost-aware fallback: inside
provider_configs, give the primary provider a 0.9 weight and let a cheaper secondary backend carry the remaining 0.1. Pair this with a routing rule that flips 100% of traffic to the cheaper backend oncebudget_used > 85, and cost behavior shifts automatically as spend climbs. -
Cross-provider failover: set up two providers backing the same model (for instance, both
openaiandazureexposinggpt-4o). When the primary returns an error, Bifrost's automatic fallback chain retries against the next provider without any application-side code change. -
Regional routing: for data residency, define a custom provider per region and pin TLS to the internal certificates each region issues. A routing rule like
headers["x-region"] == "eu"then steers EU traffic to EU-hosted models, satisfying GDPR data-locality requirements with no application logic. - Tool-scoped MCP traffic: combine custom providers with Bifrost's MCP gateway so that each consumer is locked to its own model and tool combination.
All of these patterns are configuration-only at heart. No application code edits, no SDK substitutions, no per-service retry plumbing. The gateway becomes the single surface where multi-provider strategy is declared and enforced.
Running Bifrost in Production Environments
Because routing decisions happen at runtime, observability into how those decisions get made is just as important as the configuration itself. Bifrost surfaces which provider was selected, which key was used, and which routing rule (if any) fired, both inside the dashboard and through Prometheus metrics. That same telemetry flows into OpenTelemetry-compliant traces, so teams with distributed tracing already in place can correlate routing decisions with downstream latency and error rates.
Teams in the midst of vendor evaluation can step through the LLM Gateway Buyer's Guide for a structured capability matrix that spans routing, governance, observability, and deployment. The governance resource hub goes deeper on virtual key design and cost-attribution strategy.
Start Using Bifrost for Multi-Provider Routing
With multi-provider routing in place, LLM infrastructure stops being a single point of failure and becomes a resilient, controllable layer. Bifrost combines governance routing, dynamic rules, and custom providers into a single configuration surface that platform teams can drive without touching application code. To explore how Bifrost can streamline your multi-provider AI setup, book a demo with the Bifrost team.
Top comments (0)