If you are running multiple LLM providers in production, routing logic becomes a critical infrastructure decision. Send everything to one provider and you get single points of failure. Hardcode routing rules and you lose flexibility when latency spikes or rate limits hit.
I spent the last few weeks evaluating five AI gateways specifically for their dynamic routing capabilities. The criteria: latency overhead, failover behaviour, weighted distribution, and how much config it takes to get routing working in production.
The short version: Bifrost came out on top for raw performance and routing flexibility. 11 microsecond latency overhead, written in Go, with weighted routing and automatic failover built in. You can run it right now with npx -y @maximhq/bifrost. Full docs here.
Why dynamic routing matters
Static routing is fine for prototypes. Pick a model, call the API, ship it.
Production is different. You need:
- Failover: When OpenAI returns 429s or 500s, traffic should automatically shift to Anthropic or another provider. No manual intervention.
- Weighted distribution: Split traffic 70/30 across providers for cost optimization or A/B testing model quality.
- Latency-based routing: Send requests to whichever provider responds fastest at that moment.
- Budget-aware routing: Stop sending traffic to a provider when your spend cap is hit.
The gateway layer is the right place to handle this. Application code should not care which provider serves a request.
The five gateways I tested
1. Bifrost
Language: Go | Overhead: 11 microseconds | Throughput: 5,000 RPS sustained
Bifrost is the fastest gateway I have tested. The 11 microsecond overhead is not a typo. That is roughly 50x faster than Python-based alternatives like LiteLLM, which adds around 8ms per request.
Routing configuration is declarative and clean. Here is what weighted routing across two providers looks like:
# bifrost-config.yaml
providers:
- name: openai-primary
provider: openai
model: gpt-4o
weight: 70
api_key: ${OPENAI_API_KEY}
- name: anthropic-fallback
provider: anthropic
model: claude-sonnet-4-20250514
weight: 30
api_key: ${ANTHROPIC_API_KEY}
routing:
strategy: weighted
fallback:
enabled: true
max_retries: 2
That splits 70% of traffic to OpenAI and 30% to Anthropic. If OpenAI fails, requests automatically fall back to Anthropic.
What I like: the governance layer ties routing to budgets. You can set a four-tier budget hierarchy (Customer, Team, Virtual Key, Provider Config) and routing decisions respect those limits. When a provider budget is exhausted, traffic shifts automatically.
Setup is genuinely fast. One command to start:
npx -y @maximhq/bifrost
Or Docker:
docker run -p 8080:8080 maximhq/bifrost
The setup guide covers both approaches. Provider configuration takes a few minutes.
Other features worth noting: semantic caching with dual-layer support (exact hash + semantic similarity), observability built in, MCP support with sub-3ms latency and 50%+ token reduction in Code Mode, and a drop-in replacement endpoint for the Anthropic SDK so you can migrate without changing application code. Anthropic SDK integration docs here.
Check the benchmarks if you want to verify the numbers yourself.
2. LiteLLM
Language: Python | Overhead: ~8ms | Providers: 100+
LiteLLM has the widest provider coverage I have seen. Over 100 providers through a unified interface. If you need to call a niche model API, LiteLLM probably supports it.
Routing is available through the proxy server. You can configure fallbacks and load balancing across models. The configuration is YAML-based and straightforward.
model_list:
- model_name: gpt-4
litellm_params:
model: openai/gpt-4
api_key: sk-xxx
- model_name: gpt-4
litellm_params:
model: azure/gpt-4
api_key: sk-yyy
router_settings:
routing_strategy: least-busy
num_retries: 3
The trade-off is performance. At ~8ms overhead per request, you are adding meaningful latency at high throughput. For applications doing thousands of requests per second, that adds up. The Python runtime is the bottleneck.
Credit where it is due: LiteLLM's provider coverage is unmatched and the community is active. For teams that prioritize breadth over speed, it is a solid choice.
3. Kong AI Gateway
Language: Lua/C (OpenResty) | Type: Enterprise, plugin-based
Kong is a well-established API gateway that added AI capabilities through plugins. If your organization already runs Kong for general API management, adding AI routing is incremental.
The AI plugin supports multiple providers and basic routing. Rate limiting, authentication, and logging come from Kong's mature plugin ecosystem.
The limitation: AI-specific routing features require the enterprise tier. The open-source version gives you basic proxying, but weighted routing, advanced failover, and AI-specific analytics are paid features. Configuration is also more complex because you are working within Kong's plugin architecture rather than a purpose-built AI gateway.
Credit: Kong's plugin ecosystem is mature and battle-tested for general API management.
4. Cloudflare AI Gateway
Type: Managed service | Setup: Minutes
Cloudflare AI Gateway is the easiest to set up on this list. If you are already on Cloudflare, you can enable it from the dashboard and start routing requests through their edge network.
It provides caching, rate limiting, and basic analytics out of the box. The managed nature means zero infrastructure to maintain.
The limitation: routing flexibility is constrained compared to self-hosted options. Custom routing strategies, weighted distribution, and provider-level budget controls are limited. You also depend on Cloudflare's edge network for all LLM traffic, which may not work for teams with data residency requirements.
Credit: For teams that want AI gateway functionality without managing infrastructure, Cloudflare delivers the simplest path to production.
5. Azure API Management
Type: Enterprise, Azure-native | Setup: Hours to days
Azure APIM is the default choice for organizations already invested in Azure. It supports routing to Azure OpenAI endpoints with built-in integration, and you can configure policies for retry, circuit breaking, and load balancing.
The routing configuration uses Azure's policy XML, which is verbose but powerful. You get deep integration with Azure Monitor, Key Vault, and other Azure services.
The limitation: it is Azure-native. If you are multi-cloud or use non-Azure LLM providers, the integration story gets complicated. Routing to Anthropic or other providers requires custom policy work. Setup is also significantly more complex than purpose-built AI gateways.
Credit: For Azure-first organizations, the deep integration with the Azure ecosystem and enterprise compliance features are genuinely valuable.
Comparison table
| Feature | Bifrost | LiteLLM | Kong AI | Cloudflare AI | Azure APIM |
|---|---|---|---|---|---|
| Latency overhead | 11 microseconds | ~8ms | Low (Lua/C) | Varies (edge) | Varies |
| Language | Go | Python | Lua/C | Managed | Managed |
| Weighted routing | Yes | Yes | Enterprise only | Limited | Via policy |
| Automatic failover | Yes | Yes | Yes | Basic | Via policy |
| Budget-aware routing | Yes (4-tier) | Basic | No | No | No |
| Semantic caching | Yes (dual-layer) | Basic | No | Yes | No |
| Provider count | Growing | 100+ | Major providers | Major providers | Azure-focused |
| Open source | Yes | Yes | Partial | No | No |
| Self-hosted | Yes | Yes | Yes | No | No |
| Setup time | Minutes | Minutes | Hours | Minutes | Hours to days |
Honest trade-offs
No tool is perfect. Here is what I found lacking in each.
Bifrost: Provider count is still growing. If you need a niche provider that is not yet supported, you will need to check the docs or request it. The project is newer than LiteLLM, so community resources are still building up.
LiteLLM: Performance at scale is the main concern. The ~8ms overhead is fine for low-throughput applications, but at 5,000+ RPS, you are looking at significant cumulative latency. Memory usage also climbs with the Python runtime under load.
Kong AI Gateway: The AI features feel bolted on rather than native. If you are not already a Kong customer, adopting the full Kong stack just for AI routing is overkill. Enterprise pricing for AI-specific features is a barrier.
Cloudflare AI Gateway: Limited control. You cannot implement custom routing strategies or complex failover logic. Data flows through Cloudflare's network, which is a non-starter for some compliance requirements.
Azure APIM: Vendor lock-in is real. Multi-provider routing outside Azure requires significant custom work. Configuration through XML policies is tedious compared to YAML-based alternatives.
Which one should you pick
Pick Bifrost if performance and routing flexibility are your top priorities. The 11 microsecond overhead and built-in governance features (budget-aware routing, weighted distribution, automatic failover) make it the strongest option for high-throughput production workloads. Star it on GitHub or check the docs to get started.
Pick LiteLLM if you need the widest provider coverage and performance is not your bottleneck.
Pick Kong if your organization already runs Kong and wants to add AI routing incrementally.
Pick Cloudflare if you want zero infrastructure overhead and can live with limited routing customization.
Pick Azure APIM if you are fully committed to the Azure ecosystem.
For most teams building production AI infrastructure, routing is a gateway-level concern that should not leak into application code. The right gateway depends on your throughput requirements, provider mix, and how much control you need over routing logic.
I would start with Bifrost. One command to run, sub-microsecond overhead, and routing that actually works at scale. Docs are here.
Top comments (0)