Pranay Batta

Posted on Apr 21

Smart LLM Routing in Production: Picking the Optimal Model per Request

#ai #mcp #llm #devops

Every production LLM system eventually runs into the same wall. You are paying too much, responses are too slow, or a single provider outage takes everything down.

The fix is routing. Instead of hardcoding one model for all requests, you route each request to the best available model based on cost, latency, and reliability.

I evaluated several approaches over the last few weeks. Marketplace APIs, framework-level abstractions, self-hosted gateways, DIY logic. Here is what the data showed.

Why Route at All?

If you are only using one model from one provider, you do not need routing. But the moment you add a second provider, routing decisions start piling up.

Three reasons this matters:

Cost. GPT-4o costs roughly 10x more per token than GPT-4o-mini. If 60% of your traffic is simple summarization or classification, you are burning money sending it to a frontier model. Routing lets you match request complexity to model price.

Latency. Provider response times vary by region, time of day, and current load. A request that takes 800ms on one provider might take 2.5s on another at that exact moment.

Reliability. Every provider has outages. Rate limits hit. 429s and 500s happen. If your entire product is wired to one API endpoint, you inherit their downtime.

Smart routing optimises across all three per request, without changing application code.

The Landscape: Four Approaches

Before picking a tool, I mapped out how the options break down.

Marketplace routing (OpenRouter)

OpenRouter acts as a unified API across dozens of models from different providers. You send a request to their endpoint, and they handle the provider connection. Good model catalog, single API key. The trade-off is that you are adding a network hop through their servers and routing logic is their black box, not yours. Less control over failover behaviour, budget enforcement, and routing weights.

Framework-level routing (Semantic Kernel)

Microsoft's Semantic Kernel lets you define model selection logic inside your application code. You can set up filters that choose models based on request properties, user tier, or function type. The issue: routing becomes tightly coupled to your application. Every service needs the routing logic, and updating routing config means redeploying application code. No built-in budget enforcement or provider health monitoring either.

DIY routing

You can always write your own. A reverse proxy with some logic to pick providers based on health checks and weights. I tried this first with a simple Python setup:

import random
import httpx

PROVIDERS = {
    "openai": {"url": "https://api.openai.com/v1/chat/completions", "weight": 0.6},
    "anthropic": {"url": "https://api.anthropic.com/v1/messages", "weight": 0.4},
}

def pick_provider():
    names = list(PROVIDERS.keys())
    weights = [PROVIDERS[n]["weight"] for n in names]
    return random.choices(names, weights=weights, k=1)[0]

This works for two providers with static weights. It falls apart when you need failover, budget tracking, health checks, or dynamic weight adjustment. I abandoned this after two weeks of edge cases.

Gateway-level routing

A gateway sits between your application and LLM providers. You configure routing rules once, and every service behind the gateway gets the same behaviour. Application code does not know or care which provider serves a request.

This is where I spent most of my time. And this is where the data got interesting.

Why Gateway-Level Routing Won for Me

The decision came down to one principle: routing is infrastructure, not application logic.

When routing lives in the application layer, every team implements it differently. One team does round-robin, another does random selection, a third hardcodes a provider. Failover behaviour is inconsistent. Budget tracking is scattered across services.

A gateway centralises all of that. Configure it once, every downstream service gets consistent routing, failover, and budget enforcement. Change the routing strategy and no application code changes.

After testing several gateways, Bifrost gave me the best combination of routing flexibility and raw performance. Written in Go, 11 microsecond latency overhead, 5,000 RPS sustained throughput. For context, Python-based alternatives like LiteLLM add around 8ms per request. That is roughly a 50x difference in routing overhead.

Here is how I set it up.

Bifrost Routing: The Deep Dive

Weighted Distribution

The most common routing strategy. You assign weights to providers and Bifrost distributes traffic proportionally. Weights auto-normalise, so you can use any numbers.

accounts:
  - id: "production"
    providers:
      - id: "openai-primary"
        type: "openai"
        api_key: "${OPENAI_API_KEY}"
        model: "gpt-4o"
        weight: 60
      - id: "anthropic-secondary"
        type: "anthropic"
        api_key: "${ANTHROPIC_API_KEY}"
        model: "claude-sonnet-4-20250514"
        weight: 30
      - id: "gemini-tertiary"
        type: "gemini"
        api_key: "${GEMINI_API_KEY}"
        model: "gemini-2.5-pro"
        weight: 10

60% of requests go to GPT-4o. 30% to Claude Sonnet. 10% to Gemini. I used this split to compare output quality across providers on real production traffic. Adjusting the weights is a config change, not a code deploy.

Full routing configuration docs here.

Automatic Failover

Weighted routing handles the happy path. Failover handles everything else. When a provider returns errors, Bifrost automatically retries with the next provider in weight order.

accounts:
  - id: "production"
    providers:
      - id: "openai-primary"
        type: "openai"
        api_key: "${OPENAI_API_KEY}"
        model: "gpt-4o"
        weight: 80
      - id: "anthropic-fallback"
        type: "anthropic"
        api_key: "${ANTHROPIC_API_KEY}"
        model: "claude-sonnet-4-20250514"
        weight: 15
      - id: "gemini-fallback"
        type: "gemini"
        api_key: "${GEMINI_API_KEY}"
        model: "gemini-2.5-pro"
        weight: 5

OpenAI returns a 429? Bifrost retries with Anthropic. Anthropic is down? Falls back to Gemini. The application never sees the failure. No retry logic in application code, no manual intervention.

I ran a 48-hour test where I intentionally rotated provider API keys to simulate outages. Bifrost handled every failover cleanly. Requests were slower (because retries take time) but none failed from the application's perspective.

Budget-Aware Routing

This is where Bifrost's approach gets genuinely useful. The governance layer has a four-tier budget hierarchy: Customer, Team, Virtual Key, and Provider Config.

budgets:
  - level: "team"
    id: "backend-team"
    limit: 500
    period: "monthly"
  - level: "virtual_key"
    id: "dev-key-pranay"
    limit: 100
    period: "monthly"

When a budget tier is exhausted, routing decisions respect that constraint. If the backend team hits their monthly limit, requests from that team stop going through. If a specific virtual key runs out, that key is blocked but other keys on the same team still work.

This level of granularity is something I did not find in the other approaches I tested. Most solutions do global rate limiting at best. The four-tier hierarchy lets you set guardrails at every organisational level without building custom middleware.

Semantic Caching: Skip Routing Entirely

Dual-layer semantic caching in Bifrost uses exact hash matching and semantic similarity matching.

When a request hits the cache, it never reaches a provider. No routing decision needed. No API call. No cost. The response comes back from cache directly.

For workloads with repeated or similar queries (customer support, code generation with common patterns, FAQ-type interactions), caching eliminates a significant chunk of provider calls entirely. In my testing, cache hit rates on repetitive workloads were high enough to noticeably reduce total routed requests.

This interacts well with budget-aware routing. Fewer routed requests means budgets last longer.

Getting Started

Setup is fast. One command:

npx -y @maximhq/bifrost

Or Docker:

docker run -p 8080:8080 maximhq/bifrost

Configure your providers in the config file, set your routing weights, and point your application at the gateway endpoint. The setup guide walks through it. Provider configuration covers all supported providers and model formats.

Bifrost exposes a drop-in replacement endpoint for the OpenAI and Anthropic SDKs. If your application already uses either SDK, you change the base URL and nothing else. No code changes needed. The Anthropic SDK integration docs have the specifics.

Results After Switching

I ran Bifrost for three weeks across production workloads. Here is what the data showed.

Latency overhead: Consistently under 15 microseconds per request. The 11 microsecond claim held up in my benchmarks. At 5,000 RPS, total gateway overhead was negligible compared to actual LLM response times. You can run the benchmarks yourself.

Failover recovery: Provider failures were transparent to the application. During two real OpenAI degradation events, traffic shifted to Anthropic within the same request cycle. Zero application-level errors.

Cost visibility: The four-tier budget hierarchy gave me per-team and per-key cost tracking without building anything custom. I caught one team burning through their allocation on a retry loop within the first week.

Cache savings: Semantic caching reduced routed requests by a meaningful percentage on workloads with repeated query patterns. Those were requests that never hit a provider, never cost anything.

The combination of weighted routing, automatic failover, budget controls, and semantic caching in a single layer that adds 11 microseconds of overhead is something I have not been able to replicate with any other approach I tested.

Final Thoughts

LLM routing is not optional in production. Static provider configs break under load, cost more than they should, and give you zero flexibility when things go wrong.

The approach matters. Marketplace APIs abstract away too much control. Framework-level routing couples infrastructure decisions to application code. DIY solutions work until the edge cases pile up.

Gateway-level routing keeps the concern where it belongs: in infrastructure. Bifrost's performance numbers, routing flexibility, and budget hierarchy made it the strongest option in my evaluation.

GitHub | Docs | Website

If you are running LLMs in production with multiple providers, set up a gateway and stop hardcoding routing in application code. The data speaks for itself.

DEV Community