DEV Community

Cover image for How to Route Between Claude Opus 4.7, GPT-5 Turbo, and Gemma 4 With Bifrost
Pranay Batta
Pranay Batta

Posted on

How to Route Between Claude Opus 4.7, GPT-5 Turbo, and Gemma 4 With Bifrost

TL;DR: I set up multi-model routing between Claude Opus 4.7, GPT-5 Turbo, and Gemma 4 through Bifrost in under 20 minutes. Application code stays on the OpenAI SDK, weighted routing splits traffic, and automatic fallbacks kick in when a provider fails. This post walks through the config, the failover behaviour I tested, and the gotchas.

This post assumes familiarity with OpenAI-compatible APIs, basic YAML configuration, and the difference between request-time model selection and gateway-level routing.

Why You Want a Gateway in Front of Three Models

Running production traffic against a single model has obvious failure modes. The 2024 OpenAI outage, the December 2025 Anthropic capacity throttles, and Google's regional Gemini outages have all happened in the last 18 months. If your app depends on one provider, you go down with them.

The other reason is cost. Claude Opus 4.7 is great at complex reasoning but expensive on simple tasks. Gemma 4 is faster and cheaper for routine completions. GPT-5 Turbo sits in the middle. A gateway lets you route smart traffic to the right model without rewriting application code.

Bifrost handles this with weighted load balancing and automatic fallbacks. I tested it on a workload that previously ran 100% on Claude.

Step 1: Install Bifrost

npx -y @maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

That starts Bifrost on port 8080 with a default config. For production, Docker works:

docker run -p 8080:8080 maximhq/bifrost:latest
Enter fullscreen mode Exit fullscreen mode

The setup docs cover persistent volumes if you need them.

Step 2: Configure Three Providers

Each provider gets a config block. The weight field controls traffic distribution and the allowed_models list filters which models the provider serves.

providers:
  - name: anthropic
    api_key: ${ANTHROPIC_API_KEY}
    allowed_models: ["claude-opus-4-7", "claude-sonnet-4-6"]
    weight: 0.3

  - name: openai
    api_key: ${OPENAI_API_KEY}
    allowed_models: ["gpt-5-turbo", "gpt-4o"]
    weight: 0.4

  - name: vertex
    api_key: ${VERTEX_API_KEY}
    project_id: ${VERTEX_PROJECT_ID}
    allowed_models: ["gemma-4", "gemini-2.5-pro"]
    weight: 0.3
Enter fullscreen mode Exit fullscreen mode

Weights are auto-normalised to sum to 1.0, so you do not have to do the math. With these weights, 30% of traffic goes to Anthropic, 40% to OpenAI, and 30% to Gemma 4 via Vertex. The provider configuration docs cover the full schema.

Step 3: Application Code Does Not Change

Bifrost exposes OpenAI-compatible endpoints, so the SDK call stays the same. Only the base URL switches.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/openai/v1",
    api_key="vk-prod-abc123"
)

response = client.chat.completions.create(
    model="gpt-5-turbo",
    messages=[{"role": "user", "content": "Summarize this changelog..."}]
)
Enter fullscreen mode Exit fullscreen mode

When the request hits Bifrost, weighted routing picks the provider based on the configured weights. If you want to force a specific model, set model to the exact name and Bifrost routes only to providers with that model in allowed_models.

For Claude requests using the native Anthropic SDK, point at /anthropic. For Gemini, /genai. The drop-in replacement docs cover the full endpoint matrix.

Step 4: Configure Failover

Weighted routing handles the steady state. Failover handles the failure case. Bifrost sorts providers by weight (highest first) and retries on failure.

fallbacks:
  - primary: openai
    fallback_chain: ["anthropic", "vertex"]
    retry_on:
      - "rate_limit_exceeded"
      - "service_unavailable"
      - "timeout"
    max_retries: 2
Enter fullscreen mode Exit fullscreen mode

I tested this by killing the OpenAI request mid-flight using a network rule. Bifrost detected the timeout, retried against Anthropic, and returned the response. The application saw a slightly higher latency but no error. The fallbacks docs cover all retry conditions.

One thing to know: cross-provider routing does not happen automatically. You configure the fallback chain explicitly. If you do not configure fallbacks, a provider failure becomes a request failure.

Step 5: Add Rate Limits Per Provider

Different providers have different rate limit tiers. Encoding those at the gateway prevents one provider from getting hammered when another is down.

providers:
  - name: openai
    api_key: ${OPENAI_API_KEY}
    rate_limit:
      request_limit: 5000
      request_limit_duration: "1m"
      token_limit: 2000000
      token_limit_duration: "1m"
Enter fullscreen mode Exit fullscreen mode

If a Provider Config exceeds rate limits, Bifrost excludes that provider but keeps others available. Requests do not fail outright when one provider is saturated, they get rerouted.

Comparison

Capability Direct SDK LiteLLM Bifrost
Multi-provider routing Manual Yes Yes
Weighted distribution No Yes Yes (auto-normalised)
Cross-provider fallback DIY Yes Yes (chain config)
Latency overhead 0 ~8ms 11 microseconds
Per-provider rate limits DIY Yes Yes
OpenAI-compatible N/A Yes Yes

Trade-offs and Limitations

Bifrost is self-hosted only with no managed cloud. If you do not have ops capacity, this is real overhead.

Cross-provider routing does not happen automatically. You have to explicitly configure fallback chains for every primary provider. Forgetting this is the most common config bug I have seen.

OpenRouter routing through Bifrost is broken because of a tool call streaming issue. If you use OpenRouter today, you cannot keep that path through Bifrost.

Bifrost is newer than LiteLLM. Documentation is solid but the community is still building up.

Quick Recap

  • One config file maps three providers with weighted distribution and per-provider rate limits
  • Application code stays on the OpenAI SDK with only the base URL changing
  • Failover requires explicit fallback chain configuration, it does not happen by default
  • 11 microsecond overhead per request, 50x lower than Python-based gateways
  • Provider exclusion under rate limits keeps the rest of the routing pool healthy

Links

Further Reading

Top comments (0)