Pranay Batta

Posted on Mar 12

How to Set Up Weighted Load Balancing Across LLM Providers

#ai #architecture #llm #tutorial

TL;DR: Running all your LLM traffic through a single provider is a single point of failure. Weighted load balancing lets you split traffic across providers (say 70/30 GPT-4o/Claude), optimize for cost or latency per use case, and failover automatically when one provider goes down. Bifrost handles this at the gateway layer with 11 microsecond overhead. Here is how to set it up.

Why Single-Provider Is a Bad Idea

You have probably been here. Your entire app routes through OpenAI. One day, OpenAI hits capacity. Your 429 retry logic kicks in, but the retries also get 429'd. Your app is effectively down, and your only option is to wait.

This is not a hypothetical. Every major LLM provider has had multi-hour outages in the last 12 months. If your architecture assumes 100% availability from a single provider, your architecture is wrong.

The fix is not "add retry logic." The fix is routing traffic across multiple providers at the gateway layer, with weights you control.

What Weighted Load Balancing Actually Means

Instead of sending every request to one provider, you define a split:

70% of requests go to OpenAI (GPT-4o)
30% go to Anthropic (Claude Sonnet)

Or maybe:

50% to Gemini (cheaper for simple tasks)
50% to Anthropic (better for complex reasoning)

The gateway makes the routing decision per-request based on these weights. Your application code does not change. It sends requests to one endpoint (the gateway), and the gateway distributes them.

The key benefit: you change the weights in a config file, not in your application code. No redeploy. No code review. Just update the config and the traffic split changes immediately.

Setting This Up With Bifrost

Bifrost is an open-source LLM gateway written in Go. It sits between your app and LLM providers as a reverse proxy, adding 11 microseconds of overhead per request.

Here is a weighted routing config that splits traffic between OpenAI and Anthropic:

{
  "accounts": [
    {
      "id": "openai-primary",
      "provider": "openai",
      "api_key": "${OPENAI_API_KEY}",
      "weight": 70
    },
    {
      "id": "anthropic-secondary",
      "provider": "anthropic",
      "api_key": "${ANTHROPIC_API_KEY}",
      "weight": 30
    }
  ],
  "fallback_config": {
    "enabled": true,
    "on_status_codes": [429, 500, 502, 503]
  }
}

What this does:

70% of incoming requests route to OpenAI, 30% to Anthropic
If OpenAI returns a 429 or 5xx, the request automatically fails over to Anthropic
If Anthropic is also down, the request returns an error to the client
All of this happens at the gateway. Your app sends requests to http://localhost:8080/v1/chat/completions and does not know which provider handled it

Start the gateway with zero config:

npx -y @maximhq/bifrost

Then configure providers through the web dashboard at localhost:8080. No YAML, no environment variable chains. JSON config, web UI, done.

Three Load Balancing Strategies That Actually Work

1. Cost Optimization Split

Route simple tasks to the cheapest provider, complex tasks to the most capable.

{
  "accounts": [
    {
      "id": "gemini-flash-cheap",
      "provider": "gemini",
      "model": "gemini-2.5-flash",
      "weight": 60
    },
    {
      "id": "openai-capable",
      "provider": "openai",
      "model": "gpt-4o",
      "weight": 40
    }
  ]
}

Gemini Flash handles the bulk of requests at lower cost. GPT-4o handles the rest. You can check the per-model pricing on the model library to find the right split for your use case. If you want exact cost numbers before committing, the LLM cost calculator lets you compare across providers for your specific token volumes.

2. Latency-Optimized Split

If your app is latency-sensitive (chatbots, real-time agents), route more traffic to the fastest provider.

{
  "accounts": [
    {
      "id": "groq-fast",
      "provider": "groq",
      "model": "llama-3.3-70b-versatile",
      "weight": 50
    },
    {
      "id": "anthropic-quality",
      "provider": "anthropic",
      "model": "claude-sonnet-4-6",
      "weight": 50
    }
  ]
}

Groq is extremely fast for Llama models. Anthropic gives you better reasoning quality. 50/50 split means half your users get near-instant responses, half get higher quality. You tune the ratio based on what your users actually need.

For raw numbers on how much overhead each gateway adds, check the benchmarks page. Bifrost adds 11 microseconds. That is not a typo. Microseconds, not milliseconds.

3. Reliability-First Split

For production apps where uptime matters more than cost.

{
  "accounts": [
    {
      "id": "openai-primary",
      "provider": "openai",
      "weight": 40
    },
    {
      "id": "anthropic-secondary",
      "provider": "anthropic",
      "weight": 30
    },
    {
      "id": "gemini-tertiary",
      "provider": "gemini",
      "weight": 30
    }
  ],
  "fallback_config": {
    "enabled": true,
    "on_status_codes": [429, 500, 502, 503]
  }
}

Three providers. If any one goes down, traffic redistributes to the other two. The probability of all three being down simultaneously is near zero. This is the setup we recommend for anything that cannot afford downtime. The automatic failover guide walks through the full failover config in detail.

Adding Budget Controls to the Mix

Weighted routing becomes more powerful when combined with budget controls. You do not want one runaway team to blow through your entire OpenAI quota and leave other teams with nothing.

Bifrost has a four-tier budget hierarchy: Organization > Team > Virtual Key > Provider. Each level can have daily, weekly, or monthly caps.

{
  "virtual_keys": [
    {
      "id": "team-backend",
      "budget": {
        "monthly_limit_usd": 5000,
        "daily_limit_usd": 250
      },
      "rate_limit": {
        "request_max_limit": 1000,
        "request_reset_duration": "1h"
      }
    }
  ]
}

When a team hits their budget, requests can either fail (hard stop) or automatically route to a cheaper model (soft failover). This is configurable per virtual key.

If you are running Claude Code across a dev team, this is especially useful. Each developer gets a virtual key with their own budget. No one developer can burn through the team's allocation.

The Provider-Isolation Architecture

Here is something most gateways get wrong: they use a single request queue for all providers. When OpenAI starts rate limiting you and requests back up, the queue fills. Now your Anthropic and Gemini requests are also stuck behind the OpenAI backlog.

Bifrost uses provider-isolated worker pools. Each provider gets its own queue. OpenAI being slow does not affect Anthropic latency at all. This matters a lot under weighted load balancing, because the whole point is that you have multiple providers and they should operate independently.

Your App → Bifrost Gateway → [OpenAI Pool]    → OpenAI
                           → [Anthropic Pool]  → Anthropic
                           → [Gemini Pool]     → Gemini

Backpressure policies are configurable per provider: drop (discard), block (wait), or error (fail fast). So if OpenAI's pool fills up, you can choose to fail fast and let the failover logic route to the next provider, instead of waiting.

Setting Up With Your Stack

Bifrost is provider-agnostic on the client side. It exposes an OpenAI-compatible API, so any client that speaks OpenAI format works out of the box.

Python (OpenAI SDK):

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="your-bifrost-virtual-key"
)

Claude Code:

export ANTHROPIC_BASE_URL=http://localhost:8080/anthropic
# Now Claude Code routes through Bifrost to any configured provider

Any OpenAI-compatible client:
Just change the base URL to point at your Bifrost instance. That is it. The multi-provider setup guide covers the full configuration for all 19 supported providers.

If you are using tools like Zed editor, LibreChat, or Gemini CLI, there are specific integration guides for each.

When Not to Use Weighted Load Balancing

Being honest here. Weighted routing adds complexity. If you are:

Running a prototype or hobby project: just use one provider directly
Under 100 requests per day: the reliability benefit is not worth the setup
Using only one model for a very specific task: routing adds no value

Weighted load balancing makes sense when you are in production, handling real traffic, and need either cost optimization, reliability guarantees, or both.

Getting Started

# Start Bifrost (zero config)
npx -y @maximhq/bifrost

# Open dashboard
open http://localhost:8080

# Configure providers and weights through the UI
# Or edit the JSON config directly

19 providers supported out of the box. You can compare available models and pricing on the model library before deciding your split. If you are evaluating gateways in general, the buyer's guide covers what to look for.

Source code: GitHub | Docs

We maintain Bifrost at Maxim AI. It is open-source, MIT licensed, and free to self-host. If you run into issues, open an issue on GitHub or check the docs.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.