DEV Community

Cover image for Weighted Load Balancing Across LLM Providers Without Code Changes
Pranay Batta
Pranay Batta

Posted on

Weighted Load Balancing Across LLM Providers Without Code Changes

Traffic distribution across multiple LLM providers requires routing logic. Most teams build this into application code: if statements checking provider availability, manual failover switching, hardcoded percentages.

This approach breaks when requirements change. Shifting from 80/20 to 50/50 means deploying code. A/B testing different providers means feature flags. Cost optimization means rewriting routing logic.

GitHub logo maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

Go Report Card Discord badge Known Vulnerabilities codecov Docker Pulls Run In Postman Artifact Hub License

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Get started

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080
Enter fullscreen mode Exit fullscreen mode

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

Bifrost handles traffic distribution at the gateway level through weighted routing. Configure provider weights once. The gateway distributes traffic automatically. No application code changes required.

Adaptive Load Balancing - Bifrost

Advanced load balancing algorithms with predictive scaling, health monitoring, and performance optimization for enterprise-grade traffic distribution.

favicon docs.getbifrost.ai

How Weighted Routing Works

Virtual keys support multiple provider configurations. Each provider gets assigned a weight. Traffic distributes proportionally based on these weights.

Configuration example:

{
  "id": "vk-prod",
  "provider_configs": [
    {
      "provider": "openai",
      "allowed_models": ["gpt-4o", "gpt-4o-mini"],
      "weight": 0.2
    },
    {
      "provider": "azure",
      "allowed_models": ["gpt-4o"],
      "weight": 0.8
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

This configuration routes:

  • 80% of gpt-4o requests to Azure
  • 20% of gpt-4o requests to OpenAI
  • 100% of gpt-4o-mini requests to OpenAI (only provider that supports it)

The system normalizes weights automatically. If you configure weights of 8 and 2, the system treats them as 0.8 and 0.2. If you configure 1 and 1, they normalize to 0.5 and 0.5.

Request Format

Applications send requests without specifying providers:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "x-bf-vk: vk-prod" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}]
  }'
Enter fullscreen mode Exit fullscreen mode

The gateway selects the provider based on configured weights. The application doesn't know which provider handled the request unless it checks the response:

{
  "choices": [{
    "message": {"content": "Response"}
  }],
  "extra_fields": {
    "provider": "azure"
  }
}
Enter fullscreen mode Exit fullscreen mode

The extra_fields.provider field indicates which provider processed the request.

Bypassing Weighted Routing

To target a specific provider, include the provider prefix in the model name:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "x-bf-vk: vk-prod" \
  -d '{
    "model": "openai/gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}]
  }'
Enter fullscreen mode Exit fullscreen mode

The openai/ prefix bypasses weighted routing and sends the request directly to OpenAI.

Use cases for bypassing:

  • Debugging provider-specific issues
  • Testing new provider integrations
  • Reproducing errors that only occur on specific providers
  • Benchmarking latency across providers

Weight Normalization Behavior

Weights are normalized to sum to 1.0 based on providers available for the requested model.

Example: Three providers configured with weights 0.5, 0.3, and 0.2. Model only available on first two providers. Effective weights become:

  • Provider 1: 0.5 / (0.5 + 0.3) = 0.625 (62.5%)
  • Provider 2: 0.3 / (0.5 + 0.3) = 0.375 (37.5%)
  • Provider 3: Excluded (doesn't support the model)

The math: divide each weight by the sum of all applicable weights. This ensures percentages always add to 100%.

Use Case: Cost Optimization

Route majority traffic to cheaper providers, minority traffic to premium providers:

{
  "provider_configs": [
    {
      "provider": "openai-gpt35",
      "weight": 0.7
    },
    {
      "provider": "openai-gpt4o",
      "weight": 0.3
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

70% of requests use GPT-3.5-turbo. 30% use GPT-4o. Adjust weights based on quality requirements versus cost constraints.

This reduces overall costs while maintaining quality for requests where GPT-3.5 is sufficient.

Use Case: Blue/Green Deployments

Test new providers or models with small traffic percentages before full rollout:

{
  "provider_configs": [
    {
      "provider": "azure-gpt4o",
      "weight": 0.95
    },
    {
      "provider": "azure-gpt4o-new",
      "weight": 0.05
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Send 5% of traffic to the new deployment. Monitor error rates, latency, and response quality. Gradually increase the weight if metrics look good. Rollback by setting weight to 0 if issues occur.

No code changes required during rollout. Just update the virtual key configuration.

Use Case: Load Distribution Across API Keys

Distribute traffic across multiple API keys from the same provider:

{
  "provider_configs": [
    {
      "provider": "openai",
      "api_key_id": "key-1",
      "weight": 0.33
    },
    {
      "provider": "openai",
      "api_key_id": "key-2",
      "weight": 0.33
    },
    {
      "provider": "openai",
      "api_key_id": "key-3",
      "weight": 0.34
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Each key handles roughly 33% of traffic. Prevents hitting rate limits on any single key. Increases total throughput by distributing across quota pools.

Use Case: Regional Traffic Routing

Route traffic based on geographic proximity or data residency requirements:

{
  "provider_configs": [
    {
      "provider": "azure-us-east",
      "weight": 0.6
    },
    {
      "provider": "azure-eu-west",
      "weight": 0.4
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Applications in North America get routed primarily to US regions. European traffic gets mixed routing. Adjust weights based on actual user distribution.

Interaction with Automatic Failbacks

Weighted routing integrates with automatic failover. Providers are sorted by weight (highest first) and used as fallback chain.

Configuration:

{
  "provider_configs": [
    {
      "provider": "azure",
      "weight": 0.8
    },
    {
      "provider": "openai",
      "weight": 0.2
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Execution flow:

  1. 80% of requests go to Azure (primary based on weight)
  2. If Azure fails, request automatically retries with OpenAI
  3. If OpenAI fails, return error

This means weighted routing provides both load distribution AND automatic failover. Single configuration serves dual purposes.

Documentation: https://docs.getbifrost.ai/features/governance/routing

Configuration Methods

Three ways to configure weighted routing:

Web UI: Visual interface in the Virtual Keys section. Add providers, set weights with sliders or input fields. Changes apply immediately.

REST API: Programmatic configuration for dynamic provisioning:

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {"provider": "openai", "weight": 0.2},
      {"provider": "azure", "weight": 0.8}
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

config.json: File-based configuration for GitOps workflows:

{
  "governance": {
    "virtual_keys": [{
      "id": "vk-prod",
      "provider_configs": [
        {"provider": "openai", "weight": 0.2},
        {"provider": "azure", "weight": 0.8}
      ]
    }]
  }
}
Enter fullscreen mode Exit fullscreen mode

All three methods modify the same underlying state. Changes propagate immediately.

Monitoring Traffic Distribution

Track actual traffic distribution using the extra_fields.provider field in responses:

provider_counts = {}

for response in responses:
    provider = response['extra_fields']['provider']
    provider_counts[provider] = provider_counts.get(provider, 0) + 1

total = sum(provider_counts.values())
for provider, count in provider_counts.items():
    percentage = (count / total) * 100
    print(f"{provider}: {percentage:.1f}%")
Enter fullscreen mode Exit fullscreen mode

Compare actual distribution against configured weights. Deviations indicate:

  • One provider experiencing failures (automatic failover shifting traffic)
  • Model availability differences (some models only on specific providers)
  • Configuration errors (weights not summing to expected totals)

What We Got Wrong

Mistake 1: Initially implemented weight normalization globally instead of per-model. This broke when models had different provider availability. A model available on providers A and B would incorrectly factor in the weight from provider C that didn't support it. We changed to per-model normalization.

Mistake 2: No weight validation. Users could set negative weights or weights summing to zero. This caused division-by-zero errors in normalization. We added validation requiring non-negative weights and at least one positive weight per model.

Mistake 3: Weight changes required service restart. This made A/B testing impractical. We changed to hot-reload configuration, allowing weight adjustments without restarting the gateway.

Mistake 4: No visibility into actual traffic distribution. Users set 80/20 weights but didn't know if that's what actually happened (failures skew distribution). We added extra_fields.provider to responses.

Limitations

Non-deterministic routing: Two identical requests might route to different providers. This breaks assumptions about caching or provider-specific behavior. Applications expecting deterministic routing need sticky routing (not currently supported).

Weight precision: Weights use floating-point math. Very small weight differences (0.001 vs 0.002) might not produce exact traffic splits due to rounding. Use meaningful weight differences (0.1+ separation).

No session affinity: Requests from the same user might route to different providers across requests. This breaks workflows requiring session-level consistency. Workaround: Include provider prefix in model name for session-critical requests.

Cold start unfairness: First request to each provider might have higher latency. Weight-based routing doesn't account for cold starts. Traffic distribution might show latency spikes until all providers are warmed up.

Why This Matters

Traffic distribution is a production requirement, not a development convenience. Provider pricing changes. New models launch. Quota limits get adjusted. Regional performance varies.

Handling this in application code means deployments for infrastructure changes. Weighted routing at the gateway level decouples application logic from infrastructure configuration.

Change the weights. Traffic distribution updates immediately. No code changes. No deployments. The infrastructure layer handles it.


Top comments (0)