Kuldeep Paul

Posted on Dec 11, 2025

Adaptive Load Balancing: Why Your LLM Gateway Needs It

#llm #performance #systemdesign

The Problem

You have 3 OpenAI API keys. Round-robin between them for load balancing. Should work fine, right?

Week 1: Perfect. Traffic distributed evenly, everything fast.

Week 2: One key starts hitting rate limits. Another has weird latency spikes. Your round-robin keeps sending 33% of traffic to the degraded keys anyway.

Users see 5-second response times. You're manually pausing keys and restarting your gateway.

This happened to us constantly until we built adaptive load balancing into Bifrost.

How Adaptive Load Balancing Works

Instead of blind round-robin, the gateway tracks performance metrics for each API key:

Response latency
Error rate
Success rate

Then automatically adjusts traffic weights based on real-time performance.

Example:

`Initial state (equal weights):
├─ Key 1: weight 1.0 → 33% traffic
├─ Key 2: weight 1.0 → 33% traffic
└─ Key 3: weight 1.0 → 34% traffic

After 5 minutes (Key 2 degraded):
├─ Key 1: weight 1.1 → 42% traffic (performing well)
├─ Key 2: weight 0.6 → 23% traffic (high latency detected)
└─ Key 3: weight 1.0 → 35% traffic (normal)`

The degraded key gets less traffic automatically. No manual intervention.

Real Results

Before adaptive balancing:

One degraded key = 33% of users affected
Manual detection and response time: 10-30 minutes
Required constant monitoring

After adaptive balancing:

Degraded keys detected in 30 seconds
Traffic automatically shifted to healthy keys
95% of requests unaffected

Actual metrics from our production:

`Key performance detected:
├─ Key 1: P99 latency 1100ms, error rate 0.01
├─ Key 2: P99 latency 2800ms, error rate 0.08 ⚠️
└─ Key 3: P99 latency 1200ms, error rate 0.02

Automatic weight adjustment:
├─ Key 1: weight increased to 1.2 (48% traffic)
├─ Key 2: weight decreased to 0.4 (16% traffic)
└─ Key 3: weight stays 1.0 (36% traffic)

Result:

Average latency: down 23%
Error rate: down 35%`

Beyond Single Provider

This isn't just for multiple keys from one provider. It works across providers too.

Multi-provider setup:

├─ OpenAI GPT-4 (2 keys) ├─ Anthropic Claude (2 keys) └─ Azure OpenAI (1 key)

If OpenAI has an outage, traffic automatically shifts to Anthropic. When OpenAI recovers, traffic gradually shifts back based on performance.

Cost optimization:

You can also set cost preferences. If two providers perform similarly, route more traffic to the cheaper one automatically.

How It's Implemented in Bifrost

Bifrost (our open-source gateway) evaluates performance every 30 seconds:

func (lb *LoadBalancer) AdjustWeights() { for key, metrics := range lb.keyMetrics { score := calculateScore(metrics.latency, metrics.errorRate) newWeight := clamp(score, 0.5, 1.5) lb.weights[key] = newWeight } }

Scoring factors:

Latency (lower is better)
Error rate (lower is better)
Recent trend (improving vs degrading)

Weights are clamped between 0.5x and 1.5x to avoid too-aggressive shifts.

Why This Matters

Without adaptive balancing: You're blind. Keys degrade silently. Users suffer. You react manually.

With adaptive balancing: Gateway detects issues in seconds. Automatically protects users. You focus on building, not monitoring.

Try It Yourself

Bifrost is fully open source and MIT licensed:

bash

git clone https://github.com/maximhq/bifrost cd bifrost docker compose up

Add multiple API keys through the UI at localhost:8080. Watch the dashboard show real-time weight adjustments as keys perform differently.

The full source code includes the adaptive load balancing implementation, performance tracking, and weight calculation logic.

Bottom line: Stop manually managing API keys. Let your gateway handle it automatically based on actual performance.

DEV Community