DEV Community

Debby McKinney
Debby McKinney

Posted on

Tackling Rate Limits in Production LLM Applications

Rate limits are the #1 cause of production LLM failures.

OpenAI enforces 10,000 RPM on Tier 2. Anthropic caps you at 50 RPM on the free tier. Without proper handling, a single traffic spike can trigger cascading 429s, broken user flows, and pager fatigue.

rate-limit

This guide covers 9 battle‑tested strategies to eliminate rate limit failures in production, using Bifrost (open source LLM gateway) as a reference. All of this is config, not app rewrites.

GitHub logo maximhq / bifrost

Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost AI Gateway

Go Report Card Discord badge Known Vulnerabilities codecov Docker Pulls Run In Postman Artifact Hub License

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Get started

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080
Enter fullscreen mode Exit fullscreen mode

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

That's it! Your AI gateway is running with a web interface for visual configuration…


Strategy 1: Multi-Key Load Balancing

Problem: Single API key → single rate limit.

Solution: Multiple keys → multiplied throughput.

With Bifrost, you can define multiple OpenAI keys and weight them for load balancing:

curl -X POST http://localhost:8080/api/providers \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "openai",
    "keys": [
      {"name": "key-1", "value": "sk-1...", "weight": 0.33},
      {"name": "key-2", "value": "sk-2...", "weight": 0.33},
      {"name": "key-3", "value": "sk-3...", "weight": 0.34}
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Result: 3x throughput (30,000 RPM vs 10,000 RPM on a single key).

For details, see the key management and load balancing docs.


Strategy 2: Multi-Provider Failover

Don’t pin your entire app to one provider. Wrap multiple providers behind a single virtual key and weight traffic between them:

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {"provider": "openai", "weight": 0.8},
      {"provider": "anthropic", "weight": 0.2}
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Behavior: If OpenAI gets rate limited, traffic automatically fails over to Anthropic.

You can see how automatic provider failover works in the fallbacks documentation.


Strategy 3: Gateway-Level Rate Limiting

Never hit the provider’s hard limit directly. Put a safety buffer at the gateway:

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
  -H "Content-Type: application/json" \
  -d '{
    "rate_limit": {
      "request_max_limit": 8000,
      "request_reset_duration": "1m"
    }
  }'
Enter fullscreen mode Exit fullscreen mode

Here, the gateway blocks at 8,000 RPM so you never slam into OpenAI’s 10,000 RPM cap.

More examples live in the governance and rate limiting docs.


Strategy 4: Token-Based Limiting

Sometimes requests are cheap in count but expensive in tokens. You can cap tokens instead of (or in addition to) request counts:

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
  -H "Content-Type: application/json" \
  -d '{
    "rate_limit": {
      "token_max_limit": 100000,
      "token_reset_duration": "1h"
    }
  }'
Enter fullscreen mode Exit fullscreen mode

This protects you from a few “fat” prompts blowing through your entire capacity.


Strategy 5: Provider-Level Limits

Different providers, different quotas. You can set per‑provider limits under the same virtual key:

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "openai",
        "rate_limit": {
          "request_max_limit": 8000,
          "request_reset_duration": "1m"
        }
      },
      {
        "provider": "anthropic",
        "rate_limit": {
          "request_max_limit": 500,
          "request_reset_duration": "1m"
        }
      }
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

This keeps each provider within its safe envelope. You can see more patterns in the provider‑level rate limiting section.


Strategy 6: Semantic Caching

The easiest way to dodge rate limits is to send fewer requests.

With semantic caching, similar queries get served from cache instead of hitting the model:

# First request - hits provider
response1 = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What are business hours?"}]
)

# Similar request - cached
response2 = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "When are you open?"}]
)
# Cache hit - doesn't count toward rate limit
Enter fullscreen mode Exit fullscreen mode

In practice, this can reduce provider traffic by 40–60%.

Implementation details are in the semantic caching docs.


Strategy 7: Exponential Backoff

When you do get 429s, don’t panic retry in a tight loop. Back off:

curl -X POST http://localhost:8080/api/providers \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "openai",
    "network_config": {
      "max_retries": 5,
      "retry_backoff_initial_ms": 1,
      "retry_backoff_max_ms": 10000
    }
  }'
Enter fullscreen mode Exit fullscreen mode

This yields a backoff sequence like: 1 ms → 2 ms → 4 ms → 8 ms → 16 ms, and so on, instead of hammering the API and extending your outage.


Strategy 8: Hierarchical Rate Limits

In multi‑tenant or multi‑team setups, you want isolation: one abuse case shouldn’t starve everyone else.

You can stack limits at customer, team, and virtual key levels:

# Customer limit
curl -X POST http://localhost:8080/api/governance/customers \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Acme Corp",
    "budget": {"max_limit": 10000, "reset_duration": "1M"}
  }'

# Team limit
curl -X POST http://localhost:8080/api/governance/teams \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Engineering",
    "customer_id": "customer-acme",
    "budget": {"max_limit": 5000, "reset_duration": "1M"}
  }'

# Virtual key limit
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
  -H "Content-Type: application/json" \
  -d '{
    "team_id": "team-engineering",
    "rate_limit": {
      "request_max_limit": 1000,
      "request_reset_duration": "1h"
    }
  }'
Enter fullscreen mode Exit fullscreen mode

This hierarchy is described in the budget and rate limit hierarchy docs.


Strategy 9: Monitoring and Alerting

You can’t fix what you don’t see. Wire rate limit usage into your observability stack and alert before things blow up:

groups:
  - name: rate_limits
    rules:
      - alert: RateLimitApproaching
        expr: (rate_limit_usage / rate_limit_max) > 0.8
        labels:
          severity: warning
Enter fullscreen mode Exit fullscreen mode

This type of rule is covered in the telemetry and monitoring docs.


Putting It All Together: Complete Setup

Here’s how all of this looks end‑to‑end.

1. Multi-key load balancing + retries for OpenAI:

curl -X POST http://localhost:8080/api/providers \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "openai",
    "keys": [
      {"name": "key-1", "value": "sk-1...", "weight": 0.5},
      {"name": "key-2", "value": "sk-2...", "weight": 0.5}
    ],
    "network_config": {
      "max_retries": 5,
      "retry_backoff_initial_ms": 1
    }
  }'
Enter fullscreen mode Exit fullscreen mode

2. Register Anthropic as a secondary provider:

curl -X POST http://localhost:8080/api/providers \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "anthropic",
    "keys": [
      {"name": "anthropic-key", "value": "sk-ant-...", "weight": 1.0}
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

3. Virtual key with rate limits and weighted providers:

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
  -H "Content-Type: application/json" \
  -d '{
    "rate_limit": {
      "request_max_limit": 16000,
      "request_reset_duration": "1m",
      "token_max_limit": 3000000,
      "token_reset_duration": "1m"
    },
    "provider_configs": [
      {
        "provider": "openai",
        "weight": 0.8,
        "rate_limit": {
          "request_max_limit": 16000,
          "request_reset_duration": "1m"
        }
      },
      {
        "provider": "anthropic",
        "weight": 0.2,
        "rate_limit": {
          "request_max_limit": 500,
          "request_reset_duration": "1m"
        }
      }
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Resulting behavior:

  • 2 OpenAI keys = 20,000 RPM effective capacity
  • Gateway capped at 16,000 RPM (80% of provider limit)
  • Automatic Anthropic failover when OpenAI chokes
  • Exponential backoff on transient failures
  • Token limits to keep costs in check

And your application code stays the same:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="vk-prod"
)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}]
)
Enter fullscreen mode Exit fullscreen mode

Get Started

Spin up Bifrost locally in one command:

npx -y @maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

You can dig into the full configuration and governance surface area in the Bifrost docs, and explore the code on GitHub.


Key Takeaway

Rate limits don’t have to be outages. With:

  • multi‑key load balancing (3x throughput),
  • multi‑provider failover (automatic switching),
  • gateway‑level rate limits (no surprise 429s),
  • semantic caching (40–60% fewer calls),
  • and hierarchical controls (tenant and team isolation),

you can push serious traffic through LLMs without living in fear of hard caps. Bifrost implements all of these strategies through configuration so your app code doesn’t become a tangle of rate limit hacks.

Top comments (0)