Rate limits are the #1 cause of production LLM failures.
OpenAI enforces 10,000 RPM on Tier 2. Anthropic caps you at 50 RPM on the free tier. Without proper handling, a single traffic spike can trigger cascading 429s, broken user flows, and pager fatigue.
This guide covers 9 battle‑tested strategies to eliminate rate limit failures in production, using Bifrost (open source LLM gateway) as a reference. All of this is config, not app rewrites.
maximhq
/
bifrost
Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
Bifrost AI Gateway
The fastest way to build AI applications that never go down
Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.
Quick Start
Go from zero to production-ready AI gateway in under a minute.
Step 1: Start Bifrost Gateway
# Install and run locally
npx -y @maximhq/bifrost
# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Step 2: Configure via Web UI
# Open the built-in web interface
open http://localhost:8080
Step 3: Make your first API call
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello, Bifrost!"}]
}'
That's it! Your AI gateway is running with a web interface for visual configuration…
Strategy 1: Multi-Key Load Balancing
Problem: Single API key → single rate limit.
Solution: Multiple keys → multiplied throughput.
With Bifrost, you can define multiple OpenAI keys and weight them for load balancing:
curl -X POST http://localhost:8080/api/providers \
-H "Content-Type: application/json" \
-d '{
"provider": "openai",
"keys": [
{"name": "key-1", "value": "sk-1...", "weight": 0.33},
{"name": "key-2", "value": "sk-2...", "weight": 0.33},
{"name": "key-3", "value": "sk-3...", "weight": 0.34}
]
}'
Result: 3x throughput (30,000 RPM vs 10,000 RPM on a single key).
For details, see the key management and load balancing docs.
Strategy 2: Multi-Provider Failover
Don’t pin your entire app to one provider. Wrap multiple providers behind a single virtual key and weight traffic between them:
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
-H "Content-Type: application/json" \
-d '{
"provider_configs": [
{"provider": "openai", "weight": 0.8},
{"provider": "anthropic", "weight": 0.2}
]
}'
Behavior: If OpenAI gets rate limited, traffic automatically fails over to Anthropic.
You can see how automatic provider failover works in the fallbacks documentation.
Strategy 3: Gateway-Level Rate Limiting
Never hit the provider’s hard limit directly. Put a safety buffer at the gateway:
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
-H "Content-Type: application/json" \
-d '{
"rate_limit": {
"request_max_limit": 8000,
"request_reset_duration": "1m"
}
}'
Here, the gateway blocks at 8,000 RPM so you never slam into OpenAI’s 10,000 RPM cap.
More examples live in the governance and rate limiting docs.
Strategy 4: Token-Based Limiting
Sometimes requests are cheap in count but expensive in tokens. You can cap tokens instead of (or in addition to) request counts:
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
-H "Content-Type: application/json" \
-d '{
"rate_limit": {
"token_max_limit": 100000,
"token_reset_duration": "1h"
}
}'
This protects you from a few “fat” prompts blowing through your entire capacity.
Strategy 5: Provider-Level Limits
Different providers, different quotas. You can set per‑provider limits under the same virtual key:
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
-H "Content-Type: application/json" \
-d '{
"provider_configs": [
{
"provider": "openai",
"rate_limit": {
"request_max_limit": 8000,
"request_reset_duration": "1m"
}
},
{
"provider": "anthropic",
"rate_limit": {
"request_max_limit": 500,
"request_reset_duration": "1m"
}
}
]
}'
This keeps each provider within its safe envelope. You can see more patterns in the provider‑level rate limiting section.
Strategy 6: Semantic Caching
The easiest way to dodge rate limits is to send fewer requests.
With semantic caching, similar queries get served from cache instead of hitting the model:
# First request - hits provider
response1 = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What are business hours?"}]
)
# Similar request - cached
response2 = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "When are you open?"}]
)
# Cache hit - doesn't count toward rate limit
In practice, this can reduce provider traffic by 40–60%.
Implementation details are in the semantic caching docs.
Strategy 7: Exponential Backoff
When you do get 429s, don’t panic retry in a tight loop. Back off:
curl -X POST http://localhost:8080/api/providers \
-H "Content-Type: application/json" \
-d '{
"provider": "openai",
"network_config": {
"max_retries": 5,
"retry_backoff_initial_ms": 1,
"retry_backoff_max_ms": 10000
}
}'
This yields a backoff sequence like: 1 ms → 2 ms → 4 ms → 8 ms → 16 ms, and so on, instead of hammering the API and extending your outage.
Strategy 8: Hierarchical Rate Limits
In multi‑tenant or multi‑team setups, you want isolation: one abuse case shouldn’t starve everyone else.
You can stack limits at customer, team, and virtual key levels:
# Customer limit
curl -X POST http://localhost:8080/api/governance/customers \
-H "Content-Type: application/json" \
-d '{
"name": "Acme Corp",
"budget": {"max_limit": 10000, "reset_duration": "1M"}
}'
# Team limit
curl -X POST http://localhost:8080/api/governance/teams \
-H "Content-Type: application/json" \
-d '{
"name": "Engineering",
"customer_id": "customer-acme",
"budget": {"max_limit": 5000, "reset_duration": "1M"}
}'
# Virtual key limit
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
-H "Content-Type: application/json" \
-d '{
"team_id": "team-engineering",
"rate_limit": {
"request_max_limit": 1000,
"request_reset_duration": "1h"
}
}'
This hierarchy is described in the budget and rate limit hierarchy docs.
Strategy 9: Monitoring and Alerting
You can’t fix what you don’t see. Wire rate limit usage into your observability stack and alert before things blow up:
groups:
- name: rate_limits
rules:
- alert: RateLimitApproaching
expr: (rate_limit_usage / rate_limit_max) > 0.8
labels:
severity: warning
This type of rule is covered in the telemetry and monitoring docs.
Putting It All Together: Complete Setup
Here’s how all of this looks end‑to‑end.
1. Multi-key load balancing + retries for OpenAI:
curl -X POST http://localhost:8080/api/providers \
-H "Content-Type: application/json" \
-d '{
"provider": "openai",
"keys": [
{"name": "key-1", "value": "sk-1...", "weight": 0.5},
{"name": "key-2", "value": "sk-2...", "weight": 0.5}
],
"network_config": {
"max_retries": 5,
"retry_backoff_initial_ms": 1
}
}'
2. Register Anthropic as a secondary provider:
curl -X POST http://localhost:8080/api/providers \
-H "Content-Type: application/json" \
-d '{
"provider": "anthropic",
"keys": [
{"name": "anthropic-key", "value": "sk-ant-...", "weight": 1.0}
]
}'
3. Virtual key with rate limits and weighted providers:
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
-H "Content-Type: application/json" \
-d '{
"rate_limit": {
"request_max_limit": 16000,
"request_reset_duration": "1m",
"token_max_limit": 3000000,
"token_reset_duration": "1m"
},
"provider_configs": [
{
"provider": "openai",
"weight": 0.8,
"rate_limit": {
"request_max_limit": 16000,
"request_reset_duration": "1m"
}
},
{
"provider": "anthropic",
"weight": 0.2,
"rate_limit": {
"request_max_limit": 500,
"request_reset_duration": "1m"
}
}
]
}'
Resulting behavior:
- 2 OpenAI keys = 20,000 RPM effective capacity
- Gateway capped at 16,000 RPM (80% of provider limit)
- Automatic Anthropic failover when OpenAI chokes
- Exponential backoff on transient failures
- Token limits to keep costs in check
And your application code stays the same:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="vk-prod"
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello!"}]
)
Get Started
Spin up Bifrost locally in one command:
npx -y @maximhq/bifrost
You can dig into the full configuration and governance surface area in the Bifrost docs, and explore the code on GitHub.
Key Takeaway
Rate limits don’t have to be outages. With:
- multi‑key load balancing (3x throughput),
- multi‑provider failover (automatic switching),
- gateway‑level rate limits (no surprise 429s),
- semantic caching (40–60% fewer calls),
- and hierarchical controls (tenant and team isolation),
you can push serious traffic through LLMs without living in fear of hard caps. Bifrost implements all of these strategies through configuration so your app code doesn’t become a tangle of rate limit hacks.


Top comments (0)