Every major LLM provider enforces token-per-minute limits, request-per-minute caps, and concurrent request ceilings. When your app hits a 429 response from OpenAI at 2 AM during a traffic spike, your users see errors. Your on-call engineer gets paged. And your retry logic, if it even exists, starts hammering the same provider that just told you to slow down.
The fix is not more retry logic in your application code. The fix is pushing rate limit handling to the gateway layer. Let the gateway absorb 429s, queue requests, failover to backup providers, and enforce budgets, so your application code stays clean.
Here are five AI gateways that handle rate limiting at the infrastructure level, ranked by how well they actually solve the problem.
Bifrost is an open-source LLM gateway we built in Go specifically because Python-based gateways were too slow to sit in the request path. At 11 microsecond latency overhead and 5,000 RPS sustained throughput, the gateway itself never becomes a bottleneck during rate limit storms.
TL;DR
If rate limiting is your primary pain point, you need a gateway that handles provider-level rate limits automatically, lets you enforce your own rate limits per key/team, and fails over to backup providers without your app knowing. Bifrost does all of this at 50x the speed of Python-based alternatives. The other gateways on this list each have strengths worth considering depending on your stack.
1. Bifrost
Best for: Teams that need zero-overhead rate limit handling with automatic failover
We built Bifrost because we kept running into the same problem: Python gateways like LiteLLM add roughly 8ms of overhead per request. That is fine for a prototype. It is not fine when you are processing thousands of concurrent requests and every millisecond of gateway overhead compounds into real latency for your users.
Bifrost is written in Go. The measured overhead is 11 microseconds per request. That matters when you are dealing with rate limits, because the gateway needs to make fast decisions about queuing, retrying, and failing over.
Rate limiting architecture:
Bifrost handles rate limits at multiple levels:
-
Virtual Key rate limits: Each virtual key gets independent
token_max_limitandrequest_max_limitsettings with configurable reset durations. Example config:
{
"rate_limit": {
"token_max_limit": 10000,
"token_reset_duration": "1h",
"request_max_limit": 100,
"request_reset_duration": "1m"
}
}
Four-tier budget hierarchy: Customer > Team > Virtual Key > Provider Config. When a budget is exhausted at any level, Bifrost can automatically failover to a cheaper provider instead of returning an error.
Provider-isolated worker pools: Each provider gets its own worker pool. If OpenAI starts rate limiting you, the backpressure stays contained to the OpenAI pool. Your Anthropic and Gemini traffic keeps flowing normally.
Backpressure policies: When a provider's queue fills up, you configure the behavior:
drop(discard the request),block(wait for queue space), orerror(return immediately with an error). This is configurable per provider.Automatic failover: On 429s, 5xx errors, network errors, or timeouts, Bifrost automatically routes to the next provider in your fallback chain.
Retry with exponential backoff: Up to 5 retries with exponential backoff starting at 1ms initial delay, capping at 10 seconds max delay. This is configured in
network_config:
{
"network_config": {
"max_retries": 5,
"retry_backoff_initial_ms": 1,
"retry_backoff_max_ms": 10000
}
}
Why this matters for rate limiting specifically: Most gateways treat rate limiting as a single feature. In Bifrost, rate limiting is layered across virtual keys, budgets, provider pools, and backpressure, so you have fine-grained control over how your system behaves when any provider starts pushing back.
Zero-config deployment via npx or Docker. 19 providers supported out of the box.
2. Portkey
Best for: Teams that want a managed gateway with a visual dashboard
Portkey is a managed AI gateway that handles rate limiting through its virtual keys and provider fallback system. It supports automatic retries, load balancing across providers, and budget controls.
What it does well:
- Clean dashboard for monitoring rate limit events across providers
- Virtual keys with spend limits
- Automatic retries with configurable strategies
- Fallback chains across providers
- Caching to reduce the total number of requests hitting provider rate limits
Trade-offs:
- Managed service, so your requests route through Portkey's infrastructure
- Pricing scales with usage
- Less granular control over backpressure behavior compared to self-hosted options
If your team prefers a managed solution and does not mind the additional network hop, Portkey is solid.
3. LiteLLM
Best for: Python teams that want quick setup with broad provider support
LiteLLM is the most popular open-source Python-based LLM gateway. It supports 100+ LLM providers through a unified interface and includes rate limiting, retry logic, and budget management.
What it does well:
- Broadest provider support in the ecosystem
- Built-in rate limiting with Redis-backed tracking
- Budget management per API key
- Active open-source community
- Good documentation
Trade-offs:
- Python runtime adds measurable latency overhead (roughly 8ms per request based on benchmarks, compared to 11 microseconds for Go-based alternatives)
- At high concurrency, the Python GIL becomes a real constraint
- The gateway itself can become a bottleneck under rate limit storms when it needs to make many fast routing decisions
LiteLLM is the right choice if your team is Python-native and your traffic volume is moderate. For high-throughput production workloads where the gateway needs to absorb rate limit spikes without adding latency, the Python overhead is worth considering.
4. Kong AI Gateway
Best for: Teams already using Kong for API management
Kong AI Gateway extends the existing Kong API gateway platform with AI-specific plugins for rate limiting, authentication, and provider routing.
What it does well:
- Enterprise-grade API gateway with years of production hardening
- Plugin ecosystem for rate limiting, authentication, logging
- Fine-grained rate limiting policies (per consumer, per route, per service)
- Existing infrastructure teams likely already know Kong
- Strong enterprise support
Trade-offs:
- AI-specific features are newer and less mature than core Kong
- Configuration complexity is higher than purpose-built AI gateways
- The rate limiting plugins were designed for REST APIs, not specifically for LLM token-based rate limiting
- Token-aware rate limiting (tracking tokens consumed, not just requests) requires additional configuration
Kong is a good choice if your organization already runs Kong and wants to add AI routing without deploying a separate gateway. The rate limiting is robust but API-oriented, you may need custom plugins for token-based limits.
5. Cloudflare AI Gateway
Best for: Teams already on Cloudflare's edge network
Cloudflare AI Gateway runs at the edge, sitting between your application and LLM providers. It provides caching, rate limiting, and observability.
What it does well:
- Edge deployment means low latency to most users globally
- Built-in caching reduces total requests to providers (fewer rate limit hits)
- Simple setup if you are already on Cloudflare
- Request logging and analytics
- Cost tracking
Trade-offs:
- Rate limiting is more basic compared to purpose-built AI gateways
- No provider-isolated worker pools or backpressure handling
- Fallback/failover capabilities are limited
- Less granular control over retry behavior
- Tied to Cloudflare's ecosystem
Cloudflare AI Gateway works well as a caching and observability layer. For advanced rate limit handling with automatic failover and backpressure management, you will likely need to pair it with another solution.
How to Choose
The right gateway depends on your constraints:
| Criteria | Bifrost | Portkey | LiteLLM | Kong | Cloudflare |
|---|---|---|---|---|---|
| Self-hosted | Yes | No | Yes | Yes | No |
| Latency overhead | 11us | Network hop | ~8ms | Low | Edge |
| Provider-isolated pools | Yes | No | No | No | No |
| Backpressure policies | Yes | No | No | Plugin-based | No |
| Token-aware rate limits | Yes | Yes | Yes | Plugin | Basic |
| Budget hierarchy | 4-tier | Basic | Per-key | Per-consumer | No |
| Auto failover on 429 | Yes | Yes | Yes | Plugin | Limited |
If rate limiting is a primary production concern, meaning you are hitting provider limits regularly and need fine-grained control over how your system degrades, Bifrost gives you the most control at the lowest overhead. The provider-isolated worker pools and configurable backpressure policies are specifically designed for this problem.
Check out the docs to get started. Deployment takes under two minutes with npx or Docker.
Built by the team at Maxim AI. Bifrost is open-source and free to use.
Top comments (1)
The 11 microsecond overhead on Bifrost is impressive. One thing worth mentioning: when providers return 429s with different Retry-After headers, the gateway needs per-provider backoff state, not a single global queue. Does Bifrost handle that at the virtual key level?