API Rate Limiting: Patterns That Scale

#sre #devops #api #ratelimit

Rate limiting is one of those topics where everyone knows the basics and almost nobody gets it right at scale. Let me walk through the patterns that actually work.

The algorithms, briefly

Token bucket. Each client has a bucket that refills at a fixed rate. Requests consume tokens. Empty bucket = rejected. Good for smoothing bursts.

Leaky bucket. Requests enter a queue that drains at a fixed rate. Overflowing queue = rejected. Good for enforcing smooth output.

Fixed window. Count requests per 60-second window. Reset at each boundary. Simple, but allows bursts at window boundaries.

Sliding window. Count requests over a rolling window. More accurate, slightly more expensive.

For most APIs, start with token bucket. It handles both rate limits and burst allowances cleanly.

The scaling problem

At low scale, store counters in Redis or memory. At high scale, Redis becomes your bottleneck — every request hits it, and if you have 50k RPS, that's 50k Redis ops just for rate limiting.

The fix: local counters with eventual consistency. Each edge node maintains its own counter and syncs periodically. You lose perfect accuracy (a client can temporarily exceed limits by ~5-10%) but gain orders of magnitude of throughput.

What to rate limit on

By API key. Standard. Easy. Works.

By IP. Useful for anonymous endpoints. Watch out for NAT/corporate IPs that hide thousands of users.

By endpoint cost. Not all endpoints are equal. Your /search endpoint is 100x more expensive than /health. Rate limit on 'cost units,' not request count. A complex query consumes more units than a simple one.

By user behavior patterns. Advanced. Ban clients that look like scrapers based on access patterns, not just rate.

The error response

Return 429. Include Retry-After header. Include X-RateLimit-Remaining. Make your limits discoverable so well-behaved clients can adapt.

The debugging nightmare

Most rate limit bugs come from clients being unable to tell why they got rate limited. Is it the global limit? Per-key? Per-endpoint? Return enough detail in the error response to explain.

The test

Load test your rate limiter as its own component. The time I found a bug where we were rate limiting our own healthchecks, it was during a load test, not in prod. That's the right order.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com