Saumya Karnwal

Posted on Feb 26

Five Ways to Say "Slow Down" — A Field Guide to Rate Limiting Algorithms

#systemdesign #ratelimiting #backpressure

What Is Rate Limiting?

Rate limiting is a rule: no more than N requests in T time. 100 API calls per minute. 5 login attempts per 15 minutes. 1,000 database writes per second.

At its core, it's a counter with a clock. A request comes in, you check the count, and you either let it through or reject it. That's the whole idea.

Why You Need It

Every system has a capacity — the load it can handle before performance falls off a cliff. Not a gentle slope. A cliff.

A database tuned for 500 writes/sec doesn't get 1% slower at 501. It stays fine at 550, maybe a bit sluggish at 580, and then at 600 the query queue backs up, the connection pool exhausts, and latency goes from 50ms to 8 seconds in under a minute. Recovery takes even longer because the backed-up requests are still draining.

Without rate limiting, three things go wrong:

1. One bad actor takes down everyone. A single misconfigured client retrying in a tight loop can saturate your service. The other 10,000 well-behaved clients suffer equally.

2. Cascading failures amplify the damage. When Service A slows down, Service B (which calls A) starts timing out. B's callers retry. A 20% overload on one service becomes a 300% overload on three.

3. Recovery becomes harder than survival. Even after the spike passes, the queue of backed-up requests keeps the system pinned. Without a way to shed load, you can stay degraded for minutes after the cause is gone.

Rate limiting is the difference between "gracefully reject 5% of traffic during a spike" and "return errors to 100% of traffic for 10 minutes."

Five Algorithms, Five Trade-offs

But "rate limiting" isn't one algorithm. It's five, each with a different trade-off between accuracy, memory, burst tolerance, and implementation complexity. Each one is introduced below with how it works, what you give up, and when it's the right pick.

1. Fixed Window Counter

How it works: Divide time into fixed intervals (e.g., 1-minute windows). Keep a counter per window. Increment on each request. If the counter exceeds the limit, reject. When the window ends, reset to zero.

The trade-off: You get simplicity and near-zero memory (~16 bytes per key). You give up accuracy at window boundaries.

Minute 1:                        Minute 2:
..............90 reqs at :59  |  90 reqs at :00..............

Both windows say "under 100, allowed." But in a 2-second real window spanning the boundary? 180 requests — nearly 2x your limit. In the worst case, a client can get double the allowed rate by timing requests around the boundary.

Where to use it: When the limit is a rough safety net, not a precise guarantee. Login throttling ("5 attempts per 15 minutes") is the classic case — even if someone exploits the boundary to squeeze out 10 attempts, that's still worthless for brute-force. API key quotas where "close enough" is fine. Anywhere you need something working in an hour, not a week.

2. Sliding Window Log

How it works: Instead of counting per window, store the exact timestamp of every request. When a new request arrives, evict all timestamps older than the window duration, then count what's left.

Perfectly accurate. Zero boundary tricks. The window slides smoothly with every request.

The trade-off: You get perfect precision. You give up memory. You're storing up to limit timestamps per key — typically in a Redis sorted set. At 10,000 req/min across a million users, that's up to 10 billion timestamps. At 8 bytes each, that's ~80 GB of Redis just for rate limiting state.

If you have 2,000 users making 50 requests/day each? The sorted set holds 50 entries per key. Total memory: negligible. But scale that to millions of keys and the math stops working.

Where to use it: When precision matters more than scale. Financial transaction limits where regulatory compliance demands exact counting — the auditor doesn't care about "99.7% accurate." Database write protection where exceeding the threshold causes corruption, not just slowness. Low-volume, high-stakes APIs where "off by one" has real consequences. The key constraint: either the limit is small, or the user count is small. Ideally both.

3. Sliding Window Counter

How it works: A compromise between the first two. Keep counters for the current window and the previous window. Estimate the sliding total using weighted math based on how far into the current window you are.

Limit: 5/min    Current time: 1:18 (18 sec into window)

Previous window [0:00-1:00]:  5 requests
Current window  [1:00-2:00]:  3 requests

Weighted total = 5 × (42/60) + 3
               = 3.5 + 3
               = 6.5 ~ 7  →  DO NOT ALLOW

Two counters per key. ~32 bytes of memory. Not sorted sets of timestamps — two integers.

The trade-off: You get near-perfect accuracy at minimal memory cost. You give up a small margin of error. Cloudflare measured 99.7% accuracy against a perfect sliding window. The 0.3% error comes from assuming requests were uniformly distributed in the previous window. In practice, your measurement noise is already larger than that.

Where to use it: When you need accuracy at scale. Millions of keys, limited memory, and you can tolerate a rounding error smaller than your measurement noise. This is the default choice for most production rate limiters. If you're not sure which algorithm to pick, start here.

Cloudflare uses this. AWS API Gateway uses this. Most "rate limit by API key" implementations in production use some variant of this.

4. Token Bucket

How it works: A different mental model. Instead of counting requests in a time window, imagine a bucket that fills with tokens at a steady rate. Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, so tokens don't accumulate forever.

Two parameters: the refill rate (long-term average) and the bucket capacity (maximum burst size).

The trade-off: You get burst tolerance — short spikes are absorbed as long as the long-term average stays within limits. You give up strict per-window guarantees. A user can consume their entire bucket in a single burst, which means the instantaneous rate can be much higher than the average rate.

The key insight: it rewards idle time. A user who hasn't called your API in 5 seconds has accumulated tokens and can burst. This matches how real humans use APIs — idle, idle, idle, click click click click, idle. A window-based counter punishes that pattern. Token bucket embraces it.

Where to use it: When your users are bursty and that's expected. A developer building a dashboard fires 12 parallel API calls on page load, then nothing for 45 seconds. A mobile app syncs on wake. A CLI tool batches requests. In all these cases, the average rate is fine — it's the shape that doesn't fit a fixed window. Token bucket lets legitimate bursts through while still enforcing a long-term ceiling.

Stripe, GitHub, and AWS EC2 all use token bucket for their public APIs. Google's Guava RateLimiter library implements it.

5. Leaky Bucket

How it works: The mirror image of token bucket. Requests pour into a fixed-size queue. The queue drains at a constant rate. If the queue is full when a new request arrives, it's rejected.

If 50 requests arrive in one second, 1 goes through immediately, 10 wait in the queue, and 39 are rejected. The output is perfectly smooth — always exactly the drain rate, no matter how spiky the input.

The trade-off: You get perfectly shaped output — no burst ever reaches the downstream system. You give up two things: (1) burst tolerance — even legitimate spikes get queued or rejected, and (2) latency — requests sit in the queue waiting their turn instead of being processed immediately.

The critical difference from token bucket: token bucket allows bursts and limits the average. Leaky bucket eliminates bursts and smooths the output. They look similar on paper but behave very differently under spiky load.

Where to use it: When your downstream truly cannot handle bursts — not even brief ones. An SMS provider that charges 5x for burst traffic. A legacy database that crashes above 100 writes/sec rather than degrading. A hardware device with a fixed processing rate. Network traffic shaping where you need constant bandwidth. Anywhere the shape of the output matters as much as the volume.

NGINX's limit_req module uses leaky bucket by default. Twilio and SendGrid submission queues work this way. Network QoS traffic shapers are leaky buckets over bytes instead of requests.

Choosing the Right One

But there's a sixth option that doesn't fit neatly into the tree.

Bonus: Adaptive Throttling — The Self-Tuning Client

What if you don't want to pick a number at all?

Google's SRE Book describes a pattern where the client tracks its own success rate and starts probabilistically dropping requests when the server is struggling:

If 10% of your calls fail, you preemptively drop ~10% client-side. The server gets breathing room. As it recovers, your success rate climbs and you ramp back up automatically.

Use it for: Internal service-to-service calls where you control both sides. No manual tuning. No threshold guessing. The system finds its own equilibrium. Implement it as a client interceptor (gRPC, HTTP middleware) that backs off when the backend returns overload errors.

The Uncomfortable Truth

Rate limiting is admission control — it decides who gets rejected. And rejection has a cost. A rejected API call might mean a failed checkout or a user staring at a spinner.

The algorithms above are tools. The harder question is policy: Who gets limited? Per-IP is easy to circumvent. Per-API-key punishes shared keys. Per-user requires authentication before rate checking. And in a multi-tenant SaaS, your free-tier user and your enterprise customer probably shouldn't share a bucket.

The algorithm is the mechanism. The policy is the product decision. Get both right.

DEV Community