Design a Rate Limiter: 4 Algorithms, 4 Trade-Offs

#redis #systemdesign #backend #architecture

Book: System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You shipped a rate limiter. 100 requests per minute per user. The reviewer approved it. Then a client batches a retry storm, 100 requests land in the same 200ms window at the very end of one minute, another 100 land in the first 200ms of the next, and your "100 per minute" backend just ate 200 requests in under half a second.

The limiter did exactly what you told it. The bug was the algorithm, not the code. "Rate limiter" is four different data structures wearing the same name, and they handle bursts in four different ways. Pick the wrong one and you either let a flood through or you throttle a well-behaved client who happened to hit a window boundary.

What the limiter is actually deciding

Every rate limiter answers one question on every request: allow or reject. The difference between the four algorithms is what state they keep to answer it, and how that state behaves at the edges of a time window.

Two properties separate them:

Burst tolerance. Can a client spend its whole budget at once, or is it forced to spread requests out evenly?
Memory cost per key. A counter is 8 bytes. A log of timestamps is 8 bytes times every request in the window.

The four below trade those two against accuracy. Walk them in order.

Fixed window

The simplest one. Keep a counter per key per time bucket. Increment on each request. Reject when the counter passes the limit. The bucket resets when the clock rolls to the next window.

import redis
import time

r = redis.Redis(decode_responses=True)
LIMIT = 100
WINDOW = 60  # seconds

def allow(user_id: str) -> bool:
    bucket = int(time.time()) // WINDOW
    key = f"rl:{user_id}:{bucket}"
    count = r.incr(key)
    if count == 1:
        r.expire(key, WINDOW)
    return count <= LIMIT

Cheap. One integer per key per window, and it self-expires. Reads are a single INCR. This is what most teams ship first, and for a lot of internal APIs it's fine.

The flaw is the boundary, which is the bug from the opening. The window is a hard wall at fixed clock times. A client that sends 100 requests at 00:00:59 and 100 more at 00:01:01 sends 200 requests in two seconds and every one is allowed, because they sit in two different buckets. Your stated limit is 100/min. Your real worst case is 2x the limit across any window boundary.

Use it when the limit is a soft guardrail and the 2x edge case won't hurt you. Don't use it for anything where the burst itself is the threat.

Sliding window log

Fix the boundary problem by storing the actual timestamps. Keep a sorted set per key. On each request, drop entries older than the window, count what's left, and decide.

import redis
import time
import uuid

r = redis.Redis(decode_responses=True)
LIMIT = 100
WINDOW = 60

def allow(user_id: str) -> bool:
    key = f"rl:{user_id}"
    now = time.time()
    cutoff = now - WINDOW
    pipe = r.pipeline()
    pipe.zremrangebyscore(key, 0, cutoff)
    pipe.zadd(key, {str(uuid.uuid4()): now})
    pipe.zcard(key)
    pipe.expire(key, WINDOW)
    _, _, count, _ = pipe.execute()
    return count <= LIMIT

This is exact. The window slides continuously with each request, so there is no boundary to game. 100 requests in any rolling 60 seconds, no more.

The cost is memory and write load. You store one sorted-set member per request in the window. At 100/min that's small. At 100,000/min per key it's a sorted set with 100,000 members that you rewrite on every call. The ZREMRANGEBYSCORE plus ZADD plus ZCARD is three operations against a structure that grows with traffic, not a constant. Accurate, but it scales with the thing you're trying to limit.

Reach for it when accuracy matters more than memory and the per-key request rate stays modest. Audit-sensitive limits, paid-tier quotas where over-counting means a refund.

Sliding window counter

A middle ground that most large APIs actually run. Keep two fixed-window counters, the current bucket and the previous one, and weight the previous bucket by how far you are into the current window. You get most of the sliding-window accuracy at fixed-window memory cost.

import redis
import time

r = redis.Redis(decode_responses=True)
LIMIT = 100
WINDOW = 60

def allow(user_id: str) -> bool:
    now = time.time()
    cur_bucket = int(now) // WINDOW
    prev_bucket = cur_bucket - 1
    elapsed = (now % WINDOW) / WINDOW  # 0.0 -> 1.0

    cur_key = f"rl:{user_id}:{cur_bucket}"
    prev_key = f"rl:{user_id}:{prev_bucket}"

    pipe = r.pipeline()
    pipe.incr(cur_key)
    pipe.expire(cur_key, WINDOW * 2)
    pipe.get(prev_key)
    cur, _, prev = pipe.execute()

    prev = int(prev) if prev else 0
    weighted = prev * (1 - elapsed) + cur
    return weighted <= LIMIT

Two integers per key instead of a growing log. The weighting smooths the boundary: a third of the way into the current window, the previous window still counts for two-thirds. The hard 2x cliff of the fixed window flattens into a gentle slope.

It's an approximation. It assumes requests in the previous window were evenly spread, which they weren't. The error is small (Cloudflare published numbers in the low single-digit percent on real traffic) and it's almost always an over-count rather than an under-count, so it errs toward protecting you. For public API rate limits at scale, this is the default for a reason.

Token bucket

The other three count requests against a window. Token bucket models a budget that refills over time, which makes it the only one of the four that lets you tune burst tolerance directly.

A bucket holds up to capacity tokens and refills at refill_rate tokens per second. Each request takes one token. No token, no request. The bucket size is your allowed burst; the refill rate is your sustained throughput. Those are two separate knobs, which is the whole point.

-- token_bucket.lua, run via EVAL for atomicity
local key      = KEYS[1]
local capacity = tonumber(ARGV[1])
local rate     = tonumber(ARGV[2])  -- tokens/sec
local now      = tonumber(ARGV[3])
local cost     = tonumber(ARGV[4])

local state  = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(state[1]) or capacity
local ts     = tonumber(state[2]) or now

local refill = (now - ts) * rate
tokens = math.min(capacity, tokens + refill)

local allowed = tokens >= cost
if allowed then
  tokens = tokens - cost
end

redis.call("HMSET", key, "tokens", tokens, "ts", now)
redis.call("EXPIRE", key, math.ceil(capacity / rate) * 2)
return allowed and 1 or 0

Why Lua and not Python here: the read of current tokens, the refill math, and the write back have to be one atomic step. Two clients hitting the same key in parallel from app code would both read the same token count and both decrement it, and you'd hand out the same token twice. Running it as a server-side script makes Redis execute it without interleaving. The same race exists in the sliding-window log version, which is why that one is wrapped in a pipeline, though a pipeline alone isn't truly atomic under contention. For anything where double-spend matters, move the read-modify-write into a Lua script.

The tuning is the feature. Capacity 100, refill 10/sec means a client can fire 100 requests instantly when the bucket is full, then settles to 10/sec as it drains. A mobile app that wakes up and syncs a batch gets to burst; a runaway loop gets clamped to the sustained rate. AWS API Gateway documents token-bucket throttling, and most cloud APIs run a variant of this because it matches how real clients behave: quiet, then bursty, then quiet.

Leaky bucket is the close cousin. Same bucket metaphor, but requests leak out at a fixed rate, like a queue draining at constant speed. It smooths bursts into a steady output stream, which is what you want when the thing downstream can't handle spikes at all (a payment processor with a hard QPS ceiling). Token bucket allows bursts; leaky bucket erases them. Pick by whether your downstream tolerates a spike.

The distributed part nobody mentions until it breaks

All four snippets keep state in Redis for a reason. A rate limiter that holds its counter in process memory is per-instance, not per-user. Run three API replicas with in-memory counters and a "100/min" limit becomes 300/min, because each replica counts independently. The limit silently multiplies by your replica count, and it changes every time you autoscale.

Centralizing the counter in Redis fixes correctness and creates two new problems.

Redis is now in the hot path of every request. A round trip per request adds latency and a hard dependency. Mitigate by deciding what happens when Redis is unreachable. Fail open (allow everything, you're unprotected for the outage) or fail closed (reject everything, you're down). Most public APIs fail open on the limiter so a Redis blip doesn't take the whole service offline; anything protecting a fragile or paid resource fails closed.

def allow_safe(user_id: str) -> bool:
    try:
        return allow(user_id)
    except redis.RedisError:
        return True  # fail open: availability over enforcement

A hot key concentrates on one node. Limits keyed by a shared value (an API-tier key every customer shares, a single popular endpoint) send all their writes to whichever shard owns that key, and that shard's CPU saturates while the rest sit idle. The fix is the same sharding trick used for any hot counter: split the logical limit into N sub-counters across nodes, give each a budget of LIMIT/N, and route a client to one shard consistently. You trade a little accuracy at the edges for spreading the load. For per-user keys this rarely bites; for global or per-tier limits it bites hard.

The reduced accuracy from sharding is usually worth it. A global limit of 1,000,000/min split into 10 shards of 100,000 each can let through slightly more or less than 1,000,000 depending on how clients distribute, but no single node melts. That's the trade you make above a few hundred thousand writes per second on one key.

Pick by use case

Algorithm	Burst handling	Memory per key	Accuracy	Reach for it when
Fixed window	Allows 2x at boundary	1 int	low at edges	internal APIs, soft guardrails
Sliding window log	None, exact	grows with traffic	exact	paid quotas, audit-sensitive
Sliding window counter	Smoothed boundary	2 ints	~1-2% over	public APIs at scale
Token bucket	Tunable burst + sustained	small hash	exact	client-facing APIs, SDKs
Leaky bucket	Erases bursts	small hash	exact	fragile downstream, fixed QPS

The order to think about it: start from the client. If clients are bursty by nature (mobile sync, batch jobs, SDK retries) and the burst is fine, token bucket gives you two knobs to size it. If the downstream can't take a spike at all, leaky bucket. If you just need a public per-user cap that's accurate enough and cheap, sliding window counter. If every request must be counted exactly and traffic per key is low, sliding window log. Fixed window only when the 2x boundary genuinely doesn't matter.

The mistake isn't picking the "wrong" algorithm. It's not noticing they answer the burst question differently, shipping the cheapest one, and discovering the boundary behavior in production when a retry storm walks straight through a limit you thought you had.

If this was useful

Rate limiting sits next to a handful of other building blocks (counters, queues, caches, idempotency keys) that all share the same distributed-state problems: hot keys, fail-open versus fail-closed, the gap between in-process and centralized state. The System Design Pocket Guide: Fundamentals walks through those primitives with the same "what does it cost and where does it break" lens this post used for the four algorithms.

Which limiter is running in front of your API right now, and do you actually know what it does at the window boundary? Worth checking before the next retry storm finds out for you.