Gabriel Anhaia

Posted on Apr 27

Designing a Rate Limiter: 4 Algorithms, Real Trade-Offs

#architecture #microservices #tutorial #redis

Book: System Design Pocket Guide: Interviews
Also by me: System Design Pocket Guide: Fundamentals
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You ship a rate limiter on a Friday. By Monday you're paged: legitimate batch jobs hit the wall every hour on the dot, and a few aggressive clients drain their quota in under a second and go quiet. Same algorithm, two opposite failure modes, both predicted by the choice you made the week before.

Rate limiting is not a single problem. It's four problems that share a name. Token bucket, leaky bucket, fixed window counter, sliding window — each one optimizes for a different property, and each one breaks in a way the others don't. Pick the wrong one for your traffic shape and either your users complain or your downstream services do.

What you're actually limiting

Before the algorithms, the dimension. "100 requests per minute" hides three orthogonal questions you have to answer.

Per what? Per user, per IP, per API key, per endpoint, per tenant. The key shape decides the cardinality you store in Redis, which decides the memory budget. 1M users at 100 bytes each is 100 MB. 1M users × 50 endpoints is 5 GB. Decide before you write the Lua script.

Hard or soft? Hard limits return 429 and force the client to back off. Soft limits queue the request and shape it. Public APIs almost always do hard. Internal mesh traffic between services almost always does soft. The algorithm changes with the answer.

Burst or smooth? Most APIs want bursts. A normal user pulls 10 records fast and goes idle. A misconfigured client hammers steady. Token bucket allows bursts; leaky bucket doesn't. If you don't think about this dimension, you'll pick the wrong default.

The whole conversation that follows is about which algorithm fits which set of answers.

Fixed window counter

The simplest one. You have a counter per key, scoped to a time window. Each request increments. If the counter is over the limit, reject. When the window rolls, the counter resets.

def fixed_window_check(redis_client, key: str, limit: int) -> bool:
    bucket = f"rl:fixed:{key}:{int(time.time()) // 60}"
    count = redis_client.incr(bucket)
    if count == 1:
        redis_client.expire(bucket, 60)
    return count <= limit

Memory. One integer per key per active window. Cheapest of all four.

Burst behavior. Allows a single huge burst at the boundary. If the limit is 100/min and a client fires 100 requests at 12:00:59 and another 100 at 12:01:00, the algorithm reports both windows are within limits. The downstream service sees 200 requests in two seconds. This is the classic boundary problem and it's why no serious public API ships fixed window alone.

Fairness. Bad. The client that hits at 12:00:00 has a fresh budget; the client that hits at 12:00:59 has zero seconds before reset.

When it wins. Internal counters where you don't care about the boundary effect. Daily quotas where the boundary is so far apart it doesn't matter. Anything where simplicity beats correctness.

Leaky bucket

Picture a literal bucket with a hole. Requests pour in. They drain at a fixed rate. If the bucket is full when a request arrives, the request is dropped. The output rate is always constant.

The implementation usually wraps a queue with a worker that pops at the leak rate. In a single-process service it's trivial; distributed it's awkward — you need a coordinated drain or a per-shard worker.

Memory. Queue size per key. Configurable, but if you size queues for tail-of-distribution clients you'll waste a lot.

Burst behavior. Zero. The output rate is fixed by design, no matter how the input bursts. This is the strongest of all four for protecting downstream throughput; it's also the most painful for users.

Fairness. Strict FIFO. The 51st request waits for the first 50 to drain.

When it wins. Traffic shaping toward a fragile downstream: payment gateways that fall over above some RPS ceiling, ML inference services that crash under burst. When you need the output rate guaranteed regardless of input shape, leaky bucket is the only one that gives you that. Foojay's guide is right that this is shaping, not user-facing fairness.

Sliding window log

Keep a sorted set of request timestamps per key. On each request, drop timestamps older than the window, count what's left, accept if under the limit. Redis ZADD plus ZREMRANGEBYSCORE plus ZCARD does it in one Lua call.

Memory. One entry per request per window per key. At 100/min/user and 1M active users, that's 100M timestamp entries in Redis. Real money.

Burst behavior. Exact. The window genuinely slides; no boundary effect at all.

Fairness. The most fair of all four. Every client gets exactly N requests per any rolling window of duration W.

When it wins. Premium APIs where burst correctness is a contractual obligation. Auth endpoints where a fairness gap gets you a CVE. Anywhere users pay a lot and notice the boundary.

When it loses. Memory. The naive implementation explodes at scale, which is why the next algorithm exists.

Sliding window counter

A hybrid. Keep two fixed-window counters (current minute and previous minute) and weight the previous one by how far you are into the current window. If you're 30 seconds into 12:01 with the limit at 100/min, count = current + 0.5 × previous. Reject if count > 100.

def sliding_counter_check(
    redis_client, key: str, limit: int, window: int = 60
) -> bool:
    now = time.time()
    cur_window = int(now // window)
    prev_window = cur_window - 1
    pos_in_window = (now % window) / window

    cur = int(redis_client.get(f"rl:sc:{key}:{cur_window}") or 0)
    prev = int(redis_client.get(f"rl:sc:{key}:{prev_window}") or 0)
    estimated = cur + prev * (1 - pos_in_window)

    if estimated >= limit:
        return False

    pipe = redis_client.pipeline()
    pipe.incr(f"rl:sc:{key}:{cur_window}")
    pipe.expire(f"rl:sc:{key}:{cur_window}", window * 2)
    pipe.execute()
    return True

Memory. Two integers per key. As cheap as fixed window.

Burst behavior. Approximate. Smooths the boundary effect to within a few percent error in most traffic shapes.

Fairness. Good enough for almost everything. The estimate is wrong when traffic is wildly bimodal across the boundary, but for typical web traffic it's indistinguishable from the exact log.

When it wins. General-purpose API rate limiting where you want sliding-window semantics without the memory bill. This is what most production services should ship by default. Arcjet's analysis makes the same call.

Token bucket

A bucket holds tokens. Tokens refill at a fixed rate. Each request consumes a token. If the bucket is empty, reject.

The bucket has two parameters that fixed window doesn't: capacity (max tokens, controls burst tolerance) and refill rate (tokens per second, controls steady-state). You can have a bucket of 100 tokens that refills at 10/sec — a user can burst 100 requests in a second, then sustain 10 requests/sec until they stop and let the bucket recover.

This decoupling is what makes token bucket the strongest general-purpose algorithm for user-facing APIs (API7 guide). Most real users want exactly this shape: a small burst, then a sustainable rate. Token bucket gives it to them with two tunable knobs instead of one.

Memory. Two values per key (token count and last-refill timestamp).

Burst behavior. Capacity-bounded. Configurable.

Fairness. Per-key. Every user gets the same bucket shape.

When it wins. Pretty much every public API. Most large public API gateways describe their limits with token-bucket-shaped semantics — AWS API Gateway throttling is the most explicitly documented example. The defaults work for almost every traffic shape that matters.

A 60-line distributed token bucket on Redis

Distributed rate limiting on Redis breaks if you read-then-write — two clients race the read and both think there's a token. The fix is a Lua script: Redis runs it atomically, so the read-modify-write is single-shot.

import time
import redis

LUA_TOKEN_BUCKET = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])

local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1])
local last_refill = tonumber(data[2])

if tokens == nil then
  tokens = capacity
  last_refill = now
end

local elapsed = math.max(0, now - last_refill)
local refill = elapsed * refill_rate
tokens = math.min(capacity, tokens + refill)

local allowed = 0
if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
end

redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 10)

return {allowed, tokens}
"""

The script computes the lazy refill from elapsed time, decrements if there are enough tokens, and returns whether the request was allowed alongside the remaining balance. The Python wrapper just registers the script and threads the parameters through.

class TokenBucketLimiter:
    def __init__(
        self,
        redis_client: redis.Redis,
        capacity: int = 100,
        refill_rate: float = 10.0,
    ):
        self.redis = redis_client
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.script = self.redis.register_script(LUA_TOKEN_BUCKET)

    def allow(self, key: str, cost: int = 1) -> tuple[bool, int]:
        now = time.time()
        result = self.script(
            keys=[f"rl:tb:{key}"],
            args=[self.capacity, self.refill_rate, now, cost],
        )
        return bool(result[0]), int(result[1])

A few things this script gets right that naive versions don't.

Atomic refill and consume. The whole logic runs inside one EVAL call. No race between checking the bucket and decrementing it.

Lazy refill. You don't need a background worker filling buckets at the refill rate. The refill happens on the next request, computed from the elapsed time. This is what lets you scale to 10M keys on a single Redis cluster without a separate filler service.

Variable cost. A request might cost 1 token (cheap call) or 10 (expensive call). The cost arg covers both. AWS API Gateway exposes per-method throttling that behaves like a token bucket with variable cost, which is how endpoint-tier billing falls out cleanly.

Bounded TTL. Idle keys expire after capacity / refill_rate + 10 seconds. Without it, your rl:tb:* keyspace grows unbounded as users come and go.

What's still missing for production: per-key configuration (different limits for different tiers), graceful degradation if Redis is down (most teams fail-open on the limiter, fail-closed on auth), and metrics on rejection rate (Prometheus counter on each return). All trivial to add. The script is the load-bearing part.

Picking the algorithm

The honest decision tree is short.

You're protecting a fragile downstream from any burst. Leaky bucket. Nothing else hard-caps the output rate.

You need exact sliding-window semantics for compliance or premium tiering. Sliding window log, eat the memory cost.

You want most of the sliding-window correctness without the memory bill. Sliding window counter. Default for general API limiting.

You want bursts allowed up to a configurable cap and a sustainable refill rate. Token bucket. Default for user-facing public APIs.

You're rate-limiting an internal counter where the boundary effect doesn't matter. Fixed window. Cheapest, simplest, ship it.

If you're building a new service today and the answer feels ambiguous, ship token bucket. The two knobs cover more traffic shapes than any other algorithm, the Lua script is the ~30 lines you saw above, and you can always swap the algorithm out later — only the limiter interface needs to be stable.

If this was useful

The rate limiter is one of the designs walked end-to-end in System Design Pocket Guide: Interviews, and the supporting machinery (Redis as a primitive, Lua atomicity, distributed counters) is in System Design Pocket Guide: Fundamentals. If you'd rather not learn this on the day a cron job hammers production, both books are written for the developer reaching for a calculator.