Rate Limiting in a 30-Minute System Design Interview

#architecture #tutorial #programming #devops

Book: System Design Pocket Guide: Interviews
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

The panel says: design rate limiting for our public API. You have 30 minutes. Three engineers, one product manager, a whiteboard with a 4:13pm clock visible behind the lead's left shoulder. You have done this before on toy problems. You have not done it under a clock with the staff engineer's pen tapping the table.

Here is the walk that lands the offer. Not the perfect design. The one that fits in 30 minutes and survives the follow-up questions.

Minute 0–5: clarify before you draw

Do not start sketching. Ask. Each answer collapses a branch of the design tree.

"Public API: authenticated, anonymous, or both?"
"Per-user, per-IP, per-API-key, per-endpoint, or layered?"
"What limits, like 100 per minute or 10k per hour?"
"How many requests per second across the fleet at peak?"
"Synchronous reject with 429, or shape with delay?"
"Cross-region, or is each region independent?"
"Is approximate counting acceptable, or strictly exact?"

The interviewer will say something like: "Authenticated by API key. 1,000 RPS at peak. Per-key limit of 600 requests per minute, with a burst of 100. Reject with 429. Approximate is fine within a few percent."

Now you can size it. 1,000 RPS times 86,400 seconds is 86 million counter writes per day. A single Redis primary handles that with headroom. You write ~86M writes/day, hot keys possible on the board. The interviewer nods. You have just told them you can do back-of-the-envelope math without a calculator.

Minute 5–12: pick the algorithm and defend it

Four candidates. You name them, then pick one. Do not lecture.

Fixed window. Count requests per key per minute. Reset on the minute boundary. One integer per key. Cheapest possible. Fails on the boundary: a client can fire 600 at 12:00:59 and another 600 at 12:01:00 and look compliant while doubling the limit for one second.

Sliding log. Store every request timestamp in a sorted set. Trim entries older than the window. Exact, fair, bounded by burst size in memory. At 1,000 RPS times a 60-second window, that is 60k entries times the active key count. Memory blows up.

Sliding window counter. Two fixed-window counters; weight the previous one by how far into the current window you are. Approximate, but the error is bounded under a few percent for reasonable traffic. Cheap. Smooths the boundary problem.

Token bucket. A bucket of N tokens, refilled at R per second, capped at burst size B. Each request consumes a token. Allows controlled bursts, which is what most public APIs actually want.

You pick token bucket. Reason out loud: "The customer asked for a burst of 100, which is exactly the parameter token bucket exposes. It is also the model described in public rate-limit docs from AWS API Gateway, Stripe, and GitHub, which means our integrators already understand the headers."

The interviewer asks why not leaky bucket. You say: leaky bucket smooths output to a constant rate, which is great for downstream protection but does not let a well-behaved client spike to 100 in a second. Token bucket gives you the burst budget the spec asked for. Same memory cost. Different shape.

You write on the board:

tokens(t) = min(B, tokens(t-1) + R * (t - last_refill))
allow if tokens >= 1, then tokens -= 1

Minute 12–20: the distributed Redis sketch

Single-instance is easy. Distributed is the real interview.

You draw three API servers behind a load balancer, all talking to one Redis primary with one read replica. The replica is for failover, not reads. Rate limit reads must be consistent.

[client] -> [LB] -> [api-1, api-2, api-3] -> [redis primary]
                                                |
                                            [redis replica]

Each request runs an atomic check-and-decrement against Redis. The naive version is a race:

GET tokens:<key> returns 47.
Two API servers both see 47.
Both decide to allow, both SET to 46.
You leaked a token.

You fix this with a Lua script. Redis evaluates the whole script atomically against a single key, no race, one round trip.

-- KEYS[1] = bucket key, e.g. "rl:apikey:abc123"
-- ARGV[1] = capacity (burst), ARGV[2] = refill_per_sec
-- ARGV[3] = now (unix ms),    ARGV[4] = cost (usually 1)

local capacity = tonumber(ARGV[1])
local refill   = tonumber(ARGV[2])
local now      = tonumber(ARGV[3])
local cost     = tonumber(ARGV[4])

local data = redis.call("HMGET", KEYS[1], "tokens", "ts")
local tokens = tonumber(data[1])
local ts     = tonumber(data[2])

if tokens == nil then
  tokens = capacity
  ts = now
end

local delta = math.max(0, now - ts) / 1000.0
tokens = math.min(capacity, tokens + delta * refill)

local allowed = 0
if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
end

redis.call("HMSET", KEYS[1], "tokens", tokens, "ts", now)
redis.call("PEXPIRE", KEYS[1], 120000)
return { allowed, tokens }

A few details to call out as you write it. The PEXPIRE keeps idle keys from filling memory. The hash stores tokens as a float so partial refills accumulate correctly. now comes from the caller. Never trust redis.call("TIME") for this if your clients are in multiple regions; use the API server's clock and accept a few ms of skew.

Then the Python caller. Boring on purpose:

import redis
import time

LUA = open("token_bucket.lua").read()
r = redis.Redis(host="redis-primary", decode_responses=True)
script = r.register_script(LUA)

def allow(api_key: str, burst=100, refill=10.0) -> bool:
    now_ms = int(time.time() * 1000)
    allowed, _ = script(
        keys=[f"rl:{api_key}"],
        args=[burst, refill, now_ms, 1],
    )
    return allowed == 1

refill=10.0 gives you 600 per minute. burst=100 matches the spec. One Redis round trip per request. At 1,000 RPS and a sub-millisecond Redis hop, you spend about 0.5ms of latency per call. The interviewer can do that math too.

Minute 20–25: hot keys and the scaling questions

The staff engineer leans forward. "What happens when one key does 50,000 RPS?"

Single Redis primary tops out at maybe 100k ops/sec for a Lua script of this size. One enterprise customer running a backfill can saturate the node. You need an answer.

Three layers, in order of how much you should reach for them:

1. Local pre-check with periodic sync. Each API server keeps an in-memory token bucket per hot key, configured to a fraction of the global limit. Reject locally without touching Redis when the local bucket is empty. Sync the leftover budget to Redis every 100ms. You trade a small amount of fairness across servers for a 10x throughput cut on the central counter.

2. Sharding by key. Run a Redis cluster. The bucket key hashes to a shard. One hot tenant lives on one shard, but you have moved ten cold tenants off that shard, so the hot one has the whole node. Caps your blast radius at one shard's capacity.

3. Two-tier counters. Split the global limit into N sub-buckets, one per API server. Server-1 owns 1/N of the budget locally. When it runs out, it requests another slice from Redis. Inspired by approaches like the one Cloudflare has described. Complex; only reach for this if the simpler tiers fail your math.

You name layer 1 as the answer for this question and mention 2 and 3 exist. Do not implement them on the board unless asked.

Minute 25–30: gotchas, headers, and the things that bite

The product manager asks "what could go wrong?" You give them the four that actually matter:

Clock skew across API servers. Token-bucket math is now - last_refill. If server A's clock is 800ms ahead of server B and both touch the same bucket, the bucket can over-refill. Use NTP, accept a few ms of error, and use a single clock source (the Lua script's view of now) when it matters.

Returning the right headers. A client that gets 429 with no metadata retries blindly and makes the spike worse. Return RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset (the IETF draft, not the older X- forms). Honest clients back off; your traffic shape gets better for free.

The thundering herd at reset. If the limit resets on the minute, every blocked client retries at :00. Token bucket avoids this by design — refill is continuous. If you ever drop back to fixed window, jitter the reset by a per-key hash.

Fail-open vs fail-closed. Redis goes down. You either let traffic through (fail-open, risk overload) or block everything (fail-closed, full outage). Most public APIs fail-open with a circuit breaker and aggressive alerting, because rate limiting is a cost-control feature, not a security control. Auth and abuse detection sit on a different layer that fails closed. Say this distinction out loud, because interviewers care.

You leave the room with five minutes of buffer, a board that reads top-to-bottom, and a Lua snippet that would actually run. That is the bar.

If this was useful

Rate limiting is one of fifteen full system designs in the System Design Pocket Guide: Interviews, each broken down at the same minute-by-minute cadence: clarify, sketch, defend, scale, gotchas. If your next loop has a system design round and you want a working pattern instead of a list of buzzwords, it is built for that.