Building a distributed rate limiter taught me that atomicity, distributed state, and demo fidelity matter more than which algorithm you pick. Here's what I found when I actually read my own code closely.
"Design a rate limiter" is one of those system design interview questions that sounds simple until you actually try to build one correctly. I built a distributed rate limiter with three different algorithms, Redis as the shared state store, a circuit breaker for resilience, and a full observability stack with Prometheus and Jaeger. The README calls it production-grade. Going back through the code carefully to write this post, I found three places where that label needs an asterisk, and one bug in the demo dashboard that was quietly faking its own results. All three are useful lessons, and I think they're more interesting than the algorithms themselves.
The three algorithms, briefly
Token bucket gives each client a bucket that holds up to rate tokens and refills continuously based on elapsed time. Requests cost tokens, and you're allowed as long as you have enough in the bucket. It's the most burst-friendly option since a quiet client can save up tokens and spend them all at once.
Fixed window counts requests inside a clock-aligned time slice, like "this minute," and resets the count when the slice rolls over. It's the cheapest of the three in Redis, a single INCR plus an EXPIRE, but it has the well-known boundary spike problem: a client can send a full window's worth of requests in the last second of one window and another full window's worth in the first second of the next, doubling their effective rate right at the seam.
Sliding window fixes the boundary problem by tracking the actual timestamp of every request in a Redis sorted set, trimming anything older than the window on every check, and counting what's left. It's the most accurate and the most expensive, both in Redis round trips and in memory, since it has to store one entry per request rather than one integer per window.
Where "atomic" stops being true
The README states that Redis operations are atomic with no race conditions. For fixed window, that's accurate, INCR is a single atomic command, full stop. For token bucket, it isn't quite true, and the gap is instructive.
The token bucket check reads the current token count and the last refill timestamp with two separate, unpipelined GET calls, computes the new token count in Python, and only then writes the result back with a pipelined SET. Atomicity in Redis describes what happens to a single command on the server. It says nothing about what happens between two separate round trips made from application code. If two requests from the same client arrive close enough together, both can read the same pre-update token count, both compute "yes, I have enough," and both get allowed, when only one of them should have been. The data store is atomic. The read-then-write sequence built on top of it is not, and that distinction is exactly the kind of thing that's easy to miss because the word "Redis" is doing a lot of reassuring work in your head that it hasn't actually earned here.
The fix is to push the whole read-compute-write sequence into a single Lua script executed with EVAL, so Redis runs it as one atomic unit server-side rather than as three separate network round trips from the client. That's the standard pattern for correct token bucket implementations in Redis, and its absence here is a fair thing to flag in an interview if someone hands you this code and asks what you'd change.
A subtler bug: punishing yourself for retrying
The sliding window implementation has a quieter issue. It adds the current request's timestamp into the sorted set in the same pipeline that counts existing entries, before it has decided whether to allow or block the request. That means a blocked request still gets written into the window. If a client gets rate limited and retries aggressively, every retry adds a fresh timestamp to its own sliding window, which keeps the window full of recent entries and can keep the client blocked well past when it would have naturally recovered if it had just waited. The algorithm doesn't distinguish between "this request consumed quota" and "this request was rejected." Whether a rejected request should count against the window at all is a real design decision, and here it was decided implicitly by where a line of code happened to sit in the function, not on purpose.
A distributed rate limiter with a circuit breaker that isn't distributed
The architecture has two FastAPI replicas behind an nginx load balancer, which is the whole point of calling this a distributed rate limiter. But the circuit breaker that protects against Redis failures keeps its failure count, success count, and state as plain Python instance attributes inside each process. Each replica has its own circuit breaker with no shared knowledge of the other one.
That means if Redis starts failing, one replica can independently accumulate five failures and open its circuit while the other replica, having happened to receive a slightly different mix of requests, is still on its third failure and sending full traffic at a Redis instance that's already struggling. Depending on which replica nginx routes you to, you could see completely different failure behavior for the exact same backend outage. For a project explicitly named "distributed," having a circuit breaker whose state doesn't travel past the boundary of one process is the kind of gap that's worth being able to name out loud, even if the honest fix, storing circuit state in Redis itself, adds a dependency on the very system the circuit breaker exists to protect against.
What's wired up, and what's just the schema
There's a full SQLAlchemy schema for persisting rate limit rules and historical metrics in PostgreSQL, with tables and columns ready to go. The actual rule read and write path in the API, though, operates on a plain in-memory Python dictionary that gets reset to hardcoded defaults every time the service restarts. The code comment next to those defaults says exactly that: hardcoded for MVP. Nothing in the live request path queries or writes that PostgreSQL schema. The README's own roadmap lists "database-backed hot-reload rules" as a future enhancement, so this isn't a hidden gap, it's an honestly labeled one. But it's a good reminder that a schema existing in the codebase and a feature being live are two different claims, and a README full of checkmarks can make the first one read like the second if you don't go look.
The demo that was generating its own results from a coin flip
This is the one that made me laugh a little when I found it. The React dashboard has an interactive "Demo" page meant to visually show the three algorithms behaving differently as you crank up request volume. Until a recent commit, the page wasn't running any of the three algorithms at all. It was deciding whether to "allow" each simulated request with Math.random() > (blockedSoFar / totalSoFar), a formula that produces a plausible-looking, self-correcting block rate with zero actual rate limiting logic behind it. Switch the dropdown to sliding window and the chart would shift slightly because the displayed latency number changed, not because a different algorithm was running.
The fix replaces that with real client-side implementations of all three algorithms, a refilling token count, a clock-aligned window counter, and an actual sliding array of timestamps, so the visual you watch is the algorithm you selected. The lesson isn't really about the bug itself. It's that a demo can look completely correct, animate smoothly, produce sensible-looking numbers, and still be approximating instead of implementing, and the only way I caught it was reading my own frontend code with the same suspicion I'd apply to someone else's pull request.
Why this is the part worth talking about in an interview
If someone hands you a "design a rate limiter" question, picking an algorithm is the easy fifteen percent. The interesting part, and the part these four issues all point at, is whether the thing you build is actually correct under concurrency, whether the parts you call distributed actually share their state, whether the features in your README are wired into the live path or just scaffolded, and whether the demo you'd show someone is honestly running the system you built or just performing a convincing impression of it. Those are the questions I'd rather be asked, and now, having gone looking, they're the ones I can actually answer about my own code.
Top comments (0)