Most rate-limiter tutorials show you a tidy little token bucket that works perfectly — on one machine. Then you deploy to production, where you're running three copies of your app behind a load balancer, and the limiter quietly stops doing its job. Nobody gets an error. Nothing crashes. Your "100 requests per minute" just silently becomes 300, and you don't find out until something downstream falls over.
This post is about why that happens, a small demo you can run to see it, and the one change that fixes it.
The limiter that works on your laptop
Here's a textbook in-memory token bucket. The maths is correct: tokens refill at a fixed rate, a request spends one, and you reject when the bucket is empty.
import time
class TokenBucketLimiter:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.refill_rate = refill_rate
self.tokens = float(capacity)
self.last = time.monotonic()
def allow_request(self):
now = time.monotonic()
self.tokens = min(self.capacity, self.tokens + (now - self.last) * self.refill_rate)
self.last = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
On a single process, this is fine. The problem is the word single.
Problem one: every server counts in private
The state — self.tokens — lives in the memory of one process. Run two copies of your app and each has its own bucket. The limit you think you set gets multiplied by however many instances, workers, or containers you're running:
intended limit: 100/min
3 servers: 300/min (each counts on its own)
This isn't a bug in the code. It's the code doing exactly what in-memory state does: not sharing. To enforce one limit across many servers, the count has to live somewhere all of them can see — like Redis.
Problem two: even with shared state, the naive fix still leaks
So you reach for Redis and write the obvious thing: read the token count, do the maths in Python, write it back.
tokens = int(r.get(key) or capacity) # read
# ... refill + check in Python ...
r.set(key, tokens - 1) # write
This looks shared, and it is — but it's still wrong, because the read and the write are two separate trips to Redis with your Python logic in between. Under concurrency, two requests can both read the same balance before either writes back, and both decide they're allowed. That's a classic read-modify-write race, and it gets worse the more traffic you have — exactly when you need the limiter most.
How bad is it? Here's a tiny experiment: fire 50 concurrent requests at a bucket with a capacity of 10.
capacity 10 · 50 concurrent requests
naive read-modify-write -> granted 42 (over by 32)
Forty-two grants from a bucket that should allow ten. The limiter isn't limiting.
The fix: make the whole decision atomic
The reason it leaks is that the decision is spread across multiple Redis calls. The fix is to make the entire read-check-spend happen as one indivisible operation on the Redis server — using a Lua script, which Redis executes atomically. No other request can interleave between the read and the write, because to Redis it's a single command.
-- runs atomically on the Redis server
local tokens = tonumber(redis.call('HGET', KEYS[1], 'tokens'))
-- (refill from elapsed time, clamp to capacity) ...
if tokens >= 1 then
redis.call('HSET', KEYS[1], 'tokens', tokens - 1)
return 1 -- allowed
end
return 0 -- rejected
Same experiment, same burst, with the decision moved into one atomic script:
capacity 10 · 50 concurrent requests
atomic Lua script -> granted 10 (holds)
Exactly ten. The line holds, no matter how many requests arrive at once or how many servers they come from.
Two details worth getting right while you're in there: read the current time from Redis (its TIME command) rather than each app server's clock, so independent servers don't disagree about elapsed time; and decide explicitly what happens if Redis is unreachable — fail closed to protect the backend, or fail open to keep serving. That's a real decision, not a default to stumble into.
Why this matters more than it looks
A rate limiter that over-grants under load is worse than no limiter, because it gives you false confidence. It passes every test you write on your laptop and then fails silently in the one environment it exists for: production, under concurrency, across servers. The only way to trust one is to test it the way it'll actually be hit — hundreds of simultaneous requests at a single bucket — and assert it never exceeds capacity.
If you'd rather not build and test this yourself, I package exactly this as a single-file, fully-tested drop-in (the atomic Lua limiter plus the concurrency test suite that proves the guarantee). It's here. But the technique above is the important part — whether you buy it, copy it, or roll your own, move the decision into one atomic operation and your limiter will tell the truth.
Top comments (0)