Saumya Karnwal

Posted on Feb 27

Distributed Rate Limiting — Five Problems That Break Your Counters

#systemdesign #distributedsystems #ratelimiter

Why Local Rate Limiting Breaks

A rate limiter on a single server works exactly as advertised. But most production systems aren't a single server — they're 10, 50, or 200 instances behind a load balancer. And that changes the math.

If your limit is 100 requests per minute and you have 50 instances, the load balancer sprays traffic round-robin. Each instance sees ~2 requests per minute from any given user. Every instance says "well under the limit." Nobody rejects anything. The user sends 3,000 requests. All pass.

Your per-instance rate limiter silently became a limit × num_instances rate limiter. You didn't change the code. You changed the deployment.

The fix is shared state — usually Redis. All instances read and write to the same counter, so the global count is enforced globally. But the moment you introduce shared state over a network, five new problems appear.

Problem 1: The Race You Can't See

Two requests from the same user arrive at the same millisecond, hitting two different instances. Both call Redis.

Instance A                         Instance B
──────────                         ──────────
GET counter → 99                   GET counter → 99
99 < 100? YES                      99 < 100? YES
SET counter → 100                  SET counter → 100

Both allowed. Real count: 101. Limit breached.

This is TOCTOU — time-of-check-time-of-use. You read, decided, then wrote. But someone else read the same value in the gap between your read and your write.

The fix sounds simple: use INCR instead of GET + SET. Redis INCR atomically increments and returns the new value. No gap.

Instance A                         Instance B
──────────                         ──────────
INCR counter → 100                 INCR counter → 101
100 ≤ 100? ALLOW                   101 > 100? REJECT

But what about more complex algorithms — sliding window, token bucket — where you need to read a value, do math, then conditionally write? You can't do that with one INCR. You need Redis Lua scripts. A Lua script runs atomically on Redis's single thread — no other command can interleave.

One network round-trip. One atomic operation. No race.

The alternative is WATCH/MULTI/EXEC — Redis's optimistic locking. You WATCH a key, read it, start a MULTI transaction, write your changes, and EXEC. If anyone modified the watched key between your WATCH and EXEC, the transaction aborts and you retry. It's compare-and-swap over the network. More flexible than Lua, but slower under contention because of retries.

Problem 2: Redis Dies. Now What?

Redis is down. Or the network between your service and Redis is partitioned. Every rate limit check fails. You have three options, and none of them are good.

Fail open: Allow all requests. Your system stays up, but you have no rate limiting. If Redis went down because of load, you just removed the only thing protecting you from more load.

Fail closed: Reject all requests. Congratulations, your rate limiter just became a denial-of-service attack on your own users.

Fall back to local: Switch to per-instance in-memory counters with global_limit / num_instances as the local limit. Inaccurate, but bounded.

Option three is what most production systems do. But the transition is tricky. When Redis comes back, do you trust the Redis counter (which is stale) or the local counter (which is approximate)? Most teams reset the Redis counter on recovery and accept a brief window of inaccuracy.

The deeper issue: how fast do you detect the failure? If your Redis timeout is 500ms, every rate-limited request adds 500ms of latency while you wait to find out Redis is dead. You need a circuit breaker — after N consecutive timeouts, stop asking Redis for a cooldown period. Go straight to local. Check again in 10 seconds.

Problem 3: Your Servers Disagree About What Time It Is

Window-based algorithms need to answer "which window does this request belong to?" That requires knowing what time it is. Across 50 servers, even with NTP, clocks drift by 10-50ms.

Server A:  10:00:00.000  (on time)
Server B:  10:00:00.150  (150ms ahead)
Server C:  09:59:59.900  (100ms behind)

At 10:00:00 — the window boundary — Server C thinks it's still in the old window. Server B thinks the new window started 150ms ago. They compute different bucket IDs and increment different Redis keys.

Server C:  INCR rate_limit:user123:window_599   ← old window
Server A:  INCR rate_limit:user123:window_600   ← new window
Server B:  INCR rate_limit:user123:window_600   ← new window

Requests at the boundary split across two keys.
Neither hits the limit. Both pass.

For a 60-second window, 150ms of skew is 0.25% — noise. For a 1-second window, it's 15% — a real problem.

Fixes:

Use Redis's clock. Let the Lua script call redis.call('TIME') to determine the current window. One clock, one truth. Adds no extra round-trip since you're already in a Lua script.
Use large windows. If your window is 60 seconds, clock skew doesn't matter. If you need sub-second precision, you need to solve clock sync first.
Use token bucket. No windows, no boundaries, no clock skew problem. The refill math is based on elapsed time (now - last_refill), and small drift in "now" produces proportionally small drift in tokens. A 50ms clock difference on a 1-token-per-second refill rate means 0.05 tokens of error.

Problem 4: One User Melts Your Redis

Per-user rate limiting means one Redis key per user. Most users generate 10 requests per minute. Then one user — or one bot — sends 50,000. Every request hits the same Redis key.

Redis is single-threaded. A hot key means one user's traffic is serialized through one CPU core, and if that core is saturated, ALL other Redis operations on that shard slow down. One abusive user degrades rate limiting for everyone.

Normal:   user:12345  →  10 INCR/min      (invisible)
Abusive:  user:99999  →  50,000 INCR/min  (hot key, one CPU core)

The layered fix:

Reject early. If a user is 10x over the limit, you know the answer without asking Redis. Keep a local approximate counter. If local count >> limit, reject immediately. Only call Redis when the count is near the threshold — the boundary where you actually need distributed accuracy.
Batch concurrent checks. If 200 requests from the same user arrive in the same millisecond, don't make 200 Redis calls. Batch them: one Redis call for the batch, then distribute the result to all 200 waiters locally. This is what production throttlers do — one network round-trip per batch, not per request.
Shard the key. Split user:99999 into user:99999:0, user:99999:1, ..., user:99999:7. Each instance writes to a random shard. To check the total, sum all shards. You trade perfect accuracy for throughput — the sum might be slightly stale.

Problem 5: Three Regions, Three Redis Instances, One Limit

You deploy in US-East, US-West, and EU-West. Each region has its own Redis. A user with a global limit of 1,000/min sends 400 requests to each region.

US-East Redis:  user:123 → 400  (under 1000, allow)
US-West Redis:  user:123 → 400  (under 1000, allow)
EU-West Redis:  user:123 → 400  (under 1000, allow)

Total: 1,200 allowed. Limit is 1,000.

Nobody noticed because no single Redis saw more than 400.

Your options, ranked by pragmatism:

Split the quota. Give each region 1000 / 3 = 333. Simple, but a user who only uses US-East gets 333 instead of 1,000. You're penalizing them for your architecture.

Over-provision the limit. Set it to 800 and accept that the real effective limit is somewhere between 800 and 1,200 depending on distribution. For most use cases, "roughly 1,000" is good enough.

Single global Redis. All regions talk to one Redis in US-East. Accurate, but US-West adds 60-80ms and EU adds 100-150ms per request. For a rate limit check that should take 1ms, that's a 100x latency penalty.

Cross-region sync. Each region publishes its local count every second. Each region subscribes to the others. You get eventual consistency with a 1-2 second window of inaccuracy. Complex to build, complex to debug, and you still have the "what if the sync is down" problem.

Most teams pick option one or two. The engineering cost of options three and four is almost never justified by the accuracy gain. Rate limiting is about protection, not precision — being off by 20% is fine if it still prevents abuse.

The Meta-Problem

These five problems share a root cause: rate limiting is global state enforced locally. Every instance needs to know the global count, but global knowledge has a cost — latency (network hops), availability (what if the store is down), and consistency (what if two instances disagree).

You can't have all three. Pick two:

Accurate + Available: Central store with local fallback (most common)
Accurate + Fast: Single instance, no distribution (doesn't scale)
Fast + Available: Local-only with periodic sync (inaccurate)

Every production rate limiter is a choice on this triangle. Understanding which trade-off your system made — and which failure mode it accepted — is the difference between "we have rate limiting" and "our rate limiting actually works."