DEV Community

Rajkiran
Rajkiran

Posted on

System Design - 21. Rate Limiting: The 5 Algorithms That Protect Every API on the Internet

Rate Limiting: The 5 Algorithms That Protect Every API on the Internet

Series: System Design Mastery — Day 7 of 15
Reading time: 12 min
Covers: Token Bucket, Leaky Bucket, Fixed/Sliding Window, Distributed Rate Limiting with Redis, Multi-DC


The API That Got Hugged to Death

In 2013, a small startup's API went viral — a popular blog post linked directly to their public endpoint, and within minutes their servers were receiving 50x normal traffic. Not from an attack — from genuine, enthusiastic users, all hitting "refresh" on a slow-loading page, triggering retries, triggering more load.

The servers fell over within 20 minutes. The viral moment — which should have been their best day — became their worst, as the service was completely unusable exactly when the most people wanted to try it.

Rate limiting exists to prevent exactly this — whether the traffic surge comes from genuine enthusiasm, a buggy client retrying too aggressively, or a malicious attacker. The mechanism is the same: bound how much traffic any single source can send, so the system stays healthy for everyone.


Why Rate Limiting Matters Beyond "Stopping Attacks"

Most people think rate limiting = anti-DDoS. That's one use case, but not the primary one for most systems:

1. Fair resource allocation: If one customer's batch job sends 10,000 requests/second, it shouldn't degrade service for every other customer sharing the infrastructure.

2. Cost control: Each API call to a downstream paid service (a third-party API, a database query) costs money. Rate limiting bounds your maximum cost exposure.

3. Protecting downstream systems: Your API might handle 10,000 req/s fine, but your database can only handle 1,000 writes/s. Rate limiting at the API layer protects the database from being overwhelmed by API traffic.

4. Preventing retry storms: A buggy client that retries failed requests in a tight loop can accidentally generate enormous load — rate limiting caps the damage.

5. Business model enforcement: Free tier gets 100 requests/day, paid tier gets 100,000. Rate limiting is the product tier enforcement mechanism.


The 5 Rate Limiting Algorithms

Algorithm 1: Token Bucket

The mental model: A bucket holds tokens. Tokens are added at a fixed rate, up to a maximum capacity. Each request consumes one token. If the bucket is empty, the request is rejected (or queued).

Bucket capacity: 10 tokens
Refill rate: 1 token per second

Time 0s: bucket = [●●●●●●●●●●] (10 tokens, full)
Request arrives → consume 1 token → bucket = [●●●●●●●●●_] (9 tokens)

5 rapid requests arrive → consume 5 tokens → bucket = [●●●●______] (4 tokens)
(BURST allowed! All 5 requests succeeded immediately)

Time passes, tokens refill at 1/sec...
If bucket is empty and a request arrives → REJECTED (429 Too Many Requests)
Enter fullscreen mode Exit fullscreen mode

Key property: allows bursts. If the bucket is full, you can make 10 requests instantly — then you're limited to the refill rate (1/sec) until the bucket replenishes.

Implementation:

import time

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.tokens = capacity
        self.last_refill = time.time()

    def allow_request(self):
        now = time.time()
        elapsed = now - self.last_refill
        # Refill tokens based on elapsed time
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True  # Allowed
        return False  # Rejected
Enter fullscreen mode Exit fullscreen mode

Best for: APIs where occasional bursts are normal and desirable — a user opening an app and making several requests at once (load dashboard widgets) shouldn't be immediately rate-limited just because they happened in the same second.

Who uses it: This is the most common algorithm for public APIs — Stripe, GitHub, Twitter all use token-bucket variants.


Algorithm 2: Leaky Bucket

The mental model: Requests enter a queue (the "bucket"). The queue is processed ("leaks") at a constant, fixed rate, regardless of how fast requests arrive. If the queue is full, new requests are dropped.

Queue capacity: 5
Processing rate: 1 request per second (constant, regardless of input rate)

10 requests arrive instantly:
  → First 5 enter the queue
  → Remaining 5 are REJECTED (queue full)

Queue processes at exactly 1/second:
  Second 1: process request 1
  Second 2: process request 2
  Second 3: process request 3
  ...
Enter fullscreen mode Exit fullscreen mode

Key property: smooths output to a constant rate. Unlike Token Bucket (which allows bursts through), Leaky Bucket guarantees the downstream system never sees more than the configured rate — even if 1000 requests arrive in the same millisecond, the downstream sees a steady drip.

Token Bucket vs Leaky Bucket — the core difference:

Token Bucket: "How many requests can I ALLOW THROUGH right now?"
  → Output rate can spike (bursts pass through immediately)

Leaky Bucket: "At what CONSTANT rate do I process requests?"
  → Output rate is always smooth, never spikes
Enter fullscreen mode Exit fullscreen mode

Best for: Protecting downstream systems that genuinely cannot handle bursts — e.g., a legacy database that chokes on concurrent connections, or a third-party API with a strict "exactly N requests per second" contract.


Algorithm 3: Fixed Window Counter

The mental model: Divide time into fixed windows (e.g., 1-minute blocks). Count requests in the current window. Reject if the count exceeds the limit. Reset the counter when the window ends.

Limit: 100 requests per minute

Window [12:00:00 - 12:01:00]: counter = 0
  Request arrives → counter = 1
  ... 99 more requests → counter = 100
  101st request → REJECTED (limit reached)

Window [12:01:00 - 12:01:00]: counter resets to 0
  New requests allowed again
Enter fullscreen mode Exit fullscreen mode

The edge case problem (this is the famous interview gotcha):

Limit: 100 requests per minute

Window 1 [12:00:00 - 12:01:00]: 
  100 requests arrive at 12:00:59 (last second of window) → all allowed

Window 2 [12:01:00 - 12:02:00]:
  100 requests arrive at 12:01:00 (first second of new window) → all allowed

Result: 200 requests in a 2-second span (12:00:59 to 12:01:01)
But the configured limit was "100 per minute"!
Enter fullscreen mode Exit fullscreen mode

The fixed window resets abruptly, allowing a burst of 2x limit right at the boundary. This is a real vulnerability — attackers can exploit window boundaries to send 2x traffic.

Best for: Simple cases where this edge case doesn't matter much (internal tools, low-stakes limits). Simple to implement, very low memory overhead.


Algorithm 4: Sliding Window Log

The mental model: Store a timestamp for every request. To check if a new request is allowed, count how many timestamps fall within the last N seconds (a true sliding window, not fixed boundaries).

Limit: 100 requests per minute

Request arrives at 12:01:30
→ Count all stored timestamps between 12:00:30 and 12:01:30
→ If count < 100, allow and store this timestamp
→ If count >= 100, reject

Old timestamps (before 12:00:30) are discarded
Enter fullscreen mode Exit fullscreen mode

Key property: perfectly accurate. No boundary exploits — the window always represents exactly "the last N seconds," continuously sliding.

Implementation with Redis:

import time

def is_allowed_sliding_log(redis_client, user_id, limit=100, window=60):
    now = time.time()
    key = f"rate_limit:{user_id}"

    # Remove timestamps older than the window
    redis_client.zremrangebyscore(key, 0, now - window)

    # Count remaining timestamps (requests in the current window)
    count = redis_client.zcard(key)

    if count < limit:
        redis_client.zadd(key, {str(now): now})  # Record this request
        redis_client.expire(key, window)
        return True
    return False
Enter fullscreen mode Exit fullscreen mode

Disadvantage: memory heavy. Storing a timestamp per request — for a user making 100 requests/minute, that's 100 entries per user, continuously. At millions of users, this becomes a significant memory cost in Redis.

Best for: When precision matters more than memory cost — security-critical rate limits (login attempts, password reset requests) where the 2x boundary exploit of Fixed Window is unacceptable.


Algorithm 5: Sliding Window Counter (The Best Balance)

The mental model: A hybrid — combine the previous window's count and the current window's count, weighted by how far into the current window we are. Approximates the Sliding Window Log without storing every timestamp.

Limit: 100 requests per minute

Previous window [12:00:00-12:01:00]: 80 requests
Current window  [12:01:00-12:02:00]: 30 requests so far
Current time: 12:01:15 (25% into the current window)

Weighted count = (previous_window_count × (1 - elapsed_fraction)) 
                  + current_window_count
                = (80 × (1 - 0.25)) + 30
                = (80 × 0.75) + 30
                = 60 + 30
                = 90

90 < 100 → request ALLOWED
Enter fullscreen mode Exit fullscreen mode

This formula approximates "how many requests occurred in the trailing 60 seconds" using just two counters (previous window, current window) instead of storing every timestamp.

Why this is the industry standard:

  • Memory: O(1) per user (just 2 numbers), vs O(N) for Sliding Window Log
  • Accuracy: very close to true sliding window — eliminates the 2x boundary exploit of Fixed Window
  • This is what Cloudflare and most major API providers use in production

Distributed Rate Limiting with Redis

In a system with multiple API servers behind a load balancer, rate limiting must be shared across all of them — otherwise, a user could hit server A's limit, then send requests to server B which has its own independent counter, effectively multiplying their limit by the number of servers.

The solution: centralized counting in Redis, accessed by all API servers.

# Token Bucket using Redis + Lua script for atomicity

LUA_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill based on elapsed time
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)

if tokens >= 1 then
    tokens = tokens - 1
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 60)
    return 1  -- allowed
else
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    return 0  -- rejected
end
"""

def is_allowed(redis_client, user_id, capacity=10, refill_rate=1):
    now = time.time()
    result = redis_client.eval(
        LUA_SCRIPT, 1, f"rate_limit:{user_id}", capacity, refill_rate, now
    )
    return result == 1
Enter fullscreen mode Exit fullscreen mode

Why Lua scripting matters here: Without it, "check tokens, then update tokens" is two separate Redis calls — a race condition. Two simultaneous requests could both read "5 tokens available," both proceed, and both decrement — but the actual remaining count should have only allowed one of them under the limit.

Lua scripts execute atomically in Redis — the entire check-and-update happens as one indivisible operation, eliminating the race condition. This is the standard pattern for distributed rate limiting.


Response Headers: Communicating Limits to Clients

A well-designed rate-limited API tells clients where they stand:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 37
X-RateLimit-Reset: 1718888940

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1718888940
Retry-After: 45
Enter fullscreen mode Exit fullscreen mode

Retry-After tells the client exactly how long to wait before retrying — directly enabling well-behaved exponential backoff (Topic 18) on the client side. A well-designed rate limiter doesn't just reject requests — it tells clients how to behave.


Tiered Rate Limits

Real APIs don't have one global limit — different user tiers get different limits:

TIER_LIMITS = {
    "free":       {"capacity": 100,    "refill_rate": 100/86400},     # 100/day
    "pro":        {"capacity": 10000,  "refill_rate": 10000/3600},    # 10K/hour
    "enterprise": {"capacity": 100000, "refill_rate": 100000/60},     # 100K/min
}

def get_rate_limit_config(user_id):
    user_tier = user_config_service.get_tier(user_id)  # cached lookup
    return TIER_LIMITS[user_tier]
Enter fullscreen mode Exit fullscreen mode

This is how API products like Stripe, Twilio, and GitHub enforce their pricing tiers — rate limiting is the enforcement mechanism for "upgrade to get higher limits."


Global vs Per-Service Rate Limiting

Where should rate limiting live in your architecture?

Option 1: At the API Gateway (most common)
  Single enforcement point, before requests reach any backend service
  ✓ Consistent across all services
  ✓ Protects all downstream services uniformly
  ✗ Gateway must be fast — adds latency to every request

Option 2: Per-service
  Each service enforces its own limits
  ✓ Services can have different limits based on their specific load capacity
  ✗ Inconsistent enforcement, duplicated logic
  ✗ A user could exceed the "global" intent by spreading requests across services

Option 3: Both (defense in depth)
  Gateway enforces overall user/API-key limits
  Individual services enforce limits specific to expensive operations
  (e.g., "image processing" endpoint has a stricter limit than "get profile")
Enter fullscreen mode Exit fullscreen mode

Most production systems use Option 3 — coarse-grained limiting at the gateway (overall fairness, DDoS protection) plus fine-grained limiting at specific expensive endpoints.


Multi-Datacenter Rate Limiting: The Hard Problem

If your Redis cluster is per-region, and a user's requests are routed to different regions (geo-routing, failover), each region's Redis has an independent count — a user could get limit × number_of_regions total throughput.

Approach 1: Global Redis (cross-region)
A single Redis cluster, accessed by all regions. Simple, but adds cross-region latency to every request (Day 1: 150ms cross-continent) — often unacceptable.

Approach 2: Per-region limits with a shared global budget
Each region gets limit / number_of_regions as its local limit. Simple, but if traffic is unevenly distributed (all traffic happens to hit one region), that region's limit may be too restrictive even though the global budget isn't exhausted.

Approach 3: Async global synchronization
Each region rate-limits locally with a slightly generous local limit, and periodically syncs counts to a global store. There's a small window of "overshoot" (a user could exceed the true global limit briefly), but most systems accept this trade-off for the latency win.

The honest answer for interviews: "Perfect global rate limiting across multiple datacenters with zero added latency and zero overshoot is fundamentally a trade-off — you can have strong consistency (global Redis, adds latency) or low latency (per-region, allows some overshoot), but not both. Most systems choose per-region with generous local limits and accept brief overshoot as the lesser evil — similar to the AP choice in CAP theorem from Day 2."


Key Takeaways

  • Token Bucket: allows bursts, refills at a steady rate — most common for public APIs.
  • Leaky Bucket: smooths output to a constant rate — best for protecting rate-sensitive downstream systems.
  • Fixed Window Counter: simple but has a boundary exploit (2x burst at window edges).
  • Sliding Window Log: perfectly accurate, but memory-heavy (stores every timestamp).
  • Sliding Window Counter: the industry-standard balance — O(1) memory, near-perfect accuracy, used by Cloudflare and most major APIs.
  • Distributed rate limiting requires centralized state (Redis) with atomic operations (Lua scripts) to avoid race conditions across multiple API servers.
  • Response headers (X-RateLimit-*, Retry-After) let clients behave well — combine with Topic 18's exponential backoff.
  • Multi-DC rate limiting is a genuine CAP-theorem-style trade-off between strict accuracy and low latency — most systems accept some overshoot for speed.

You've now covered Security (authentication, authorization, Zero Trust), Observability (the 3 pillars, golden signals, alerting), and Rate Limiting (the 5 algorithms and distributed implementation). These three topics form the protective and diagnostic layer that wraps around every system you'll design.

Next — the final day of Phase 2 — covers the advanced data structures every senior engineer should know: Consistent Hashing, Bloom Filters, Geospatial Indexing, and a grab-bag of structures (Skip Lists, HyperLogLog, Tries, LSM Trees, B+ Trees) that power the internals of the databases and caches you've been learning about all week.

Tags: system-design rate-limiting redis api-design backend distributed-systems interview-prep

Top comments (0)