DEV Community

Cover image for Rate Limiting & Throttling in System Design
CodeWithDhanian
CodeWithDhanian

Posted on

Rate Limiting & Throttling in System Design

In large-scale distributed systems and microservices architectures, uncontrolled incoming traffic can quickly lead to resource exhaustion, degraded performance, or complete service outages. Rate limiting and throttling serve as critical defensive mechanisms that protect backend services, ensure fair usage among clients, prevent abuse, and maintain overall system stability under varying load conditions. These techniques control the flow of requests to APIs, databases, or other resources, allowing systems to operate reliably even during traffic spikes or malicious attacks.

Understanding Rate Limiting

Rate limiting is a technique that enforces a strict upper bound on the number of requests a client, user, IP address, or API key can make within a defined time window. The primary goals include protecting against DDoS attacks, ensuring fair resource allocation, enforcing business quotas, and preventing any single client from monopolizing shared resources.

When a request exceeds the allowed limit, the system typically rejects it immediately and returns an HTTP 429 Too Many Requests status code, often accompanied by headers such as Retry-After to inform the client when it may retry.

Rate limiting operates at multiple layers: at the API gateway, within individual microservices, at the load balancer, or even at the edge using content delivery networks. In distributed environments, the rate limiter must maintain consistent state across multiple nodes, typically using a centralized store such as Redis.

Understanding Throttling

Throttling differs from rate limiting by focusing on controlling the processing speed or flow of requests rather than imposing a hard rejection limit. Instead of outright denying excess requests, throttling slows down, queues, or paces the handling of requests to maintain a steady load on the system.

While rate limiting answers the question “Is this request allowed?”, throttling addresses “How fast should this request be processed?”. Throttling is particularly useful for smoothing bursty traffic, protecting downstream services with their own rate limits, or gracefully handling temporary overload without dropping legitimate requests.

Common throttling strategies include introducing artificial delays, queuing requests in message queues, or dynamically reducing the processing rate based on current system metrics such as CPU usage or queue length.

Key Differences Between Rate Limiting and Throttling

Rate limiting provides a hard cap and immediate rejection for excess requests, making it ideal for quota enforcement and abuse prevention. Throttling prioritizes smoothing traffic and improving user experience by avoiding abrupt denials, often at the cost of increased latency for some requests. Many production systems combine both: rate limiting at the entry point for protection and throttling internally for traffic shaping.

Common Rate Limiting Algorithms

Several well-established algorithms exist for implementing rate limiting, each offering different trade-offs in terms of burst tolerance, accuracy, memory usage, and implementation complexity.

Token Bucket Algorithm

The token bucket algorithm is one of the most widely adopted approaches due to its flexibility and ability to handle controlled bursts. It models capacity as a bucket that accumulates tokens at a constant refill rate up to a maximum capacity. Each incoming request consumes one token. If tokens are available, the request is allowed; otherwise, it is rejected.

Key parameters:

  • Refill rate (r): Tokens added per unit time (e.g., 10 tokens per second).
  • Bucket capacity (b): Maximum number of tokens the bucket can hold, determining burst size.

This algorithm allows short bursts up to the bucket capacity while enforcing the long-term average rate. It is particularly suitable for public APIs where users may send occasional bursts of requests after periods of inactivity.

Complete Token Bucket Implementation Example Using Redis (Lua Script for Atomicity)

-- Token Bucket Lua Script for Redis
local key = KEYS[1]                  -- e.g., "rate:limit:user:123"
local now = tonumber(ARGV[1])        -- current timestamp in seconds
local refill_rate = tonumber(ARGV[2]) -- tokens per second
local capacity = tonumber(ARGV[3])   -- max bucket size
local tokens_requested = tonumber(ARGV[4]) or 1

-- Get current tokens and last refill time
local last_refill = tonumber(redis.call("HGET", key, "last_refill") or now)
local tokens = tonumber(redis.call("HGET", key, "tokens") or capacity)

-- Calculate new tokens to add
local elapsed = now - last_refill
local new_tokens = math.floor(elapsed * refill_rate)
tokens = math.min(tokens + new_tokens, capacity)

-- Check if enough tokens available
if tokens >= tokens_requested then
    tokens = tokens - tokens_requested
    redis.call("HSET", key, "tokens", tokens)
    redis.call("HSET", key, "last_refill", now)
    redis.call("EXPIRE", key, 3600)  -- expire after 1 hour for cleanup
    return {1, tokens}               -- allowed, remaining tokens
else
    return {0, tokens}               -- rejected, remaining tokens
end
Enter fullscreen mode Exit fullscreen mode

This Lua script ensures atomic execution, preventing race conditions in distributed systems. The client calls this script via EVAL or EVALSHA commands in Redis.

Leaky Bucket Algorithm

The leaky bucket algorithm treats requests as water pouring into a bucket with a small hole at the bottom. Requests enter the bucket and are processed (leaked) at a constant fixed rate. If the bucket overflows, incoming requests are rejected or queued.

Leaky bucket excels at smoothing traffic to a steady output rate, making it ideal for scenarios requiring predictable load, such as payment processing or integration with external services that have strict rate limits. Unlike token bucket, it does not permit large bursts; excess requests are either delayed or dropped.

Fixed Window Counter Algorithm

The fixed window algorithm divides time into fixed intervals (e.g., one minute or one hour) and counts the number of requests within each window. A counter is incremented for every allowed request. When the counter exceeds the limit for the current window, further requests are rejected until the next window begins.

This approach is simple and memory-efficient but suffers from the boundary burst problem: clients can send twice the allowed rate at window edges (e.g., 100 requests at the end of one minute and another 100 immediately at the start of the next).

Simple Fixed Window Pseudocode

function isAllowed(clientId, limit, windowSeconds):
    currentWindow = floor(currentTime / windowSeconds)
    counterKey = "rate:" + clientId + ":" + currentWindow
    count = redis.INCR(counterKey)
    if count == 1:
        redis.EXPIRE(counterKey, windowSeconds)
    return count <= limit
Enter fullscreen mode Exit fullscreen mode

Sliding Window Algorithms

Sliding window approaches provide higher accuracy by using a continuously moving time frame instead of rigid boundaries.

Sliding Window Log: Maintains a sorted list or set of timestamps for every request made by a client within the window. On each request, remove old timestamps outside the window and check if the remaining count is below the limit. This offers precise control but consumes significant memory for high-traffic clients.

Sliding Window Counter: A hybrid that combines fixed windows with mathematical adjustment. It tracks counts in the current and previous windows and calculates a weighted count for the sliding period. This balances accuracy and memory usage effectively.

Distributed Rate Limiting Considerations

In microservices or multi-node deployments, a single in-memory rate limiter is insufficient. Designers must ensure consistency across instances using a shared distributed cache such as Redis, Memcached, or a dedicated rate-limiting service.

Consistent hashing can route requests for the same client to the same shard, while Lua scripts or atomic operations guarantee correctness under concurrency. For extremely high scale, consider Redis Cluster or consistent hashing combined with local caching for hot clients.

Idempotency and proper error handling are essential: clients should receive clear rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) to adjust their behavior gracefully.

Best Practices for Implementing Rate Limiting & Throttling

Apply rate limiting at multiple levels: edge (CDN or API gateway), service level, and database level. Choose the algorithm based on requirements — token bucket for burst-tolerant APIs, leaky bucket for traffic shaping, and sliding window counter for strict fairness with good performance.

Use Redis with Lua scripts for atomicity in distributed setups. Always return informative headers and consider adaptive rate limiting that dynamically adjusts limits based on system load. Combine with circuit breakers, bulkheads, and monitoring (Prometheus, Grafana) to detect and respond to abuse patterns.

For throttling, integrate with message queues (Kafka, RabbitMQ) to queue excess requests or apply exponential backoff and jitter on retries.

Rate limiting and throttling form foundational resilience patterns in system design. Proper implementation protects services, improves user experience, and enables sustainable scaling of distributed systems.

Rate limiting vs throttling comparison

System Design Handbook

For more in-depth insights and comprehensive coverage of system design topics, consider purchasing the System Design Handbook at https://codewithdhanian.gumroad.com/l/ntmcf. It will equip you with the knowledge to master complex distributed systems.

Buy me coffee to support my content at: https://ko-fi.com/codewithdhanian

Top comments (0)