DEV Community

JosephAkayesi
JosephAkayesi

Posted on

Design a Rate Limiter

Rate Limiting — System Design Deep Dive

A rate limiter is a piece of software that regulates how much traffic a client can send to a server within a given period of time.

At scale, rate limiting becomes a fundamental building block for building reliable and cost-efficient systems.


Why Do We Need Rate Limiting?

Rate limiters provide several important benefits:

  • ✅ Prevent denial-of-service (DoS) attacks
  • ✅ Promote fair usage of shared resources
  • ✅ Protect backend services from overload
  • ✅ Reduce infrastructure and operational costs

Real-World Examples

Rate limiting appears everywhere in modern applications:

  • Users can share up to 150 posts per day
  • Users can post 300 tweets within 2 hours
  • Users can make two withdrawal transactions within 15 seconds

In reality, rate limits depend entirely on your application's access patterns and business rules.


Where Can Rate Limiters Live?

Rate limiters can be deployed in different parts of the system:

1️⃣ Client-Side Rate Limiting

Client-side rate limiting happens within the application itself.

Pros

  • Reduces unnecessary requests early
  • Improves perceived responsiveness

Cons

  • Less secure
  • Clients can tamper with requests and bypass restrictions

2️⃣ Server-Side Rate Limiting

Server-side rate limiting enforces rules centrally.

Pros

  • Strong enforcement
  • Cannot be bypassed easily
  • Reliable tracking of usage

Server-side limiter 1


3️⃣ API Gateway / Middleware Layer

A very common approach is placing the rate limiter at the API gateway.

This allows all incoming traffic to be evaluated before reaching backend services.

Server-side limiter 2

Gateway limiter


Core Rate Limiting Algorithms

Most industry rate limiters are based on a few well-known algorithms.

  • Fixed Window Counter
  • Sliding Window Counter
  • Token Bucket
  • Leaky Bucket

Let’s walk through each one.


Fixed Window Counter

In a fixed window counter, clients can make a specific number of requests within a fixed time interval.

Example:

  • 100 requests per minute

Problem: Burstiness

A client could send:

  • 100 requests at 00:59
  • Another 100 requests at 01:00

Result: 200 requests within seconds, even though the limit is 100 per minute.

Fixed window


Sliding Window Counter

Instead of resetting counters at fixed intervals, the sliding window evaluates requests relative to the current time.

When a request arrives:

  • The system checks how many requests occurred during the previous time window.
  • If the limit is exceeded, the request is rejected.

This significantly reduces burst traffic compared to fixed windows.

Sliding window


Token Bucket

The token bucket is one of the most widely used rate limiting algorithms.

How it works

  • A bucket contains tokens.
  • Each token allows one request.
  • Tokens refill at a constant rate.
  • Requests consume tokens.
  • If no tokens remain → request is rejected.

Burst traffic is allowed as long as tokens are available.

More expensive operations can consume multiple tokens.

This makes token buckets ideal for high-traffic APIs.

Token bucket

Token bucket flow

Token bucket example


Leaky Bucket

The leaky bucket processes requests at a constant rate.

Think of it as a FIFO queue:

  • Requests enter the queue
  • Requests are processed steadily
  • When the queue is full, new requests are dropped

This smooths traffic spikes and ensures consistent processing.

Leaky bucket


High-Level Architecture

A rate limiter typically acts as middleware between clients and servers.

Every incoming request is evaluated before reaching the API.

Architecture

If a request exceeds limits, the server responds with:
HTTP 429 — Too Many Requests

429 response


Helpful Rate Limit Headers

Servers often return headers to help clients behave correctly:

Header Meaning
X-RateLimit-Remaining Remaining allowed requests
X-RateLimit-Limit Maximum allowed requests
X-RateLimit-Retry-After Seconds before retrying

Headers


Rule Configuration

Rate limiting rules define what is allowed.

Example:

  • Maximum 5 marketing messages per day
  • Maximum 5 login attempts per minute

Rules are typically:

  • Stored on disk or configuration services
  • Loaded into cache by workers
  • Evaluated in middleware during requests

Request Flow

  1. Client sends request
  2. Request reaches rate limiter middleware
  3. Rules are loaded from cache
  4. Counters and timestamps are checked
  5. Request is either forwarded or throttled

Rate Limiting in Distributed Systems

Scaling introduces new challenges.

Race Conditions

Multiple concurrent requests may update counters simultaneously.

Example:

  • Limit = 3 requests/sec
  • Two threads read counter value = 2
  • Both allow requests → limit exceeded

Solutions:

  • Atomic operations
  • Redis sorted sets
  • Distributed locks (with performance tradeoffs)

Synchronization Problems

In distributed systems:

  • Requests may hit different servers
  • Replication lag causes stale counters
  • Limits become inconsistent

Sticky sessions can help but are usually avoided due to operational complexity.

Sticky sessions


Centralized Rate Limiting (Global Cache)

A common solution is using a centralized datastore like Redis.

All nodes read and update shared counters.

Global cache

Tradeoffs:

  • Potential single point of failure
  • Increased latency for global users

Performance Optimization

A better large-scale solution is a multi–data center architecture.

  • Deploy rate limiter nodes close to users
  • Maintain regional counters
  • Synchronize data using eventual consistency

Benefits:

  • Reduced latency
  • Improved user experience
  • Better global scalability

Monitoring and Observability

After deployment, monitoring is critical.

Track:

  • Rate limit hit frequency
  • False positives
  • Traffic patterns
  • Algorithm effectiveness
  • User impact

Rate limiting is not a “set and forget” system — it requires continuous tuning.


Final Thoughts

Rate limiting is more than just protecting APIs from abuse. It is a core reliability mechanism that:

  • stabilizes systems under load,
  • ensures fairness,
  • and controls operational costs.

Choosing the right algorithm and architecture depends heavily on your traffic patterns, scale, and consistency requirements.

Design it carefully — because at scale, rate limiting becomes part of your system’s resilience strategy.

Top comments (0)