Imagine you are building a social website or any large-scale system. Suddenly, a million requests flood your system from a single IP address. Your servers slow down or even crash.
How do we prevent this?
The answer is: Rate Limiting.
In simple terms, a rate limiter restricts the number of requests a user (or IP address) can send within a given time window.
Let’s design one.
Functional Requirements
- Limit the number of requests per user ID or IP address
- Return an error (e.g., HTTP 429) when the limit is exceeded
Non-Functional Requirements
- Low latency while checking the limit (e.g., <10ms)
- High availability (Availability > Consistency)
- Scalable for millions of users
System Function / Endpoint
boolean isAvailable(userId, request)
- If true → forward request to backend
- If false → return 429 (Too Many Requests)
Choosing the Right Algorithm
When we think about limiting requests over time, a natural idea is:
Limit the number of requests in a fixed time window.
But there’s a problem.
Suppose we allow 100 requests per second.
If a user sends:
100 requests at the end of second 1
100 requests at the beginning of second 2
That’s 200 requests within ~1 second.
This is known as the Fixed Window problem.
We don’t want that.
Sliding Window
To solve this, we can use a sliding window approach.
The idea:
At any given time window, the number of requests must not exceed the limit.
This is more accurate but requires storing timestamps of requests.
Implementation might use:
- Sorted sets
- Heaps / priority queues
However, memory usage increases with traffic.
Token Bucket (Preferred Approach)
- Think of tokens as balls in a bucket.
- Each request consumes one token.
- Tokens are refilled at a fixed rate.
- If no tokens are available → reject the request.
Example:
- Bucket size = 100
- Refill rate = 100 per minute
- If a user sends 100 requests instantly, they must wait until tokens are refilled.
Benefits:
- Allows burst traffic (up to bucket capacity)
- Smoothens traffic over time
- Flexible and production-friendly
- Token Bucket is widely used in real systems.
High-Level Architecture
- The Rate Limiter logic is placed before the backend API.
- Load balancer distributes traffic across multiple app servers.
- A shared Redis store keeps token bucket state.
This ensures:
- Distributed rate limiting
- No single point of failure
- Low latency checks
Bottlenecks
Redis Bottleneck
If millions of users hit the system simultaneously, Redis may become the bottleneck.
To scale:
- Use Redis clustering
- Shard keys across multiple Redis nodes
- Use consistent hashing for distribution If each Redis instance stores 100k users and we need to support 1 million users: We need around 10 Redis nodes.
2. Concurrency Problem
What if:
- A user has only 1 token left
- Two requests hit Redis at the same time from different servers?
- Redis solves this using atomic operations.
Using:
- Lua scripts
- Atomic commands like INCR
- Or transactions
This prevents race conditions.
3. Latency Considerations
To reduce latency:
- Keep Redis close to application servers (same region)
- Use cluster topology
- Avoid cross-region calls for rate limit checks
- Geographical distance directly impacts response time.
Final Thought
- Functional requirements define what the system does.
- Non-functional requirements define how well it performs at scale.
Rate limiting may look simple, but designing it correctly in distributed systems requires careful thought.
See you in the next post 🚀

Top comments (0)