Cover image for Design a Scalable request Rate Limiting Algorithm for API- System Design

Design a Scalable request Rate Limiting Algorithm for API- System Design

mayankjoshi profile image mayank joshi Originally published at nlogn.in ・6 min read

System Architecture (4 Part Series)

1) Designing a URL shortening service from scratch to scale million of users. 2) Hacking into Pastebin scalable architecture - System Design 3) Design a Scalable request Rate Limiting Algorithm for API- System Design 4) How to use cache and Redux to make your App blazing fast

Ever wondered how tech giants providing access to their services using API, control the number of requests that can be made by the requester/user per hour. Example GitHub allows 5000 requests per hour per user, Facebook allows only 200 calls per user per hour, etc.

Let's learn how these tech giants implement rate-limiting.

Introduction to Rate Limiting

Rate limiting is a defensive measure for API or services. It provides a mechanism to limit the number of requests to our API or service in a given time period. This protects an API or service from malicious attack, overuse, occasional request spikes, DDoS attack, brute-force password attempts, and other types of abusive behavior targeting the application layer.

Without rate limiting, each user may request as many times as they like, which can lead to “spikes” of requests that can starve other consumers or can even crash our server. After rate limiting is enabled, the users are limited to make a fixed number of requests per second.


Request without Rate LimitingRequests with rate limiting

From the above diagram, we can see that after the rate limiter of 30 requests/min is applied if a user makes more than 30 requests/minute, only 30 requests are accepted, and remaining are all dropped.

Rate Limiting Algorithms

There are several different algorithms for implementing rate limiting. Let's talk about each of them.

1) Leaky Bucket

It is the simplest algorithm to implement a rate limiter. It uses a bucket or queue to hold the incoming requests. Whenever a new request arrives, it is appended to the rear of the queue, until the queue is not full.

The requests are processed at fixed time intervals in the first come first serve (FCFS) manner, i.e. old requests are the one to be executed first.  If the queue is full, the remaining are dropped or leaked with a proper message or notification to the client.

Leaky bucket

Rate limiting using Leaky bucket

The advantage of this algorithm:

  1. It smoothens burst of requests by processing them at a constant rate.
  2. Easy to implement.
  3. The size of the queue(buffer) used will be constant, hence it is memory efficient.

The disadvantage of the leaky bucket algorithm:

  1. A burst of traffic can fill-up the queue with old requests in a time slot and the new request might starve.
  2. It provides no guarantee that requests will be processed in a fixed amount of time.
2) Fixed Window

Fixed window rate limiting algorithm, the timeline is divided into a fixed window(say 1min or 1 hour, etc.) and each window is provided with a counter(to count a number of requests in a particular window). If the value of the counter exceeds the limit, the remaining requests are dropped.
The counter resets after every window.

Suppose we have a rate limit of 10 requests/hour and have a data model like below.

fixed window

With the current method in the above example, if a new request arrives at 12:40, we get the count from the bucket(12:00 - 1:00) which is 7, and if less than our request limit, hence this request will be processed and count of the current window will become 8.

Now assume a case for window (1:00 - 2:00), a request arrives at 1:40 and the value of the counter in this window is 9, which is less than permissible limit(10), so this request will be accepted and the value of the counter will become 10. Now no more requests in the window (1:00 - 2:00) will be accepted.

The advantage of this method is:

  1. It is easy to implement.
  2. Less memory requirement since we are storing the only count in a given time window.
  3. It ensures more recent requests get processed without being starved by old requests (as the counter resets after every window).

Cons of this algorithm are:

  1. A single burst of traffic that occurs near the boundary of a window can result in twice the rate of requests being processed. Suppose, that counter is empty 10 requests spikes arrive at 12:59, they will be accepted and again a 10 requests spike arrives at 1:00 since this is a new window and the counter will be set to 0 for this window. So even these requests will be accepted and sever is now handling 20 requests> 10 requests/ hour limit.
  2. Many consumers waiting for a reset window(ex during peak hour like black Friday sale) can stampede our server at the same time.
3) Sliding Window

It is based on a fixed window algorithm. Here instead of completely the counter after every window, we use the information from the previous counter to estimate the size of the current request rate for the current window.

In the sliding window, instead of fixed window size, we have a rolling window of time to smooth bursts.

The windows are typically defined by the floor of the current timestamp, so 12:03:15 with a 60-second window length would be in the 12:03:00 window.

Let's say we want to limit 100 requests per hour on an API and assume there are 84 requests in the time window [12:00 - 1:00) and 36 requests current window [1:00 to 2:00) which started 15 minutes ago.

sliding window

Now imagine a new request arrives at 1:15. To decide, whether we should accept this request or deny it will be based on the approximation.

The approximation rate will be calculated like this:

limit = 100 requests/hour

rate = 84 * ((60-15)/60) + 36
     = 84 * 0.75 + 36
     = 99

rate < 100
    hence, we will accept this request.

Since the requests in the current window [12:15 - 1:15) are 99  which is less than our limit of 100 requests/hour, hence this request will be accepted. But any new request during the next second will not be accepted.

This algorithm assumes a constant request rate in the (any) previous window, which is not true as there can be request spikes too during a minute and no request during another hour. Hence the result is only an approximated value.

Pros of the sliding window:

  1. It smoothens the traffic spikes problem we had in the fixed window method, it is easy to implement.
  2. It results in an approximate value, but the value is very closer to an accurate value ( an analysis on 400 million requests from 270,000 distinct sources shows only  0.003% of requests have been wrongly allowed)
  3. It has very little memory usage: we need to store only 2 numbers per counter.

Note: The sliding window method for rate limiting is used practically by many companies ex - Cloudflare.

Did, we miss something, or do you want to add some other key points? 🤔
Please comment.

Do you know System Design is a very important topic in product based company interviews and almost every tech giant asks it? At Nlogn we have a dedicated section for system design to help you prepare. Read here.  

This post was originally published at nlogn.in

System Architecture (4 Part Series)

1) Designing a URL shortening service from scratch to scale million of users. 2) Hacking into Pastebin scalable architecture - System Design 3) Design a Scalable request Rate Limiting Algorithm for API- System Design 4) How to use cache and Redux to make your App blazing fast

Posted on May 1 by:

mayankjoshi profile

mayank joshi


I love system design and most of the time I find myself learning or designing one of them. I'm highly active on twitter, So ping me there.


nlogn is a Computer Science portal specially designed to help you prepare for product-based companies interviews by practicing a wide variety of problem-related to system Desing, Data-Structure and much more.


markdown guide