DEV Community

Cover image for Rate limiting and Throttling - System Design (Explained)

Posted on • Updated on

Rate limiting and Throttling - System Design (Explained)

If you're here then you probably came from this link.

How do Applications Rate Limit?

Applications can use a variety of techniques to rate limit their clients.

The basic outcome from the client side is the same though: if you exceed a certain number of requests per time window, your requests will be rejected and the API will throw you a ThrottlingException. Throttling exceptions indicate what you would expect – you’re either calling too much, or your rate limits are too low. Either or, you should slow down your rate of calling.

The most popular rate limiting or throttling technique that I’ve encountered in the real world is the Token Bucket Algorithm.

In fact, its the most popular method used in Amazon Web Services APIs so its important to be familiar with it if you’re using AWS. Lets explore it below.

The Token Bucket Algorithm

The Token Bucket Algorithm has two major components, burst and refill (sometimes called sustain). We define them below.

#1 Burst

  • corresponds to the number of Tokens that are available for a client. The Tokens are consumed every time a request comes in.

In this example, imagine that a token is a 1:1 mapping to a drop of water in a bucket.

Burst actually refers to the size of the bucket (aka the number of drops that are within it available for consumption).

Image description

The more burst capacity you have, the more receptive the server is of high volume, but low frequency traffic.

Keep in mind here that each client has their own bucket.

#2 Refill/Sustain

  • corresponds to the rate in which the backend service refills water into your bucket.

It is essentially the replenishment rate of how fast the backend will give you more opportunities to call the API.

Image description

An important concept of the token bucket algorithm is the time unit used to define the burst/refill.

Usually this operates in seconds for most respectable APIs. This means that your burst capacity is calculated on a per second basis (too many requests exceeding the burst rate in a single second will cause Throttling).

Image description

Similarly, refill/sustain is usually on a per second basis (you receive new tokens or water on a per-second basis).

The time unit here could be anything though from milliseconds to seconds to minutes to hours – its really up to you.

Image description

Another thing to keep in mind is that the burst rate(size) is usually greater than or equal to the refill / sustain rate.

Practically speaking, this makes sense – the bucket capacity we are pouring our water into is usually greater than the rate we are pouring water in.

For example, if we have a bucket that is 1 Litre capacity, it wouldn’t make much sense to have a sustain rate or pouring rate of 2 Litres per second – it would mean that you’re pouring at a rate that is constantly making the bucket overflow which doesn’t make much sense.

Image description

So lets think about some different combinations of high/low burst and sustain limits and the implications it has on the client of a rate limited api.

High Burst, Low Refill/Sustain

Image description

This combination means that the client will be allowed to make infrequent, bursty calls to an API. However if the client has too many bursts before the bucket has gone back to maximum capacity, the next burst of calls will fail

Equal Burst and Refill/Sustain

Image description

If your burst is equivalent to your refill/sustain, you essentially have a static limit per time unit, i.e. 10 requests allowed per second. This configuration is more similar to the Time Window rate limiting algorithm.

The main strength of the Token Bucket Algorithm is that it allows you to build an API that is accommodating of low frequency, bursty workloads. At the same time, you can prevent too many bursts from occurring from a single client by controlling their refill rate.

Implications of a Rate Limited API

Server Perspective

From the server perspective, Rate Limiting means that you need to either use existing rate limiting features in web servers, or build your own to control traffic.

For example, here’s an example from NGINX showing a rate limited API by client IP address at a rate of 1 request / second:

http {
    limit_req_zone $binary_remote_addr zone=one:10m rate=1r/s;
Enter fullscreen mode Exit fullscreen mode

Using IP address is a bit ill advised here since its very easy for customers to change their IP addresses at well.

Image description

Ideally, you would want to use some client identity name or access token to identify a client in order to control their rates.

This is how it works with the Twitter API where with each request you attach your developer access token. This lets Twitter identify you consistently and apply rate limits to just you and nobody else.

Client Perspective

If you’re a Client, or a user of a rate limited API, there are some important things to be aware of. Most importantly, you as the client need to be aware of your rate limits.

This helps you design your system in such a way that you won’t exceed the rates provisioned by your resource server.

Secondly, its important to implement a robust retry policy with exponential backoff when faced with a ThrottlingException from a rate limited API.

Image description

In the real world, I usually see a retry policy that consists of 3 attempts, with an exponentially increasing sleep duration between each attempts.

This makes it such that your request takes a progressively longer sleep between each attempt, giving the resource server the opportunity to ‘catch up’ and assign you more tokens.


Its important to know what Rate Limiting / Throttling is from both the client and the server perspective.

If you’re a developer using an open source API, I guarantee you that you will at some point be facing the dreaded ThrottlingException or RateLimitedExceedException from these APIs. Its important to know how to handle them in any case.

If you’re a resource owner / service builder, Rate Limiting / Throttling is an important concept that helps regulate the resources of your service per client so that you can ensure a consist experience for ALL users.

Its important to know how it works, and some of the algorithms that are available to you. The Token bucket is one of the most popular and wide-spread used.

Top comments (0)