DEV Community

Cover image for Rate Limiting: How to Stop Your API From Drowning in Requests
Jairo Junior
Jairo Junior

Posted on

Rate Limiting: How to Stop Your API From Drowning in Requests

Hello! I'm Jairo.

Your favorite dev.to writer.

Just kidding — I know I'm not. Just breaking the ice 😄

Last week I was reading an excellent book called System Design Interview by Alex Xu. If you work with backend systems and haven't read it yet, you probably should.

One concept from the book reminded me of something interesting about software engineers: we all know rate limiting exists, but very few engineers really understand when to use it, how it works internally, and which strategy to choose.

So today let's talk about one of the most important protections your API can have: rate limiting.


What Is Rate Limiting?

Let’s imagine something simple.

Your application is a person walking in the rain, and every raindrop represents an HTTP request hitting your server.

At first, everything is fine. Then the rain gets heavier. More drops. More requests. Eventually your application becomes completely soaked — CPU usage increases, your database starts struggling, and suddenly your server turns into soup.

Not ideal.

What do we do when it's raining?

We grab an umbrella.

(No, not the evil corporation from Resident Evil — a real umbrella.)

That umbrella represents a rate limiter. It doesn't stop the rain entirely; it simply controls how much rain reaches you. In the same way, a rate limiter allows some requests to pass while blocking the excess ones, protecting your system from overload.


What Happens Without Rate Limiting?

Without rate limiting, a client could send:

10 requests per second
100 requests per second
1000 requests per second
Enter fullscreen mode Exit fullscreen mode

Your application will try to process every request it receives. Eventually this leads to CPU overload, database contention, cascading failures, and sometimes a full system crash.

With rate limiting in place, the server can simply respond with:

HTTP 429 - Too Many Requests
Enter fullscreen mode Exit fullscreen mode

Which is basically your server saying:

"Slow down, my friend."


Common Rate Limiting Algorithms

There are several strategies used to implement rate limiting. Each one has different trade-offs depending on the system requirements.

Let's go through the most common ones.


Token Bucket

The Token Bucket algorithm is one of the most widely used rate limiting strategies.

Imagine a bucket filled with tokens. Every incoming request must take one token to be processed. If the bucket still has tokens, the request is allowed. If the bucket is empty, the request is rejected.

Tokens are added back into the bucket at a fixed rate.

Example configuration:

Bucket capacity: 10 tokens
Refill rate: 1 token per second
Enter fullscreen mode Exit fullscreen mode

This allows the system to support short bursts of traffic, but once all tokens are consumed, incoming requests must wait until new tokens are added.

This algorithm is popular because it supports bursts while still controlling overall traffic.


Leaky Bucket

The Leaky Bucket algorithm works a bit differently.

Imagine a bucket with a small hole at the bottom. Requests enter the bucket like water, but water leaks out at a constant rate.

If too many requests arrive too quickly, the bucket fills up and eventually overflows, causing new requests to be rejected.

This approach forces requests to be processed at a constant and predictable rate, which helps smooth traffic spikes.

The downside is that it doesn't handle bursts as well as the token bucket strategy.


Sliding Window Log

The Sliding Window Log strategy tracks the timestamp of every request.

For example, if your limit is 5 requests per minute, the system checks all requests that happened within the last 60 seconds.

If five requests already exist in that window, the new request is rejected. If not, it is allowed.

This method is very accurate because it always considers the real time window instead of fixed intervals.

However, it requires storing many timestamps, which can become expensive at large scale.


Sliding Window Counter

The Sliding Window Counter is an optimization of the sliding window idea.

Instead of storing every timestamp, it keeps track of only two counters: requests in the current time window and requests in the previous window.

The algorithm then calculates a weighted average between those two counters to estimate the real request rate.

This method dramatically reduces memory usage while still providing good accuracy, which is why it is commonly used in large distributed systems.


Implementing Rate Limiting in Java

Now let's look at a few simple ways to implement rate limiting in Java.

These examples are simplified but illustrate the idea.


Simple In-Memory Rate Limit

import java.util.concurrent.ConcurrentHashMap;

public class SimpleRateLimiter {

    private final ConcurrentHashMap<String, Long> lastRequest = new ConcurrentHashMap<>();

    public boolean allowRequest(String clientId) {
        long now = System.currentTimeMillis();
        Long last = lastRequest.get(clientId);

        if (last == null || now - last > 1000) {
            lastRequest.put(clientId, now);
            return true;
        }

        return false;
    }
}
Enter fullscreen mode Exit fullscreen mode

This example allows one request per second per client.


Token Bucket Implementation

import java.util.concurrent.atomic.AtomicInteger;

public class TokenBucket {

    private final int capacity = 10;
    private final AtomicInteger tokens = new AtomicInteger(capacity);

    public boolean allowRequest() {
        if (tokens.get() > 0) {
            tokens.decrementAndGet();
            return true;
        }
        return false;
    }

    public void refill() {
        if (tokens.get() < capacity) {
            tokens.incrementAndGet();
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

A scheduled task can periodically refill tokens.


Using Resilience4j (Production Approach)

In production systems, engineers often use libraries instead of building rate limiters from scratch.

One popular option is Resilience4j.

RateLimiterConfig config = RateLimiterConfig.custom()
        .limitForPeriod(5)
        .limitRefreshPeriod(Duration.ofSeconds(1))
        .timeoutDuration(Duration.ofMillis(0))
        .build();

RateLimiter rateLimiter = RateLimiter.of("apiLimiter", config);

Supplier<String> decoratedSupplier =
        RateLimiter.decorateSupplier(rateLimiter, () -> "Hello API");

Try.ofSupplier(decoratedSupplier)
        .onFailure(e -> System.out.println("Rate limit exceeded"));
Enter fullscreen mode Exit fullscreen mode

This approach integrates well with Spring Boot and production APIs.


Where Rate Limiting Is Usually Applied

Rate limiting can be implemented in several layers of your architecture.

At the API Gateway, tools like NGINX, Kong, Cloudflare, or AWS API Gateway commonly enforce limits before traffic even reaches your application.

Inside the application layer, libraries like Resilience4j or Bucket4j allow developers to control request flow directly within the service.

For distributed systems, Redis is often used to share rate limit counters across multiple instances.


Final Thoughts

Rate limiting looks simple on the surface. But once your system begins handling real traffic, it quickly becomes clear how important it is.

A well-designed rate limiter protects your:

  • API
  • infrastructure
  • databases
  • users

And sometimes, the difference between a stable system and an outage is surprisingly simple.

Sometimes your API just needs…

a good umbrella ☔

Top comments (0)