Rate Limiting: The 4 Algorithms Behind Every 429

#ratelimiting #ratelimiter #429 #tokenbucket

https://www.youtube.com/watch?v=H0SWt7MB0lI

Two terminals. Same curl. Same second. One of them returns a hundred green 200 OK responses. The other slams red at request six. Both are valid APIs. Both send the same status code when they refuse. Behind the refusal — four completely different machines.

This is what every engineer runs into and almost nobody looks at straight on. 429 Too Many Requests isn't a protocol. It's a signal. The machinery that decides when to fire it is a design choice — and the choice is why your one-liner integration breaks at Cloudflare but sails through at Stripe.

A rate limiter is really just a question: how many requests has this client sent in the last N seconds? Four algorithms, four different answers.

Fixed window — cheap and broken

The simplest thing that could possibly work: keep a counter per client, keyed by the current minute. Every request increments it. Past the limit, return 429. At the next minute boundary, reset to zero.

INCR  rate:alice:2026-04-21T14:05
EXPIRE rate:alice:2026-04-21T14:05 60

One number per client. One Redis INCR. Ships in ten lines. It has exactly one bug.

Imagine the counter is at 100 at 11:59:59.9. A hundred more requests fire in the final tenth of a second — all rejected. Clock ticks to 12:00:00.1. Counter slams to zero. A hundred more requests fire immediately — all allowed. Two hundred requests in two-tenths of a second under a limit of "100 per minute."

Fixed window is still the cheapest thing you can run. It just leaves a door open at every minute boundary. Close that door, and you get the next algorithm.

Sliding window — exact, or cheap, pick one

Stop thinking in calendar windows. Keep a list. Every request drops a timestamp on a timeline. Draw a 60-second window. Count only the timestamps inside. As time moves forward, the window slides. Old timestamps fall off the left edge.

No boundary seam. Exactly right.

Exactly expensive. A client at 10,000 requests per hour carries 10,000 timestamps in memory under a one-hour window. Now multiply by every client.

Cloudflare faced this at scale and picked an approximation instead. Two counters per client — last minute's count and this minute's — weighted by how far you've slid into the new minute.

rate = prev_count * (window_remaining / window_size) + curr_count

Forty-two requests last minute. Eighteen so far this minute, a quarter of the way through. 42 × 0.75 + 18 = 49.5.

It isn't exact. Cloudflare measured it across 400 million of their requests anyway — wrong answer on three of every hundred thousand. Two numbers per client, close enough to right, runs on GET/SET/INCR.

Token bucket — stop counting requests, count capacity

Flip the whole mental model. Don't count what came in. Count what's left.

A bucket holds tokens, capped at some capacity C. Tokens drip in at rate R per second. Every request reaches in and grabs one. Empty bucket, rejected. That's the algorithm.

The interesting behavior shows up when a client sits idle. At 10 tokens per second, if they wait 10 seconds, the bucket fills to 100. Now they can fire 100 requests in a single second — every one gets a token. Then the bucket drains, and they're back to steady 10/sec.

Sprint, then jog. That's the feature, not a bug.

Stripe wants a user to be able to load a dashboard in a burst. Then idle. Then another burst. Humans and dashboards and mobile apps do not send at constant rates. Token bucket doesn't make them pretend to.

The algorithm was described in 1986 by Jonathan Turner for ATM networks — 53-byte cells moving over phone lines. Now Stripe, AWS API Gateway, and countless modern APIs all use variants of it. The problem didn't change. Just the packets.

Leaky bucket — the inverse twin

Flip the bucket upside down and you get the other classical algorithm. Requests now fill the bucket from the top. The bucket leaks out the bottom at a fixed rate. Overflow the capacity, the next request overflows.

Token bucket polices. Leaky bucket shapes.

Nginx's limit_req directive is a leaky bucket. Configure it rate=1r/s burst=5 and five requests arriving in the same instant don't get rejected — they line up. Nginx drains them one per second to the upstream. Requests six and seven, arriving while the queue is still full, get dropped.

Same mathematical family as token bucket. Different posture. Leaky bucket is what you want between your edge and a downstream that breaks under bursts.

Four algorithms. One question underneath each — when does the server forget what it's counted?

The takeaway

Fixed window forgets at the minute mark. Cheap, simple, broken at the seam.
Sliding window forgets as timestamps age out. Exact, or Cloudflare's 99.997%-accurate approximation with two counters.
Token bucket doesn't count requests at all — it counts unused capacity. Sit idle, bank tokens, sprint.
Leaky bucket is the inverse — requests fill, time drains the tally, overflow drops.

Same 429. Four completely different forgetting strategies.

When you hit 429, the right question isn't am I sending too much? It's which bucket just rejected me? The answer tells you what to do next:

If it was Stripe (token bucket), you probably bursted past capacity — wait a second and the bucket refills.
If it was Cloudflare (sliding window), your last 60 seconds of traffic is the counted metric — actually slow down.
If it was a legacy fixed-window limiter, you might just be bad-luck-timing the reset boundary — wait for the next minute.
If nginx is shaping your traffic (leaky bucket queue), the requests aren't lost — they're queued. Expect latency, not failure.

Same three digits. Four different machines. The choice of machine is the design.