The Token Bucket Algorithm: Server-Side API Rate Limiting in ~40 Lines
Plenty of tutorials teach you how to survive someone else's rate limit with retries and backoff. Far fewer show you how to build one. If you run an API, you need rate limiting on your side too — to protect your database from a runaway client, keep one noisy tenant from starving everyone else, and give abusive traffic a polite 429 instead of a melted server.
The cleanest algorithm for the job is the token bucket. Let's implement it from scratch, then make it production-ready.
How token bucket works
Picture a bucket that holds up to capacity tokens. Every request removes one token. The bucket refills at a steady refillRate (tokens per second), up to its cap. If a request arrives and the bucket is empty, it's rejected.
This gives you two useful properties at once:
- A sustained rate — the long-run average, set by
refillRate. - A burst allowance — clients can spend the whole bucket at once, set by
capacity.
That burst tolerance is why token bucket feels fair. A user who's been quiet for a minute can fire off a batch of requests without being punished for it.
A minimal implementation
Here's a self-contained bucket in JavaScript. No dependencies, no timers — we compute refill lazily based on elapsed time, which is both simpler and more accurate than a background interval.
class TokenBucket {
constructor(capacity, refillRatePerSec) {
this.capacity = capacity;
this.refillRate = refillRatePerSec;
this.tokens = capacity;
this.lastRefill = Date.now();
}
_refill() {
const now = Date.now();
const elapsedSec = (now - this.lastRefill) / 1000;
this.tokens = Math.min(
this.capacity,
this.tokens + elapsedSec * this.refillRate
);
this.lastRefill = now;
}
take(cost = 1) {
this._refill();
if (this.tokens >= cost) {
this.tokens -= cost;
return { ok: true, remaining: Math.floor(this.tokens) };
}
const deficit = cost - this.tokens;
const retryAfter = Math.ceil(deficit / this.refillRate);
return { ok: false, remaining: 0, retryAfter };
}
}
Notice take() returns a retryAfter in seconds when it rejects — that's the value your clients need, and we'll hand it straight to them.
Wiring it into Express
Give each API key its own bucket and enforce it as middleware. Ten requests/second sustained, with a burst of up to 20:
const buckets = new Map();
function rateLimit(req, res, next) {
const key = req.header("x-api-key") || req.ip;
if (!buckets.has(key)) {
buckets.set(key, new TokenBucket(20, 10)); // capacity 20, 10/s
}
const result = buckets.get(key).take();
res.set("X-RateLimit-Limit", "10");
res.set("X-RateLimit-Remaining", String(result.remaining));
if (!result.ok) {
res.set("Retry-After", String(result.retryAfter));
return res.status(429).json({
type: "https://example.com/errors/rate-limit",
title: "Too Many Requests",
detail: `Rate limit exceeded. Retry in ${result.retryAfter}s.`,
});
}
next();
}
app.use("/api", rateLimit);
Two details that matter: always emit the Retry-After header on a 429 so well-behaved clients know exactly when to come back, and return a structured error body (this one follows RFC 9457 Problem Details) instead of a bare status code.
Making it survive production
The in-memory Map above works great for a single process, but it breaks the moment you scale horizontally — each instance keeps its own bucket, so a client behind a load balancer effectively gets N× the limit. Move the state to Redis and every instance shares one bucket:
// Atomic check-and-decrement via a Lua script (runs server-side in Redis)
const LUA = `
local tokens = tonumber(redis.call('get', KEYS[1]) or ARGV[1])
local last = tonumber(redis.call('get', KEYS[2]) or ARGV[4])
local now = tonumber(ARGV[4])
tokens = math.min(ARGV[1], tokens + (now - last) / 1000 * ARGV[2])
if tokens >= 1 then
redis.call('set', KEYS[1], tokens - 1)
redis.call('set', KEYS[2], now)
return 1
end
redis.call('set', KEYS[2], now)
return 0
`;
// KEYS = [tokensKey, timestampKey]; ARGV = [capacity, refillRate, cost, now]
Running the refill-and-take logic as a single Lua script keeps it atomic — no race between reading the token count and decrementing it, even under thousands of concurrent requests.
A few more things worth doing before you ship: bound your key map (or set a Redis TTL) so idle clients get evicted, decide whether the limit is per-key, per-IP, or per-endpoint, and consider a higher cost for expensive routes so a single heavy query counts as several cheap ones.
Test it before your users do
The tricky part of rate limiting isn't the happy path — it's the boundary. Does the 21st burst request actually get a 429? Is Retry-After accurate? Does the bucket refill on schedule? You want to fire controlled bursts and inspect the exact headers that come back, which is exactly the kind of thing APIKumo makes easy: send repeated requests against your endpoint, watch the X-RateLimit-Remaining and Retry-After headers tick down in real time, and save the whole scenario so you can re-run it every time you touch your limiter. Build the bucket, then prove it behaves.
Top comments (0)