Gabriel Anhaia

Posted on May 3

A Token Bucket Rate Limiter: a 50-Line In-Memory + 95-Line Redis Variant in TypeScript

#node #redis #typescript #backend

Book: TypeScript in Production
Also by me: The TypeScript Library — the 5-book collection
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Picture the shape of an outage that happens every quarter at companies like this. A SaaS team launches a partner integration. The partner's system retries every non-2xx with no jitter. A deploy on the partner side ships a bug that turns one logical event into a multiplier of webhook deliveries. Your service jumps to several times its steady-state RPS in a couple of minutes. The database connection pool saturates, health checks fail, the pager goes off.

Somebody writes a counter: if a single API key sends more than 100 requests per second, return 429. Ship it. The pager goes off again on Tuesday because the counter resets on the wall clock, and a client that fires 140 requests at 12:00:00.999 followed by another 180 at 12:00:01.001 sees both windows accept the burst. The counter never crossed 100 inside any single second. The counter was wrong.

The naive answer to rate limiting is almost right. Almost-right is what produces the outage. The token bucket algorithm is the answer most production systems converge on, and a working in-memory version fits in about 50 lines of TypeScript. A distributed version on Redis with a Lua script for atomicity fits in roughly 95 lines, including the script.

Why "deny when over X per second" is wrong

Three failure modes in the simple counter approach:

Window boundary bursts. A fixed window resets at second boundaries. A client that fires the limit's worth of traffic in the last 10 ms of one window and the same volume in the first 10 ms of the next has done 2x the intended rate, and no individual window saw a violation.

Bursts that should be allowed are rejected. Real API traffic is bursty. A user silent for 30 seconds who then makes 5 quick calls is not abusing the system. A counter that compares "calls in the last second" to a flat threshold rejects the fifth call.

No notion of capacity. The rate limit is a single number. There is no way to express "you can burst to 100 but the steady-state is 10" without bolting on a second counter, which has its own boundary problem.

The algorithms that fix these:

Algorithm	Boundary-burst safe	Allows bursts	State per key
Fixed window	No	No	1 counter
Sliding window log	Yes	Yes	N timestamps
Sliding window counter	Approx	Approx	2 counters
Leaky bucket	Yes	No (smoothed)	1 queue or counter
Token bucket	Yes	Yes	2 floats

Sliding window log is correct but expensive: storing every request timestamp per key blows up memory under load. Sliding window counter is the cheap approximation many CDNs use. Leaky bucket smooths traffic to a constant output rate, which is what you want for a queue, not for a public API where bursts are normal.

Token bucket maps cleanly to "you have a budget, it refills at rate R, you can spend it as fast as you like up to capacity C". Two floats per key. O(1) update. Allows bursts. No boundary problem.

Token bucket: the model

Each key has a bucket. The bucket has:

A capacity C — the maximum number of tokens it can hold.
A refill rate R — tokens added per second.
A current tokens count and a lastRefill timestamp.

On every request:

Compute how much time has passed since lastRefill.
Add R * elapsed tokens, capped at C.
If tokens >= cost, subtract cost, allow the request.
Otherwise, deny. Tell the client when to retry: (cost - tokens) / R seconds.

Tokens are floats, not integers. That matters because at 10 requests per second a single request "costs" 0.1 seconds of refill time, and you do not want to round it.

In-memory implementation

For a single-instance Node service, an in-memory Map<string, BucketState> is the right answer. No network hop, no serialization, no Redis bill.

type BucketState = { tokens: number; lastRefill: number };

export type TokenBucketResult = {
  allowed: boolean;
  remaining: number;
  retryAfterMs: number;
};

export class TokenBucket {
  private buckets = new Map<string, BucketState>();

  constructor(
    private readonly capacity: number,
    private readonly refillPerSecond: number,
  ) {}

  consume(key: string, cost = 1, now = Date.now()): TokenBucketResult {
    const state = this.buckets.get(key) ?? {
      tokens: this.capacity,
      lastRefill: now,
    };

    const elapsedSec = (now - state.lastRefill) / 1000;
    const refilled = Math.min(
      this.capacity,
      state.tokens + elapsedSec * this.refillPerSecond,
    );

    if (refilled >= cost) {
      const next: BucketState = {
        tokens: refilled - cost,
        lastRefill: now,
      };
      this.buckets.set(key, next);
      return { allowed: true, remaining: next.tokens, retryAfterMs: 0 };
    }

    const deficit = cost - refilled;
    const retryAfterMs = Math.ceil(
      (deficit / this.refillPerSecond) * 1000,
    );
    this.buckets.set(key, { tokens: refilled, lastRefill: now });
    return { allowed: false, remaining: refilled, retryAfterMs };
  }

  // Optional: prune idle buckets to bound memory.
  prune(now = Date.now(), idleMs = 60 * 60 * 1000): number {
    let removed = 0;
    for (const [k, s] of this.buckets) {
      if (now - s.lastRefill > idleMs) {
        this.buckets.delete(k);
        removed++;
      }
    }
    return removed;
  }
}

Around 50 lines with the types and the optional prune method (38 if you cut prune and the TokenBucketResult type). It is "atomic" because Node's event loop runs consume to completion before scheduling the next. No two callers see a stale read of the same key inside a tick. That property does not hold under worker threads or cluster mode (PM2 in cluster mode counts here, even though "single Node service" makes it sound like it should not). For those, use the Redis variant.

prune is the only operational concern. Without it the Map grows unboundedly across keys you never see again. Run it on an interval:

const limiter = new TokenBucket(100, 10); // 100 capacity, 10/s refill

setInterval(() => limiter.prune(), 10 * 60 * 1000).unref();

unref() so the interval does not keep the process alive on its own. The prune cadence and the idleMs threshold are independent: the interval can be tighter or looser than idleMs, it just bounds how stale a tombstone gets before it is freed.

Distributed: Redis with a Lua script

Once you have two Node processes behind a load balancer, the in-memory Map is wrong. Each process has its own view of the bucket. A malicious client can deliberately spread traffic across instances to multiply the effective limit by the instance count.

Redis is the standard answer. It runs Lua atomically against a single-threaded event loop, which is what makes the read-modify-write safe under concurrent access.

The naive approach has a race: GET, compute, SET from the Node side. Two processes read tokens=5, both decide they have budget, both write tokens=4. The bucket lost a token. Under burst load this race fires constantly.

The script below reads, computes, and writes in one indivisible step.

-- token_bucket.lua
local key       = KEYS[1]
local capacity  = tonumber(ARGV[1])
local refill    = tonumber(ARGV[2]) -- tokens per second
local now_ms    = tonumber(ARGV[3])
local cost      = tonumber(ARGV[4])

local data = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(data[1])
local ts     = tonumber(data[2])

if tokens == nil then
  tokens = capacity
  ts = now_ms
end

local elapsed = math.max(0, now_ms - ts) / 1000
tokens = math.min(capacity, tokens + elapsed * refill)

local allowed = 0
local retry_ms = 0
if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
else
  local deficit = cost - tokens
  retry_ms = math.ceil((deficit / refill) * 1000)
end

redis.call("HSET", key, "tokens", tokens, "ts", now_ms)
-- Expire idle buckets so Redis does not grow without bound.
redis.call("PEXPIRE", key, math.ceil((capacity / refill) * 1000) * 2)

return { allowed, tostring(tokens), retry_ms }

The TypeScript side, with ioredis:

import Redis from "ioredis";
import { createHash } from "node:crypto";

// Inline the script as a template literal so the module
// is not cwd-fragile. A `readFileSync` from a relative
// path breaks the moment a tool runs from a different cwd.
const SCRIPT = `
local key       = KEYS[1]
local capacity  = tonumber(ARGV[1])
local refill    = tonumber(ARGV[2])
local now_ms    = tonumber(ARGV[3])
local cost      = tonumber(ARGV[4])

local data = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(data[1])
local ts     = tonumber(data[2])
if tokens == nil then tokens = capacity; ts = now_ms end

local elapsed = math.max(0, now_ms - ts) / 1000
tokens = math.min(capacity, tokens + elapsed * refill)

local allowed, retry_ms = 0, 0
if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
else
  retry_ms = math.ceil(((cost - tokens) / refill) * 1000)
end

redis.call("HSET", key, "tokens", tokens, "ts", now_ms)
redis.call("PEXPIRE", key, math.ceil((capacity / refill) * 1000) * 2)
return { allowed, tostring(tokens), retry_ms }
`;
const SHA = createHash("sha1").update(SCRIPT).digest("hex");

export class RedisTokenBucket {
  constructor(
    private readonly redis: Redis,
    private readonly capacity: number,
    private readonly refillPerSecond: number,
    private readonly prefix = "rl:",
  ) {}

  async consume(
    key: string,
    cost = 1,
  ): Promise<TokenBucketResult> {
    const fullKey = this.prefix + key;
    const args = [
      String(this.capacity),
      String(this.refillPerSecond),
      String(Date.now()),
      String(cost),
    ];

    const raw = (await this.evalCached(fullKey, args)) as [
      number,
      string,
      number,
    ];

    return {
      allowed: raw[0] === 1,
      remaining: parseFloat(raw[1]),
      retryAfterMs: raw[2],
    };
  }

  private async evalCached(key: string, args: string[]) {
    try {
      return await this.redis.evalsha(SHA, 1, key, ...args);
    } catch (err) {
      // NOSCRIPT: cold start, SCRIPT FLUSH, failover, restart.
      // Fall back to EVAL, which executes and loads in one round-trip.
      if (
        err instanceof Error &&
        err.message.includes("NOSCRIPT")
      ) {
        return this.redis.eval(SCRIPT, 1, key, ...args);
      }
      throw err;
    }
  }
}

EVALSHA saves the cost of shipping the script body on every call. EVAL on the cold-start path keeps it to a single round-trip: EVAL runs the script and caches it server-side, so subsequent calls hit EVALSHA. The same fallback handles SCRIPT FLUSH, failover, and restart. node-redis v4 has the same shape with defineScript; ioredis also exposes defineCommand if you want to skip the manual dance.

Two processes hitting the same key at the same wall-clock millisecond will not race on the tokens value. Redis serializes them.

The script trusts ARGV[3] (the client's Date.now()), so two Node processes with skewed clocks can submit a "now" that is older than the previous one stored under the key. The math.max(0, now_ms - ts) line in the Lua is what saves you: a negative elapsed is clamped to zero, the bucket simply does not refill on that call, and the next caller with a forward-going clock catches up. Using redis.call("TIME") instead would remove the skew, but Lua scripts that call TIME cannot replicate to replicas under the default lua-replicate-commands setting and break under Redis Cluster. Pass the client clock; keep the clamp.

When Redis is unavailable

Fail open. A Redis blip should not become a full outage just because the limiter cannot read state. You lose the protection during the blip and keep the service up; emit a metric and a log line so the fallback rate is visible and someone notices when "blip" turns into "ten minutes". The limiter is a defense layer. Treating it as a critical path inverts the whole point.

async function checkLimit(
  limiter: RedisTokenBucket,
  key: string,
): Promise<TokenBucketResult> {
  try {
    return await limiter.consume(key);
  } catch (err) {
    rateLimiterErrors.inc({ kind: classify(err) });
    return { allowed: true, remaining: 0, retryAfterMs: 0 };
  }
}

A more conservative variant keeps a small in-memory bucket as a backstop. The math is approximate during a Redis outage but caps the worst case at "per-instance limit" rather than "no limit".

Choosing capacity and refill rate

The numbers most teams write down are wrong on the first try because they come from gut feel rather than real traffic. Pull the p99 request rate per key from the last 7 days. If your busiest API key does 8 requests per second at p99, set the refill rate at roughly 2x that, so refill = 16. The 2x gives headroom for legitimate spikes.

Capacity is the burst budget. Five to ten seconds of refill is the range most production tuning lands in: with refill = 16, that puts capacity somewhere between 80 and 160. A user can absorb a normal-shaped burst (cron jobs, page loads with parallel calls, batch sync at the start of a session) and then settle into the steady-state.

Two more shapes worth keeping separate from the API-key bucket:

Unauthenticated traffic. A tighter IP-keyed bucket (refill = 5, capacity = 50) catches credential-stuffing and scraping without punishing logged-in users.
Per-endpoint buckets. A POST /payments bucket should be much tighter than a GET /products bucket. Same algorithm, different tuple per route.

Public reference points:

Cloudflare's Rate Limiting Rules plan availability caps the free tier at a small number of rules per zone with coarse counting windows; the paid tiers raise the rule count and unlock per-second windows. Numbers and tier mapping shift; check the linked plans page for the current quota (verified 2026-04).
AWS API Gateway documents account-level throttling at a default of 10,000 requests per second steady-state with a 5,000-request burst per region (token bucket under the hood). Defaults vary by region and AWS occasionally updates them; verified 2026-04.
Stripe's public rate limits are 100 read and 100 write operations per second in live mode (lower in test mode, with separate quotas for endpoints like Files and Search). Idempotency keys (Stripe's idempotency docs) ensure retries inside the limit do not double-charge.

The Stripe pattern is the one to copy: rate limit on top of an idempotency key. A retry of the same logical request consumes one token (the first attempt) and reuses the cached response on the second call, so the limit punishes accidental retries less than real burst traffic.

Surfacing the limit to clients

Returning HTTP 429 with no metadata leaves the caller guessing. Three conventions that exist:

Retry-After: <seconds> — the standard (RFC 9110). Integer seconds or an HTTP-date.
X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset — GitHub-style headers most clients expect. Not standardised, but conventional.
RateLimit and RateLimit-Policy — the IETF draft-ietf-httpapi-ratelimit-headers. Worth emitting alongside the legacy headers for forward compatibility.

A small Express/Fastify-shaped middleware that wraps the limiter. The shape worth flagging up front: a union of TokenBucket | RedisTokenBucket cannot be narrowed with "consume" in limiter, because both classes have a consume method. Awaiting a sync value is a no-op, so the safe shape is to type the input as a thing that returns either TokenBucketResult or Promise<TokenBucketResult> and await it unconditionally:

import type { Request, Response, NextFunction } from "express";

type Limiter = {
  consume(key: string, cost?: number):
    | TokenBucketResult
    | Promise<TokenBucketResult>;
};

export function rateLimit(
  limiter: Limiter,
  keyFor: (req: Request) => string,
) {
  return async (req: Request, res: Response, next: NextFunction) => {
    const key = keyFor(req);
    // `await` of a sync value is a no-op, so this works for both
    // TokenBucket (sync) and RedisTokenBucket (async) without
    // a discriminator. Using `"consume" in limiter` is a real bug:
    // both classes have `consume`, the union is not narrowed, and
    // the sync branch silently returns a Promise that 200s requests
    // that should have been 429-d.
    const result = await limiter.consume(key);

    res.setHeader(
      "X-RateLimit-Remaining",
      String(Math.floor(result.remaining)),
    );

    if (!result.allowed) {
      const retryAfterSec = Math.ceil(result.retryAfterMs / 1000);
      res.setHeader("Retry-After", String(retryAfterSec));
      res.status(429).json({
        error: "rate_limited",
        retryAfterMs: result.retryAfterMs,
      });
      return;
    }

    next();
  };
}

keyFor is where the policy lives: API-key for authenticated traffic, IP for unauthenticated, route-prefixed for per-endpoint buckets. The middleware itself is dumb; the key function is what makes it useful.

A polite client respects Retry-After. An impolite one retries immediately and burns through its remaining tokens trying. The limiter handles both; the headers help the polite ones be polite.

Where you do not need this

A rate limiter inside your Node service is not a substitute for the layers above it. If you sit behind Cloudflare, AWS WAF, or any CDN with rate-limiting features, the first line of defense is there. The edge handles raw flood traffic before it costs you a request. AWS API Gateway's account-level throttle sits in front of your Lambda. NGINX limit_req handles abusive single-IP traffic without waking up Node.

The application-layer token bucket exists for the policies the edge cannot express. That covers per-API-key limits when the key sits in a header the edge does not parse, and per-endpoint limits that vary by business value (POST /payments vs GET /products). It also covers two trickier cases:

Per-tenant limits where the tenant id is a JWT claim only your service knows how to validate.
Idempotency-aware limits (the Stripe pattern above) where the limiter has to know whether the body is a retry of a previously-counted request.

If your traffic shape is "one IP, one route, one limit", the edge handles it. The bucket in code earns its keep when the policy is "fifty endpoints, ten tenants per endpoint, three plan tiers per tenant".

Once the bucket is in your repo, you stop writing one-off counters and start writing policy. The policy is the part that changes; the bucket stays put.

If this was useful

A working rate limiter sits at the boundary between your service and the rest of the world. TypeScript in Production is the book for that boundary — tsconfig, build, library shape, the bits that decide whether your code survives contact with real traffic.

The full TypeScript Library, five books:

TypeScript Essentials — daily-driver TS across Node, Bun, Deno, and the browser: Amazon
The TypeScript Type System — generics, mapped/conditional types, template literals, branded types: Amazon
Kotlin and Java to TypeScript — bridge for JVM developers: Amazon
PHP to TypeScript — bridge for modern PHP 8+ developers: Amazon
TypeScript in Production — tsconfig, build, monorepos, library authoring: Amazon