DEV Community

Young Gao
Young Gao

Posted on

Advanced API Rate Limiting: Sliding Windows, Token Buckets, and Distributed Counters (2026)

Every production API hits the same inflection point: traffic grows, abuse appears, and suddenly you need to answer the question "how many requests should I allow, and for whom?" Rate limiting sounds simple until you run multiple servers, need sub-second accuracy, and have endpoints with wildly different costs.

This is the third installment in the Production Backend Patterns series. We will walk through four major rate limiting algorithms, implement each in TypeScript with Redis, and then tackle the hard parts: distributed coordination, burst handling, cost-based limits, and the headers your clients actually need.

The Four Algorithms, Visualized

Before writing any code, let's build intuition for how each algorithm behaves. Imagine a limit of 10 requests per minute.

Fixed Window

Minute 1          Minute 2          Minute 3
[||||||||  ]      [||||||||||]      [|||       ]
 8 allowed         10 (full)         3 so far

               ^ boundary: counter resets
Enter fullscreen mode Exit fullscreen mode

The window is aligned to clock boundaries (e.g., 12:00:00 - 12:00:59). A counter increments per request and resets at the boundary. The flaw is obvious: a user can send 10 requests at 12:00:59 and another 10 at 12:01:00 -- 20 requests in two seconds while the limit is "10 per minute."

Sliding Window Log

Timeline:  --[--r---r--r-----r--r---r--]-->
                                         ^
                      window slides with each request
              only requests within the trailing 60s count
Enter fullscreen mode Exit fullscreen mode

Every request timestamp is stored. On each new request, you count how many timestamps fall within the last 60 seconds. Accurate, but storing every timestamp is memory-expensive at scale.

Sliding Window Counter (Hybrid)

Previous window weight: 40%    Current window weight: 60%
[    7 reqs    ] [  4 reqs so far  ]
                      ^-- 36s into 60s window

Estimated count = 7 * 0.40 + 4 = 6.8 --> under limit of 10
Enter fullscreen mode Exit fullscreen mode

This blends the previous and current fixed window counts using a weighted average based on how far into the current window you are. Near-zero memory overhead and surprisingly accurate.

Token Bucket

Bucket capacity: 10 tokens
Refill rate: 10 tokens / minute

[@@@@@@@@@@]  --> full bucket (10 tokens)
[@@@@@@    ]  --> 4 requests consumed 4 tokens
[@@@@@@@@@ ]  --> tokens refilled over time
[          ]  --> burst of 10 exhausts bucket
              --> must wait for refill
Enter fullscreen mode Exit fullscreen mode

Tokens accumulate at a steady rate up to a maximum capacity. Each request consumes one (or more) tokens. This naturally allows bursts up to the bucket size while enforcing a long-term average rate.

Leaky Bucket

Requests pour in at variable rate:
    |||  |    ||||||    |   ||

    v v v v v v v v v v v v v
   [  queue / bucket  ]
    |   |   |   |   |   |
    v   v   v   v   v   v
   Processed at fixed rate
Enter fullscreen mode Exit fullscreen mode

Requests enter a queue that drains at a constant rate. If the queue is full, new requests are rejected. This produces the smoothest output rate but adds latency because requests wait in the queue.

Implementing Each Algorithm in TypeScript with Redis

All implementations share a common interface:

interface RateLimitResult {
  allowed: boolean;
  limit: number;
  remaining: number;
  retryAfter?: number; // seconds until next request is allowed
  resetAt?: number;    // unix timestamp when the window resets
}

interface RateLimiter {
  consume(key: string, cost?: number): Promise<RateLimitResult>;
}
Enter fullscreen mode Exit fullscreen mode

We use ioredis throughout:

import Redis from "ioredis";
const redis = new Redis({ host: "127.0.0.1", port: 6379 });
Enter fullscreen mode Exit fullscreen mode

Fixed Window

class FixedWindowLimiter implements RateLimiter {
  constructor(
    private redis: Redis,
    private limit: number,
    private windowSec: number
  ) {}

  async consume(key: string, cost = 1): Promise<RateLimitResult> {
    const now = Math.floor(Date.now() / 1000);
    const window = Math.floor(now / this.windowSec) * this.windowSec;
    const redisKey = `rl:fw:${key}:${window}`;

    const count = await this.redis
      .multi()
      .incrby(redisKey, cost)
      .expire(redisKey, this.windowSec)
      .exec();

    const current = count![0][1] as number;
    const resetAt = window + this.windowSec;

    return {
      allowed: current <= this.limit,
      limit: this.limit,
      remaining: Math.max(0, this.limit - current),
      resetAt,
      retryAfter: current > this.limit ? resetAt - now : undefined,
    };
  }
}
Enter fullscreen mode Exit fullscreen mode

Sliding Window Counter

This is the production workhorse. It approximates a sliding window using two fixed windows and a weighted average, requiring only two Redis keys and no stored timestamps.

class SlidingWindowLimiter implements RateLimiter {
  constructor(
    private redis: Redis,
    private limit: number,
    private windowSec: number
  ) {}

  async consume(key: string, cost = 1): Promise<RateLimitResult> {
    const now = Math.floor(Date.now() / 1000);
    const currentWindow = Math.floor(now / this.windowSec) * this.windowSec;
    const previousWindow = currentWindow - this.windowSec;
    const elapsed = now - currentWindow;
    const weight = (this.windowSec - elapsed) / this.windowSec;

    const prevKey = `rl:sw:${key}:${previousWindow}`;
    const currKey = `rl:sw:${key}:${currentWindow}`;

    const [prevCount, currCount] = await this.redis
      .mget(prevKey, currKey)
      .then((r) => r.map((v) => parseInt(v ?? "0", 10)));

    const estimated = Math.floor(prevCount * weight) + currCount;

    if (estimated + cost > this.limit) {
      return {
        allowed: false,
        limit: this.limit,
        remaining: 0,
        retryAfter: this.windowSec - elapsed,
        resetAt: currentWindow + this.windowSec,
      };
    }

    await this.redis
      .multi()
      .incrby(currKey, cost)
      .expire(currKey, this.windowSec * 2)
      .exec();

    return {
      allowed: true,
      limit: this.limit,
      remaining: Math.max(0, this.limit - estimated - cost),
      resetAt: currentWindow + this.windowSec,
    };
  }
}
Enter fullscreen mode Exit fullscreen mode

Token Bucket

The token bucket is ideal when you want to allow bursts. We store two values in a Redis hash: the token count and the last refill timestamp. A Lua script makes the check-and-update atomic.

class TokenBucketLimiter implements RateLimiter {
  private script: string;

  constructor(
    private redis: Redis,
    private capacity: number,
    private refillRate: number, // tokens per second
    private windowSec: number
  ) {
    this.script = `
      local key = KEYS[1]
      local capacity = tonumber(ARGV[1])
      local refill_rate = tonumber(ARGV[2])
      local now = tonumber(ARGV[3])
      local cost = tonumber(ARGV[4])
      local ttl = tonumber(ARGV[5])

      local data = redis.call('HMGET', key, 'tokens', 'last_refill')
      local tokens = tonumber(data[1])
      local last_refill = tonumber(data[2])

      if tokens == nil then
        tokens = capacity
        last_refill = now
      end

      local elapsed = math.max(0, now - last_refill)
      tokens = math.min(capacity, tokens + elapsed * refill_rate)
      last_refill = now

      local allowed = 0
      local retry_after = 0

      if tokens >= cost then
        tokens = tokens - cost
        allowed = 1
      else
        retry_after = math.ceil((cost - tokens) / refill_rate)
      end

      redis.call('HMSET', key, 'tokens', tokens, 'last_refill', last_refill)
      redis.call('EXPIRE', key, ttl)

      return {allowed, math.floor(tokens), retry_after}
    `;
  }

  async consume(key: string, cost = 1): Promise<RateLimitResult> {
    const now = Date.now() / 1000;
    const redisKey = `rl:tb:${key}`;

    const result = (await this.redis.eval(
      this.script, 1, redisKey,
      this.capacity, this.refillRate, now, cost, this.windowSec * 2
    )) as number[];

    return {
      allowed: result[0] === 1,
      limit: this.capacity,
      remaining: result[1],
      retryAfter: result[2] > 0 ? result[2] : undefined,
    };
  }
}
Enter fullscreen mode Exit fullscreen mode

Leaky Bucket

The leaky bucket can be modeled as a counter that drains at a fixed rate. We use the same Lua-script-in-Redis approach.

class LeakyBucketLimiter implements RateLimiter {
  private script: string;

  constructor(
    private redis: Redis,
    private capacity: number,
    private drainRate: number, // requests drained per second
    private windowSec: number
  ) {
    this.script = `
      local key = KEYS[1]
      local capacity = tonumber(ARGV[1])
      local drain_rate = tonumber(ARGV[2])
      local now = tonumber(ARGV[3])
      local cost = tonumber(ARGV[4])
      local ttl = tonumber(ARGV[5])

      local data = redis.call('HMGET', key, 'level', 'last_drain')
      local level = tonumber(data[1]) or 0
      local last_drain = tonumber(data[2]) or now

      local elapsed = math.max(0, now - last_drain)
      level = math.max(0, level - elapsed * drain_rate)
      last_drain = now

      local allowed = 0
      local retry_after = 0

      if level + cost <= capacity then
        level = level + cost
        allowed = 1
      else
        retry_after = math.ceil((level + cost - capacity) / drain_rate)
      end

      redis.call('HMSET', key, 'level', level, 'last_drain', last_drain)
      redis.call('EXPIRE', key, ttl)

      return {allowed, math.floor(capacity - level), retry_after}
    `;
  }

  async consume(key: string, cost = 1): Promise<RateLimitResult> {
    const now = Date.now() / 1000;
    const redisKey = `rl:lb:${key}`;

    const result = (await this.redis.eval(
      this.script, 1, redisKey,
      this.capacity, this.drainRate, now, cost, this.windowSec * 2
    )) as number[];

    return {
      allowed: result[0] === 1,
      limit: this.capacity,
      remaining: Math.max(0, result[1]),
      retryAfter: result[2] > 0 ? result[2] : undefined,
    };
  }
}
Enter fullscreen mode Exit fullscreen mode

Choosing Your Rate Limit Key

The "key" in rate limiting determines who is being limited. Different strategies serve different purposes:

type KeyExtractor = (req: Request) => string;

const keyStrategies: Record<string, KeyExtractor> = {
  // For public endpoints: limit by IP
  ip: (req) => {
    return req.headers.get("x-forwarded-for")?.split(",")[0].trim()
      ?? req.headers.get("cf-connecting-ip")
      ?? "unknown";
  },

  // For authenticated endpoints: limit by user ID
  user: (req) => {
    const userId = (req as any).auth?.userId;
    if (!userId) throw new Error("No user ID — use IP strategy instead");
    return `user:${userId}`;
  },

  // For third-party integrations: limit by API key
  apiKey: (req) => {
    const key = req.headers.get("x-api-key")
      ?? req.headers.get("authorization")?.replace("Bearer ", "");
    if (!key) throw new Error("No API key provided");
    return `key:${key.slice(-12)}`; // use suffix to avoid storing full key
  },

  // Composite: user + endpoint for fine-grained control
  userEndpoint: (req) => {
    const userId = (req as any).auth?.userId ?? "anon";
    const path = new URL(req.url).pathname;
    return `${userId}:${path}`;
  },
};
Enter fullscreen mode Exit fullscreen mode

In practice, you often layer multiple strategies. Public endpoints get IP-based limits. Authenticated endpoints get per-user limits that are more generous. Expensive endpoints (search, export, AI inference) get their own tighter limits stacked on top.

Rate Limit Headers

Communicating limits to clients is not optional -- it is an API contract. The emerging standard (RFC 9110 and the draft RateLimit header field RFC) uses these headers:

function setRateLimitHeaders(
  res: Response,
  result: RateLimitResult,
  policy: string = "default"
): void {
  const headers = res.headers;

  // Standard headers (draft RFC)
  headers.set("RateLimit-Limit", String(result.limit));
  headers.set("RateLimit-Remaining", String(result.remaining));
  if (result.resetAt) {
    headers.set("RateLimit-Reset", String(result.resetAt));
  }
  headers.set("RateLimit-Policy", `${result.limit};w=${policy}`);

  // Legacy X-prefixed headers (still widely expected)
  headers.set("X-RateLimit-Limit", String(result.limit));
  headers.set("X-RateLimit-Remaining", String(result.remaining));
  if (result.resetAt) {
    headers.set("X-RateLimit-Reset", String(result.resetAt));
  }

  // Retry-After is standard (RFC 9110)
  if (!result.allowed && result.retryAfter) {
    headers.set("Retry-After", String(result.retryAfter));
  }
}
Enter fullscreen mode Exit fullscreen mode

A well-behaved 429 response looks like:

HTTP/1.1 429 Too Many Requests
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1711036800
Retry-After: 23
Content-Type: application/json

{"error": "rate_limit_exceeded", "retryAfter": 23}
Enter fullscreen mode Exit fullscreen mode

Handling Bursts

Token bucket inherently handles bursts (the bucket capacity is the burst size). But what if you want to allow occasional bursts on top of a sliding window limiter? Add a separate burst allowance:

class BurstAwareLimiter implements RateLimiter {
  private sustained: SlidingWindowLimiter;
  private burst: TokenBucketLimiter;

  constructor(redis: Redis, config: {
    sustainedLimit: number;   // e.g., 100 per minute
    sustainedWindow: number;  // e.g., 60
    burstCapacity: number;    // e.g., 20 extra
    burstRefillRate: number;  // e.g., 0.33 tokens/sec (20 per minute)
  }) {
    this.sustained = new SlidingWindowLimiter(
      redis, config.sustainedLimit, config.sustainedWindow
    );
    this.burst = new TokenBucketLimiter(
      redis, config.burstCapacity, config.burstRefillRate, config.sustainedWindow
    );
  }

  async consume(key: string, cost = 1): Promise<RateLimitResult> {
    const sustainedResult = await this.sustained.consume(key, cost);
    if (sustainedResult.allowed) return sustainedResult;

    // Sustained limit hit — try burst allowance
    const burstResult = await this.burst.consume(`burst:${key}`, cost);
    if (burstResult.allowed) {
      return {
        ...burstResult,
        limit: burstResult.limit,
        remaining: burstResult.remaining,
      };
    }

    return sustainedResult; // both exhausted
  }
}
Enter fullscreen mode Exit fullscreen mode

Distributed Rate Limiting Across Multiple Servers

Using Redis as the backing store already gives you distributed rate limiting for free -- all servers share the same state. But there are edge cases to handle.

Redis Unavailability

If Redis is down, you have two choices: fail open (allow all requests) or fail closed (reject all). Most production systems fail open with a local fallback:

class ResilientLimiter implements RateLimiter {
  private localCounts = new Map<string, { count: number; resetAt: number }>();

  constructor(
    private primary: RateLimiter,
    private limit: number,
    private windowSec: number
  ) {}

  async consume(key: string, cost = 1): Promise<RateLimitResult> {
    try {
      return await this.primary.consume(key, cost);
    } catch (err) {
      // Redis is down — fall back to local in-memory counter
      return this.localConsume(key, cost);
    }
  }

  private localConsume(key: string, cost: number): RateLimitResult {
    const now = Math.floor(Date.now() / 1000);
    let entry = this.localCounts.get(key);

    if (!entry || now >= entry.resetAt) {
      entry = {
        count: 0,
        resetAt: now + this.windowSec,
      };
      this.localCounts.set(key, entry);
    }

    entry.count += cost;

    // Use a more conservative limit per-node
    const localLimit = Math.max(1, Math.floor(this.limit / 4));

    return {
      allowed: entry.count <= localLimit,
      limit: localLimit,
      remaining: Math.max(0, localLimit - entry.count),
      resetAt: entry.resetAt,
    };
  }
}
Enter fullscreen mode Exit fullscreen mode

The local fallback uses limit / 4 (assuming roughly 4 servers) to avoid a sudden 4x spike when Redis recovers.

Near-Simultaneous Requests

Lua scripts in Redis execute atomically, so two requests arriving at the exact same millisecond on different servers will be serialized by Redis. This is why the Lua script approach matters -- MULTI/EXEC pipelines are atomic but the read-then-write is not. For the sliding window counter (which does a read in TypeScript then a write), wrap the entire operation in a Lua script in production:

const SLIDING_WINDOW_LUA = `
  local curr_key = KEYS[1]
  local prev_key = KEYS[2]
  local limit = tonumber(ARGV[1])
  local window = tonumber(ARGV[2])
  local weight = tonumber(ARGV[3])
  local cost = tonumber(ARGV[4])

  local prev = tonumber(redis.call('GET', prev_key) or "0")
  local curr = tonumber(redis.call('GET', curr_key) or "0")
  local estimated = math.floor(prev * weight) + curr

  if estimated + cost > limit then
    return {0, 0, math.floor(estimated)}
  end

  local new_count = redis.call('INCRBY', curr_key, cost)
  redis.call('EXPIRE', curr_key, window * 2)
  return {1, limit - math.floor(prev * weight) - new_count, math.floor(prev * weight) + new_count}
`;
Enter fullscreen mode Exit fullscreen mode

Rate Limiting in API Gateways

In a gateway architecture, rate limiting should happen at the edge before requests reach your services. Here is a middleware pattern that composes multiple limiters:

type LimiterRule = {
  name: string;
  limiter: RateLimiter;
  keyExtract: KeyExtractor;
  matchPath?: RegExp;
};

function rateLimitMiddleware(rules: LimiterRule[]) {
  return async (req: Request): Promise<Response | null> => {
    const path = new URL(req.url).pathname;

    for (const rule of rules) {
      if (rule.matchPath && !rule.matchPath.test(path)) continue;

      const key = rule.keyExtract(req);
      const result = await rule.limiter.consume(key);

      if (!result.allowed) {
        const res = new Response(
          JSON.stringify({
            error: "rate_limit_exceeded",
            limiter: rule.name,
            retryAfter: result.retryAfter,
          }),
          { status: 429, headers: { "Content-Type": "application/json" } }
        );
        setRateLimitHeaders(res, result);
        return res;
      }
    }

    return null; // all checks passed, proceed to handler
  };
}

// Usage: compose multiple layers
const limiter = rateLimitMiddleware([
  {
    name: "global-ip",
    limiter: new SlidingWindowLimiter(redis, 1000, 60),
    keyExtract: keyStrategies.ip,
  },
  {
    name: "auth-user",
    limiter: new TokenBucketLimiter(redis, 200, 3.33, 60),
    keyExtract: keyStrategies.user,
  },
  {
    name: "expensive-endpoints",
    limiter: new TokenBucketLimiter(redis, 10, 0.167, 60),
    keyExtract: keyStrategies.userEndpoint,
    matchPath: /^\/(search|export|ai)\//,
  },
]);
Enter fullscreen mode Exit fullscreen mode

Cost-Based Rate Limiting

Not all requests are equal. A GET /users/me is cheap. A POST /ai/generate that runs a large language model inference is expensive. Cost-based limiting assigns a weight to each request:

const endpointCosts: Record<string, number> = {
  "GET:/api/users":       1,
  "GET:/api/search":      5,
  "POST:/api/export":     20,
  "POST:/api/ai/generate": 50,
  "POST:/api/bulk-import": 100,
};

function getRequestCost(req: Request): number {
  const method = req.method;
  const path = new URL(req.url).pathname;

  // Check exact match first, then prefix match
  const exactKey = `${method}:${path}`;
  if (endpointCosts[exactKey]) return endpointCosts[exactKey];

  for (const [pattern, cost] of Object.entries(endpointCosts)) {
    const [m, p] = pattern.split(":");
    if (method === m && path.startsWith(p)) return cost;
  }

  return 1; // default cost
}

// Apply in middleware
async function costAwareRateLimit(req: Request): Promise<RateLimitResult> {
  const key = keyStrategies.user(req);
  const cost = getRequestCost(req);

  // User has 1000 "credits" per minute
  const limiter = new TokenBucketLimiter(redis, 1000, 16.67, 60);
  return limiter.consume(key, cost);
}
Enter fullscreen mode Exit fullscreen mode

This means a user with 1000 credits per minute can make 1000 cheap reads, or 20 AI generation calls, or a mix. The token bucket is the natural choice here because its cost parameter maps directly to variable request weights.

You can take this further by returning the cost in the response headers so clients can plan their usage:

headers.set("X-RateLimit-Cost", String(cost));
headers.set("X-RateLimit-Remaining-Credits", String(result.remaining));
Enter fullscreen mode Exit fullscreen mode

Putting It All Together

Here is a complete Express-style integration showing how the pieces compose in a real application:

import express from "express";
import Redis from "ioredis";

const app = express();
const redis = new Redis();

// Define tiered limits
const tiers: Record<string, { limit: number; burstCapacity: number }> = {
  free:       { limit: 100,  burstCapacity: 10 },
  pro:        { limit: 1000, burstCapacity: 50 },
  enterprise: { limit: 10000, burstCapacity: 200 },
};

app.use(async (req, res, next) => {
  const tier = (req as any).auth?.tier ?? "free";
  const config = tiers[tier];

  const limiter = new TokenBucketLimiter(
    redis,
    config.limit,
    config.limit / 60,  // spread evenly across 60 seconds
    120                   // 2-minute TTL
  );

  const key = (req as any).auth?.userId
    ? `user:${(req as any).auth.userId}`
    : `ip:${req.ip}`;

  const cost = getRequestCost(req);
  const result = await limiter.consume(key, cost);

  // Always set headers, even when allowed
  res.set("RateLimit-Limit", String(config.limit));
  res.set("RateLimit-Remaining", String(result.remaining));
  res.set("X-RateLimit-Cost", String(cost));

  if (!result.allowed) {
    res.set("Retry-After", String(result.retryAfter));
    return res.status(429).json({
      error: "rate_limit_exceeded",
      tier,
      retryAfter: result.retryAfter,
      upgradeUrl: tier === "free" ? "/pricing" : undefined,
    });
  }

  next();
});
Enter fullscreen mode Exit fullscreen mode

Quick Reference: Which Algorithm to Use

Scenario Algorithm Why
Simple API with low traffic Fixed window Easy to implement, good enough
General-purpose API rate limiting Sliding window counter Best accuracy-to-memory ratio
APIs that need burst tolerance Token bucket Burst capacity is a first-class parameter
Smoothing traffic to downstream services Leaky bucket Guarantees constant output rate
Cost-based / variable-weight limits Token bucket Natural cost parameter support
Strict compliance requirements Sliding window log Exact counts, no approximation

Operational Checklist

Before shipping rate limiting to production, verify:

  1. Fail-open behavior -- your API still works when Redis is unreachable.
  2. Headers on every response -- not just 429s. Clients need to see remaining quota on successful requests.
  3. Monitoring -- track rate_limit_hit as a metric, broken down by key strategy and tier. A spike in 429s may indicate an attack or a misconfigured limit.
  4. Differentiated limits -- at minimum, separate authenticated from unauthenticated traffic. Paying customers should never share a pool with anonymous scrapers.
  5. Documentation -- publish your limits. Undocumented rate limits cause frustration and support tickets.
  6. Gradual rollout -- start in logging-only mode (emit metrics but allow all requests), then enable enforcement after you understand your traffic patterns.

Rate limiting is one of those systems that appears trivial but touches authentication, infrastructure resilience, billing, and developer experience. Get the algorithm right, put it behind Redis with proper Lua atomicity, set the headers, and you have a system that protects your services and respects your users.

Next in the series: we will cover circuit breakers, bulkheads, and graceful degradation patterns for building backend services that stay up when their dependencies go down.


If this was useful, consider:


You Might Also Like

Follow me for more production-ready backend content!


If this helped you, buy me a coffee on Ko-fi!

Top comments (0)