Bill Tu

Posted on Apr 8

Four Bugs We Found in Our Node.js Rate Limiter (And How We Fixed Them)

#npm #javascript #node

We recently shipped node-rate-limiter-pro, a high-performance rate limiter for Node.js with Token Bucket, Sliding Window, and Redis support. It benchmarks at ~2M ops/sec in memory and ~10x faster than express-rate-limit.

Then we did a proper code review.

We found four bugs — none of them obvious at first glance, all of them the kind that only show up in production under real load. This post walks through each one: what went wrong, why it matters, and the fix.

Bug #1: The Middleware That Swallowed Errors

The problem: Our Express middleware called await this.consume(key) without a try/catch.

// Before (broken)
return async (req, res, next) => {
  const key = keyFn(req);
  const result = await this.consume(key); // 💥 if Redis is down, this throws
  res.setHeader('X-RateLimit-Limit', result.limit);
  // ...
  next();
};

If you're using the in-memory store, this will never bite you — the memory store can't fail. But the moment you plug in Redis and the connection drops, consume() throws, and that error has nowhere to go. No catch, no next(err). Just an unhandled promise rejection that crashes your process.

This is particularly insidious because it works perfectly in development and testing (where Redis is always up), and only fails in production (where Redis occasionally hiccups).

The fix:

// After (fixed)
return async (req, res, next) => {
  try {
    const key = keyFn(req);
    const result = await this.consume(key);
    res.setHeader('X-RateLimit-Limit', result.limit);
    // ...
    next();
  } catch (err) {
    next(err);
  }
};

Now errors flow through Express's error handling pipeline, and users can implement their own strategy:

app.use(limiter.middleware());

app.use((err, req, res, next) => {
  if (err.message.includes('Redis')) {
    // Fail open: let the request through
    return next();
  }
  res.status(500).json({ error: 'Internal Server Error' });
});

Lesson: Every await in middleware is a potential throw. If you're writing Express middleware with async operations, wrap the entire body in try/catch and forward errors to next(). This is Express Middleware 101, but it's easy to forget when the happy path works so cleanly.

Bug #2: The Birthday Problem in Redis

The problem: Our Redis sliding window implementation uses a sorted set (ZSET) to track request timestamps. Each request adds a member with the current timestamp as the score. But sorted set members must be unique — if two members have the same value, the second one overwrites the first instead of adding a new entry.

To make members unique, we appended a random number:

-- Before (broken)
redis.call('ZADD', key, now, now .. '-' .. math.random(1000000))

See the problem? math.random(1000000) generates a number between 1 and 1,000,000. At high throughput — say 1,000 requests per second for a single key — multiple requests land in the same millisecond. The birthday problem tells us that with ~1,000 values drawn from a pool of 1,000,000, the probability of at least one collision is roughly:

P ≈ 1 - e^(-n² / 2m) = 1 - e^(-1000² / 2000000) ≈ 39%

A 39% chance of collision per second. When a collision happens, ZADD overwrites instead of adding, the count is off by one, and the rate limiter allows one extra request. Under sustained load, this adds up.

The fix: Replace math.random() with an atomic Redis counter:

-- After (fixed)
local seq = redis.call('INCR', key .. ':seq')
redis.call('ZADD', key, now, now .. '-' .. seq)
redis.call('PEXPIRE', key, window)
redis.call('PEXPIRE', key .. ':seq', window)

INCR is atomic and monotonically increasing — no collisions, ever. The sequence counter key gets the same TTL as the main key, so it cleans itself up.

Lesson: When you need uniqueness in a distributed system, don't use random numbers unless the space is astronomically large. Monotonic counters are simpler, faster, and provably collision-free. And always do the birthday problem math — your intuition about "how likely is a collision" is almost certainly wrong.

Bug #3: The Slow Memory Leak

The problem: Both the Token Bucket and Sliding Window algorithms store per-key state in a JavaScript Map:

private windows = new Map<string, WindowState>();  // sliding window
private buckets = new Map<string, BucketState>();   // token bucket

Keys are added on first consume() call. Keys are never removed (except via explicit reset()).

If you're rate limiting by IP address and your API serves 100,000 unique IPs per day, after 30 days you have 3 million entries in the Map. Each entry is small (~24-40 bytes of data, plus the Map overhead and the key string), but it adds up. More importantly, it never stops growing.

This is the kind of bug that doesn't show up in benchmarks, doesn't show up in tests, and doesn't show up in the first week of production. It shows up when your ops team notices the Node.js process is using 2GB of RAM and climbing.

The fix: Lazy eviction. Every 1,000 consume() calls, we scan up to 500 keys and remove the ones that have expired:

const EVICTION_INTERVAL = 1000;
const EVICTION_BATCH = 500;

export class SlidingWindow {
  private callCount = 0;

  consume(key: string): RateLimitResult {
    const now = Date.now();

    if (++this.callCount >= EVICTION_INTERVAL) {
      this.callCount = 0;
      this.evict(now);
    }
    // ... rest of consume logic
  }

  private evict(now: number): void {
    const expiry = this.windowMs * 2;
    let scanned = 0;
    for (const [key, state] of this.windows) {
      if (scanned++ >= EVICTION_BATCH) break;
      if (now - state.currStart >= expiry) {
        this.windows.delete(key);
      }
    }
  }
}

The same pattern applies to Token Bucket — evict keys that have been idle long enough to have fully refilled (meaning they're back to their initial state and can be safely recreated on next access).

Why lazy eviction instead of a setInterval timer?

No timer overhead — no background work when the limiter is idle
Amortized cost — the eviction work is spread across consume calls
Bounded per-call cost — scanning at most 500 keys per pass keeps the hot path fast
No concurrency issues — everything runs synchronously in the consume path

The tradeoff is that memory isn't freed immediately when keys expire. But with a 1:1000 ratio (one eviction pass per 1,000 calls), stale keys don't accumulate meaningfully.

Lesson: If you're storing per-key state in memory, you need an eviction strategy. This applies to caches, rate limiters, session stores — anything with unbounded key cardinality. The simplest approach is often the best: piggyback on existing operations and do a little cleanup each time.

Bug #4: The `any` That Defeated TypeScript

The problem: The Redis client was typed as any:

export interface RedisStoreOptions extends RateLimiterOptions {
  client: any;  // 🤷
}

This means you could pass literally anything — a string, a number, null, a random object — and TypeScript would happily compile it. The error would only surface at runtime when the code tries to call client.eval().

We originally used any because ioredis is an optional peer dependency. We didn't want to import its types and create a hard dependency. But any is never the answer.

The fix: Define a minimal structural interface that describes what we actually need from the client:

export interface RedisLike {
  eval(...args: any[]): Promise<any>;
  del(...keys: string[]): Promise<number>;
}

export interface RedisStoreOptions extends RateLimiterOptions {
  client: RedisLike;
}

This is TypeScript's structural typing at its best. Any object with eval() and del() methods will satisfy the interface — ioredis, the redis package, a mock object in tests, or a custom wrapper. No import needed, no hard dependency, but you get compile-time checking that catches the obvious mistakes.

Lesson: any is a code smell. When you're tempted to use it because you don't want a hard dependency on a specific library, define a structural interface instead. Describe the shape you need, not the specific implementation. This is one of TypeScript's superpowers — use it.

Wrapping Up

Four bugs, four different categories:

#	Bug	Category	When It Bites
1	Missing try/catch in middleware	Error handling	Redis goes down in production
2	Random collision in ZADD	Distributed systems math	High throughput (>1K req/sec/key)
3	No key eviction	Memory management	After days/weeks of uptime
4	`any` typed Redis client	Type safety	Passing wrong object at integration time

None of these showed up in our test suite. The tests all passed. The benchmarks looked great. The code was clean and well-structured.

That's the thing about these kinds of bugs — they live in the gaps between what you test and what production actually does. They require thinking about failure modes (what if Redis dies?), mathematical edge cases (what's the collision probability?), long-running behavior (what happens after a month?), and developer experience (what happens when someone passes the wrong thing?).

If you're building a library, especially one that sits in the critical path of every HTTP request, these are the questions worth asking before your users ask them for you.

All four fixes shipped in v1.1.0. The project is MIT licensed and available on GitHub and npm.

DEV Community

Four Bugs We Found in Our Node.js Rate Limiter (And How We Fixed Them)

Bug #1: The Middleware That Swallowed Errors

Bug #2: The Birthday Problem in Redis

Bug #3: The Slow Memory Leak

Bug #4: The `any` That Defeated TypeScript

Wrapping Up

Top comments (0)

Bug #1: The Middleware That Swallowed Errors

Bug #2: The Birthday Problem in Redis

Bug #3: The Slow Memory Leak

Bug #4: The any That Defeated TypeScript

Wrapping Up

Bug #4: The `any` That Defeated TypeScript