DEV Community

I Audited My Own Open Source Library and Found 9 Security Bugs. Here's Every One.

Hey dev.to ๐Ÿ‘‹

If you've read my previous post about layercache, you know it's a multi-layer caching library for Node.js โ€” Memory โ†’ Redis โ†’ Disk behind a single get() call, with stampede prevention, tag invalidation, circuit breaking, and all the production-grade stuff you eventually need.

Today I'm releasing v1.3.3, and it's different from all the previous releases.

No new features. No benchmark numbers. No shiny API additions.

Just nine bugs I found in my own library. I want to walk through all of them โ€” what they were, why they happened, and what I did to fix them.

Some are embarrassing. All of them are real.


Why I did a full security audit

When you're building in the open and people start actually using the thing, you feel differently about the code. I went back through the internals with fresh eyes and a specific question: what could go wrong in production under real load?

Turns out: a lot.

Here's everything I found, roughly in severity order.


VULN-1 (HIGH): Unbounded memory growth in keyEpochs

The bug: CacheStackMaintenance uses a Map<string, number> called keyEpochs to track write invalidation โ€” every time a key is deleted or updated, its epoch is bumped so stale write-behind operations know to skip it. The map grew forever. No cap, no pruning. In a long-running service writing lots of unique keys, this is a slow memory leak that only gets worse over time.

The fix: Added MAX_KEY_EPOCHS = 50_000 and a pruning step after every bumpKeyEpochs() call. When the map exceeds the limit, the oldest 10% (lowest epoch values) get evicted.

+ const MAX_KEY_EPOCHS = 50_000

  bumpKeyEpochs(keys: string[]): void {
    for (const key of keys) {
      this.keyEpochs.set(key, this.currentKeyEpoch(key) + 1)
    }
+   this.pruneKeyEpochsIfNeeded()
  }

+ private pruneKeyEpochsIfNeeded(): void {
+   if (this.keyEpochs.size <= MAX_KEY_EPOCHS) return
+   const sorted = [...this.keyEpochs.entries()].sort((a, b) => a[1] - b[1])
+   const toDelete = Math.ceil(sorted.length * 0.1)
+   for (let i = 0; i < toDelete; i++) {
+     this.keyEpochs.delete(sorted[i][0])
+   }
+ }
Enter fullscreen mode Exit fullscreen mode

This one stings because it's exactly the kind of bug that's invisible in tests โ€” you only see it after the process has been running for days and memory graphs start climbing.


VULN-2 (MED-HIGH): Unbounded queue in FetchRateLimiter

The bug: FetchRateLimiter queues fetcher requests per-bucket when rate limits are hit. The queue itself had no bound. Under sustained high contention on a single cache key, that queue would grow without limit โ€” eventually consuming unbounded memory and causing backpressure to pile up indefinitely.

The fix: Added MAX_QUEUE_PER_BUCKET = 10_000. When a bucket's queue is full, new requests bypass the rate limiter entirely rather than blocking (availability > strict throttling in this failure mode).

+ const MAX_QUEUE_PER_BUCKET = 10_000

  return new Promise<T>((resolve, reject) => {
    const bucketKey = this.resolveBucketKey(normalized, context)
    const queue = this.queuesByBucket.get(bucketKey) ?? []
+   if (queue.length >= MAX_QUEUE_PER_BUCKET) {
+     this.rateLimitBypasses += 1
+     task().then(resolve, reject)
+     return
+   }
    queue.push({ bucketKey, options: normalized, task, resolve, reject })
    ...
  })
Enter fullscreen mode Exit fullscreen mode

The bypass counter is exposed via metrics so you can see when it's happening in production.


VULN-3 (MEDIUM): CLI accepted unvalidated input before hitting Redis

The bug: The admin CLI (npx layercache keys --pattern "...", invalidate --tag "...", etc.) didn't validate keys, patterns, or tags before passing them to Redis operations. The runtime CacheStack enforces strict validation on all inputs โ€” the CLI was just... not doing that.

The fix: The same validateCacheKey(), validatePattern(), and validateTag() functions used by the runtime are now called in the CLI before any Redis operation runs.

// cli.ts โ€” now applied before every Redis op
if (args.pattern && !validateCliInput(args.pattern, validatePattern)) return
if (args.tag && !validateCliInput(args.tag, validateTag)) return
if (args.key && !validateCliInput(args.key, validateCacheKey)) return
Enter fullscreen mode Exit fullscreen mode

The runtime had this hardened back in v1.2.x. The CLI just... never got the memo.


VULN-4 (MEDIUM): invalidate could wipe the entire cache with no confirmation

The bug: Running npx layercache invalidate with no --pattern or --tag defaults to * โ€” which matches every key in the cache. There was no confirmation step. One mistyped command in a terminal and your entire production cache is gone.

The fix: If you run invalidate with no targeting flags and there are keys to delete, the CLI now refuses and asks you to pass --force explicitly.

$ npx layercache invalidate
Warning: this operation will invalidate 14,823 keys. Use --force to confirm.

$ npx layercache invalidate --force
Invalidated 14,823 keys.
Enter fullscreen mode Exit fullscreen mode

This one is embarrassing because I added the CLI for convenience in production, and then left a footgun that could nuke the entire cache by accident. Glad I caught it before anyone else did.


VULN-5 (MEDIUM): TagIndex pruning was silently broken

The bug: TagIndex uses a knownKeys collection to track which keys exist, so prefix and wildcard invalidation can find them. Since v1.2.0, it had a maxKnownKeys limit to prevent unbounded growth โ€” but it was a Set<string>, which has no access-recency ordering. The pruning code sorted and evicted by... nothing meaningful. It was effectively random deletion, not LRU eviction. Hot keys were just as likely to get pruned as cold ones.

The fix: Changed knownKeys from Set<string> to Map<string, number> where the value is a timestamp updated on every touch() or track() call. Now pruning correctly evicts least-recently-used entries.

- private readonly knownKeys = new Set<string>()
+ private readonly knownKeys = new Map<string, number>()  // key โ†’ last-touched timestamp

  async touch(key: string): Promise<void> {
-   this.knownKeys.add(key)
+   this.knownKeys.set(key, Date.now())  // updates on every access
    this.pruneKnownKeysIfNeeded()
  }

  private pruneKnownKeysIfNeeded(): void {
    if (!this.maxKnownKeys || this.knownKeys.size <= this.maxKnownKeys) return
-   // old: iterated a Set with no ordering guarantee
+   const sorted = [...this.knownKeys.entries()].sort((a, b) => a[1] - b[1])
+   const toDelete = Math.ceil(sorted.length * 0.1)
+   for (let i = 0; i < toDelete; i++) this.knownKeys.delete(sorted[i][0])
  }
Enter fullscreen mode Exit fullscreen mode

The limit was there since v1.2.0 and looked like it was working. It wasn't.


VULN-6 (MEDIUM): TOCTOU race in snapshot file writes

The bug: The snapshot persistence code (persistToFile()) wrote directly to the target path. If the process crashed mid-write, you'd get a partial or corrupt snapshot file with no recovery path. Worse, if two processes tried to write a snapshot concurrently, they'd clobber each other.

The fix: Centralized all snapshot writes through two new utilities: atomicWriteTempPath() generates a randomized temp filename, and commitAtomicWrite() renames the temp file to the target โ€” an atomic operation on all POSIX-compliant filesystems.

// src/internal/CacheSnapshotFile.ts

export function atomicWriteTempPath(targetPath: string): string {
  return `${targetPath}.tmp-${randomBytes(8).toString('hex')}`
}

export async function commitAtomicWrite(tempPath: string, targetPath: string): Promise<void> {
  try {
    await rename(tempPath, targetPath)
  } catch (error) {
    await unlink(tempPath).catch(() => undefined)
    throw error
  }
}
Enter fullscreen mode Exit fullscreen mode

Write to the temp path, then fs.rename(). If anything goes wrong before the rename, the original snapshot is untouched. If the rename succeeds, readers see either the old file or the new one โ€” never a partial state.


VULN-7 (LOW): Memory leak in layerDegradedUntil

The bug: When a cache layer fails and enters degraded mode, CacheStack stores layerDegradedUntil.set(layer.name, expiryTimestamp). When the degradation period expired, the entry was never removed. In a service where Redis occasionally has brief hiccups, this map accumulates an entry per layer per incident โ€” forever.

The fix: On every read that checks degradation status, if the entry has expired, delete it before returning.

  const degradedUntil = this.layerDegradedUntil.get(layer.name)
  const skip = shouldSkipDegradedLayer(degradedUntil)
+ if (!skip && degradedUntil !== undefined) {
+   this.layerDegradedUntil.delete(layer.name)  // clean up expired entry
+ }
Enter fullscreen mode Exit fullscreen mode

One-liner fix, but this would quietly accumulate in any service that ever experiences Redis downtime.


VULN-8 (LOW): Math.random() for TTL jitter

The bug: TtlResolver.applyJitter() used Math.random() to spread cache expiration times. Math.random() is not cryptographically secure โ€” it's seeded from a deterministic internal state. For TTL jitter this is mostly harmless, but using a predictable PRNG to compute expiration windows is bad practice. In theory, an observer who can measure cache miss patterns could infer when keys are about to expire.

The fix: Replaced Math.random() with a crypto.randomBytes-based equivalent.

+ import { randomBytes } from 'node:crypto'

+ export const secureRandom = {
+   value(): number {
+     return randomBytes(4).readUInt32BE(0) / 0x100000000
+   }
+ }

  applyJitter(ttl: number | undefined, jitter: number | undefined): number | undefined {
    if (!ttl || ttl <= 0 || !jitter || jitter <= 0) return ttl
-   const delta = (Math.random() * 2 - 1) * jitter
+   const delta = (secureRandom.value() * 2 - 1) * jitter
    return Math.max(1, Math.round(ttl + delta))
  }
Enter fullscreen mode Exit fullscreen mode

randomBytes(4) is fast. No measurable performance impact.


VULN-9 (LOW): Background refresh failures logged at debug level

The bug: When a stale-while-revalidate background refresh fails โ€” upstream is down, fetcher throws, timeout โ€” the error was logged at debug level. In almost every production setup, debug logs are disabled. So these failures were silently swallowed. You'd see keys serving stale values with no log entry explaining why.

The fix: One-line change.

- this.logger.debug?.('background-refresh-failed', { key, error })
+ this.logger.warn?.('background-refresh-failed', { key, error })
Enter fullscreen mode Exit fullscreen mode

I genuinely don't know how long this was invisible. If you've been running layercache with staleWhileRevalidate and wondering why some keys feel permanently stale โ€” this might be why.


What I learned from this

A few patterns that caused most of these bugs:

Unbounded Maps are silent killers. VULN-1, VULN-5, and VULN-7 are all variations of the same mistake: I allocated a Map or Set, put the bounds/pruning logic on my TODO list, and shipped without it. In tests, these are invisible. In production they show up in memory graphs after days of uptime.

Internal tools don't inherit production hardening automatically. VULN-3 and VULN-4 happened because the CLI was an afterthought. The core library had strict input validation. The CLI that wraps it did not. Every interface โ€” HTTP endpoints, CLIs, admin tools โ€” needs its own hardening pass.

"Debug-level logging" is often "no logging" in production. VULN-9 was a legitimate design decision that turned out to be wrong in practice. Background refresh failures are operational signals, not debugging details.

TOCTOU bugs hide behind success. VULN-6 was only a problem during crashes or concurrent writes โ€” situations that don't happen in unit tests. The atomic write pattern is just the right default, regardless.


Upgrade

v1.3.3 is a drop-in upgrade. No API changes, no migration needed.

npm install layercache@latest
Enter fullscreen mode Exit fullscreen mode

Full changelog: CHANGELOG.md
Security PR: #19


If you're already using layercache โ€” please upgrade. If you're not, this might be a decent time to take a look:

If this has been useful, a โญ on GitHub helps a lot โ€” it's the main signal that helps other developers find the library. Thanks for reading. ๐Ÿ™

Top comments (0)