날다람쥐

Posted on Apr 21

My Cache Caused 4 Production Incidents. Here's What I Learned.

#node #typescript #redis #opensource

Part 3 of the LayerCache series.
Part 1 — why I built it.
Part 2 — how to use it.

I want to talk about the incidents.

Not the elegant architecture diagrams. Not the benchmark numbers. The 2am Slack alerts. The postmortems. The moments where I stared at a graph thinking "the cache is supposed to prevent this."

Every incident in this post is real. And each one taught me something that I eventually baked into LayerCache.

Incident #1: The Database Died at 9:02 AM on a Monday

The alert: DB CPU at 100%. Query queue backing up. Response times climbing past 30 seconds.

The timeline:

9:00 AM — regular Monday morning traffic ramp-up begins
9:01 AM — a popular product listing key expires (it was set with a fixed TTL over the weekend)
9:02 AM — 400+ users hit the product listing at the same moment
9:02 AM — all 400 requests miss the cache simultaneously and slam the DB

Classic cache stampede. The fix at the time was embarrassingly manual: restart the service, pre-warm the cache by hand, pray it doesn't happen again.

What was actually happening:

400 requests hit the expired key at the same time
       ↓
400 separate DB queries fire in parallel
       ↓
DB connection pool exhausted
       ↓
Everything else queues up behind it
       ↓
Cascading timeout across the service

The frustrating part is that the data didn't need to be fetched 400 times. It was the same key. One fetch would have been enough for all of them.

How LayerCache handles it:

When multiple requests hit a missing or expired key simultaneously, they all share a single in-flight promise instead of each firing their own fetcher:

const cache = new CacheStack([
  new MemoryLayer({ ttl: 60 }),
  new RedisLayer({ client: redis, ttl: 3600 }),
])

// 400 concurrent callers → fetcher runs exactly once
const product = await cache.get('products:listing', () => db.getProductListing())

In a test with 75 concurrent requests hitting an expired key, origin fetches dropped from 375 to 5 (one per cache layer). All 75 callers got the same result, returned at the same time, from a single DB round-trip.

For a multi-instance setup where the stampede spans multiple servers, RedisSingleFlightCoordinator extends this across processes with distributed locks:

const cache = new CacheStack(layers, {
  singleFlightCoordinator: new RedisSingleFlightCoordinator({ client: redis }),
})

60 concurrent requests across multiple instances → 1 origin fetch total.

Incident #2: Redis Went Down and Took Everything With It

The alert: Redis connection errors. But also... the entire API is returning 500s?

The expectation: Redis is a cache. It's optional infrastructure. If it goes down, requests should fall back to the database and keep working.

The reality: Our code looked like this:

async function getUser(id: number) {
  const cached = await redis.get(`user:${id}`)  // throws on connection error
  if (cached) return JSON.parse(cached)

  const user = await db.findUser(id)
  await redis.set(`user:${id}`, JSON.stringify(user), 'EX', 3600)
  return user
}

The Redis client threw on connection failure. The error propagated up. The entire request failed. What was supposed to be a cache layer had become a hard dependency.

We fixed it that day by wrapping every Redis call in a try/catch. Every. Single. One. And then we had to make sure every future Redis call also got that treatment. It was a good fix. It was also exhausting.

How LayerCache handles it:

Graceful degradation is a first-class option. When a layer fails, it's temporarily skipped and the request continues to the next layer (or falls through to the fetcher directly):

const cache = new CacheStack(
  [
    new MemoryLayer({ ttl: 60 }),
    new RedisLayer({ client: redis, ttl: 3600 }),
  ],
  {
    gracefulDegradation: { retryAfterMs: 10_000 },
  }
)

With 500ms of injected Redis latency (well above the default timeout):

Scenario	Without graceful degradation	With graceful degradation
L1 warm hit	✅ 0.065 ms	✅ 0.065 ms
L2 hit (Redis slow)	❌ timeout → 500	✅ 201 ms (fell back to fetcher)
Cold miss (Redis slow)	❌ timeout → 500	✅ 200 ms (fell back to fetcher)

L1 hot hits aren't affected at all since they never touch Redis. And once Redis recovers, the layer re-enables itself automatically after retryAfterMs.

There's also a circuit breaker that stops hammering a broken upstream after repeated failures — so if Redis is flapping, you're not generating a flood of failed connection attempts:

const cache = new CacheStack(layers, {
  circuitBreaker: {
    threshold: 5,        // open after 5 consecutive failures
    resetAfterMs: 30_000 // try again after 30 seconds
  },
})

Incident #3: Server A Showed Different Data Than Server B

The alert: A user filed a support ticket saying their profile picture updated successfully, but "sometimes" shows the old one.

It took an embarrassingly long time to figure out what was happening. The update was going through. Redis was getting invalidated correctly. But we had 4 app instances, each with an in-process memory cache — and only the instance that processed the update was clearing its L1. The other three kept serving stale data until their TTLs expired naturally.

The user was hitting different instances on different requests. Depending on which one load-balanced to them, they'd see new data or old data.

We shipped a hotfix that dropped the L1 memory cache entirely and went Redis-only. It worked. It also quietly made every single request ~30% slower because now everything paid the Redis round-trip cost.

How LayerCache handles it:

The Redis invalidation bus broadcasts L1 invalidations across all instances via pub/sub. When any instance deletes or updates a key, every other instance's memory layer clears it too:

const cache = new CacheStack(
  [
    new MemoryLayer({ ttl: 60, maxSize: 10_000 }),
    new RedisLayer({ client: redis, ttl: 3600 }),
  ],
  {
    invalidationBus: new RedisInvalidationBus({
      publisher: redis,
      subscriber: new Redis(), // needs a separate connection for subscriptions
    }),
  }
)

// This invalidation propagates to ALL instances automatically
await cache.delete('user:profile:42')

You keep the L1 speed benefit. You don't lose consistency. You don't have to drop the memory layer and eat the latency cost.

Tag-based invalidation works across instances too:

// Any instance calling this clears 'user:42' on all servers
await cache.invalidateByTag('user:42')

Incident #4: A Bug Fix Made the Stale Data Worse

The situation: We had a content management system where editors would publish changes. After a publish, they'd refresh the page and still see the old version. Sometimes for 60 seconds. Sometimes for 5 minutes.

The TTL was set to 5 minutes. The fix seemed obvious: reduce it to 30 seconds. Editors would see their changes faster. Done.

What actually happened: Reducing the TTL from 5 minutes to 30 seconds increased our DB query rate by roughly 10x. Not because traffic went up — because cache entries were expiring 10x more frequently across thousands of keys. We'd fixed the editor experience and broken DB performance in the same commit.

The real problem wasn't the TTL. It was that we had no way to explicitly invalidate by content — we were relying entirely on TTL expiry to eventually serve fresh data.

How LayerCache handles it:

Tag invalidation lets you associate keys with logical groups and invalidate the entire group at once — regardless of TTL:

// When caching, attach tags
await cache.set('article:42', article, { tags: ['articles', 'author:7'] })
await cache.set('article:43', article, { tags: ['articles', 'author:7'] })
await cache.set('article:44', article, { tags: ['articles', 'author:12'] })

// When author 7 publishes a change, invalidate everything tagged with them
await cache.invalidateByTag('author:7')
// → clears article:42 and article:43 immediately
// → article:44 is untouched

With explicit invalidation on publish, you can safely set long TTLs again — because you're no longer relying on expiry to deliver fresh data. The cache stays warm, DB load stays low, and editors see their changes immediately.

The Pattern Behind All Four Incidents

Looking back at these, the failures weren't random. They fit a pattern:

Stampede — no coordination between concurrent callers hitting the same key
Redis down → full outage — cache was treated as required infrastructure, not optional
Stale L1 across instances — in-memory caches had no way to talk to each other
TTL as the only invalidation mechanism — no explicit control over when data becomes stale

Each of these is a known problem. Solutions exist. But every time I joined a new project, I'd find the same gaps, fix them one by one, and wonder why I was doing it again.

That's still the reason LayerCache exists. Not because any individual piece is complicated — but because having all of it working together correctly, from day one, is the thing that saves you the 2am incident.

Try It

npm install layercache

If any of this resonated — if you've been in one of these incidents — I'd genuinely love to hear about it in the comments.

And if LayerCache looks useful, a ⭐ on GitHub helps other developers find it:
👉 github.com/flyingsquirrel0419/layercache

DEV Community

My Cache Caused 4 Production Incidents. Here's What I Learned.

Incident #1: The Database Died at 9:02 AM on a Monday

Incident #2: Redis Went Down and Took Everything With It

Incident #3: Server A Showed Different Data Than Server B

Incident #4: A Bug Fix Made the Stale Data Worse

The Pattern Behind All Four Incidents

Try It

Top comments (0)