Jayesh Shinde

Posted on Dec 5, 2025

How a Cache Invalidation Bug Nearly Took Down Our System - And What We Changed After

#node #redis #aws #mysql

A few weeks ago, we had one of those production incidents that quietly start in the background and explode right when the traffic peaks.
This one involved Aurora MySQL, a Lambda with a 30-second timeout, and a poorly-designed cache invalidation strategy that ended up flooding our database.

Here’s the story, what went wrong, and the changes we made so it never happens again.

🎬 The Setup

The night before the incident, we upgraded our Aurora MySQL engine version.
Everything looked good. No alarms. No red flags.

The next morning around 8 AM, our daily job kicked in — the one responsible for:

deleting the stale “master data” cache
refetching fresh master data from the DB
storing it back in cache

This master dataset is used Application to work correctly, so if the cache isn’t warm, the DB gets hammered.

💥 The Explosion

Right after the engine upgrade, a specific query in the Lambda suddenly started taking 30+ seconds.
But our Lambda had a 30-second timeout.

So what happened?

The cacheInvalidate → cacheRebuild flow failed.
The cache remained empty.
Every user request resulted in a cache miss.
All those requests hit the DB directly.
Aurora CPU spiked to 99%.
Application responses stalled across the board.

Classic cache stampede.

We eventually triggered a failover, and luckily the same query ran in ~28.7 seconds on the new writer, just under the Lambda timeout. That bought us a few minutes to stabilize.

Later that night, we found the real culprit:
➡️ The query needed a new index, and the upgrade changed its execution plan.

We created the index via a hotfix, and the DB stabilized.

But the deeper problem was our cache invalidation approach.

🧹 Our Original Cache Invalidation: Delete First, Hope Later

Our initial flow was:

Delete the existing cache key
Fetch fresh data from DB
Save it back to cache

If step 2 fails, everything collapses.

It’s simple… until it isn’t.
In our case, the Lambda failed to fetch fresh data, so the cache stayed empty.

🔧 What We Changed (and Recommend)

1. Never delete the cache before you have fresh data

We inverted the flow:

Fetch → Validate → Update cache
Only delete if we already have fresh data ready

This eliminates the “empty cache” window.

2. Use “stale rollover” instead of blunt deletion

If the refresh job fails, we now:

rename the key
- "Master-Data" → "Master-Data-Stale"
keep the old value available
add an internal notification so the team can investigate

This ensures that even if the DB is slow or down, the system still has something to serve.

It’s not ideal, but it prevents a meltdown.

3. API layer now returns stale data when fresh data is unavailable

The API logic became:

Try to read "Master-Data"
If not found:
- Attempt to rebuild (only if allowed)
- If rebuild fails → return stale data

This avoids cascading failures.

4. Add a Redis distributed lock to prevent cache stampede

Without this, even if stale data existed, multiple API nodes or Lambdas could all try to rebuild simultaneously — hammering the DB again.

With a Redis lock:

Only one request gets the lock and rebuilds
Others:
- Do not hit DB
- Simply return stale data
- Wait for the winner to repopulate the cache

This one change alone eliminates 90% of stampede risk.

Node.js — Acquire Distributed Lock (Redis)

Below is a simple Redis-based lock using SET NX PX (no external library).
You can replace redis client with ioredis or node-redis based on your stack.

// redis.js
const { createClient } = require("redis");

const redis = createClient({
  url: process.env.REDIS_URL
});
redis.connect();

module.exports = redis;

Acquiring and Releasing the Lock

// lock.js
const redis = require("./redis");
const { randomUUID } = require("crypto");

const LOCK_KEY = "lock:master-data-refresh";
const LOCK_TTL = 10000; // 10 seconds

async function acquireLock() {
  const lockId = randomUUID();

  const result = await redis.set(LOCK_KEY, lockId, {
    NX: true,
    PX: LOCK_TTL
  });

  if (result === "OK") {
    return lockId; // lock acquired
  }

  return null; // lock not acquired
}

async function releaseLock(lockId) {
  const current = await redis.get(LOCK_KEY);

  if (current === lockId) {
    await redis.del(LOCK_KEY);
  }
}

module.exports = { acquireLock, releaseLock };

Usage

const { acquireLock, releaseLock } = require("./lock");

async function refreshMasterData() {
  const lockId = await acquireLock();

  if (!lockId) {
    console.log("Another request is refreshing. Returning stale data.");
    return getStaleData();
  }

  try {
    const newData = await fetchFromDB();
    await saveToCache(newData);
    return newData;
  } finally {
    await releaseLock(lockId);
  }
}

5. Add observability around refresh times

We now record:

query execution time
cache refresh duration
lock acquisition metrics
alerts when a refresh exceeds a threshold

The goal is to catch slowdowns before timeout happens.

📝 Key Takeaways

Engine upgrades can change execution plans, sometimes dramatically.
Always benchmark critical queries after major DB changes.
Cache invalidation strategies must assume that refresh can fail.
Serving stale-but-valid data is often better than serving errors.
Distributed locks are essential in preventing cache stampede.

🚀 Final Thoughts

The incident was stressful, but the learnings were worth it.
Caching problems rarely show up during normal traffic — they appear right when your system is busiest.

If you have a similar “delete-then-refresh” pattern somewhere in your application… you may want to review it before it reviews you.

Top comments (1)

Art light • Dec 5 '25

This was a great breakdown — really clear and easy to follow. I appreciate how you explained the issue and the improvements; it shows real depth and experience. I definitely learned something from this and would love to read more of your insights.