A few weeks ago, we had one of those production incidents that quietly start in the background and explode right when the traffic peaks.
This one involved Aurora MySQL, a Lambda with a 30-second timeout, and a poorly-designed cache invalidation strategy that ended up flooding our database.
Here’s the story, what went wrong, and the changes we made so it never happens again.
🎬 The Setup
The night before the incident, we upgraded our Aurora MySQL engine version.
Everything looked good. No alarms. No red flags.
The next morning around 8 AM, our daily job kicked in — the one responsible for:
- deleting the stale “master data” cache
- refetching fresh master data from the DB
- storing it back in cache
This master dataset is used Application to work correctly, so if the cache isn’t warm, the DB gets hammered.
💥 The Explosion
Right after the engine upgrade, a specific query in the Lambda suddenly started taking 30+ seconds.
But our Lambda had a 30-second timeout.
So what happened?
- The cacheInvalidate → cacheRebuild flow failed.
- The cache remained empty.
- Every user request resulted in a cache miss.
- All those requests hit the DB directly.
- Aurora CPU spiked to 99%.
- Application responses stalled across the board.
Classic cache stampede.
We eventually triggered a failover, and luckily the same query ran in ~28.7 seconds on the new writer, just under the Lambda timeout. That bought us a few minutes to stabilize.
Later that night, we found the real culprit:
➡️ The query needed a new index, and the upgrade changed its execution plan.
We created the index via a hotfix, and the DB stabilized.
But the deeper problem was our cache invalidation approach.
🧹 Our Original Cache Invalidation: Delete First, Hope Later
Our initial flow was:
- Delete the existing cache key
- Fetch fresh data from DB
- Save it back to cache
If step 2 fails, everything collapses.
It’s simple… until it isn’t.
In our case, the Lambda failed to fetch fresh data, so the cache stayed empty.
🔧 What We Changed (and Recommend)
1. Never delete the cache before you have fresh data
We inverted the flow:
- Fetch → Validate → Update cache
- Only delete if we already have fresh data ready
This eliminates the “empty cache” window.
2. Use “stale rollover” instead of blunt deletion
If the refresh job fails, we now:
-
rename the key
-
"Master-Data"→"Master-Data-Stale"
-
keep the old value available
add an internal notification so the team can investigate
This ensures that even if the DB is slow or down, the system still has something to serve.
It’s not ideal, but it prevents a meltdown.
3. API layer now returns stale data when fresh data is unavailable
The API logic became:
- Try to read
"Master-Data" -
If not found:
- Attempt to rebuild (only if allowed)
- If rebuild fails → return stale data
This avoids cascading failures.
4. Add a Redis distributed lock to prevent cache stampede
Without this, even if stale data existed, multiple API nodes or Lambdas could all try to rebuild simultaneously — hammering the DB again.
With a Redis lock:
- Only one request gets the lock and rebuilds
-
Others:
- Do not hit DB
- Simply return stale data
- Wait for the winner to repopulate the cache
This one change alone eliminates 90% of stampede risk.
Node.js — Acquire Distributed Lock (Redis)
Below is a simple Redis-based lock using SET NX PX (no external library).
You can replace redis client with ioredis or node-redis based on your stack.
// redis.js
const { createClient } = require("redis");
const redis = createClient({
url: process.env.REDIS_URL
});
redis.connect();
module.exports = redis;
Acquiring and Releasing the Lock
// lock.js
const redis = require("./redis");
const { randomUUID } = require("crypto");
const LOCK_KEY = "lock:master-data-refresh";
const LOCK_TTL = 10000; // 10 seconds
async function acquireLock() {
const lockId = randomUUID();
const result = await redis.set(LOCK_KEY, lockId, {
NX: true,
PX: LOCK_TTL
});
if (result === "OK") {
return lockId; // lock acquired
}
return null; // lock not acquired
}
async function releaseLock(lockId) {
const current = await redis.get(LOCK_KEY);
if (current === lockId) {
await redis.del(LOCK_KEY);
}
}
module.exports = { acquireLock, releaseLock };
Usage
const { acquireLock, releaseLock } = require("./lock");
async function refreshMasterData() {
const lockId = await acquireLock();
if (!lockId) {
console.log("Another request is refreshing. Returning stale data.");
return getStaleData();
}
try {
const newData = await fetchFromDB();
await saveToCache(newData);
return newData;
} finally {
await releaseLock(lockId);
}
}
5. Add observability around refresh times
We now record:
- query execution time
- cache refresh duration
- lock acquisition metrics
- alerts when a refresh exceeds a threshold
The goal is to catch slowdowns before timeout happens.
📝 Key Takeaways
- Engine upgrades can change execution plans, sometimes dramatically.
- Always benchmark critical queries after major DB changes.
- Cache invalidation strategies must assume that refresh can fail.
- Serving stale-but-valid data is often better than serving errors.
- Distributed locks are essential in preventing cache stampede.
🚀 Final Thoughts
The incident was stressful, but the learnings were worth it.
Caching problems rarely show up during normal traffic — they appear right when your system is busiest.
If you have a similar “delete-then-refresh” pattern somewhere in your application… you may want to review it before it reviews you.
Top comments (1)
This was a great breakdown — really clear and easy to follow. I appreciate how you explained the issue and the improvements; it shows real depth and experience. I definitely learned something from this and would love to read more of your insights.