Why Your Caching Strategy Probably Has the Same 4 Holes

#architecture #database #tutorial #microservices

Book: System Design Pocket Guide: Fundamentals
Also by me: Database Playbook
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A team I talked to last quarter had a Tuesday outage that lasted eight minutes. The storefront looked fine until checkout, which started returning 503s. Database CPU sat at 98%. The cache was healthy. The cache was, in fact, the problem. A single product page (the one being pushed in an email blast that morning) held a cached entry that expired at 09:00:00 sharp, and 14,000 concurrent requests proceeded to hammer Postgres to rebuild it. This is the canonical shape of a cache stampede.

You have probably seen at least one of these. Most teams have all four somewhere in their stack and do not know it until the page goes red.

Hole 1: Stampede on hot key invalidation

The shape. A cache entry expires. N concurrent requests miss simultaneously. All N attempt to rebuild it from the underlying store. The store is sized for the cached steady-state load, not for N times the steady-state load, and falls over.

The classic shape is synchronized expiration on a viral key. Email blast at 09:00, every product cache entry has TTL=3600 set during yesterday's deploy at 08:00, the most popular product gets pushed by every Redis node into miss-state at the same second. There are well-documented variants of this in the wild.

Two defenses, used together.

TTL jitter. Add a random fraction to every TTL. If your base TTL is 600s, set it to 600 + random(0, 60). Cheap, no coordination, dramatically flattens the miss curve. The single most underrated change you can make.

Single-flight on miss. When a miss happens, only one request rebuilds. The rest wait for that rebuild and read its result. The pattern is a short-lived distributed lock with a queue, not a hard mutex.

import random, time
import redis

r = redis.Redis()

def cached_get(key, fetch_fn, ttl=600, jitter=60):
    cached = r.get(key)
    if cached is not None:
        return cached

    lock_key = f"lock:{key}"
    if r.set(lock_key, "1", nx=True, ex=10):
        try:
            value = fetch_fn()
            ttl_actual = ttl + random.randint(0, jitter)
            r.set(key, value, ex=ttl_actual)
            return value
        finally:
            r.delete(lock_key)

    # Lost the race. Wait briefly, then read the rebuilt value.
    deadline = time.monotonic() + 5.0
    while time.monotonic() < deadline:
        cached = r.get(key)
        if cached is not None:
            return cached
        time.sleep(0.05)

    # Fallback: take the hit rather than block forever.
    return fetch_fn()

The 50ms poll is intentional. It is short enough that the waiter sees the rebuilt value within one round trip, long enough that you are not spinning on Redis. The 5-second deadline keeps a stuck rebuild from blocking the whole pool.

Hole 2: Stale read on write-through misuse

How it breaks. Write-through caching writes to the cache and the database in the same call. The intuition is that the cache is always fresh because every write goes through it. The intuition is wrong on two paths: failure halfway, and reads that bypass the writer.

Consider the failure halfway. The cache write succeeds, the database write fails. You now have a cache that returns a value the database does not have. On the next cache eviction, the value vanishes, and the database never hears about it. Or the inverse: database write succeeds, cache write fails, and you serve the old cache for the rest of the TTL.

The other path is subtler. Service A uses write-through. Service B writes directly to the same database. Service A's cache is now stale by however long until the next invalidation, and there is no invalidation, because Service A has no idea Service B wrote.

The defense is to stop pretending write-through gives you a "single source of truth." Two patterns work.

Write-around with explicit invalidation. Writes go to the database. After the database commit succeeds, the writer sends an INVALIDATE key to the cache. The next read repopulates from the fresh database state. You lose the latency benefit on writes; you gain consistency you can reason about.

Write-behind only with idempotent writes and an outbox. If you really need write-through latency, write to the cache, write to a durable outbox, replay the outbox to the database asynchronously. Reads are fast and consistent (within the cache). Eventually-consistent with the database, with a bounded lag. Use only when you can tolerate that lag and your writes are idempotent.

def write_with_invalidation(key, value, db_write, cache):
    db_write(key, value)        # primary store first
    try:
        cache.delete(key)       # invalidate, do not update
    except Exception as e:
        log_invalidation_failure(key, e)
        # Fall back to the TTL.

Note the deletion rather than set. Setting on write looks faster but races with concurrent reads: you can resurrect a stale value if the read landed milliseconds before your set.

Hole 3: Negative cache poisoning

What goes wrong. Your cache stores None results to avoid hammering the database for nonexistent records. A user lookup for id=999999999 returns nothing, and you cache nothing for ten minutes to prevent a denial-of-service via random IDs. Reasonable.

Now the record is created. The cache still says None. The user signs up, the cache says they do not exist, the login flow rejects them. Or worse, an attacker correlated which IDs your system marks as "does not exist," and now uses that to enumerate the gap between assigned IDs.

Two failure modes, both common.

The first is duration. A negative-cache TTL of ten minutes is too long if entities can be created at any time. A negative-cache TTL of five seconds is short enough that the pain of a few extra DB hits is worth the reduced staleness.

The second is invalidation on creation. The negative cache must be invalidated on the create path, not just the update path. Most teams remember to invalidate on update. They forget that a create is a transition from "not exists" to "exists" and needs the same treatment.

def get_user(user_id, cache, db):
    cached = cache.get(f"user:{user_id}")
    if cached == b"__NEGATIVE__":
        return None
    if cached is not None:
        return deserialize(cached)

    user = db.fetch_user(user_id)
    if user is None:
        cache.set(f"user:{user_id}", "__NEGATIVE__", ex=30)
        return None
    cache.set(f"user:{user_id}", serialize(user), ex=600)
    return user

def create_user(user_id, payload, cache, db):
    db.insert_user(user_id, payload)
    cache.delete(f"user:{user_id}")  # critical: clear negative entry

The 30-second negative TTL is a deliberate tradeoff. Long enough to absorb scan attacks, short enough that legitimate creates are visible quickly.

Hole 4: Cache-aside race conditions

The mechanism. Cache-aside is the pattern every tutorial teaches: read returns from cache or falls back to DB and populates the cache; write goes to the DB and invalidates the cache. The race is between the read's populate and the write's invalidate.

Sequence of events. Read R1 misses cache, fetches value V1 from DB, is briefly paused (GC, scheduler, network blip). Meanwhile, write W1 updates the DB to V2 and invalidates the cache (which is empty anyway). Now R1 resumes and writes V1 into the cache. The cache holds V1; the database holds V2. Until next TTL, every read is wrong.

This is genuinely hard to fix without coordination. The defenses are all "make the window smaller" rather than "close the window."

Stale-while-revalidate. Treat the cache value as "good enough" until refresh, but always serve from cache. On a write, mark the entry as "needs revalidation" instead of deleting it. The next read returns the stale value and triggers a background refresh. The race window collapses to "how stale will you accept" rather than "did the populate land before or after the invalidate."

Per-key versioning. Store the DB write version in the cache value. On the read populate, only set the cache if the version is greater than the current cached version. This requires a WATCH/SET transaction on Redis or equivalent, but the race is closed.

A combined cached_get with single-flight, jittered TTL, and stale-while-revalidate.

import random, time, threading
import redis

r = redis.Redis()

def cached_get_swr(key, fetch_fn, fresh=300, stale=600, jitter=60):
    payload = r.hgetall(key)
    if payload:
        value = payload[b"value"]
        expires_at = int(payload[b"expires"])
        now = int(time.time())
        if now < expires_at:
            return value          # fresh
        if now < expires_at + stale:
            _trigger_refresh(key, fetch_fn, fresh, jitter)
            return value          # stale, refreshing in background
    return _refresh_now(key, fetch_fn, fresh, jitter)

def _trigger_refresh(key, fetch_fn, fresh, jitter):
    if r.set(f"refreshing:{key}", "1", nx=True, ex=10):
        threading.Thread(
            target=_refresh_now,
            args=(key, fetch_fn, fresh, jitter),
            daemon=True,
        ).start()

_refresh_now is also called from the foreground path on a cold miss; the single-flight key in _trigger_refresh keeps the background path from stampeding.

def _refresh_now(key, fetch_fn, fresh, jitter):
    value = fetch_fn()
    expires = int(time.time()) + fresh + random.randint(0, jitter)
    r.hset(key, mapping={"value": value, "expires": expires})
    r.expire(key, fresh + jitter + 600)  # full window incl. stale
    r.delete(f"refreshing:{key}")
    return value

Three things at once. The fresh window is the "definitely good" period. The stale window is the "still serve, but refresh in background" period. The single-flight on refreshing:{key} keeps the rebuild from stampeding. Jitter on the expiry is sprinkled on top.

What ties them together

The four holes share one root cause: cache and database are two systems that must agree, and most cache code treats them as one. All four are versions of the same disagreement: a populate, an invalidate, or a write that the other side did not see.

The fixes are mechanical: jitter, single-flight, explicit invalidation on the right transitions, stale-while-revalidate. None of them are clever. All of them are easy to skip when you are shipping a feature and the cache is a quick r.set() and r.get() away.

The teams who avoid these holes are not smarter. They have read enough postmortems to know which corner the next outage will come from, and they have written the boring defenses into their cache helper rather than into a thousand call sites.

If this was useful

The four holes above are the same ones I cover with worked diagrams in the caching chapter of System Design Pocket Guide: Fundamentals. If you want the next layer down (choosing the cache topology, local, near, distributed, multi-tier, and how it interacts with your primary store), that is the Database Playbook.