Harsh Raj Dubey

Posted on Jun 15

The cache bug that only appears when your app goes viral

#go #redis #distributedsystems #opensource

So this is not a story about a bug I found in someone else's code.

This is a story about a bug that is sitting in your code right now. Probably. And it will not show up in your local testing, it will not show up in staging, it will not show up at normal traffic. It shows up exactly when you don't want it to. When your app is trending on Product Hunt, or some influencer tweets about you, or you just hit the front page of Hacker News.

I found this bug in my own backend. Then I built a library to fix it properly. The library is called HerdLock. But before I talk about that, let me explain what the actual problem is.

You're caching things. Great. That's not enough.

Most Go backends I've seen, and most backends in general honestly, do something like this:

func GetUserProfile(ctx context.Context, userID string) (*User, error) {
    // Check cache first
    cached, err := redis.Get(ctx, "user:"+userID)
    if err == nil {
        return deserialize(cached), nil
    }

    // Cache miss, go to database
    user, err := db.QueryUser(ctx, userID)
    if err != nil {
        return nil, err
    }

    // Store in cache with 5 minute TTL
    redis.Set(ctx, "user:"+userID, serialize(user), 5*time.Minute)
    return user, nil
}

This is fine. This works. This is what everyone does.

Until the key expires.

When user:123 has a 5 minute TTL and exactly 5 minutes pass, what happens if at that exact moment you have 200 concurrent requests all asking for that user?

All 200 of them check the cache. All 200 see a miss. All 200 go to the database. Simultaneously.

[ user:123 expired ]
        │
        ├──► Request 1 ──► Cache Miss ──► DB Query
        ├──► Request 2 ──► Cache Miss ──► DB Query
        ├──► Request 3 ──► Cache Miss ──► DB Query
        ├──► ...
        └──► Request 200 ──► Cache Miss ──► DB Query
                                    │
                              DB goes 💥

This is called a cache stampede or thundering herd problem. The cache was supposed to protect your database. But the moment it expires under load, it does the opposite. It coordinates an attack on your database.

"Okay but 200 concurrent requests on one key, that's rare no?"

In normal traffic, yes.

In viral traffic? Your hot keys are hot. That trending product page, that leaderboard endpoint, that "current user" API call that every frontend makes on page load. Under 10x traffic these can easily get hundreds of concurrent hits.

And the worst part: the more popular your app gets, the worse the stampede. Traffic spike means more concurrent requests means more goroutines all hitting the expired key at the same time means bigger DB explosion. Your success literally causes your failure.

The naive fixes that don't actually work

"Just set a longer TTL"

You're delaying the problem, not solving it. Eventually it expires. Stampede happens.

"Use singleflight"

This is actually a good idea, and golang.org/x/sync/singleflight is a solid package. It deduplicates concurrent requests within a single process. So if 50 goroutines on the same pod all want the same key, only 1 actually fetches it.

But here's the thing. You're probably running multiple pods. You have 10 pods in production, each with their own singleflight group. Each pod sends 1 request to the DB. That's still 10 simultaneous DB queries. With 50 pods it's 50 queries. singleflight alone doesn't cross process boundaries.

"Add a mutex / distributed lock manually"

Now we're getting somewhere. But this is actually non-trivial to implement correctly. The lock needs to:

Be atomic (you can't use GET then SET, there's a race condition between them)
Release only if you own it (another process shouldn't release your lock)
Handle the case where the lock holder crashes mid-fetch
Do a double-check GET after acquiring (another pod may have already filled the cache while you waited for the lock)

Most hand-rolled implementations I've seen miss at least 2 of these. Mine did too, the first time.

What actually needs to happen

The correct flow looks like this:

Request comes in for key "user:123"
    │
    ├──► Check local in-memory cache ──► HIT: return immediately (sub-microsecond)
    │
    ├──► Check Redis ──► HIT (fresh): return value
    │                        │
    │                   HIT (stale but within SWR window):
    │                        └──► return stale value immediately
    │                             + trigger background refresh (user sees no delay)
    │
    └──► MISS: enter protection layer
              │
              ├──► In-process singleflight (deduplicate within this pod)
              │
              ├──► Acquire distributed Redis lock
              │         │
              │    Lock taken? ──► wait, retry
              │
              ├──► Double-check Redis (someone else may have filled it)
              │         └──► HIT: release lock, return (no DB query needed)
              │
              └──► Fetch from DB
                        └──► Store in Redis ──► Release lock ──► Return

Every step here has a reason. Skip one and you either have a stampede, a race condition, or unnecessary DB queries.

I got tired of writing this every time

I've worked on a few different backends now and I found myself implementing some version of this pattern in each one. Copy pasting from previous projects, tweaking slightly, introducing new subtle bugs each time.

So I packaged it properly as an open source Go library: HerdLock

The simplest usage looks like this:

// One time setup
herdlock.RegisterType(&User{})
hl := herdlock.New(redisClient)

// Replace your existing cache logic with this
val, err := hl.Fetch(ctx, "user:"+userID, 5*time.Minute, func(ctx context.Context) (any, error) {
    return db.QueryUser(ctx, userID)  // your existing DB call, unchanged
})

user := val.(*User)

That's it. Your existing fetch function goes in as-is. HerdLock handles everything around it. The in-process deduplication, the distributed lock, the double-check, the stale serving, all of it.

The benchmark that made this real for me

I wanted to actually prove this works under load, not just claim it does. So I wrote a benchmark that simulates a database with a connection pool of maximum 5 concurrent queries, then fires 100 goroutines at the same expired key simultaneously.

Benchmark Case                   | Time per Op      | DB Hits
--------------------------------------------------------------
Coalesced Fetch (HerdLock)       | ~2.3ms  total    |       1
Direct Fetch (No Protection)     | ~31.6ms total    |     100
--------------------------------------------------------------
                                   14x faster        99 DB calls saved

The DB hits column is what matters here. Without protection, your database gets 100 simultaneous queries. With HerdLock, it gets 1. Under real connection pool constraints, those 99 extra queries queue up and cause exactly the latency spike you see in production during traffic spikes.

The 14x latency number comes from the queuing. 100 requests divided by 5 connections equals 20 serial batches of queries. HerdLock collapses all of that down to a single query and 99 waiters sharing the result.

Some things I added that I haven't seen in other libraries

Stale-While-Revalidate

Serve the old value immediately while refreshing in background. Users see zero extra latency. The refresh happens invisibly. This is the same pattern browsers use for service worker caching and it works beautifully for API responses too.

hl := herdlock.New(rdb,
    herdlock.WithStaleWhileRevalidate(30 * time.Second),
)

XFetch — probabilistic early expiry

This one is based on an actual research paper (Vattani, Chierichetti, Lowenstein 2015). Instead of waiting for the TTL cliff at t=60s, XFetch probabilistically starts refreshing keys before they expire. The math:

refresh early if:  now - (delta x beta x -ln(random)) > expiresAt

Where delta is how long your fetch function actually takes. Slow fetches means refresh even earlier. The result is no more expiry cliff. Keys get quietly refreshed before they expire and users never see a miss. Higher beta means more aggressive early refresh.

Jitter strategies

If you cache 10,000 keys at startup all with TTL=60s, they all expire at t=60s. Mega stampede. Adding random jitter to TTLs spreads them out:

herdlock.WithJitter(herdlock.JitterEqual),
herdlock.WithJitterMax(10 * time.Second),
// TTLs now vary ±5s around your set value

Circuit breaker

If Redis itself starts failing, you don't want HerdLock to make things worse by retrying locks in a tight loop. The circuit breaker detects consecutive failures and automatically bypasses cache entirely, serving requests directly from DB until Redis recovers. Degraded mode instead of full outage.

What I chose to NOT include in v1

I made a deliberate call to keep HerdLock as a library, not a daemon or sidecar. Some distributed lock libraries want you to run a separate process. HerdLock just needs your existing Redis client, whatever you're already using. No extra infrastructure.

Also kept the dependency count low. The only non-standard dependencies are go-redis/v9 (which you likely already have) and hashicorp/golang-lru/v2 for the local cache. That's it.

When you should NOT use HerdLock

Being honest here:

Single instance apps: singleflight alone is sufficient, HerdLock is overkill
Non-idempotent fetch functions: HerdLock cannot guarantee exactly-once execution. If your fetch function charges a card or sends an email, that's a different problem entirely
Multi-key atomic fetches: not supported in v1

The part where I ask for feedback

I'm genuinely curious how are you all handling this in your current projects? Because I've talked to a few people and the answers vary wildly:

Some folks have this fully solved with custom middleware
Some have a partial solution that handles the single-process case but not multi-pod
Some are just not handling it and hoping for the best (no judgment, I was here too)

And the bigger question I keep thinking about: at what point does it make sense to use a library for this vs. rolling your own singleflight + Redis lock? There's a real argument for owning the implementation. You understand exactly what it does, no external dependency to audit. Where's your line?

Drop a comment, would love to know.

If HerdLock solves something you've been manually patching, a star on GitHub helps more than you'd think for a new OSS project: github.com/harshrajdubey/herdlock-go

DEV Community