DEV Community: Harsh Raj Dubey

The cache bug that only appears when your app goes viral

Harsh Raj Dubey — Mon, 15 Jun 2026 14:09:41 +0000

So this is not a story about a bug I found in someone else's code.

This is a story about a bug that is sitting in your code right now. Probably. And it will not show up in your local testing, it will not show up in staging, it will not show up at normal traffic. It shows up exactly when you don't want it to. When your app is trending on Product Hunt, or some influencer tweets about you, or you just hit the front page of Hacker News.

I found this bug in my own backend. Then I built a library to fix it properly. The library is called HerdLock. But before I talk about that, let me explain what the actual problem is.

You're caching things. Great. That's not enough.

Most Go backends I've seen, and most backends in general honestly, do something like this:

func GetUserProfile(ctx context.Context, userID string) (*User, error) {
    // Check cache first
    cached, err := redis.Get(ctx, "user:"+userID)
    if err == nil {
        return deserialize(cached), nil
    }

    // Cache miss, go to database
    user, err := db.QueryUser(ctx, userID)
    if err != nil {
        return nil, err
    }

    // Store in cache with 5 minute TTL
    redis.Set(ctx, "user:"+userID, serialize(user), 5*time.Minute)
    return user, nil
}

This is fine. This works. This is what everyone does.

Until the key expires.

When user:123 has a 5 minute TTL and exactly 5 minutes pass, what happens if at that exact moment you have 200 concurrent requests all asking for that user?

All 200 of them check the cache. All 200 see a miss. All 200 go to the database. Simultaneously.

[ user:123 expired ]
        │
        ├──► Request 1 ──► Cache Miss ──► DB Query
        ├──► Request 2 ──► Cache Miss ──► DB Query
        ├──► Request 3 ──► Cache Miss ──► DB Query
        ├──► ...
        └──► Request 200 ──► Cache Miss ──► DB Query
                                    │
                              DB goes 💥

This is called a cache stampede or thundering herd problem. The cache was supposed to protect your database. But the moment it expires under load, it does the opposite. It coordinates an attack on your database.

"Okay but 200 concurrent requests on one key, that's rare no?"

In normal traffic, yes.

In viral traffic? Your hot keys are hot. That trending product page, that leaderboard endpoint, that "current user" API call that every frontend makes on page load. Under 10x traffic these can easily get hundreds of concurrent hits.

And the worst part: the more popular your app gets, the worse the stampede. Traffic spike means more concurrent requests means more goroutines all hitting the expired key at the same time means bigger DB explosion. Your success literally causes your failure.

The naive fixes that don't actually work

"Just set a longer TTL"

You're delaying the problem, not solving it. Eventually it expires. Stampede happens.

"Use singleflight"

This is actually a good idea, and golang.org/x/sync/singleflight is a solid package. It deduplicates concurrent requests within a single process. So if 50 goroutines on the same pod all want the same key, only 1 actually fetches it.

But here's the thing. You're probably running multiple pods. You have 10 pods in production, each with their own singleflight group. Each pod sends 1 request to the DB. That's still 10 simultaneous DB queries. With 50 pods it's 50 queries. singleflight alone doesn't cross process boundaries.

"Add a mutex / distributed lock manually"

Now we're getting somewhere. But this is actually non-trivial to implement correctly. The lock needs to:

Be atomic (you can't use GET then SET, there's a race condition between them)
Release only if you own it (another process shouldn't release your lock)
Handle the case where the lock holder crashes mid-fetch
Do a double-check GET after acquiring (another pod may have already filled the cache while you waited for the lock)

Most hand-rolled implementations I've seen miss at least 2 of these. Mine did too, the first time.

What actually needs to happen

The correct flow looks like this:

Request comes in for key "user:123"
    │
    ├──► Check local in-memory cache ──► HIT: return immediately (sub-microsecond)
    │
    ├──► Check Redis ──► HIT (fresh): return value
    │                        │
    │                   HIT (stale but within SWR window):
    │                        └──► return stale value immediately
    │                             + trigger background refresh (user sees no delay)
    │
    └──► MISS: enter protection layer
              │
              ├──► In-process singleflight (deduplicate within this pod)
              │
              ├──► Acquire distributed Redis lock
              │         │
              │    Lock taken? ──► wait, retry
              │
              ├──► Double-check Redis (someone else may have filled it)
              │         └──► HIT: release lock, return (no DB query needed)
              │
              └──► Fetch from DB
                        └──► Store in Redis ──► Release lock ──► Return

Every step here has a reason. Skip one and you either have a stampede, a race condition, or unnecessary DB queries.

I got tired of writing this every time

I've worked on a few different backends now and I found myself implementing some version of this pattern in each one. Copy pasting from previous projects, tweaking slightly, introducing new subtle bugs each time.

So I packaged it properly as an open source Go library: HerdLock

The simplest usage looks like this:

// One time setup
herdlock.RegisterType(&User{})
hl := herdlock.New(redisClient)

// Replace your existing cache logic with this
val, err := hl.Fetch(ctx, "user:"+userID, 5*time.Minute, func(ctx context.Context) (any, error) {
    return db.QueryUser(ctx, userID)  // your existing DB call, unchanged
})

user := val.(*User)

That's it. Your existing fetch function goes in as-is. HerdLock handles everything around it. The in-process deduplication, the distributed lock, the double-check, the stale serving, all of it.

The benchmark that made this real for me

I wanted to actually prove this works under load, not just claim it does. So I wrote a benchmark that simulates a database with a connection pool of maximum 5 concurrent queries, then fires 100 goroutines at the same expired key simultaneously.

Benchmark Case                   | Time per Op      | DB Hits
--------------------------------------------------------------
Coalesced Fetch (HerdLock)       | ~2.3ms  total    |       1
Direct Fetch (No Protection)     | ~31.6ms total    |     100
--------------------------------------------------------------
                                   14x faster        99 DB calls saved

The DB hits column is what matters here. Without protection, your database gets 100 simultaneous queries. With HerdLock, it gets 1. Under real connection pool constraints, those 99 extra queries queue up and cause exactly the latency spike you see in production during traffic spikes.

The 14x latency number comes from the queuing. 100 requests divided by 5 connections equals 20 serial batches of queries. HerdLock collapses all of that down to a single query and 99 waiters sharing the result.

Some things I added that I haven't seen in other libraries

Stale-While-Revalidate

Serve the old value immediately while refreshing in background. Users see zero extra latency. The refresh happens invisibly. This is the same pattern browsers use for service worker caching and it works beautifully for API responses too.

hl := herdlock.New(rdb,
    herdlock.WithStaleWhileRevalidate(30 * time.Second),
)

XFetch — probabilistic early expiry

This one is based on an actual research paper (Vattani, Chierichetti, Lowenstein 2015). Instead of waiting for the TTL cliff at t=60s, XFetch probabilistically starts refreshing keys before they expire. The math:

refresh early if:  now - (delta x beta x -ln(random)) > expiresAt

Where delta is how long your fetch function actually takes. Slow fetches means refresh even earlier. The result is no more expiry cliff. Keys get quietly refreshed before they expire and users never see a miss. Higher beta means more aggressive early refresh.

Jitter strategies

If you cache 10,000 keys at startup all with TTL=60s, they all expire at t=60s. Mega stampede. Adding random jitter to TTLs spreads them out:

herdlock.WithJitter(herdlock.JitterEqual),
herdlock.WithJitterMax(10 * time.Second),
// TTLs now vary ±5s around your set value

Circuit breaker

If Redis itself starts failing, you don't want HerdLock to make things worse by retrying locks in a tight loop. The circuit breaker detects consecutive failures and automatically bypasses cache entirely, serving requests directly from DB until Redis recovers. Degraded mode instead of full outage.

What I chose to NOT include in v1

I made a deliberate call to keep HerdLock as a library, not a daemon or sidecar. Some distributed lock libraries want you to run a separate process. HerdLock just needs your existing Redis client, whatever you're already using. No extra infrastructure.

Also kept the dependency count low. The only non-standard dependencies are go-redis/v9 (which you likely already have) and hashicorp/golang-lru/v2 for the local cache. That's it.

When you should NOT use HerdLock

Being honest here:

Single instance apps: singleflight alone is sufficient, HerdLock is overkill
Non-idempotent fetch functions: HerdLock cannot guarantee exactly-once execution. If your fetch function charges a card or sends an email, that's a different problem entirely
Multi-key atomic fetches: not supported in v1

The part where I ask for feedback

I'm genuinely curious how are you all handling this in your current projects? Because I've talked to a few people and the answers vary wildly:

Some folks have this fully solved with custom middleware
Some have a partial solution that handles the single-process case but not multi-pod
Some are just not handling it and hoping for the best (no judgment, I was here too)

And the bigger question I keep thinking about: at what point does it make sense to use a library for this vs. rolling your own singleflight + Redis lock? There's a real argument for owning the implementation. You understand exactly what it does, no external dependency to audit. Where's your line?

Drop a comment, would love to know.

If HerdLock solves something you've been manually patching, a star on GitHub helps more than you'd think for a new OSS project: github.com/harshrajdubey/herdlock-go

Android TV Is Not Just Big-Screen Android

Harsh Raj Dubey — Fri, 22 May 2026 19:58:49 +0000

What I learned building a browser for Android TV and why everything I assumed was wrong.

When I started working on a browser for Android TV, I thought: How different can it be? It runs Android. We have WebView. We know web tech.

That assumption aged poorly.

Android TV development is not just Android development on a bigger screen. It's a fundamentally different platform with different input models, different hardware realities, and a fragmentation problem that makes the regular Android ecosystem look tame. Here's what I ran into and what I wish someone had warned me about.

D-pad Focus Is Harder Than Touch UX

Normal Android apps assume touch. Gestures. Scrolling. Users tap what they want, swipe to explore, and pinch to zoom. The system knows exactly where the user's finger is pointing.

TV UX works on none of those assumptions.

With a remote control, you navigate with four directional buttons. That's it. There is no cursor (usually). There is no hover. There is no "tap anywhere." Every interaction is routed through focus states and a movement graph that the developer (not the user) defines.

That means:

Manually managing focus order, because Android's default focus traversal makes sense for form fields, not arbitrary UI layouts
Preventing focus traps, where the user presses right infinitely and nothing happens
Handling invisible focus states, where the focused element has no visible ring and the user has no idea where they are
Making every interactive element reachable via remote only, with no fallback to "just tap it" The movement graph has to be predictable. Users on TV develop a mental model of where focus will go when they press a direction. Break that model once, and the experience feels broken forever.

Websites Are Not Designed for 10-Foot Viewing

This sounds obvious until you try it.

A site that works fine on desktop (readable, usable, functional) can become genuinely painful on TV because:

Text that looks normal at arm's length becomes tiny at 10 feet
Hover interactions (dropdown menus, tooltips, navigation reveals) simply don't exist on TV
Menus designed for mouse precision require pixel-perfect targeting that a D-pad can't provide
Dialogs that are carefully sized for 1080p monitors overflow on TV viewports with different scaling
Spacing that feels generous on a monitor feels cramped when you're looking at it across a room
To make arbitrary web content usable, we had to:
Increase clickable areas well beyond what the original site intended
Force zoom and scaling to make text legible at distance
Override viewport behavior to prevent sites from making layout decisions we didn't want
Inject CSS and JS fixes as a layer between the user and the original content
You're basically shim-ing bad assumptions at runtime. It's messy, but it's necessary.

WebView Fragmentation Is Brutal on TVs

On phones, Android System WebView is updated regularly through the Play Store. Not so on TVs.

TV vendors rarely push:

Android System WebView updates
Chromium engine updates
Security patches What that means in practice: Android version became a meaningless signal.

Two TVs, both reporting Android 12. One has Chromium 66. Another has Chromium 102. That's a four-year gap in browser engine capability. Both TVs will pass any OS version check you write. Neither will behave the same.

The consequence is a class of bugs that are genuinely hard to reproduce and reason about:

Missing JavaScript APIs that the spec added years ago
CSS features that silently fail or render incorrectly
Video playback inconsistencies in how codecs are handled
Modern frameworks that partially work, just enough to be confusing We learned to detect capabilities, not versions. Don't ask "is this Android 11?" Ask "does this device support this specific API?" It's more work upfront, but it's the only thing that gives you accurate information.

Fake Android Versions from OEMs

Related, but worse.

Cheap and regional OEM TVs often:

Spoof Android version strings entirely
Heavily customize firmware in ways that break standard behaviors
Remove Google components (no Play Services, no certified WebView)
Ship uncertified builds that passed no compatibility testing So you'd see a device claiming Android 13 that behaved like a heavily stripped Android 9. Or a "certified Android TV" that was actually a modified AOSP box with a launcher slapped on.

This is where capability detection stopped being a best practice and became a survival strategy. You simply cannot trust what the device tells you about itself.

Pointer Simulation Is Deceptive

Some TVs support a simulated cursor, a virtual pointer you can move around with the remote, mimicking mouse behavior. This sounds like it solves the D-pad problem. It doesn't.

The issue is that TVs don't have real pointer semantics. The cursor is a visual overlay, not an actual input device the OS understands as a pointer. That creates a cascade of problems:

Focus-based and touch-based systems conflict with each other when both exist simultaneously
Coordinate mapping becomes inconsistent: what does "cursor position" mean if there's no real pointer device?
Hitbox mismatch: the visual cursor appears over a button, but the actual registered click coordinate is offset, often by however the DPI or scaling is miscalculated The most concrete example: we had a bug where we treated the center of the cursor image as the click point. Seemed reasonable. It was wrong. The actual registered input coordinate was different, and elements that appeared to be under the cursor weren't being activated. Tracking that down was not fun.

DPI inconsistencies and density miscalculations made it worse. A UI that looked correct on one TV would have systematically shifted hitboxes on another.

Hardware Acceleration Inconsistencies

TVs have GPUs, but the drivers for those GPUs on cheap hardware are often poor.

We saw:

Animations that dropped frames dramatically even when the content was simple
WebView rendering that lagged visibly on transitions
Video overlays that conflicted with composited UI layers
Hardware acceleration that caused more problems than it solved on specific chipsets The workaround was often the opposite of what you'd do on a performance-focused mobile app: disable effects, reduce transparency, simplify rendering, lower repaint frequency. You're optimizing for correctness over aesthetics.

Overscan Still Exists

Overscan is a legacy TV behavior where the display crops the edges of the image slightly, a leftover from CRT broadcasting. You'd think it's gone by now.

It's not.

A UI that fits perfectly in the Android emulator or on your test monitor can, on a real TV:

Clip the edges of buttons so they're partially off-screen
Hide navigation elements
Cut subtitles or action labels TV-safe margins (keeping all meaningful content away from the outer 5-10% of the screen) aren't just a guideline. They're a requirement if you want to ship something that works everywhere.

Remote Latency Matters More Than Expected

On mobile, a 100ms UI response feels fine. Acceptable. Maybe even snappy.

On TV, with a D-pad, 100ms delay on focus movement feels terrible. The navigation feels sluggish and unresponsive, even when the actual delay is imperceptible on other platforms.

This is partly perceptual. TV UX is used from a couch, passively, and the threshold for "this feels broken" is lower. But it's also because D-pad navigation is sequential and modal. You press right, wait for focus to move, then press again. If each press introduces latency, it compounds.

Focus movement needs to feel instant. Not fast. Instant.

APK Behavior Varies Wildly by Manufacturer

Beyond WebView, the underlying APK behavior differed meaningfully across:

Keycodes: what keycode does "back" send? Depends on the manufacturer.
Launcher behavior: how does the system handle app lifecycle when the user goes home?
Background restrictions: some TVs killed background processes aggressively; others didn't
Fullscreen APIs: WindowInsets, systemUiVisibility, the behavior of immersive mode all differed
Permissions: some TVs prompted for permissions differently or blocked them silently
Autoplay support: video autoplay policies varied dramatically This was especially pronounced on Mi TV, Realme TV, generic AOSP TV boxes, and uncertified Android TVs from regional manufacturers. Each had its own quirks, and none of them were documented.

Debugging Is Painful

Many TVs:

Lack proper developer tools
Have ADB support that's broken, disabled, or intermittent
Disconnect randomly during sessions
Hide or truncate logs This makes iteration significantly slower than standard Android development. The feedback loop is longer, crashes are harder to inspect, and reproducing bugs in a controlled way often requires having the exact physical device in hand.

We developed a habit of logging aggressively to in-app overlays: visible debug panels that showed state, errors, and event sequences without relying on ADB. Inelegant, but effective.

Memory Limitations on Cheap Hardware

Premium TVs have reasonable RAM. Cheap TVs, which are often the majority of units in certain markets, have:

1-2GB RAM, sometimes less
Slow eMMC storage
Weak CPUs with thermal throttling Heavy web apps that run fine on mid-range phones become noticeably sluggish on these devices. Memory pressure causes WebView to drop cached resources. Page loads take longer. Complex layouts trigger more GC pauses.

This pushed us toward lighter rendering, aggressive caching strategy, and simplified page structures for TV-specific views.

The Core Insight

After all of this, the biggest architectural realization was this:

The browser wasn't the problem. The assumptions browsers make about input methods and responsive behavior were.

Browsers are built for a world of mouse pointers, touch surfaces, and responsive viewports. That world doesn't exist on TV. The moment you try to present web content on a TV, you're in a gap between two systems that weren't designed to meet.

Building a good TV browser isn't about shipping Chromium on a bigger screen. It's about mediating between the web's assumptions and the TV's reality, at every layer of the stack, from input handling to rendering to hardware capability detection.

It's more work than it looks. But it's genuinely interesting work.

If you've shipped something for Android TV and ran into similar (or completely different) problems, I'd love to hear about it in the comments.