<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Harsh Raj Dubey</title>
    <description>The latest articles on DEV Community by Harsh Raj Dubey (@harshrajdubey).</description>
    <link>https://dev.to/harshrajdubey</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1783245%2F2b4ba993-2484-4883-bf18-e45bc6d53c3a.png</url>
      <title>DEV Community: Harsh Raj Dubey</title>
      <link>https://dev.to/harshrajdubey</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/harshrajdubey"/>
    <language>en</language>
    <item>
      <title>The cache bug that only appears when your app goes viral</title>
      <dc:creator>Harsh Raj Dubey</dc:creator>
      <pubDate>Mon, 15 Jun 2026 14:09:41 +0000</pubDate>
      <link>https://dev.to/harshrajdubey/the-cache-bug-that-only-appears-when-your-app-goes-viral-5d5j</link>
      <guid>https://dev.to/harshrajdubey/the-cache-bug-that-only-appears-when-your-app-goes-viral-5d5j</guid>
      <description>&lt;p&gt;So this is not a story about a bug I found in someone else's code.&lt;/p&gt;

&lt;p&gt;This is a story about a bug that is sitting in &lt;em&gt;your&lt;/em&gt; code right now. Probably. And it will not show up in your local testing, it will not show up in staging, it will not show up at normal traffic. It shows up exactly when you don't want it to. When your app is trending on Product Hunt, or some influencer tweets about you, or you just hit the front page of Hacker News.&lt;/p&gt;

&lt;p&gt;I found this bug in my own backend. Then I built a library to fix it properly. The library is called &lt;a href="https://github.com/harshrajdubey/herdlock-go" rel="noopener noreferrer"&gt;HerdLock&lt;/a&gt;. But before I talk about that, let me explain what the actual problem is.&lt;/p&gt;




&lt;h2&gt;
  
  
  You're caching things. Great. That's not enough.
&lt;/h2&gt;

&lt;p&gt;Most Go backends I've seen, and most backends in general honestly, do something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;GetUserProfile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;userID&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Check cache first&lt;/span&gt;
    &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"user:"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;userID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;deserialize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Cache miss, go to database&lt;/span&gt;
    &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;QueryUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;userID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Store in cache with 5 minute TTL&lt;/span&gt;
    &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"user:"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;userID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;serialize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Minute&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is fine. This works. This is what everyone does.&lt;/p&gt;

&lt;p&gt;Until the key expires.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;user:123&lt;/code&gt; has a 5 minute TTL and exactly 5 minutes pass, what happens if at that exact moment you have 200 concurrent requests all asking for that user?&lt;/p&gt;

&lt;p&gt;All 200 of them check the cache. All 200 see a miss. All 200 go to the database. Simultaneously.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ user:123 expired ]
        │
        ├──► Request 1 ──► Cache Miss ──► DB Query
        ├──► Request 2 ──► Cache Miss ──► DB Query
        ├──► Request 3 ──► Cache Miss ──► DB Query
        ├──► ...
        └──► Request 200 ──► Cache Miss ──► DB Query
                                    │
                              DB goes 💥
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is called a &lt;strong&gt;cache stampede&lt;/strong&gt; or &lt;strong&gt;thundering herd problem&lt;/strong&gt;. The cache was supposed to protect your database. But the moment it expires under load, it does the opposite. It coordinates an attack on your database.&lt;/p&gt;




&lt;h2&gt;
  
  
  "Okay but 200 concurrent requests on one key, that's rare no?"
&lt;/h2&gt;

&lt;p&gt;In normal traffic, yes.&lt;/p&gt;

&lt;p&gt;In viral traffic? Your hot keys are &lt;em&gt;hot&lt;/em&gt;. That trending product page, that leaderboard endpoint, that "current user" API call that every frontend makes on page load. Under 10x traffic these can easily get hundreds of concurrent hits.&lt;/p&gt;

&lt;p&gt;And the worst part: the more popular your app gets, the worse the stampede. Traffic spike means more concurrent requests means more goroutines all hitting the expired key at the same time means bigger DB explosion. Your success literally causes your failure.&lt;/p&gt;




&lt;h2&gt;
  
  
  The naive fixes that don't actually work
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Just set a longer TTL"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You're delaying the problem, not solving it. Eventually it expires. Stampede happens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Use &lt;code&gt;singleflight&lt;/code&gt;"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is actually a good idea, and &lt;code&gt;golang.org/x/sync/singleflight&lt;/code&gt; is a solid package. It deduplicates concurrent requests &lt;em&gt;within a single process&lt;/em&gt;. So if 50 goroutines on the same pod all want the same key, only 1 actually fetches it.&lt;/p&gt;

&lt;p&gt;But here's the thing. You're probably running multiple pods. You have 10 pods in production, each with their own &lt;code&gt;singleflight&lt;/code&gt; group. Each pod sends 1 request to the DB. That's still 10 simultaneous DB queries. With 50 pods it's 50 queries. &lt;code&gt;singleflight&lt;/code&gt; alone doesn't cross process boundaries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Add a mutex / distributed lock manually"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now we're getting somewhere. But this is actually non-trivial to implement correctly. The lock needs to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Be atomic (you can't use GET then SET, there's a race condition between them)&lt;/li&gt;
&lt;li&gt;Release only if &lt;em&gt;you&lt;/em&gt; own it (another process shouldn't release your lock)&lt;/li&gt;
&lt;li&gt;Handle the case where the lock holder crashes mid-fetch&lt;/li&gt;
&lt;li&gt;Do a double-check GET after acquiring (another pod may have already filled the cache while you waited for the lock)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most hand-rolled implementations I've seen miss at least 2 of these. Mine did too, the first time.&lt;/p&gt;




&lt;h2&gt;
  
  
  What actually needs to happen
&lt;/h2&gt;

&lt;p&gt;The correct flow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request comes in for key "user:123"
    │
    ├──► Check local in-memory cache ──► HIT: return immediately (sub-microsecond)
    │
    ├──► Check Redis ──► HIT (fresh): return value
    │                        │
    │                   HIT (stale but within SWR window):
    │                        └──► return stale value immediately
    │                             + trigger background refresh (user sees no delay)
    │
    └──► MISS: enter protection layer
              │
              ├──► In-process singleflight (deduplicate within this pod)
              │
              ├──► Acquire distributed Redis lock
              │         │
              │    Lock taken? ──► wait, retry
              │
              ├──► Double-check Redis (someone else may have filled it)
              │         └──► HIT: release lock, return (no DB query needed)
              │
              └──► Fetch from DB
                        └──► Store in Redis ──► Release lock ──► Return
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every step here has a reason. Skip one and you either have a stampede, a race condition, or unnecessary DB queries.&lt;/p&gt;




&lt;h2&gt;
  
  
  I got tired of writing this every time
&lt;/h2&gt;

&lt;p&gt;I've worked on a few different backends now and I found myself implementing some version of this pattern in each one. Copy pasting from previous projects, tweaking slightly, introducing new subtle bugs each time.&lt;/p&gt;

&lt;p&gt;So I packaged it properly as an open source Go library: &lt;strong&gt;&lt;a href="https://github.com/harshrajdubey/herdlock-go" rel="noopener noreferrer"&gt;HerdLock&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The simplest usage looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// One time setup&lt;/span&gt;
&lt;span class="n"&gt;herdlock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RegisterType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="p"&gt;{})&lt;/span&gt;
&lt;span class="n"&gt;hl&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;herdlock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redisClient&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Replace your existing cache logic with this&lt;/span&gt;
&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;hl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"user:"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;userID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Minute&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;QueryUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;userID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c"&gt;// your existing DB call, unchanged&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Your existing fetch function goes in as-is. HerdLock handles everything around it. The in-process deduplication, the distributed lock, the double-check, the stale serving, all of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The benchmark that made this real for me
&lt;/h2&gt;

&lt;p&gt;I wanted to actually prove this works under load, not just claim it does. So I wrote a benchmark that simulates a database with a connection pool of maximum 5 concurrent queries, then fires 100 goroutines at the same expired key simultaneously.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Benchmark Case                   | Time per Op      | DB Hits
--------------------------------------------------------------
Coalesced Fetch (HerdLock)       | ~2.3ms  total    |       1
Direct Fetch (No Protection)     | ~31.6ms total    |     100
--------------------------------------------------------------
                                   14x faster        99 DB calls saved
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The DB hits column is what matters here. Without protection, your database gets 100 simultaneous queries. With HerdLock, it gets 1. Under real connection pool constraints, those 99 extra queries queue up and cause exactly the latency spike you see in production during traffic spikes.&lt;/p&gt;

&lt;p&gt;The 14x latency number comes from the queuing. 100 requests divided by 5 connections equals 20 serial batches of queries. HerdLock collapses all of that down to a single query and 99 waiters sharing the result.&lt;/p&gt;




&lt;h2&gt;
  
  
  Some things I added that I haven't seen in other libraries
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Stale-While-Revalidate&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Serve the old value &lt;em&gt;immediately&lt;/em&gt; while refreshing in background. Users see zero extra latency. The refresh happens invisibly. This is the same pattern browsers use for service worker caching and it works beautifully for API responses too.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;hl&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;herdlock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rdb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;herdlock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithStaleWhileRevalidate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;30&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;XFetch — probabilistic early expiry&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This one is based on an actual research paper (Vattani, Chierichetti, Lowenstein 2015). Instead of waiting for the TTL cliff at t=60s, XFetch probabilistically starts refreshing keys &lt;em&gt;before&lt;/em&gt; they expire. The math:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;refresh early if:  now - (delta x beta x -ln(random)) &amp;gt; expiresAt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;delta&lt;/code&gt; is how long your fetch function actually takes. Slow fetches means refresh even earlier. The result is no more expiry cliff. Keys get quietly refreshed before they expire and users never see a miss. Higher &lt;code&gt;beta&lt;/code&gt; means more aggressive early refresh.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Jitter strategies&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you cache 10,000 keys at startup all with TTL=60s, they all expire at t=60s. Mega stampede. Adding random jitter to TTLs spreads them out:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;herdlock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithJitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;herdlock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JitterEqual&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;herdlock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithJitterMax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="c"&gt;// TTLs now vary ±5s around your set value&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Circuit breaker&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If Redis itself starts failing, you don't want HerdLock to make things worse by retrying locks in a tight loop. The circuit breaker detects consecutive failures and automatically bypasses cache entirely, serving requests directly from DB until Redis recovers. Degraded mode instead of full outage.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I chose to NOT include in v1
&lt;/h2&gt;

&lt;p&gt;I made a deliberate call to keep HerdLock as a library, not a daemon or sidecar. Some distributed lock libraries want you to run a separate process. HerdLock just needs your existing Redis client, whatever you're already using. No extra infrastructure.&lt;/p&gt;

&lt;p&gt;Also kept the dependency count low. The only non-standard dependencies are &lt;code&gt;go-redis/v9&lt;/code&gt; (which you likely already have) and &lt;code&gt;hashicorp/golang-lru/v2&lt;/code&gt; for the local cache. That's it.&lt;/p&gt;




&lt;h2&gt;
  
  
  When you should NOT use HerdLock
&lt;/h2&gt;

&lt;p&gt;Being honest here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single instance apps&lt;/strong&gt;: &lt;code&gt;singleflight&lt;/code&gt; alone is sufficient, HerdLock is overkill&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-idempotent fetch functions&lt;/strong&gt;: HerdLock cannot guarantee exactly-once execution. If your fetch function charges a card or sends an email, that's a different problem entirely&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-key atomic fetches&lt;/strong&gt;: not supported in v1&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The part where I ask for feedback
&lt;/h2&gt;

&lt;p&gt;I'm genuinely curious how are you all handling this in your current projects? Because I've talked to a few people and the answers vary wildly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some folks have this fully solved with custom middleware&lt;/li&gt;
&lt;li&gt;Some have a partial solution that handles the single-process case but not multi-pod&lt;/li&gt;
&lt;li&gt;Some are just not handling it and hoping for the best (no judgment, I was here too)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the bigger question I keep thinking about: &lt;strong&gt;at what point does it make sense to use a library for this vs. rolling your own singleflight + Redis lock?&lt;/strong&gt; There's a real argument for owning the implementation. You understand exactly what it does, no external dependency to audit. Where's your line?&lt;/p&gt;

&lt;p&gt;Drop a comment, would love to know.&lt;/p&gt;

&lt;p&gt;If HerdLock solves something you've been manually patching, a star on GitHub helps more than you'd think for a new OSS project: &lt;a href="https://github.com/harshrajdubey/herdlock-go" rel="noopener noreferrer"&gt;github.com/harshrajdubey/herdlock-go&lt;/a&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>redis</category>
      <category>distributedsystems</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Android TV Is Not Just Big-Screen Android</title>
      <dc:creator>Harsh Raj Dubey</dc:creator>
      <pubDate>Fri, 22 May 2026 19:58:49 +0000</pubDate>
      <link>https://dev.to/harshrajdubey/android-tv-is-not-just-big-screen-android-a9a</link>
      <guid>https://dev.to/harshrajdubey/android-tv-is-not-just-big-screen-android-a9a</guid>
      <description>&lt;p&gt;&lt;em&gt;What I learned building a browser for Android TV and why everything I assumed was wrong.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When I started working on a browser for Android TV, I thought: &lt;em&gt;How different can it be? It runs Android. We have WebView. We know web tech.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That assumption aged poorly.&lt;/p&gt;

&lt;p&gt;Android TV development is not just Android development on a bigger screen. It's a fundamentally different platform with different input models, different hardware realities, and a fragmentation problem that makes the regular Android ecosystem look tame. Here's what I ran into and what I wish someone had warned me about.&lt;/p&gt;




&lt;h2&gt;
  
  
  D-pad Focus Is Harder Than Touch UX
&lt;/h2&gt;

&lt;p&gt;Normal Android apps assume touch. Gestures. Scrolling. Users tap what they want, swipe to explore, and pinch to zoom. The system knows exactly where the user's finger is pointing.&lt;/p&gt;

&lt;p&gt;TV UX works on none of those assumptions.&lt;/p&gt;

&lt;p&gt;With a remote control, you navigate with four directional buttons. That's it. There is no cursor (usually). There is no hover. There is no "tap anywhere." Every interaction is routed through focus states and a movement graph that the developer (not the user) defines.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manually managing focus order&lt;/strong&gt;, because Android's default focus traversal makes sense for form fields, not arbitrary UI layouts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Preventing focus traps&lt;/strong&gt;, where the user presses right infinitely and nothing happens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handling invisible focus states&lt;/strong&gt;, where the focused element has no visible ring and the user has no idea where they are&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Making every interactive element reachable via remote only&lt;/strong&gt;, with no fallback to "just tap it"
The movement graph has to be predictable. Users on TV develop a mental model of where focus will go when they press a direction. Break that model once, and the experience feels broken forever.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Websites Are Not Designed for 10-Foot Viewing
&lt;/h2&gt;

&lt;p&gt;This sounds obvious until you try it.&lt;/p&gt;

&lt;p&gt;A site that works fine on desktop (readable, usable, functional) can become genuinely painful on TV because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Text that looks normal at arm's length becomes tiny at 10 feet&lt;/li&gt;
&lt;li&gt;Hover interactions (dropdown menus, tooltips, navigation reveals) simply don't exist on TV&lt;/li&gt;
&lt;li&gt;Menus designed for mouse precision require pixel-perfect targeting that a D-pad can't provide&lt;/li&gt;
&lt;li&gt;Dialogs that are carefully sized for 1080p monitors overflow on TV viewports with different scaling&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Spacing that feels generous on a monitor feels cramped when you're looking at it across a room&lt;br&gt;
To make arbitrary web content usable, we had to:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Increase clickable areas&lt;/strong&gt; well beyond what the original site intended&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Force zoom and scaling&lt;/strong&gt; to make text legible at distance&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Override viewport behavior&lt;/strong&gt; to prevent sites from making layout decisions we didn't want&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Inject CSS and JS fixes&lt;/strong&gt; as a layer between the user and the original content&lt;br&gt;
You're basically shim-ing bad assumptions at runtime. It's messy, but it's necessary.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  WebView Fragmentation Is Brutal on TVs
&lt;/h2&gt;

&lt;p&gt;On phones, Android System WebView is updated regularly through the Play Store. Not so on TVs.&lt;/p&gt;

&lt;p&gt;TV vendors rarely push:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Android System WebView updates&lt;/li&gt;
&lt;li&gt;Chromium engine updates&lt;/li&gt;
&lt;li&gt;Security patches
What that means in practice: &lt;strong&gt;Android version became a meaningless signal.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two TVs, both reporting Android 12. One has Chromium 66. Another has Chromium 102. That's a &lt;em&gt;four-year&lt;/em&gt; gap in browser engine capability. Both TVs will pass any OS version check you write. Neither will behave the same.&lt;/p&gt;

&lt;p&gt;The consequence is a class of bugs that are genuinely hard to reproduce and reason about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing JavaScript APIs that the spec added years ago&lt;/li&gt;
&lt;li&gt;CSS features that silently fail or render incorrectly&lt;/li&gt;
&lt;li&gt;Video playback inconsistencies in how codecs are handled&lt;/li&gt;
&lt;li&gt;Modern frameworks that partially work, just enough to be confusing
We learned to &lt;strong&gt;detect capabilities, not versions&lt;/strong&gt;. Don't ask "is this Android 11?" Ask "does this device support this specific API?" It's more work upfront, but it's the only thing that gives you accurate information.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Fake Android Versions from OEMs
&lt;/h2&gt;

&lt;p&gt;Related, but worse.&lt;/p&gt;

&lt;p&gt;Cheap and regional OEM TVs often:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spoof Android version strings entirely&lt;/li&gt;
&lt;li&gt;Heavily customize firmware in ways that break standard behaviors&lt;/li&gt;
&lt;li&gt;Remove Google components (no Play Services, no certified WebView)&lt;/li&gt;
&lt;li&gt;Ship uncertified builds that passed no compatibility testing
So you'd see a device claiming Android 13 that behaved like a heavily stripped Android 9. Or a "certified Android TV" that was actually a modified AOSP box with a launcher slapped on.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where &lt;strong&gt;capability detection&lt;/strong&gt; stopped being a best practice and became a survival strategy. You simply cannot trust what the device tells you about itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pointer Simulation Is Deceptive
&lt;/h2&gt;

&lt;p&gt;Some TVs support a simulated cursor, a virtual pointer you can move around with the remote, mimicking mouse behavior. This sounds like it solves the D-pad problem. It doesn't.&lt;/p&gt;

&lt;p&gt;The issue is that TVs don't have real pointer semantics. The cursor is a visual overlay, not an actual input device the OS understands as a pointer. That creates a cascade of problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Focus-based and touch-based systems conflict&lt;/strong&gt; with each other when both exist simultaneously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coordinate mapping becomes inconsistent&lt;/strong&gt;: what does "cursor position" mean if there's no real pointer device?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hitbox mismatch&lt;/strong&gt;: the visual cursor appears over a button, but the actual registered click coordinate is offset, often by however the DPI or scaling is miscalculated
The most concrete example: we had a bug where we treated the center of the cursor image as the click point. Seemed reasonable. It was wrong. The actual registered input coordinate was different, and elements that appeared to be under the cursor weren't being activated. Tracking that down was not fun.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DPI inconsistencies and density miscalculations made it worse. A UI that looked correct on one TV would have systematically shifted hitboxes on another.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hardware Acceleration Inconsistencies
&lt;/h2&gt;

&lt;p&gt;TVs have GPUs, but the drivers for those GPUs on cheap hardware are often poor.&lt;/p&gt;

&lt;p&gt;We saw:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Animations that dropped frames dramatically even when the content was simple&lt;/li&gt;
&lt;li&gt;WebView rendering that lagged visibly on transitions&lt;/li&gt;
&lt;li&gt;Video overlays that conflicted with composited UI layers&lt;/li&gt;
&lt;li&gt;Hardware acceleration that caused more problems than it solved on specific chipsets
The workaround was often the opposite of what you'd do on a performance-focused mobile app: &lt;strong&gt;disable effects, reduce transparency, simplify rendering, lower repaint frequency.&lt;/strong&gt; You're optimizing for correctness over aesthetics.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Overscan Still Exists
&lt;/h2&gt;

&lt;p&gt;Overscan is a legacy TV behavior where the display crops the edges of the image slightly, a leftover from CRT broadcasting. You'd think it's gone by now.&lt;/p&gt;

&lt;p&gt;It's not.&lt;/p&gt;

&lt;p&gt;A UI that fits perfectly in the Android emulator or on your test monitor can, on a real TV:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clip the edges of buttons so they're partially off-screen&lt;/li&gt;
&lt;li&gt;Hide navigation elements&lt;/li&gt;
&lt;li&gt;Cut subtitles or action labels
&lt;strong&gt;TV-safe margins&lt;/strong&gt; (keeping all meaningful content away from the outer 5-10% of the screen) aren't just a guideline. They're a requirement if you want to ship something that works everywhere.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Remote Latency Matters More Than Expected
&lt;/h2&gt;

&lt;p&gt;On mobile, a 100ms UI response feels fine. Acceptable. Maybe even snappy.&lt;/p&gt;

&lt;p&gt;On TV, with a D-pad, 100ms delay on focus movement feels terrible. The navigation feels sluggish and unresponsive, even when the actual delay is imperceptible on other platforms.&lt;/p&gt;

&lt;p&gt;This is partly perceptual. TV UX is used from a couch, passively, and the threshold for "this feels broken" is lower. But it's also because D-pad navigation is sequential and modal. You press right, wait for focus to move, then press again. If each press introduces latency, it compounds.&lt;/p&gt;

&lt;p&gt;Focus movement needs to feel &lt;strong&gt;instant&lt;/strong&gt;. Not fast. Instant.&lt;/p&gt;




&lt;h2&gt;
  
  
  APK Behavior Varies Wildly by Manufacturer
&lt;/h2&gt;

&lt;p&gt;Beyond WebView, the underlying APK behavior differed meaningfully across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Keycodes&lt;/strong&gt;: what keycode does "back" send? Depends on the manufacturer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Launcher behavior&lt;/strong&gt;: how does the system handle app lifecycle when the user goes home?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Background restrictions&lt;/strong&gt;: some TVs killed background processes aggressively; others didn't&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fullscreen APIs&lt;/strong&gt;: &lt;code&gt;WindowInsets&lt;/code&gt;, &lt;code&gt;systemUiVisibility&lt;/code&gt;, the behavior of immersive mode all differed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Permissions&lt;/strong&gt;: some TVs prompted for permissions differently or blocked them silently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autoplay support&lt;/strong&gt;: video autoplay policies varied dramatically
This was especially pronounced on Mi TV, Realme TV, generic AOSP TV boxes, and uncertified Android TVs from regional manufacturers. Each had its own quirks, and none of them were documented.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Debugging Is Painful
&lt;/h2&gt;

&lt;p&gt;Many TVs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lack proper developer tools&lt;/li&gt;
&lt;li&gt;Have ADB support that's broken, disabled, or intermittent&lt;/li&gt;
&lt;li&gt;Disconnect randomly during sessions&lt;/li&gt;
&lt;li&gt;Hide or truncate logs
This makes iteration significantly slower than standard Android development. The feedback loop is longer, crashes are harder to inspect, and reproducing bugs in a controlled way often requires having the exact physical device in hand.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We developed a habit of &lt;strong&gt;logging aggressively to in-app overlays&lt;/strong&gt;: visible debug panels that showed state, errors, and event sequences without relying on ADB. Inelegant, but effective.&lt;/p&gt;




&lt;h2&gt;
  
  
  Memory Limitations on Cheap Hardware
&lt;/h2&gt;

&lt;p&gt;Premium TVs have reasonable RAM. Cheap TVs, which are often the majority of units in certain markets, have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1-2GB RAM, sometimes less&lt;/li&gt;
&lt;li&gt;Slow eMMC storage&lt;/li&gt;
&lt;li&gt;Weak CPUs with thermal throttling
Heavy web apps that run fine on mid-range phones become noticeably sluggish on these devices. Memory pressure causes WebView to drop cached resources. Page loads take longer. Complex layouts trigger more GC pauses.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pushed us toward &lt;strong&gt;lighter rendering, aggressive caching strategy, and simplified page structures&lt;/strong&gt; for TV-specific views.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Insight
&lt;/h2&gt;

&lt;p&gt;After all of this, the biggest architectural realization was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;The browser wasn't the problem. The assumptions browsers make about input methods and responsive behavior were.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Browsers are built for a world of mouse pointers, touch surfaces, and responsive viewports. That world doesn't exist on TV. The moment you try to present web content on a TV, you're in a gap between two systems that weren't designed to meet.&lt;/p&gt;

&lt;p&gt;Building a good TV browser isn't about shipping Chromium on a bigger screen. It's about &lt;strong&gt;mediating between the web's assumptions and the TV's reality&lt;/strong&gt;, at every layer of the stack, from input handling to rendering to hardware capability detection.&lt;/p&gt;

&lt;p&gt;It's more work than it looks. But it's genuinely interesting work.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you've shipped something for Android TV and ran into similar (or completely different) problems, I'd love to hear about it in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>android</category>
      <category>softwaredevelopment</category>
      <category>ui</category>
      <category>ux</category>
    </item>
  </channel>
</rss>
