The Art of Distributed Locking: Implementing Redlock in Production

#redis #caching #systemdesign #programming

So here's a fun story. I deployed a feature on a Friday evening (I know, I know—rookie mistake), grabbed dinner, and crashed. Saturday morning, I'm scrolling through my phone with coffee, and Sentry is absolutely screaming at me. Red everywhere. Database alerts going nuclear.

I'm thinking, "Wait, we have caching. What the hell?"

Turns out? Cache stampede. And boy, did it hit hard.

Here's the thing about cache stampede—it's sneaky. Your cache expires, and suddenly every request that was waiting decides to regenerate that cache entry at the exact same time. It's like a thousand threads all going "I'll fix it!" and then immediately DDoS'ing your own database. Not fun.

Picking Your Poison: Caching Strategies

Before we dive into fixes, let's talk strategy. There's three main approaches people use and each and every has there own Pro's and Con's depending on the use-case and conditions of the archetecture we use any one of them

I like to think about it with the librarian metaphor:

Cache-Aside — The "I'll deal with it later" approach. Someone asks for a book? Fine, I'll walk down to the dusty basement (database), grab it, and keep it on my desk for next time.

Good stuff: Dead simple. Doesn't waste space on books nobody wants.
The catch: That first person? Yeah, they're waiting while you trek to the basement.

Write-Through — The obsessive organizer. Every new book gets logged twice—one on the desk, one in the basement. No exceptions.

Good stuff: Your desk is always current. No surprises.
The catch: Everything takes forever because you're essentially doing double work.

Write-Behind — My personal favorite when I'm feeling risky. Stack books on the desk all day, then at 5 PM, haul everything to the basement in one trip.

Good stuff: Stupid fast writes. Users love it.
The catch: If your desk catches fire before 5 PM (server crashes), you're toast. Hope you like data loss.

What Actually Happens During a Cache Stampede

Okay, back to my Saturday morning disaster.

You know those sneaker drops? Picture the Travis Scott Jordan collab. Nike's got 1,000 pairs. There's 10,000 people outside losing their minds. The store opens, and everyone rushes the counter simultaneously. Security? Overwhelmed. Counter staff? Crying. Total chaos.

That's cache stampede.

Your cache key is the locked store doors. Your database is the overwhelmed counter staff. The second that cache expires, every single waiting request bulldozes through at once. No line, no rate limiting, just pure pandemonium hitting your poor database.

And that's exactly what I woke up to that Saturday. Fun times.

Now comes the hero of our story: Distributed Locking with Redlock

So What's Redlock Actually Doing?

Think of it like a ticket system at a busy restaurant.

Going back to our sneaker drop chaos—imagine if Nike smartened up. Instead of letting everyone mob the counter, they hand out numbered tickets at the door. "We've got 1,000 pairs, here's your ticket, we'll call you when it's your turn."

Now people can grab coffee, browse around, whatever. The store staff aren't getting trampled. When your number's called, you walk up calmly and complete your purchase. Once all 1,000 tickets are gone, anyone else showing up just gets told "Sorry, sold out"—no point in hanging around.

That's essentially what distributed locking does for your cache regeneration. Only one request gets the "ticket" (lock) to rebuild the cache. Everyone else? They either wait for that first request to finish, or they get served stale data while the rebuild happens in the background. No stampede, no database meltdown.

Similarly in our system when the cache expires, the first request shouts, "I'm going to the DB! Here is my ID." It grabs a "Lock" (token) from Redis. Every other request sees that someone already has the token. They don't rush the DB; they just wait a few milliseconds and check the cache again.

Wait, But What Makes Redlock Special?

Here's where I almost screwed up again. My first implementation just used a single Redis instance for locking. Worked great for about three weeks.

Then our Redis instance hiccuped. Not even a full crash—just a brief restart during a deployment. And guess what happened? All my fancy locks disappeared. Cache stampede part two, electric boogaloo.

So here's the thing about regular Redis locking: it's a single point of failure. If that one Redis node goes down, your entire locking mechanism vanishes. Every request suddenly thinks "oh, there's no lock, I'll go hit the database!" And we're back to Saturday morning hell.

Enter Redlock. Salvatore Sanfilippo (the guy who created Redis) came up with this algorithm specifically to solve that problem. Instead of trusting just one Redis instance with your locks, Redlock spreads the responsibility across multiple independent Redis nodes—usually 5.

Here's how it works: When you want to acquire a lock, you don't just ask one Redis node. You ask all 5 nodes, "Hey, can I get a lock on this key?" If you get a "yes" from the majority (at least 3 out of 5), you win the lock. If one Redis node crashes or gets disconnected? No big deal—you still have 4 others, and you only needed 3 to agree anyway.

It's like needing 3 out of 5 signatures to authorize a bank transaction. One person's on vacation? Doesn't matter. Two people? Okay, now we have a problem, but at that point, you've got bigger issues than cache stampedes.

The catch? You need to run 5 independent Redis instances. More infrastructure, more complexity. But honestly? After that second incident, I stopped complaining about the extra nodes.

Show Me the Code

Alright, enough theory. Here's how I actually implemented this using Go's redsync library:

package main

import (
    "context"
    "fmt"
    "time"

    "github.com/go-redsync/redsync/v4"
    "github.com/go-redsync/redsync/v4/redis/goredis/v9"
    goredislib "github.com/redis/go-redis/v9"
)

func GetUserProfile(userID string) (*UserProfile, error) {
    ctx := context.Background()
    cacheKey := fmt.Sprintf("user:profile:%s", userID)

    // Try getting from cache first
    cached, err := redisClient.Get(ctx, cacheKey).Result()
    if err == nil {
        // Cache hit! We're good
        return deserializeProfile(cached), nil
    }

    // Cache miss - we need to hit the DB
    // But first, let's get a lock so we don't all do it at once
    lockKey := fmt.Sprintf("lock:%s", cacheKey)
    mutex := redSync.NewMutex(lockKey,
        redsync.WithExpiry(8*time.Second),
        redsync.WithTries(1),  // Don't retry, fail fast
    )

    // Try to acquire the lock
    if err := mutex.Lock(); err != nil {
        // Someone else got the lock, let them do the work
        // We'll just wait a bit and check cache again
        time.Sleep(100 * time.Millisecond)
        cached, err := redisClient.Get(ctx, cacheKey).Result()
        if err == nil {
            return deserializeProfile(cached), nil
        }
        // Still not there? Return stale data or error gracefully
        return nil, err
    }
    defer mutex.Unlock()

    // Double-check cache (someone might've filled it while we waited)
    cached, err = redisClient.Get(ctx, cacheKey).Result()
    if err == nil {
        return deserializeProfile(cached), nil
    }

    // Okay, we actually need to hit the DB
    profile, err := db.GetUserProfile(userID)
    if err != nil {
        return nil, err
    }

    // Cache it for next time
    redisClient.Set(ctx, cacheKey, serialize(profile), 5*time.Minute)

    return profile, nil
}

The magic happens in those few lines where we try to grab the lock. If we get it, we're the chosen one who hits the database. If we don't? We back off, wait a tiny bit, and check if whoever got the lock already populated the cache. Simple, but it saved my Saturday morning.