Gabriel Anhaia

Posted on Jun 13

Design a Distributed Lock Service: Fencing Tokens and the Failure Modes

#systemdesign #interview #distributedsystems #backend

Book: System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You acquired the lock. You did the work. You released the lock. Two clients still wrote to the same file, and one of them clobbered the other. Nobody crashed. No error logged. The data is just wrong.

That's the interview question hiding inside "design a distributed lock." It isn't SETNX. It's: what happens when the holder thinks it still holds the lock but the lock service has already given it to someone else? Get that wrong and you've shipped a system that corrupts data silently and passes every happy-path test.

What a lock actually has to guarantee

State it before you draw a box. A distributed lock has one job: at most one client holds it at any moment. That's mutual exclusion, and the word that matters is exclusion, not acquisition.

Acquisition is the easy 5%. Every key-value store does it:

# acquire: set the key only if it doesn't exist,
# with a TTL so a dead holder can't lock forever
ok = redis.set("lock:invoice:42", token,
               nx=True, ex=30)

The hard 95% is everything after acquisition. Networks pause. Processes freeze. Clocks drift. The holder can lose the lock without knowing it lost the lock, and that gap is where the corruption lives. So the question you ask the interviewer first: "Is this a lock for efficiency or for correctness?"

That single question splits the whole design.

Efficiency locks vs correctness locks

An efficiency lock stops duplicate work. Two workers both regenerate the same cache entry, you waste some CPU, no harm done. If the lock occasionally fails and two workers run, the worst case is wasted effort. For these, a single Redis key with a TTL is fine. Don't overbuild it.

A correctness lock protects shared state from concurrent writes. Two workers write to the same row, the same file, the same ledger, and a missed exclusion means corrupted data. For these, a TTL key is not enough, and most candidates stop exactly one step short of why.

Say which one you're solving out loud. If the interviewer says "correctness," the rest of this post is the answer. If they say "efficiency," tell them you'd use a Redis key with a TTL and move on, because anything heavier is wasted complexity.

Why a lease alone doesn't save you

The standard fix for a dead holder is a lease: the lock has a TTL, so if the holder dies, the lock expires and someone else can take it. Sounds complete. It isn't.

Walk the timeline the interviewer is waiting for:

Client A acquires the lock with a 30-second lease.
Client A starts a stop-the-world GC pause. Or its host gets descheduled. Or the network partitions it.
30 seconds pass. The lease expires. The lock service hands the lock to Client B.
Client B does its work and writes to storage.
Client A wakes up from its pause. It still believes it holds the lock. It writes to storage too.

Now both A and B wrote. The lease did exactly what it promised, and you still got two writers. The lease handled the crash case. It did nothing for the pause case, because a paused process can't tell it was paused.

This is the GC-pause split-brain, and it's the heart of Martin Kleppmann's critique of using locks for correctness. A 30-second JVM stop-the-world pause is not exotic. Neither is a hypervisor freezing a VM to migrate it. The holder has no way to know time jumped out from under it.

Fencing tokens: the part candidates miss

The fix doesn't live in the lock service. It lives at the resource you're protecting. The lock service hands out a monotonically increasing number with every grant, the fencing token. The protected resource rejects any write carrying a token older than the highest it has already seen.

# the lock service: every grant bumps a counter
def acquire(resource: str) -> tuple[bool, int]:
    token = redis.incr(f"fence:{resource}")
    ok = redis.set(f"lock:{resource}", token,
                   nx=True, ex=30)
    if not ok:
        return (False, 0)
    return (True, token)

The token rides along on every write:

def write(resource, payload, token):
    # storage layer enforces the fence
    storage.put(resource, payload,
                fencing_token=token)

And the storage layer enforces it:

# pseudo-storage: reject stale tokens
def put(resource, payload, fencing_token):
    last = max_token.get(resource, 0)
    if fencing_token <= last:
        raise StaleTokenError(
            f"token {fencing_token} <= {last}")
    max_token[resource] = fencing_token
    do_write(resource, payload)

Replay the split-brain with fencing. Client A got token 33. It pauses. Its lease expires. Client B acquires the lock and gets token 34. B writes with token 34; storage records 34. A wakes up and writes with token 33. Storage sees 33 <= 34 and rejects it. The corruption is gone, and the lock service never had to be perfect, it only had to be monotonic.

This is the answer that separates "knows Redis" from "understands distributed locking." Mention fencing tokens before the interviewer has to ask.

The catch: your resource has to cooperate

Fencing only works if the protected resource can check the token. That's the real constraint, and it's worth naming.

A database can enforce it with a conditional update: UPDATE ... WHERE token > stored_token.
An object store with compare-and-set on a version field can enforce it.
A dumb file on a plain filesystem cannot. It has no idea what a token is.

So when the resource can't check a fence, you don't actually have a correctness lock, no matter how fancy the lock service is. You have an efficiency lock wearing a correctness costume. Say this. It's the senior-level observation: the strongest link in the chain is the resource, not the coordinator.

Lease plus heartbeat (and why it's not a fix for the pause)

Heartbeats keep a long-running holder from losing the lock while it's still alive and working. The holder renews the lease periodically, well inside the TTL:

import threading, time

class Heartbeat:
    def __init__(self, redis, key, token,
                 ttl=30, interval=10):
        self.redis = redis
        self.key = key
        self.token = token
        self.ttl = ttl
        self.interval = interval
        self._stop = threading.Event()

    def _renew(self):
        # only renew if WE still hold it
        lua = """
        if redis.call('get', KEYS[1]) == ARGV[1]
        then return redis.call('expire',
             KEYS[1], ARGV[2])
        else return 0 end
        """
        while not self._stop.wait(self.interval):
            self.redis.eval(lua, 1, self.key,
                            self.token, self.ttl)

    def start(self):
        t = threading.Thread(target=self._renew,
                             daemon=True)
        t.start()

That Lua check matters: renew only if the value still equals your token. Without it, a holder that already lost the lock would happily extend someone else's grant. The release path needs the same guard, check-then-delete in one atomic script, never a plain GET then DEL.

But heartbeats don't fix the pause either. If the holder is frozen, it isn't heartbeating, the lease expires, and you're back to split-brain. Heartbeats reduce how often a healthy-but-slow holder loses its lock. They do nothing for a holder that stopped existing for a while. Fencing is still the thing that makes the write safe.

Why "Redlock" doesn't change this

Redlock is the multi-node Redis locking algorithm: acquire the lock on a majority of N independent Redis nodes, each with a TTL, and you "hold" it if you got the majority before the TTL elapsed.

It buys availability. If one Redis node dies, you can still acquire on the other nodes. That's worth something. But it does not buy you correctness, because it leans on every node's clock advancing at roughly the same rate. A node whose clock jumps forward expires your lease early. A long pause on the client still ends with a stale writer. Redlock makes acquisition more available; it does not make exclusion safe under pauses and clock skew.

The honest framing for the interview: Redlock is an efficiency lock with better availability than a single node. If you need correctness, you still need fencing tokens at the resource, and once you have fencing tokens, the strength of the lock service matters a lot less. Don't let "I'd use Redlock" be your whole answer. It's a component, not the design.

Picking the lock service

When the interviewer asks what backs the lock, the choice tracks the consistency you need:

Backing store	Acquisition	Fencing token	Best for
Single Redis key + TTL	fast, available	manual via `INCR`	efficiency locks
Redlock (N Redis nodes)	available under node loss	manual via `INCR`	efficiency, higher availability
ZooKeeper (ephemeral seq node)	strong, linearizable	built in (zxid / node seq)	correctness coordination
etcd (lease + revision)	strong, linearizable	built in (mod_revision)	correctness, k8s-native

The pattern worth saying: ZooKeeper and etcd give you a monotonic number for free (the sequence number or revision), so the fencing token falls out of the design instead of being bolted on. That's why correctness-critical coordination tends to live there, not in Redis. Redis is the right call when the lock is an optimization and you want speed and availability over linearizability.

The 90-second answer (rehearse this one)

When they say "design a distributed lock," start the clock:

"First question: efficiency or correctness? If it's efficiency, stopping duplicate work, a single Redis key with SET NX and a TTL is enough, and I'd stop there.

If it's correctness, protecting shared state, a TTL key alone is not safe. The failure mode is a holder that pauses (GC, VM migration, partition) past its lease. The lease expires, someone else acquires, then the paused holder wakes up and writes, thinking it still holds the lock. Two writers, silent corruption.

The fix is fencing tokens. Every grant hands out a monotonically increasing number. Every write to the protected resource carries the token, and the resource rejects any token older than the highest it has seen. The paused holder's stale write gets rejected at the storage layer. That works even when the lock service itself is imperfect, it only has to be monotonic.

The catch is the resource has to be able to check the token, a conditional update in a database, a compare-and-set on an object version. A plain file can't, so against a dumb resource you don't really have a correctness lock.

Backing store: Redis for efficiency locks. ZooKeeper or etcd for correctness, because their sequence number or revision gives you the fencing token for free. Heartbeats keep a healthy long-running holder from losing the lease early, but they don't fix the pause, fencing does.

Redlock improves availability across Redis nodes but doesn't make exclusion safe under pauses or clock skew. It's a component, not the whole answer."

That hits efficiency vs correctness, the lease, the pause, fencing tokens, the resource constraint, the backing-store trade-off, heartbeats, and Redlock. The interviewer can push on any of those and you have something coherent to defend.

The follow-ups that burn candidates

"What if two clients get the same fencing token?" They can't, if the counter is atomic and monotonic. INCR in Redis is atomic. A SELECT ... FOR UPDATE then increment in SQL is atomic. A read-then-write in app code is not. If you generate tokens app-side without a single source of truth, you've reintroduced the race. Name the atomic source.
"Your lock service itself goes down. Now what?" Acquisition stops, which is the safe failure: nobody can take the lock, so nobody corrupts anything. Compare that to the unsafe failure (lock service hands the same lock twice), which fencing protects against. Locks should fail closed, not open.
"Can you do this without a lock at all?" Often, yes, and the strong candidates raise it themselves. If the resource already does compare-and-set on a version, you can skip the coordinator and let the optimistic write win or retry. The fencing token is just a version number. Sometimes the cleanest distributed lock is no lock, only a conditional write.

What's the worst lock bug you've watched in production, the cron job that double-fired because two boxes both "held" the lock, the migration that ran twice? Drop it in the comments. The failure folklore is half of why these threads are worth reading.

If this was useful

This layered way of reasoning (state the guarantee, find the pause, fence at the resource, pick the backing store that hands you the token for free) is what turns "design a lock" from a trick question into a routine one. The System Design Pocket Guide: Fundamentals walks through coordination primitives, leases, and consensus with the same lens, plus the failure modes that show up when you trust a lock to do a fence's job.