Anil Kurmi

Posted on Apr 3

Distributed Locks Are a Code Smell

#architecture #distributedsystems #java #systemdesign

Distributed Locks Are a Code Smell

The Lock That Lied

A single angry support ticket is usually an anomaly. Three identically angry support tickets arriving within 60 seconds about the exact same missing money? That is a pattern. Last quarter, our supposedly bulletproof payment pipeline successfully charged a single customer three times for one order. The investigation dragged on for hours, but the root cause took only four minutes to explain.

Here's what actually happened. Service A acquired a Redis lock with a 10-second TTL to process a payment. Right in the middle of executing, the JVM triggered a stop-the-world garbage collection. The entire process froze for 12 seconds. It didn't crash. It didn't log anything. It just... stopped.

While Service A was completely frozen, the lock expired in Redis. Service B picked up the very same lock, processed the exact same payment, and committed the charge. Seconds later, Service A woke up from its GC pause. It had absolutely no idea the lock was gone. It happily finished processing and committed the charge a second time.

Two microservices. Both believed they held the exclusive lock. Both were right — just at different points in time. The customer paid three times because a third container hit the same fate milliseconds later during a sudden traffic spike.

This isn't some theoretical academic edge case. This is exactly what distributed locks do in production. They wrap a warm, fuzzy blanket of safety around your architecture while hiding a massive trapdoor right underneath you.

Why Distributed Locks Are Nothing Like Local Locks

When you type synchronized in a Java application or lock() in Go, you receive a hard, physical guarantee. The operating system and the CPU strictly enforce mutual exclusion. Two threads literally cannot hold the same mutex simultaneously. The laws of physics back you up — there is only one physically shared piece of memory, and the hardware executes an atomic compare-and-swap instruction.

A distributed lock gives you absolutely none of this.

There is no shared memory between your services. You don't have reliable clocks. Your servers' clocks drift constantly, NTP daemons can jump time forward or backward randomly, and cloud VMs can stall for seconds without any warning to the guest OS. You don't even have guaranteed message delivery. The GitHub infrastructure team famously documented an incident where network layer packets were delayed for 90 seconds.

A local mutex provides a guarantee. A distributed lock provides an opinion. It represents the lock service's best guess that you probably still hold the lock right now. But "probably" and "right now" are doing a tremendous amount of heavy lifting.

The moment you accept that a distributed lock is fundamentally an approximation, you naturally start asking the right question: what actually happens when two processes both think they hold the lock at the same time?

The Kleppmann vs Antirez Debate (The 5-Minute Version)

Back in 2016, Martin Kleppmann (who wrote Designing Data-Intensive Applications) published a deep analysis of the Redlock algorithm. Salvatore Sanfilippo (antirez, the creator of Redis) wrote a rebuttal. The exchange between them remains one of the greatest, most important debates in distributed systems engineering. Here is the short version of what you need to know.

What Redlock claims to provide. The algorithm relies on 5 independent Redis nodes. A client attempts to acquire the lock on a majority (at least 3), using clock-based expiry to ensure the lock eventually releases. Antirez designed it specifically to survive individual node failures.

Kleppmann's critique. He pointed out two massive holes:

No fencing tokens. Redlock does not generate a monotonically increasing number every time a client acquires a lock. Without this token, a storage system has no possible way to reject stale writes from a process that thinks it still owns the lock but actually doesn't.
Timing assumptions. Redlock assumes bounded network delay, bounded process pauses, and bounded clock error. Real production systems violently violate all three. A garbage collection pause of 30 seconds, a sudden NTP clock jump, or a 90-second network delay will easily cause two clients to hold the "lock" simultaneously.

Antirez's response. He pushed back, arguing that Redlock explicitly checks the elapsed time before and after acquiring the majority. This makes it immune to delays during the acquisition itself. He also proposed that random unique tokens could substitute for monotonic counters if you use check-and-set operations. Finally, he conceded that Redis really should switch to monotonic time APIs.

The verdict. Here's the thing: both sides are absolutely right, depending on what you're trying to do. Antirez is perfectly correct that for many practical use cases — like preventing duplicate cron jobs or stopping cache stampedes — Redlock works just fine. Kleppmann is equally correct that if you care about strict data safety, Redlock's guarantees fall short. The question you should ask isn't "is Redlock safe?" but rather "safe enough for what?"

If you just want to prevent wasted CPU cycles, Redlock operates perfectly. If you want to prevent corrupted databases or duplicate customer charges, it fails completely. The problem I see is that most engineers reaching for distributed locks don't know which outcome they actually need.

Two Types of Locks (This Is the Key Insight)

Martin Kleppmann's framing here is the single best mental model for distributed locking I've ever found. Every single time you consider reaching for a lock, stop and ask yourself: is this for efficiency or correctness?

Efficiency Locks: "Don't Do Expensive Work Twice"

The whole goal here is preventing duplicate computation, not preventing data corruption. If your lock mysteriously fails and two processes run the job, you just waste some CPU cycles. Nobody loses real money. Nobody overwrites critical data.

Examples:

Cache stampede prevention. A hundred concurrent requests hit a newly expired cache key. You just want one worker to recompute the payload, not all hundred.
Job deduplication. A daily cron job triggers across three nodes. You want it to execute exactly once, not three times.
Rate limiting. You want roughly one API call per second, not a mathematically perfect single execution.

For these cases, a simple Redis SETNX with a TTL does exactly what you need:

SET lock:rebuild-cache "worker-7a3f" NX EX 30

That's it. One Redis node. No Redlock complexity. No intense consensus algorithm required. If it randomly fails, you rebuild the cache twice. The world keeps spinning just fine.

Correctness Locks: "Don't Corrupt My Data"

This time, the goal is strict mutual exclusion to ensure data safety. If the lock fails and two processes operate simultaneously, bad things happen. You see double charges, corrupted financial states, lost writes, or oversold inventory.

I learned this the hard way, so I'll give you the uncomfortable truth: you don't actually need a lock for this. You need a fencing token.

Why? Because a lock will eventually be "held" by two processes simultaneously in production. The garbage collection pause scenario isn't some exotic theoretical event. It's just a normal Tuesday. Any of the following triggers it:

JVM garbage collection (stop-the-world pauses often last minutes on large heaps)
Container CPU throttling when Kubernetes gets overloaded
VM stalls in multi-tenant cloud environments
Network partitions where the locking service communicates fine with both clients, but the clients can't reach each other
NTP clock jumps forcing a lock to expire prematurely on one specific node

When your system's correctness depends on perfect mutual exclusion, and that mutual exclusion relies on perfect clocks and flawless networks, your correctness essentially depends on perfect clocks and flawless networks. You do not want your career depending on that.

Fencing Tokens: The Right Abstraction for Correctness

A fencing token is simply a monotonically increasing number generated every single time a lock is granted. The client holding the lock passes this token down to the storage layer with every write request. The storage layer keeps track of the highest token it has ever seen and aggressively rejects any write carrying a lower or equal token.

This represents a critical shift in your architecture: the central storage system becomes an active, enforcing participant in safety, rather than just a passive victim accepting writes from whoever shows up last.

Let's walk through how this works in a real crash scenario:

Process A asks ZooKeeper for a lock. ZooKeeper grants it and hands back fencing token 33.
Process A initiates a slow write to the database, actively including token 33 in the payload.
Process A gets hit with a massive GC pause. It freezes completely.
The lock lease times out. Process B comes along and acquires the lock, receiving fencing token 34.
Process B writes to the database with token 34. The database accepts it and records 34 as the new high-water mark.
Process A finally wakes up. It attempts to finish its write using token 33.
The database sees that 33 < 34. It outright rejects Process A's write.

No data corruption. No double charging the customer. Even though the lock effectively "lied" — even though both processes genuinely believed they owned the lock at the same time — the fencing token caught the violation at the absolute lowest layer.

The implementation in a relational database like PostgreSQL is remarkably straightforward:

-- Add a fencing column to your table
ALTER TABLE orders ADD COLUMN lock_token BIGINT DEFAULT 0;

-- Write only if our token is the highest
UPDATE orders
SET status = 'processed', lock_token = 34
WHERE id = 42 AND lock_token < 34;
-- Rows affected: 1 (success) or 0 (stale token, rejected)

If you use ZooKeeper, the znode's zxid (transaction ID) naturally acts as a perfect fencing token because it explicitly increases monotonically with every single state change. If you use etcd, the lease's revision number serves the exact same purpose.

The Decision Tree: What Do You Actually Need?

Before you install a shiny new distributed lock library, walk yourself through this tree:

I watch most engineering teams jump straight to the bottom-right corner. Start at the top instead. You'll be genuinely surprised how frequently you can exit the tree much earlier.

Five Alternatives That Are Usually Better

1. Single-Writer Architecture

The absolute simplest way to avoid the headache of distributed locks is to stop distributing your writes in the first place.

Route every single write for a specific entity (or partition) through just one process. Kafka consumer groups handle this natively — each partition gets tied to exactly one active consumer in the group. If all updates for customer 42 always route to partition 42 % N, you guarantee serial processing without a drop of external coordination.

This isn't some hacky workaround. It's the exact architectural foundation behind heavy-duty systems like Kafka Streams, Akka Cluster Sharding, and Orleans virtual actors. The "lock" effectively becomes the partition assignment itself.

2. Optimistic Concurrency Control (CAS)

Let every process try to write at the same time. Reject the stale writes at the database layer. This works incredibly well when conflicts are rare. And in the vast majority of normal CRUD applications, they are extremely rare.

-- Read the current version
SELECT id, balance, version FROM accounts WHERE id = 42;
-- Returns: id=42, balance=100, version=7

-- Write only if version hasn't changed
UPDATE accounts
SET balance = 80, version = 8
WHERE id = 42 AND version = 7;
-- Rows affected: 1 (success) or 0 (conflict, retry)

DynamoDB has this built right into its core API using conditional expressions:

{
  "TableName": "Orders",
  "Key": { "orderId": { "S": "order-42" } },
  "UpdateExpression": "SET #s = :new_status, #v = :new_version",
  "ConditionExpression": "#v = :expected_version",
  "ExpressionAttributeNames": { "#s": "status", "#v": "version" },
  "ExpressionAttributeValues": {
    ":new_status": { "S": "processed" },
    ":new_version": { "N": "8" },
    ":expected_version": { "N": "7" }
  }
}

No external lock. No expiring TTLs. No terrifying vulnerability to GC pauses. If two processes race, one succeeds and the other immediately retries. The database acts as the strict arbiter, completely eliminating the need for a separate lock service.

3. Queue-Based Serialization

Dump your operations into an ordered queue. Process them strictly sequentially. The queue itself guarantees the ordering, not an external lock.

You desperately want this pattern when the operations are naturally sequential anyway. Think payment processing, inventory decrements, or state machine transitions. Instead of running a complex loop of "acquire lock, read state, modify, write, release lock," you simply shift to "enqueue operation, let the single processor read from queue, apply sequentially."

AWS SQS FIFO queues, Kafka topics configured with a single partition per entity, or honestly just a simple Redis list using LPUSH and BRPOP serve this purpose brilliantly. You shift the complex serialization point completely away from a fragile lock and into a durable queue instance.

4. Database-Level Advisory Locks

If all your writers share a single PostgreSQL database, you already own a phenomenal lock service. It's called PostgreSQL.

-- Acquire an advisory lock (blocks until available)
SELECT pg_advisory_lock(hashtext('order-42'));

-- Do your critical work
UPDATE orders SET status = 'shipped' WHERE id = 42;

-- Release the lock
SELECT pg_advisory_unlock(hashtext('order-42'));

Advisory locks release automatically the moment the session drops. A violently crashed process physically cannot leave behind a zombie lock (unlike Redis without a careful TTL strategy). They also tie directly into PostgreSQL's standard deadlock detection engine. Most importantly, they add absolutely zero new infrastructure to your stack. No Redis clusters to maintain, no ZooKeeper ensembles to monitor.

The main limitation is obvious: they only work when all your writers talk exclusively to that same database instance. If you run heavily decoupled microservices with isolated databases, this won't help you at all. But if you do share a database — and let's be honest, many teams do — this is simply the correct, easiest answer.

5. Lease + Fencing Token

When your architecture genuinely demands pure distributed mutual exclusion for raw correctness, and none of the alternative patterns fit your constraints, you use a lease-based lock paired with a fencing token. I call this the "last resort" option. Not because the pattern is flawed, but strictly because it brings the highest operational complexity.

Here is ZooKeeper's standard recipe:

// Create an ephemeral sequential znode
String lockPath = zk.create(
    "/locks/order-42/lock-",
    data,
    ZooDefs.Ids.OPEN_ACL_UNSAFE,
    CreateMode.EPHEMERAL_SEQUENTIAL
);

// The zxid is your fencing token
long fencingToken = zk.exists(lockPath, false).getCzxid();

// Pass the token to your storage layer
orderService.process(orderId, fencingToken);

And with etcd:

# Create a lease (TTL = 30 seconds)
etcdctl lease grant 30
# lease 694d7c3b6cc3c01a granted with TTL(30s)

# Acquire the lock with the lease
etcdctl lock order-42 --lease=694d7c3b6cc3c01a
# order-42/694d7c3b6cc3c01b  <-- the revision is your fencing token

# Keep the lease alive while processing
etcdctl lease keep-alive 694d7c3b6cc3c01a

The absolute key here: you must always pass the fencing token (the zxid or the revision) downstream and explicitly validate it at the final storage layer.

The Implementation Guide: Fencing Tokens with ZooKeeper/etcd

If you walked through the decision tree and confirmed you genuinely need a distributed lock with fencing, here is the exact implementation pattern you must follow.

Step 1: Acquire a lease with a monotonic identifier.

ZooKeeper's ephemeral sequential znodes automatically give you a czxid that ticks upward with every transaction. etcd's lock command explicitly returns a revision number. Both systems provide monotonically increasing, globally ordered tokens.

Step 2: Pass the token to every downstream write.

Your process holding the lock must never write to underlying storage without physically including the fencing token. Treat it exactly like a request header — it travels alongside every single operation inside your critical section.

Step 3: Validate at the storage layer.

The storage layer (your database, object store, or downstream API) absolutely must reject writes carrying a token lower than the highest one it has previously seen:

def write_with_fencing(storage, key, value, token):
    current_token = storage.get_token(key)
    if token <= current_token:
        raise StaleTokenError(
            f"Token {token} is stale (current: {current_token})"
        )
    storage.put(key, value, token)

Step 4: Handle lease expiry gracefully.

When your background lease expires, you must cleanly stop all ongoing writes immediately. Do not simply assume the database work you started can safely complete. Actively check the lease status right before each write step, and strictly design your critical section to be as short as humanly possible.

The most common disaster I see is developers acquiring the lock, doing five heavy minutes of computation, and then eagerly writing the result. By the time you trigger the write, the lock is completely gone. Do this instead: acquire the lock quickly, write your state immediately, and release the lock. Move the massive computation blocks entirely outside of the critical section.

When You Actually Need a Distributed Lock

I realize I've spent an entire article aggressively telling you not to use distributed locks. Let me be clear: there are real scenarios where you genuinely need one.

Leader election for singleton processes. You specifically need exactly one background scheduler, one cluster rebalancer, or one job coordinator running at any time. This represents a perfectly legitimate use of distributed mutual exclusion. ZooKeeper and etcd were literally built for this exact task.

Distributed resource coordination. You manage a tight pool of expensive external resources (like costly GPU instances or strict licensed API connections) that you absolutely cannot over-allocate. A lease-based lock with strict fencing handles this beautifully.

Cross-service state machine transitions. When a complex operation spans multiple distinct microservices and must never be duplicated (not just made idempotent — you physically cannot safely duplicate it), a lock combined with a fencing token correctly protects the state transition.

But here is the thing. Even in these specific cases, strongly prefer lease-based approaches paired with fencing tokens over simple TTL-based Redis locks. The lease natively gives you automatic safety releases on failure. The fencing token guarantees your data stays safe even when the lease inevitably lies to you.

The Checklist Before Reaching for a Lock

Print this list out. Tape it directly next to your monitor. Force your team to consult it every single time someone proposes tossing a new distributed lock into your architecture.

Is this for efficiency or correctness? If it's just for efficiency, a single Redis SETNX with a TTL does the job. Stop right here.
What happens if two processes hold the "lock" simultaneously? If the honest answer is "we waste some minor compute cycles" — congratulations, you want an efficiency lock. If the answer is "we corrupt user data" — you need fencing tokens, not just a bare lock.
Can I use a single-writer architecture? Partition your data physically. Route all writes through exactly one process per partition. You eliminate the lock entirely.
Can I use optimistic concurrency (CAS)? Push version numbers, ETags, or conditional writes. Let the database safely arbitrate conflicts. You eliminate the lock entirely.
Can I use a queue? Serialize your operations through an explicitly ordered queue. You eliminate the lock entirely.
If I absolutely must lock: am I using fencing tokens? A lock without fencing tokens represents a lock stripped of safety. Use ZooKeeper's czxid or etcd's revision numbers.
Have I explicitly tested the failure mode where the lock holder pauses for 30 seconds? If you haven't, you haven't truly tested your lock. GC pauses, aggressive container throttling, and VM stalls aren't edge cases. They happen constantly.

The next time an engineer on your team casually says, "we just need a distributed lock," treat it just like a code smell in a pull request. It isn't necessarily wrong by default — but it demands deep investigation. The lock might somehow be the right answer for your specific pain. But far more often than not, it merely masks a symptom of a system design that hasn't yet discovered the correct underlying abstraction.

Sources & Further Reading

The Kleppmann/Antirez Debate:

How to do distributed locking -- Martin Kleppmann -- The foundational critique of Redlock and the efficiency/correctness framework
Is Redlock safe? -- antirez -- Salvatore Sanfilippo's response defending Redlock

Implementation & Patterns:

Distributed Locks with Redis -- Redis Documentation -- Official Redis documentation on the Redlock algorithm
Distributed Locking: A Practical Guide -- Oskar Dudycz -- Comprehensive comparison of Redis, ZooKeeper, etcd, and database locks
Fencing Tokens and Distributed Locking -- Rakan -- Clear explanation of the fencing token mechanism

Production War Stories:

Using a distributed lock in production -- Victor On Software -- Real-world incident with Google Calendar API callbacks
Distributed Locks: Why Microservices Need Them and Why They Still Fail -- Vinayak Chobe -- FinTech duplicate charges and e-commerce overselling incidents

Database Advisory Locks:

PostgreSQL: Explicit Locking Documentation -- Official PostgreSQL documentation on advisory locks
Distributed Locking with Postgres Advisory Locks -- Richard Clayton -- Practical guide to using PostgreSQL for distributed coordination

Books:

Designing Data-Intensive Applications by Martin Kleppmann -- Chapter 8 (The Trouble with Distributed Systems) and Chapter 9 (Consistency and Consensus) cover the foundations of everything in this article

Top comments (1)

Andre Cytryn • Apr 3

the efficiency vs correctness framing is the clearest way I've seen this distinction laid out. most teams I've seen reach for Redis locks when they actually need correctness, not just efficiency, and then wonder why they get occasional duplicate operations in production.

the postgres advisory lock point is underrated. if you're already on postgres, you have a perfectly good distributed lock primitive that auto-releases on connection drop, no zombie locks possible. it only breaks down when you're doing cross-service coordination outside the same db, which is exactly when the fencing token pattern becomes necessary.

one thing worth adding to the checklist: make the critical section as short as humanly possible. most of the GC-pause disasters happen because engineers lock, then do 5 minutes of computation, then write. the lock is long gone by then.