Distributed Lock Alternatives: Which One to Use in Which Scenario?

#distributedsystems #career #indiehacker

Overview of Distributed Locking Mechanisms

Concurrency management in distributed systems is critical to prevent multiple operations from conflicting at the same time on a database transaction or a shared resource. This is indispensable for ensuring data integrity and preventing race conditions. On a single server, in-memory locks or database-level locks are usually sufficient, but in systems spread across multiple servers, things get more complicated. This is exactly where we need distributed locking mechanisms.

The primary goal of these mechanisms is to ensure that only one process or thread can access a resource at any given time. However, this seemingly simple goal faces challenges in distributed environments such as network latencies, server failures, and network partitions. Therefore, different distributed locking strategies offer different trade-offs, making them more suitable for specific use cases. Today, we will examine these alternatives and when we should prefer which one in detail.

Database-Based Distributed Locks

One of the most common and usually first-to-mind approaches is to use our existing database as a distributed lock mechanism. This method is attractive because it eliminates the need to set up an additional service, especially if you already have a database infrastructure. Two main strategies stand out: unique constraints and atomic operations.

To create a lock using unique constraints, we create a table with a unique key that holds the name or ID of the resource to be locked. When a transaction wants to acquire the lock, it tries to insert a new record into this table. If the record is successfully inserted, the lock is acquired. If an error occurs due to a unique key violation, the lock has already been acquired by someone else. The biggest advantage of this approach is its simplicity. However, to release the lock, you have to manually delete the record, which can lead to locks remaining forever (deadlock) in failure scenarios.

Atomic operations usually attempt to acquire the lock with INSERT ... ON CONFLICT DO NOTHING or similar atomic queries. If the operation succeeds, the lock is acquired; otherwise, it is not. To release the lock, a DELETE operation is used. This method is safer than the unique constraint method because lock acquisition and release operations become more atomic. However, faulty clients may still fail to release locks.

ℹ️ Lock Acquisition Example with PostgreSQL

For example, in PostgreSQL, the pg_try_advisory_lock function allows you to acquire a session-based lock. This function returns false instead of an error when it cannot acquire the lock. You release the lock with pg_advisory_unlock. This has the advantage of automatically releasing the lock when the session ends, but it still does not guarantee that the client won't hang during network issues.
-- Attempting to acquire the lock
SELECT pg_try_advisory_lock(12345); -- 12345 is our lock ID

-- Releasing the lock
SELECT pg_advisory_unlock(12345);

Distributed Locks with Redis (SETNX and Redlock Algorithm)

Redis is a popular choice for distributed locks as a high-performance, in-memory data store. The most basic Redis locking mechanism is to use the SETNX (SET if Not eXists) command. This command assigns a value to the specified key only if the key does not exist, returning 1 if successful and 0 otherwise.

To acquire the lock, the process runs the SETNX lock_key 'some_value' command. If the return value is 1, the lock is acquired. However, an additional mechanism is needed to release the lock and ensure it expires. This is usually done with an EXPIRE command. That is, we first acquire the lock with SETNX and then set a duration with EXPIRE lock_key timeout. The fact that these two operations are not atomic is risky; if SETNX succeeds and the Redis server crashes before the EXPIRE command runs, the lock can remain forever. To solve this problem, the SET lock_key 'some_value' NX EX timeout command is used. This single command both checks if the key exists and sets its expiration, which provides atomicity.

⚠️ Issues Can Occur If SETNX and EXPIRE Are Not Atomic

If SETNX succeeds and the Redis server crashes before the EXPIRE command runs, the lock remains forever. This is a serious problem, especially in systems that require high availability.
# Example: Acquire the lock and automatically release it after 5 seconds
SET my_lock_key some_random_value NX EX 5

A more complex and reliable Redis-based lock solution is the Redlock algorithm. This algorithm increases lock durability by using multiple independent Redis instances. A client tries to acquire the lock from the majority (N/2 + 1) of N Redis instances. If it can acquire the lock from the majority, the lock is considered successful. This approach ensures that the system continues to function even if a single Redis server crashes. However, implementing Redlock is complex, and it still has some controversial aspects due to network latencies and time synchronization issues.

Coordination Services like ZooKeeper and etcd

Distributed coordination services like Apache ZooKeeper and CoreOS etcd are specifically designed for tasks like concurrency management and leader election in distributed systems. These services offer strong consistency models and distributed locking mechanisms.

In ZooKeeper, distributed locks are usually implemented using ephemeral nodes. When a client wants to acquire a lock, it creates a sequential and ephemeral node under the lock key. These nodes have a unique numeric value based on the order in which they were created. The client checks whether the node it created is the very first node in the list. If it is at the top, it has acquired the lock. If not, it starts watching the events (child watch) of the node immediately preceding it. When the node preceding it is deleted (i.e., the lock is released), the client is notified, and the new node moves to the top, allowing it to acquire the lock. The biggest advantage of ephemeral nodes is that they are automatically deleted when the client connection is lost or the client crashes, which ensures that locks are automatically released.

etcd offers a structure similar to ZooKeeper but is generally known for a simpler API and better performance. In etcd, locking mechanisms are implemented using key-based leasing mechanisms along with TTL (Time To Live) on keys. The client creates a lease for a key and defines a key (lock) associated with this lease. When the lease expires or the client revokes the lease, the key is automatically deleted. This provides an automatic cleanup mechanism similar to ephemeral nodes in ZooKeeper. etcd's strong consistency model ensures that locks are managed reliably.

💡 Acquiring a Lock with etcd (Pseudo-Code)

Acquiring a lock in etcd is done by creating a lease and then creating a key (lock) associated with this lease.

Create Lease: A lease is created with a specific TTL.

Add Lock Key: A key carrying the name of the resource to be locked is added, bound to the created lease.

Check: It is checked whether the key exists. If it does not exist and the key was successfully added, the lock is acquired.

Keep Alive Lease: To keep holding the lock, the lease duration must be regularly renewed.

Release Lock: When the lease is revoked or expires, the key is automatically deleted.

This approach is more resilient to client-side failures because lock management is largely handled by etcd itself.

Criteria for Selecting a Distributed Lock

Which distributed locking mechanism we choose depends on the project requirements, existing infrastructure, and acceptable risks. Here are the key criteria to consider:

Consistency: How strong of a consistency your system requires is important. Services like ZooKeeper and etcd usually provide strong consistency, while Redis's default configuration is more availability-oriented.
Durability: How important is the persistence of locks? Do locks need to survive a server failure, or are temporary locks sufficient?
Performance: How fast do lock acquisition and release operations need to be? Database-based solutions are usually slower, Redis is faster, and coordination services also offer high performance.
Simplicity and Ease of Management: How easy is it to integrate into the existing infrastructure? Setting up and managing an additional service brings extra overhead.
Cost: Setting up or managing additional infrastructure can mean extra costs.

Locking Mechanism	Consistency	Durability	Performance	Ease of Management	Prominent Use Cases
Database (Unique Constraint)	Medium	Low	Medium	High	Simple, single-server, or less critical applications
Database (Atomic Ops)	Medium	Medium	Medium	High	Database-based and medium-criticality applications
Redis (SETNX + EXPIRE)	Medium	Low	High	Medium	High-throughput, short-term locks
Redis (Redlock)	Medium	Medium	High	Low	Redis-based systems requiring high availability
ZooKeeper / etcd	High	High	High	Low	Leader election, distributed coordination, complex sync

Example Scenario: In a production ERP system, we want to prevent multiple users from editing the same order at the same time in the shipping module. In this case, we know that the lock acquisition time is not highly critical, but once a lock is acquired, no one else should be able to modify that order. If the system crashes, the locks must be released somehow. In this scenario, a coordination service like etcd or ZooKeeper would be the most suitable solution with the strong consistency and automatic lock release features it provides. In particular, setting up a locking mechanism over the order number offers the right trade-offs.

Conclusion: The Importance of Choosing the Right Tool

Choosing the right locking mechanism in distributed systems has a direct impact on the reliability, performance, and manageability of the system. A simple database lock with unique constraints might be sufficient for a small application, while a robust solution like ZooKeeper or etcd is required for a large-scale financial transaction platform.

It should not be forgotten that every solution has its own trade-offs. Database-based locks eliminate the need for an additional service but can remain limited in terms of performance and durability. Redis-based solutions offer high performance but may require more complex algorithms like Redlock. ZooKeeper and etcd offer the strongest consistency and durability but are harder to manage. Based on my own experience, whenever a need for a locking mechanism arises, first understanding "how critical" the problem is and "how much consistency" it requires is the first step to mapping out the right path. This will save you from unnecessary complexity or an inadequate solution.