Distributed Locks vs. Leased Locks: The Right Choice in Resource

#tutorials #distributedsystems #resourcemanagement #concurrency

Distributed Locks and Leased Locks: Two Sides of Resource Management

In distributed systems, situations where multiple services try to access a shared resource simultaneously are inevitable. This can lead to data inconsistency and unexpected errors. This is precisely where various mechanisms come into play to control access to resources. In this article, I will delve deep into distributed locks and leased lock mechanisms. I will discuss which one to prefer in which scenario, their trade-offs, and practical application examples. By understanding the fundamental differences between these two approaches, we can enhance the reliability and performance of our systems.

These two mechanisms fundamentally solve the same problem: ensuring that a shared resource is used by only one process at a time. However, they differ significantly in how they implement this solution, the guarantees they offer, and their performance characteristics. These differences play a critical role in determining the correct usage scenarios. My goal is to clarify these differences with concrete examples, helping the reader choose the most suitable solution for their own systems.

Fundamentals and Implementation of Distributed Locks

Distributed locks are a mechanism used to coordinate concurrent access to a shared resource by multiple nodes or services. The basic idea is that only one client can hold a lock on a resource at any given time. This is typically achieved through a central lock manager or a distributed coordination service (e.g., ZooKeeper, etcd, or Redis). Before accessing the resource, a client requests the lock from this service. If the lock is available, the client acquires it and uses the resource. Once the operation is complete, it releases the lock.

One of the most common implementations of this mechanism uses Redis's SETNX (Set if Not Exists) command. A client checks if a key exists. If the key does not exist, it creates the key representing the lock and sets its value to its own ID or a timestamp. Then, it sets a TTL (Time To Live) for this lock to be automatically released after a certain period. If the key already exists, it means the resource is being used by another client, and the requesting client must wait. This simple yet effective method works in many scenarios.

# A client tries to acquire the lock
SETnx mylock "client_id_123" EX 30

# If successful (returns 1), the lock is acquired
# If unsuccessful (returns 0), the lock is held by someone else

However, guaranteeing "only one client can hold the lock" in distributed systems is not that simple. Situations like network delays, service crashes, or clients unexpectedly shutting down can prevent locks from being released correctly. This can lead to "deadlock" situations, where a resource remains locked indefinitely, and no client can access it. More advanced algorithms and protocols are used to overcome such problems.

ℹ️ Advantages of Distributed Locks

Simple Implementation: Can be easily implemented, especially with tools like Redis.

Strong Consistency: When implemented correctly, it offers strong ordering guarantees for resource access.

Flexibility: Adaptable to various scenarios.

Another important aspect of distributed locks is the concept of "fairness." In the Redis example above, the client that successfully completes the SETNX operation first acquires the lock, not necessarily the one that requested it first. This might not be fair in some cases. For example, if a service has been waiting for a long time and a new service acquires the lock, the waiting service might wait indefinitely. To solve this problem, tools like ZooKeeper provide lock mechanisms that offer "sequential access" guarantees. This works on a "first-come, first-served" principle.

Leased Lock Mechanism: Reliability and Timing

A leased lock is another approach used for resource management in distributed systems. In this mechanism, a lock is "leased" only for a specific duration. When a client wants to acquire a lock, it requests the lock from the lock manager for a defined period (the lease duration). The lock manager assigns the lock to the client for this duration. When the time expires, the lock is automatically released, without the client needing to explicitly signal its release. This is particularly important when the client crashes unexpectedly or network connectivity is lost.

The most significant advantage of this mechanism is that the lock is automatically released when its duration expires. This greatly reduces the "deadlock" problem often encountered with distributed locks. Even if the client crashes, the resource becomes available again once the lease expires. This increases the overall availability of the system. ZooKeeper's ephemeral znodes or Consul's sessions work similarly to the leased lock logic.

// Example Java code (pseudo-code) - Leased lock with Consul Session
Session session = consulClient.getSessionClient().getSessionCreate(sessionConfig).orElseThrow();
String lockKey = "my_resource_lock";

// Try to acquire the lock
Response<String> lockResponse = consulClient.getKVClient().acquireLock(lockKey, session.getId());

if (lockResponse.isOk()) {
    // Lock acquired, use the resource
    try {
        // ... use the resource ...
    } finally {
        // The lock is released when the session ends or is explicitly called
        // consulClient.getKVClient().releaseLock(lockKey, session.getId()); // Optional
    }
} else {
    // Lock could not be acquired, wait or perform another action
}

Another important aspect of the leased lock mechanism is the "lease renewal" mechanism. If the client wants to continue using the resource, it can request to renew the lock from the lock manager before the lease expires. This allows the client to use the resource securely for extended periods. If the client stops sending renewal requests, the lock is automatically released when the lease expires. This prevents resources from being unnecessarily occupied for too long.

⚠️ Considerations for Leased Locks

Determining Lease Duration: Setting the correct lease duration is critical. Too short a duration can lead to frequent renewal traffic, while too long a duration can cause resources to be unnecessarily occupied.

Clock Skew: Clock skew (unsynchronized clocks) in distributed systems can cause problems with leased lock mechanisms. Therefore, using synchronization mechanisms like NTP is important.

Consul's sessions manage this lease mechanism quite elegantly. When a session is created, it has a specific lifespan. Locks acquired with this session are automatically released when the session ends (e.g., when the client crashes or network connectivity is lost). The client must send regular "heartbeats" to keep the session alive. This both increases reliability and simplifies lock management.

Distributed Locks vs. Leased Locks: Comparison and Trade-offs

The fundamental difference between distributed locks and leased lock mechanisms lies in how the lock is released. In distributed locks, releasing the lock is typically the client's responsibility. The client must explicitly release the lock when its operation is complete. In leased locks, the lock is automatically released when a predetermined duration expires. This fundamental difference determines the advantages, disadvantages, and appropriate usage scenarios for the two mechanisms.

Distributed locks can be preferred, especially in scenarios requiring strong consistency. For example, when performing a write operation to a database, it is important to explicitly release the lock to ensure the operation is complete. This approach is more suitable for situations requiring "pessimistic locking" rather than "optimistic locking." However, the risk of deadlock is higher in case of client crashes or network issues.

💡 Usage Scenarios

Distributed Locks: Situations requiring strong consistency, such as database transactions, financial transactions, and critical data updates.

Leased Locks: Situations with defined durations and potential client crashes, such as scheduled jobs, distributed job queues, temporary resource allocation, and service discovery.

Leased locks, on the other hand, are ideal for scenarios requiring fault tolerance and automatic recovery. For example, if a lock is acquired for processing a task in a distributed job queue, and the processor crashes, the leased lock mechanism ensures the lock is automatically released, allowing another processor to pick up the task. This increases the overall resilience of the system. However, if the lease duration is not set correctly, resources might remain occupied for longer than necessary, or too frequent renewal traffic might occur.

For example, in a distributed system, I once encountered a WAL rotation error at 03:14, which indicated that a database lock was not released correctly. In such a scenario, a leased lock mechanism could have helped resolve the issue faster by ensuring the lock was automatically released. With distributed locks, such situations might require manual intervention or the design of complex "liveness" checks.

Reliable Distributed Lock Implementations: Redlock and Fencing Tokens

When it comes to distributed locks, the simple SETNX method implemented with tools like Redis has some serious drawbacks. Network delays and Redis instance failures can undermine the reliability of locks. Algorithms developed to overcome these problems exist. The Redlock algorithm is one of the most well-known examples in this field.

Redlock uses multiple independent Redis instances to create distributed locks. A client must obtain approval from a majority of these instances to acquire the lock. If it receives approval from a sufficient number of instances, it assumes the lock has been successfully acquired. This approach aims to increase lock reliability even if a single Redis instance crashes. However, the Redlock algorithm itself is open to criticism, and its implementation is quite complex.

# Redlock-like logic (pseudo-code)
import redis

def acquire_distributed_lock(redises, resource_name, client_id, ttl_seconds):
    success_count = 0
    lock_start_time = time.time()

    for r in redises:
        if r.set(resource_name, client_id, nx=True, ex=ttl_seconds):
            success_count += 1

    current_time = time.time()
    elapsed_time = (current_time - lock_start_time) * 1000 # milliseconds

    # The time we took to acquire the lock must be less than Redis's response time
    if success_count >= (len(redises) // 2 + 1) and elapsed_time < ttl_seconds * 1000:
        return True
    else:
        # If failed, clean up acquired locks
        for r in redises:
            if r.get(resource_name) == client_id:
                r.delete(resource_name)
        return False

# Usage
redis_instances = [redis.Redis(host=f'redis{i}') for i in range(5)]
if acquire_distributed_lock(redis_instances, "my_shared_resource", "my_client_1", 30):
    print("Lock acquired!")
else:
    print("Failed to acquire lock.")

🔥 Criticisms of Redlock

Clock Skew Issues: Redlock's reliability depends on the synchronization of client and Redis instance clocks. Clock skew can lead to false positives.

Network Delays: Network delays can prevent the algorithm from correctly calculating the lock duration.

Complexity: The algorithm is quite difficult to implement and manage correctly.

Another important problem encountered with distributed locks is the use of "fencing tokens." A fencing token is an increasing unique number for each lock acquisition. When a client acquires a lock, it also receives this token. When accessing the resource, the client transmits this token. The resource checks if the incoming token is greater than the current token. If it is greater, it understands that a new lock has been acquired and accepts the operation. This mechanism prevents clients from accessing the resource using old locks. For example, on April 28th, when a system's disk was 100% full, the client holding the lock could not receive a new request. In this situation, if fencing tokens had been used, the client with the old lock would not have been able to perform new operations.

Distributed Locks and Leased Locks: Selection Criteria

Which mechanism we choose largely depends on the nature of the problem we face and the requirements of our system. As a general rule, distributed locks can be preferred in situations where strong consistency is critical and the risk of client crashes is low. On the other hand, leased locks are a more suitable solution in scenarios where fault tolerance and automatic recovery are priorities, and clients might disappear unexpectedly.

If data consistency is of the highest priority in your system and you are performing irreversible operations such as financial transactions, you might consider distributed lock mechanisms built on more reliable coordination services like ZooKeeper or etcd. These services typically offer stronger consistency guarantees and support additional security mechanisms like fencing tokens. For example, such mechanisms are vital for critical operations like stock updates in a production ERP.

ℹ️ Questions to Ask for the Right Choice

How critical is data consistency in my system?

How high is the probability of my clients crashing or experiencing network issues?

How sensitive should the lock release time be?

Is the complexity of the solution I will use manageable?

Which services are available in my existing infrastructure (e.g., Redis cluster, ZooKeeper)?

On the other hand, if your system needs to continue operating even when a processor is lost or network connectivity is interrupted, leased lock mechanisms would be a more logical choice. For example, in situations like a background task running in a mobile application or processing data from an IoT device, the automatic release of the lock when the lease expires offers a significant advantage. In an Android spam blocker application I developed, I sometimes encounter unexpected termination of background services. In such cases, the leased lock logic would prevent the resource from being unnecessarily locked.

The choice between these two mechanisms is not just a technical preference but also a strategic decision that directly affects the operational costs and risks of the system. By carefully analyzing the application's requirements and understanding the trade-offs of both approaches, we can make the most informed decision.

Conclusion and Next Steps

Resource management in distributed systems is a complex and critical topic. Distributed locks and leased locks are two fundamental approaches used to overcome these challenges. Distributed locks offer strong consistency guarantees, while leased locks excel in fault tolerance and automatic recovery capabilities. Which mechanism is more suitable depends on the project's specific requirements, client behavior, and fault tolerance expectations.

In summary, if an operation absolutely must be completed and the resource fully released, distributed locks might be more appropriate. However, if there's a high probability of clients disappearing and overall system availability is a priority, leased locks are a better option. Whichever method you choose, it's important to design carefully, considering edge cases like network delays, clock synchronization, and client crash scenarios.

💡 Next Steps

Identify which resources need protection in your systems.

Decide whether strong consistency or fault tolerance is more important for these resources.

Evaluate the advantages and disadvantages of your chosen mechanism (distributed lock or leased lock) according to your project.

If possible, review my previous article where I shared my experiences on [related: Distributed Lock Management on Kubernetes].

I hope the information in this article helps you choose the right resource management mechanism for your distributed systems. Remember, the best solution is the one that best fits your project's specific needs.