Mustafa ERBAY

Posted on May 15 • Originally published at mustafaerbay.com.tr

Distributed Lock Alternatives: 3 Methods Every System Architect

#career #distributedsystems #concurrency #locking

When building distributed systems or breaking down an existing monolithic system, managing concurrently accessed shared resources has always been a headache. Specifically, multiple services attempting to write to the same resource simultaneously can lead to data inconsistencies and unexpected errors. I've encountered such scenarios many times, especially during stock updates or order processing steps in a production ERP system.

To manage these complex situations, distributed lock mechanisms come into play. A simple mutex or semaphore on a single server is insufficient because our application might be running on multiple servers. A distributed lock ensures that only one of these different processes can access a specific resource at any given time. In this post, I will delve into three fundamental distributed lock alternatives that I frequently use in my experience and that every system architect should know.

What is a Distributed Lock and Why Do We Need It?

A distributed lock is a synchronization mechanism used in distributed systems to control concurrent access to a shared resource by multiple processes or applications. Simply put, the first process that wants to perform an operation on a resource (e.g., a database row, a file, or an API call) requests a lock for that resource. Once locked, other processes must wait until the lock is released or take an alternative path.

Let's consider a scenario I frequently encountered in a production ERP system: two different operators might update the stock quantity of the same product simultaneously. If we perform this operation without a distributed lock, both operators read the current stock quantity, make their own updates, and write to the database. As a result, the changes made by the first operator might be overwritten by the second, leading to an incorrect stock quantity. Such data inconsistencies pose unacceptable risks, especially in areas like financial transactions or critical inventory management.

ℹ️ Race Condition Risks

Without distributed locks, "race conditions" in distributed systems can lead to data inconsistency, incorrect calculations, and even unexpected system crashes. Therefore, synchronizing access to shared resources is critically important.

While mechanisms like threading.Lock or sync.Mutex on a single server provide synchronization among threads within the same process, distributed locks ensure synchronization among applications running on different servers or in different processes. Consequently, distributed lock mechanisms have much more complex requirements regarding fault tolerance, performance, and reliability. An incorrect distributed lock implementation can severely degrade system performance or, even worse, lead to "deadlock" situations. Once, in a client project, a faulty lock mechanism caused the entire order processing queue to stall, resulting in hundreds of thousands of liras in daily losses.

Method 1: Database-Based Distributed Locks

Databases, by virtue of already offering transaction and consistency guarantees, are a natural candidate for distributed lock implementations. This method is generally preferred in less complex distributed systems or applications that already perform intensive operations on the database. I used this method for the financial calculators of my side product because we were already working with PostgreSQL and I didn't want additional infrastructure costs or complexity.

Reliable Locks with PostgreSQL Advisory Locks

PostgreSQL offers special "advisory lock" functions like pg_advisory_lock() and pg_advisory_xact_lock(). These locks differ from standard row or table locks; instead of being tied to a specific data row, they operate on an arbitrary numeric key that you define. This allows your application to lock a specific logical resource, even if that resource doesn't physically exist in the database. The pg_advisory_xact_lock() function holds the lock until the current transaction ends and automatically releases it when the transaction is committed or rolled back. This is a significant advantage, especially as it guarantees lock cleanup in error situations.

Advantages:

Transactional: The biggest advantage is that locks work integrated with database transactions. If an operation fails or is rolled back, the lock is automatically released, which reduces the risk of "deadlock."
Easy Implementation: Uses existing database infrastructure, no need to set up an additional service.
ACID Guarantees: Benefits from the consistency guarantees provided by the database.

Disadvantages:

Database Load: Intensive lock acquisition and release operations can create additional load on the database server. There are limits in terms of scalability.
Single Point of Failure (SPOF): If the database server crashes or becomes inaccessible, the lock mechanism also stops working. Replication and failover mechanisms can reduce this risk but not eliminate it entirely.
Network Latency: Since lock acquisition/release operations occur over the network, performance issues can arise in high-latency networks.

Here's a simple Python example demonstrating PostgreSQL advisory lock usage:

import psycopg2
import time

# PostgreSQL bağlantı bilgileri

<figure>
  <Image src={cover} alt="An abstract illustration representing lock mechanisms in distributed systems." />
</figure>

DB_CONFIG = {
    "host": "localhost",
    "database": "mydatabase",
    "user": "myuser",
    "password": "mypassword"
}

def get_db_connection():
    return psycopg2.connect(**DB_CONFIG)

def acquire_lock(conn, lock_id):
    """Belirtilen lock_id için advisory lock alır."""
    with conn.cursor() as cur:
        # pg_advisory_xact_lock, transaction sona erene kadar kilidi tutar
        cur.execute(f"SELECT pg_advisory_xact_lock({lock_id});")
        print(f"[{time.time():.2f}] Lock {lock_id} alındı.")
        return True
    return False

def release_lock(conn, lock_id):
    """pg_advisory_xact_lock, transaction sona erdiğinde otomatik serbest kalır.
    Ancak, açıkça transaction'ı commit veya rollback etmek gerekir.
    pg_advisory_lock() kullanılsaydı, pg_advisory_unlock() kullanılırdı."""
    print(f"[{time.time():.2f}] Lock {lock_id} serbest bırakıldı (transaction sonu).")

def perform_critical_operation(lock_id, duration):
    conn = None
    try:
        conn = get_db_connection()
        conn.autocommit = False # Advisory lock transaction içinde çalışır
        if acquire_lock(conn, lock_id):
            print(f"[{time.time():.2f}] Kritik işlem başlatılıyor (Lock {lock_id} ile).")
            time.sleep(duration) # Kritik işi simüle et
            print(f"[{time.time():.2f}] Kritik işlem tamamlandı (Lock {lock_id} ile).")
            conn.commit() # Kilidi serbest bırakır
        else:
            print(f"[{time.time():.2f}] Lock {lock_id} alınamadı, başka bir işlem tarafından tutuluyor.")
    except Exception as e:
        print(f"[{time.time():.2f}] Hata oluştu: {e}")
        if conn:
            conn.rollback() # Kilidi serbest bırakır
    finally:
        if conn:
            conn.close()

# Örnek kullanım:
# Aynı anda iki farklı terminalde çalıştırıldığında, biri bekleyecektir.
# python -c "from your_module import perform_critical_operation; perform_critical_operation(12345, 5)"
# python -c "from your_module import perform_critical_operation; perform_critical_operation(12345, 5)"

In this example, the perform_critical_operation function simulates a critical operation with a lock_id. If another process holds the lock with the same lock_id, the current process will wait. The lock is automatically released when the transaction is committed or rolled back. This simplicity is a major advantage, especially for less complex systems.

Method 2: Redis-Based Distributed Locks

Redis, being an in-memory data structure server and supporting atomic operations, is a popular choice for distributed lock implementations. I've frequently turned to Redis for situations requiring high-performance and low-latency lock mechanisms. I've used Redis locks extensively in the backend of a task management application I built for my own site, or to synchronize real-time data updates on operator screens in a manufacturing company.

Redis Locks with `SET NX PX` and Redlock Controversies

The foundation of distributed lock implementation in Redis is the SET key value NX PX milliseconds command. This command is atomic and means:

NX: Set the value only if the key does Not eXist.
PX milliseconds: Make the key automatically expire after the specified milliseconds (Lease time).

This way, when a process tries to acquire a lock, it gets it if the lock doesn't exist and sets it to be automatically released after a certain period. This "lease" time prevents a lock from being held indefinitely if a process crashes after acquiring it. After acquiring the lock, it's crucial for the process to set a unique value (e.g., a UUID). When releasing the lock, it should only release it if it matches its own value. This prevents another process from accidentally (or maliciously) releasing another process's lock. This check must be performed atomically with a Lua script.

import redis
import uuid
import time

# Redis bağlantı bilgileri
REDIS_HOST = "localhost"
REDIS_PORT = 6379

def get_redis_client():
    return redis.StrictRedis(host=REDIS_HOST, port=REDIS_PORT, db=0)

def acquire_lock(client, lock_name, acquire_timeout=10, lock_timeout=10):
    """
    Redis'ten bir distributed lock almaya çalışır.
    :param client: Redis client.
    :param lock_name: Kilitlenecek kaynağın adı.
    :param acquire_timeout: Kilidi almak için beklenecek maksimum süre (saniye).
    :param lock_timeout: Kilidin tutulacağı süre (saniye).
    :return: Başarılı olursa benzersiz kilit değeri (token), aksi takdirde None.
    """
    identifier = str(uuid.uuid4())
    end_time = time.time() + acquire_timeout

    while time.time() < end_time:
        if client.set(lock_name, identifier, nx=True, px=int(lock_timeout * 1000)):
            print(f"[{time.time():.2f}] Lock '{lock_name}' alındı. Token: {identifier}")
            return identifier
        time.sleep(0.01) # Kısa bir bekleme

    print(f"[{time.time():.2f}] Lock '{lock_name}' alınamadı.")
    return None

def release_lock(client, lock_name, identifier):
    """
    Redis'teki bir distributed lock'ı serbest bırakır.
    Sadece kilidi alan sürecin serbest bırakmasını sağlar (atomik olarak).
    """
    # Lua script, GET ve DEL komutlarını atomik yapar
    lua_script = """
    if redis.call("GET", KEYS[1]) == ARGV[1] then
        return redis.call("DEL", KEYS[1])
    else
        return 0
    end
    """
    script = client.register_script(lua_script)
    if script(keys=[lock_name], args=[identifier]):
        print(f"[{time.time():.2f}] Lock '{lock_name}' serbest bırakıldı. Token: {identifier}")
        return True

    print(f"[{time.time():.2f}] Lock '{lock_name}' serbest bırakılamadı (başka bir değer veya zaman aşımı).")
    return False

def perform_critical_operation_redis(lock_name, duration):
    client = get_redis_client()
    identifier = None
    try:
        identifier = acquire_lock(client, lock_name, acquire_timeout=5, lock_timeout=5)
        if identifier:
            print(f"[{time.time():.2f}] Kritik işlem başlatılıyor (Redis Lock '{lock_name}' ile).")
            time.sleep(duration) # Kritik işi simüle et
            print(f"[{time.time():.2f}] Kritik işlem tamamlandı (Redis Lock '{lock_name}' ile).")
        else:
            print(f"[{time.time():.2f}] Kritik işlem yapılamadı, kilit alınamadı.")
    except Exception as e:
        print(f"[{time.time():.2f}] Hata oluştu: {e}")
    finally:
        if identifier:
            release_lock(client, lock_name, identifier)

# Örnek kullanım:
# Aynı anda iki farklı terminalde çalıştırıldığında, biri bekleyecektir.
# python -c "from your_module import perform_critical_operation_redis; perform_critical_operation_redis('my_resource', 5)"
# python -c "from your_module import perform_critical_operation_redis; perform_critical_operation_redis('my_resource', 5)"

Advantages:

High Performance: Since Redis operates in-memory, lock acquisition/release operations are very fast.
Flexibility: Automatic lock release with PX prevents crashed processes from holding locks indefinitely.
Atomic Operations: Atomic guarantees can be provided using the SET NX PX command and Lua scripts.

Disadvantages:

Redlock Controversies: Running Redis on a single instance carries an SPOF risk. While Redis Cluster mitigates this, the "Redlock" algorithm (acquiring locks on multiple Redis instances) has sparked serious debate among distributed system experts regarding its safety and consistency. In my experience, using Redlock without truly understanding its promised guarantees can lead to headaches.
Lease Time Management: Correctly setting the lock's duration is crucial. If it's too short, the lock might be released before the operation finishes; if too long, a crashed process will hold the lock unnecessarily.
Network Partition: In a network partition scenario, if one Redis instance becomes isolated from others, multiple locks might be acquired for the same resource.

⚠️ Beware of Redlock!

The Redlock algorithm has been heavily criticized within the distributed systems community. Using it in production without a deep understanding of its consistency and safety guarantees can be risky. While a single Redis instance might suffice for simple use cases, more robust alternatives should be considered for situations requiring high consistency.

Method 3: Coordination Services like Zookeeper / etcd

Distributed coordination services such as Apache Zookeeper and etcd offer highly reliable solutions designed for distributed lock implementations. These services handle complex tasks in distributed systems like consensus, leader election, and configuration management. While managing such an infrastructure at a major Turkish e-commerce site, I witnessed how critical Zookeeper was for distributed locks. They are indispensable, especially in microservice architectures, for inter-service synchronization and access to shared resources.

The Power of Consensus Protocols: Zookeeper/etcd

Zookeeper and etcd use consensus algorithms like Paxos or Raft to guarantee data consistency across all nodes in a cluster. In distributed lock implementations, these services typically use the concepts of "ephemeral nodes" and "sequential nodes."

Ephemeral Node: Created when a client connects and automatically deleted when the client's connection is lost. This prevents a crashed client from holding the lock indefinitely.
Sequential Node: When created, an automatically incrementing number is appended to its name. This is used to manage the lock queue.

When a client wants to acquire a lock, it creates an ephemeral and sequential node under a specific lock path. The name of the created node will be something like lock-00000001, lock-00000002. Then, the client checks if its own created node is the smallest sequential node under the same path. If it is the smallest, it has acquired the lock. If not, it starts watching the node immediately preceding it. When the preceding node is deleted (i.e., the client holding the lock finishes its work or crashes), the watching client checks again and acquires the lock if its turn has come. This mechanism ensures "fairness" (an equitable queue) and prevents the "thundering herd" problem (all waiting clients retrying simultaneously).

Advantages:

High Reliability and Consistency: Thanks to consensus algorithms, data consistency and fault tolerance are very high. The system can continue to operate even if a majority of nodes crash (as long as a quorum is maintained).
Leader Election: They can also handle complex tasks like leader election in distributed systems, providing a robust foundation upon which lock mechanisms can be built.
Fair Lock Ordering: Thanks to sequential nodes, clients waiting for a lock are processed in a fair order.

Disadvantages:

Operational Complexity: Setting up, managing, and monitoring a Zookeeper or etcd cluster is more complex than Redis or database-based solutions. It requires additional infrastructure and operational expertise.
Performance: Due to the nature of consensus algorithms, lock acquisition/release operations might have higher latency compared to Redis.
Infrastructure Cost: May require additional servers and resources.

A lock implementation with Zookeeper or etcd is quite detailed, and rather than providing all the code here, I will summarize the basic flow:

Acquiring a Lock:
- The client creates an EPHEMERAL_SEQUENTIAL type node under /locks/resource_name/. For example, /locks/resource_name/lock-00000001.
- The client lists all child nodes under /locks/resource_name/.
- If its own node is the smallest sequential node, it has acquired the lock.
- Otherwise, it starts watching

DEV Community

Distributed Lock Alternatives: 3 Methods Every System Architect

What is a Distributed Lock and Why Do We Need It?

Method 1: Database-Based Distributed Locks

Reliable Locks with PostgreSQL Advisory Locks

Method 2: Redis-Based Distributed Locks

Redis Locks with `SET NX PX` and Redlock Controversies

Method 3: Coordination Services like Zookeeper / etcd

The Power of Consensus Protocols: Zookeeper/etcd

Top comments (0)

What is a Distributed Lock and Why Do We Need It?

Method 1: Database-Based Distributed Locks

Reliable Locks with PostgreSQL Advisory Locks

Method 2: Redis-Based Distributed Locks

Redis Locks with SET NX PX and Redlock Controversies

Method 3: Coordination Services like Zookeeper / etcd

The Power of Consensus Protocols: Zookeeper/etcd

Redis Locks with `SET NX PX` and Redlock Controversies