The Hidden Costs of Distributed Lock Alternatives and Their Impact on

#distributedsystems #learning #productivity

In distributed systems, there are times when we need to ensure that only one process can use a resource at a time. This is one of the most fundamental concurrency problems, and the first solution that comes to mind is a "lock." However, managing these locks in a distributed environment introduces far more costs than we might initially think. Over the years, I've repeatedly seen how these locks impact our lives, our projects, and our wallets. I've faced these costs whether I was performing a critical stock update in a production ERP or trying to guarantee that a background job in my own side project ran uniquely. Every choice has a price, and these prices can sometimes be overlooked.

Simple File Locks and Their Hidden Costs

In its simplest form, using a file lock comes to mind. Solutions based on flock or mkdir seem to work in single-server environments. I also started with a method like mkdir /var/lock/myapp.lock && trap 'rmdir /var/lock/myapp.lock' EXIT to prevent a script from running twice simultaneously. This is not bad for a simple cron job. Even simpler, just writing and checking a PID file:

#!/bin/bash
LOCK_FILE="/var/run/myapp.pid"

if [ -f "$LOCK_FILE" ]; then
    PID=$(cat "$LOCK_FILE")
    if ps -p $PID > /dev/null; then
        echo "Application is already running (PID: $PID)."
        exit 1
    else
        echo "Cleaning up previous lock file (PID: $PID not found)."
        rm -f "$LOCK_FILE"
    fi
fi

echo $$ > "$LOCK_FILE"
trap "rm -f $LOCK_FILE; exit" INT TERM EXIT

echo "Application started..."
# Application logic goes here
sleep 60
echo "Application finished."

However, as the system becomes distributed, the costs of these approaches quickly escalate. To access the same LOCK_FILE on multiple servers, you'd need to use a shared file system like NFS. NFS itself is a source of complexity: network latencies, server crashes, file system consistency issues. Suppose the NFS server holding the lock file crashes. What happens? All your applications might freeze, or conversely, if the lock file isn't detected, they might all start running simultaneously, leading to data corruption.

In one scenario that happened to me, around 3:00 AM one night, a reporting service couldn't create the lock file due to an NFS connection drop. As a result, the reporting scripts on two servers ran simultaneously and tried to process the same data. Reports were duplicated, and some critical business processes (which were related to invoicing) continued with incorrect data. I spent hours debugging this error. A simple file lock, in a distributed system, returns as significant operational overhead, erroneous data, and time loss as a "hidden cost." Moreover, the atomicity and crash resilience of such locks are often weak, making them risky for production environments.

Database-Backed Locks: Convenience and Scalability Hurdles

A step further are database-backed locks. It's possible to create a distributed lock mechanism using SELECT FOR UPDATE or a dedicated lock table in a database like PostgreSQL. This method is much more reliable than file locks in terms of atomicity, as it leverages the database's transaction capabilities. In an ERP project, I used this method to prevent multiple operators from interfering simultaneously while updating the status of a specific order. More sophisticated functions like pg_advisory_lock are also available.

-- Simple lock with a dedicated lock table
CREATE TABLE distributed_locks (
    lock_name VARCHAR(255) PRIMARY KEY,
    locked_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    locked_by VARCHAR(255)
);

-- Attempting to acquire a lock
BEGIN;
SELECT * FROM distributed_locks WHERE lock_name = 'my_critical_resource' FOR UPDATE;
-- If the row doesn't exist, insert it
INSERT INTO distributed_locks (lock_name, locked_by) VALUES ('my_critical_resource', 'app_instance_1') ON CONFLICT (lock_name) DO NOTHING;
-- If the insert was successful, the lock was acquired. Otherwise, someone else took it.
COMMIT;

While this method seems more robust due to transaction guarantees, it comes with its own costs. Firstly, your database becomes a single lock server. Under high contention, all your applications will load onto this single point, and database performance will degrade. Connection pools will be exhausted, queries will be queued, and latencies will increase. In early 2023, in a client project, during a busy period (allocation of a product's serial number), the database CPU usage spiked to 90% due to this type of lock. Yet, there were only a few hundred transactions at that moment. The problem was the constant retries for the lock and the transaction overhead.

Secondly, in case of database crashes or network issues, the lock state can become uncertain. If a transaction remains open, the lock might be held indefinitely, blocking all other operations. You'll need to develop complex application logic or additional monitoring systems to track and automatically clean up such situations, which means additional development and operational costs. Furthermore, even with database replication, the lock is typically held on the primary instance, which can lead to lock mechanism restarts or consistency issues during failover. All these extra burdens show how expensive the initially "easy" solution can be in the long run.

Redis Locks and the Need for "Fencing Tokens"

Redis has become a popular choice for distributed locks. We can create a lock mechanism by setting a specific key only once (NX) and for a certain duration (PX) using the command SET resource_name my_random_value NX PX 10000. The my_random_value serves as a "fencing token" representing the lock owner's identity, ensuring only the client that acquired the lock can release it.

import redis
import uuid
import time

r = redis.Redis(host='localhost', port=6379, db=0)

def acquire_lock(lock_name, acquire_timeout=10, lock_timeout=5):
    identifier = str(uuid.uuid4())
    end_time = time.time() + acquire_timeout
    while time.time() < end_time:
        if r.set(lock_name, identifier, nx=True, px=int(lock_timeout * 1000)):
            return identifier
        time.sleep(0.01)
    return None

def release_lock(lock_name, identifier):
    pipe = r.pipeline(transaction=True)
    pipe.watch(lock_name)
    if pipe.get(lock_name).decode() == identifier:
        pipe.multi()
        pipe.delete(lock_name)
        pipe.execute()
        return True
    pipe.unwatch()
    return False

# Example usage
lock_name = "my_resource_lock"
identifier = acquire_lock(lock_name)

if identifier:
    print(f"Lock acquired: {identifier}")
    try:
        # Critical operation is performed here
        time.sleep(2)
    finally:
        release_lock(lock_name, identifier)
        print(f"Lock released: {identifier}")
else:
    print("Failed to acquire lock.")

While Redis locks are more performant and scalable, they come with their own complexities and costs. One of the biggest issues is that Redis, when run as a single instance, is a SPOF (Single Point of Failure). Even if you achieve high availability using Redis cluster or Sentinel, problems like network partitions and clock drift can arise. Suppose a client acquires a lock, experiences a network interruption while working, and cannot send a "I'm done, release the lock" message to Redis. When the lock expires, another client might acquire it. If the previous client sends a "release lock" message to Redis after finishing its work, it might actually release the new client's lock. This is a situation that the "fencing token" attempts to solve, but it still requires careful handling.

The Redlock algorithm is designed to address these issues, but it also introduces additional complexity and performance costs. Redlock requires acquiring locks across multiple Redis instances, which increases latency and raises management overhead. In my own side project, I used Redis locks to ensure a search engine indexing job ran uniquely. Initially, a simple SET NX PX was sufficient, but I observed the indexing job running twice during a network problem, corrupting the index. The result was a full index rebuild and 4 hours of downtime. Events like these perfectly summarize the "operational cost" hidden beneath the seemingly simple facade of Redis locks.

ZooKeeper/etcd-like Consensus-Based Systems

One of the accepted "gold standards" for distributed locks are solutions like Apache ZooKeeper or etcd, which are consensus-based systems. These systems are designed to provide consistency and durability in a distributed environment. Using algorithms like Paxos or Raft, they ensure that a value (in this case, the lock state) is agreed upon by all nodes through consensus. This offers a much more reliable and accurate locking mechanism than Redis or database-backed locks. In a bank's internal platform, ZooKeeper locks were used during the production of critical financial reports to prevent multiple jobs from generating the same report simultaneously.

# Simple lock acquisition with etcdctl
# etcdctl get /mylock --for-write
# etcdctl put /mylock "locked" --lease=10  # With a 10-second lease
# When the operation is finished:
# etcdctl del /mylock

The biggest cost of these systems is their operational complexity. Setting up, managing, and monitoring a ZooKeeper or etcd cluster requires a separate area of expertise. You need to run a three- or five-node cluster, which means more server resources, network bandwidth, and management overhead. Furthermore, lock acquisition and release operations, passing through a consensus algorithm, have higher latency compared to an in-memory database like Redis. Each lock operation requires consensus among the cluster nodes, increasing network traffic.

For me, one of the biggest costs was the learning curve. Properly configuring, backing up, monitoring, and debugging an etcd cluster is not a daily task for an ordinary application developer or system administrator. A misconfigured cluster can fall into inconsistent states during network partitions or node failures, leading to the collapse of your entire distributed lock mechanism. Last year, in a client project, the lock acquisition times reached 30 seconds because the etcd cluster's disk I/O was insufficient. This situation caused all critical operations in the application to freeze. Finding and fixing such "hidden" performance bottlenecks requires significant engineering effort.

ℹ️ Consensus and Scope

Tools like ZooKeeper and etcd are used not only for locks but also for broader distributed coordination problems such as distributed configuration management, service discovery, and leader election. The lock mechanism is just one part of their extensive capabilities. Therefore, using such a heavy tool solely for locking might not always be the most cost-effective solution.

My Own Lock Implementation and What I Learned

Over the years, I've grappled with distributed locks in many different systems, from simple cron jobs to complex production ERPs. At one point, I thought, "Why not write my own simple lock service?" My goal was to overcome some of the disadvantages of the above solutions and develop a lighter, purpose-built solution. I wrote a service that acquired and released locks via a REST API, using a simple PostgreSQL table in the backend, and had a TTL (Time To Live) mechanism. This service automatically released the lock if it wasn't released within a certain period, thus preventing deadlocks.

# Simple lock service example with FastAPI
# (A real implementation would be much more complex, with error handling, etc.)
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import time
import uuid

app = FastAPI()

# In-memory locks (not suitable for production, for example purposes)
# Production should use PostgreSQL or Redis
locks = {}

class LockRequest(BaseModel):
    resource_id: str
    ttl_seconds: int = 300 # Default 5 minutes

@app.post("/lock")
async def acquire_lock_endpoint(request: LockRequest):
    lock_id = str(uuid.uuid4())
    current_time = time.time()

    # Check if resource is already locked and if existing lock is expired
    if request.resource_id in locks:
        existing_lock = locks[request.resource_id]
        if current_time < existing_lock['expires_at']:
            raise HTTPException(status_code=409, detail="Resource already locked")
        else:
            # Expired lock, clean up and acquire new one
            del locks[request.resource_id]

    locks[request.resource_id] = {
        "lock_id": lock_id,
        "expires_at": current_time + request.ttl_seconds
    }
    return {"lock_id": lock_id, "message": "Lock acquired"}

@app.post("/unlock")
async def release_lock_endpoint(request: LockRequest, lock_id: str):
    if request.resource_id not in locks:
        raise HTTPException(status_code=404, detail="Resource not locked")

    existing_lock = locks[request.resource_id]
    if existing_lock['lock_id'] != lock_id:
        raise HTTPException(status_code=403, detail="Invalid lock_id for resource")

    del locks[request.resource_id]
    return {"message": "Lock released"}

This approach also had its own costs. While writing my own lock service initially gave me a sense of control, it actually meant taking on the complexities of a distributed system from scratch. I had to handle issues like time synchronization (clock drift), network latencies, service crashes, and automatic lock renewal myself. For example, if a client crashed before finishing its work after acquiring a lock, the lock would remain until the TTL expired, blocking other jobs. To overcome this, clients had to periodically send "heartbeats" to renew the lock, which added extra complexity on the client side.

Once, due to a latency error (I had written sleep 3600 instead of sleep 360), my lock service experienced an OOM-killed, and all locks were reset. The panic at that moment and the effort I spent finding this error afterward once again showed me how valuable the "reliability" provided by a standard library or service is. While my own solution might be suitable for a specific scenario, as a general solution, it carried much more maintenance and debugging costs. This was one of those moments in my engineering career where I learned that "reinventing the wheel" is rarely a good idea.

Reducing Costs: When is a Lock Necessary?

After seeing the costs associated with all these lock mechanisms, the most important question becomes: Do I really need a distributed lock? Most of the time, my answer is "no." In many cases, it's possible to avoid these costs by abstaining from using locks or by adopting different architectures.

First, the concept of idempotency. If a process produces the same result even if it runs multiple times, why lock it? When processing jobs from a message queue, if it's not an issue for multiple workers to process messages simultaneously, using idempotency keys instead of locks is a simpler and more scalable approach. For example, processing a payment transaction only once based on its transaction_id and ignoring it if it arrives multiple times.

Second, eventual consistency. Is it necessary for everything to be instantly consistent? Usually not. The fact that a user adds an item to their cart doesn't need to be visible to everyone in the world instantly. A short delay is often acceptable. In such cases, with architectures like CQRS (Command Query Responsibility Segregation) or event-sourcing, we can eliminate the need for locks by allowing some time for the data to become consistent. In a production ERP, we significantly reduced the lock load on the system by updating stock levels with a few seconds of delay (but guaranteeing the final consistent state) instead of in real-time. This is an approach I frequently use in designing real-time dashboards.

Third, optimistic locking. Especially in database operations, when updating a record, we can check if someone else has modified the same record using a field like a version column. If it has been modified, we retry our own operation. This is much lighter and more performant than distributed locks in low-contention scenarios.

-- Optimistic Locking Example
UPDATE products
SET stock = stock - 1, version = version + 1
WHERE id = 123 AND version = <current_version_read_earlier>;

-- If the number of affected rows is 0, another process has modified the record, and it can be retried.

These approaches significantly reduce the costs associated with distributed locks, such as performance degradation, complexity, and operational overhead. Before using a distributed lock, questioning whether you truly need hard-guaranteed unique access means much less headache in your engineering career. In most cases, simpler and more flexible solutions are available.

Conclusion: Are Locks a Necessity or a Choice?

While distributed locks might seem like an unavoidable necessity in distributed system architectures, they are often, in reality, a choice. And this choice, as detailed above, carries significant technical, operational, and even financial costs. From simple file locks to database-backed solutions, from the fast but complex world of Redis to consensus giants like ZooKeeper/etcd, each alternative harbors its own set of trade-offs.

My experience suggests that we should pause and think before implementing a distributed lock. Do we really need such strict unique access? Can we solve the same problem more flexibly and at a lower cost using alternative approaches like idempotency, eventual consistency, or optimistic locking? Last month, for a component of my own side project, I initially considered a Redis lock. However, I then realized that the nature of the task meant it wouldn't cause problems if it ran multiple times, and I could control it with a last_run timestamp. This simple realization saved me both development time and potential operational nightmares.

Remember, in engineering, using the most powerful or popular tool is not always the best solution. What matters is understanding the problem correctly and finding the most suitable, least costly, and most sustainable solution for that problem. Distributed locks are powerful tools, but they must be used with great care and awareness. I hope that the next time you design a critical operation, you will make more informed choices by considering these costs.