The On-Call Cost of Distributed Locks

#career #architecture #operations #systemdesign

Distributed locks are one of the most popular tools in the toolbox of teams transitioning to microservice architectures. Preventing multiple application servers from accessing the same resource (e.g., an order balance, inventory item, or invoice number) simultaneously sounds like a perfect solution in theory. In practice, however, the operational complexity that distributed locks introduce to a system can become the number one reason for those dreaded midnight on-call pages.

I've pushed the limits of this architecture countless times, both in the backend infrastructure of a side project I developed and in a large-scale e-commerce platform I consulted for. I personally experienced how distributed locks don't behave as neatly as they appear on paper, and how a 50-millisecond network fluctuation or a minor database lock starvation can turn into an operational nightmare. In this post, I'll discuss the hidden operational cost of distributed locks, the limitations of algorithms like Redlock, and alternative design patterns to reduce on-call burden, all based on my own experiences.

The Allure of Distributed Locks and Their Real-World Cost

In distributed systems, the first method that comes to mind for resolving inconsistencies known as "race conditions" is to establish a global locking mechanism. As your application servers scale horizontally, local memory locks (e.g., sync.Mutex in Go or synchronized blocks in Java) become ineffective. At this point, it seems perfectly logical to go to a shared memory server (typically Redis or Consul) that all servers can see and say, "I've locked this resource X; no one touch it until I'm done."

However, the biggest cost of this approach is that your system becomes extremely sensitive to latency at a single point and to network partitioning. When you use a distributed lock, a critical piece of your application's business logic becomes directly dependent on the lock acquisition time over the network. On a client project, I witnessed an API handling 12,000 requests per second during a peak campaign period collapse in a domino effect due to microsecond delays on the lock server.

+------------------+      1. Lock Request (SET NX)     +-------------------+
|  Application Pod 1 | --------------------------------> |                   |
|                  | <-------------------------------- |   Redis Cluster   |
+------------------+       2. Lock Acquired (OK)       |                   |
                                                       |                   |
+------------------+      3. Request Same Lock         |                   |
|  Application Pod 2 | --------------------------------> |                   |
|                  | <-------------------------------- |                   |
+------------------+         4. Denied (NIL)           +-------------------+

Even if the lock server is up, incorrect management of lock timeout durations by client-side libraries can lead to situations like deadlocks, where locks are "never released." If the unlock command fails to reach its destination due to a packet loss on the network, that resource remains locked until its Time to Live (TTL) expires. This means the on-call engineer will be losing sleep at midnight, searching system logs for lines like these:

[2026-05-30 03:14:22] ERROR [order-service] LockAcquisitionException: 
Could not acquire lock for resource 'order_982341'. 
Lock held by client 'pod-a-7d8f' with remaining TTL 28800ms.

The Redis Redlock Algorithm and the 3:00 AM On-Call Pager

When it comes to setting up distributed locks on Redis, the Redlock algorithm is often the first standard that comes to mind. Redlock aims to acquire a lock by obtaining a majority (quorum) approval from at least 5 different Redis nodes, guarding against the possibility of a single Redis node failing. The algorithm works great on paper: if 3 out of 5 nodes approve, the lock is yours. However, as Martin Kleppmann pointed out, Redlock is overly dependent on the accuracy of physical clocks (wall-clock time) and network latencies.

We used Redlock in a background job running Material Requirements Planning (MRP) in a production ERP. The process took about 45 seconds, and we set the lock's TTL to 60 seconds. However, due to vCPU steal time on the virtual machine (VM) running in the background and a long Garbage Collection (GC) pause triggered at that moment, the application server completely froze for 18 seconds. During this freeze, the application thought time wasn't passing, while the TTL counter on the Redis nodes continued to tick down.

⚠️ Timeout Danger

When the GC pause ended, the application assumed it still held the lock and began writing to the database. However, the lock had already expired, and another worker had acquired the same lock and started its own process. The result: a double write and inconsistent inventory data in the database.

The CPU and network metrics we extracted while debugging this incident showed us how operationally fragile Redlock can be. When clock synchronization (NTP) between nodes deviated even by a second, the quorum calculation went completely awry. If you're using Redlock, you have to monitor not just Redis, but also the NTP status of all your servers. This simply adds another alert item to the on-call roster.

Why Database-Level Locking (Advisory Locks) Is Often Sufficient

If your system already uses a robust relational database like PostgreSQL, introducing an additional layer like Redis or Consul for distributed lock management is often an unnecessary operational burden. PostgreSQL offers "Advisory Locks" that run entirely in memory, independent of application logic, and do not lock physical tables.

As I mentioned in my previous post [related: PostgreSQL index strategies], because the database's internal mechanisms have ACID guarantees, they are much more reliable than Redis for consistency. Acquiring a lock on a key in PostgreSQL only requires a single SQL query:

-- Acquire a transaction-level lock on a 64-bit key
SELECT pg_advisory_xact_lock(8472910472);

The biggest advantage of this method is that the lock operation is directly tied to the database connection (session/connection). If your application crashes, the network connection drops, or the server physically shuts down, PostgreSQL immediately detects this and automatically releases all advisory locks associated with that connection (cleanup). There's no "lock is hanging, let's wait for TTL to expire" worry like with Redis.

However, this method also has its own trade-offs. If your application receives very high traffic and makes tens of thousands of lock requests per second, you might hit PostgreSQL connection pooler limits (e.g., PgBouncer). Transaction-level locks may not always work correctly with PgBouncer in session-pooling mode. It's important to design with this limitation in mind.

Criterion	Redis (Redlock)	PostgreSQL (Advisory Locks)
Connection Loss Behavior	Lock remains locked for TTL duration	Lock released immediately upon connection loss
Operational Complexity	High (managing 5 independent nodes)	Low (uses existing DB)
Performance (Throughput)	Very High (>50k ops/sec)	Moderate (~5k-10k ops/sec)
Reliability (Consistency)	Dependent on clock synchronization (NTP)	ACID guaranteed, independent of clock

Lock Issues Caused by Network Latency and GC Pauses (Garbage Collection)

The most insidious reason distributed locks fail in practice is that the software world often ignores physical realities. The network is asynchronous; you can never know exactly how long it will take for a packet to travel from point A to point B. When virtualization, hypervisor layers, and runtime-level Garbage Collection pauses are introduced, the validity periods of locks become mere estimations.

"Stop-the-World" GC pauses encountered in large applications running on JVM-based languages (Java, Kotlin, Scala) or the .NET runtime are the biggest enemies of distributed locks. When your application server pauses for 5 seconds, TCP sockets at the operating system level might continue to buffer, but your application code freezes.

Application Flow:
[00:00.000] -> Lock Acquired (TTL: 5 seconds)
[00:00.100] -> GC Pause Started (Stop-the-World)
[00:04.800] -> GC Pause Ended (Elapsed time: 4.7s)
[00:04.900] -> Write Request Sent to Database (Lock will actually expire in 0.1 seconds!)
[00:05.100] -> Request reached DB, but lock had already expired and was acquired by another pod.

To solve this, fencing tokens should be placed after the lock check. A fencing token is a monotonically increasing counter that increments each time a lock is acquired. You send this counter when writing to the database. If the current counter in the database is greater than the one you sent, the write operation is rejected. This method, also recommended in RFC 7230 and similar distributed system principles, builds a second firewall behind locks but also significantly complicates the software architecture.

Idempotency and Constrained Designs: Never Needing a Lock at All

So, instead of dealing with all these operational headaches, can we completely eliminate the need for locks? Most of the time, yes. Software architecture is more about organizational and data flow design than just writing code. Making operations naturally idempotent, meaning they can be run multiple times without adverse effects, is the cleanest way to reduce on-call costs to zero.

For example, consider an operation to debit money from a user's balance. Instead of using a distributed lock to check the balance and then update it, we can use the database's own atomic operations and "optimistic locking."

# FastAPI endpoint example with optimistic concurrency control, no locks
from fastapi import FastAPI, HTTPException, status
from pydantic import BaseModel
import psycopg2

app = FastAPI()

class PaymentRequest(BaseModel):
    account_id: int
    amount: float
    version: int  # Version number for fencing and optimistic control

@app.post("/withdraw")
def withdraw_money(req: PaymentRequest):
    conn = psycopg2.connect("dbname=finance user=postgres")
    cursor = conn.cursor()

    # Update only if version matches and balance is sufficient
    cursor.execute(
        """
        UPDATE accounts 
        SET balance = balance - %s, version = version + 1
        WHERE id = %s AND version = %s AND balance >= %s
        """,
        (req.amount, req.account_id, req.version, req.amount)
    )

    if cursor.rowcount == 0:
        conn.rollback()
        raise HTTPException(
            status_code=status.HTTP_409_CONFLICT,
            detail="Insufficient balance or data version is outdated. Please try again."
        )

    conn.commit()
    return {"status": "success"}

This design requires no distributed lock server. If two requests arrive simultaneously, one will update the version number, causing the other to automatically fail (rowcount == 0). At the application layer, we can catch this error and tell the user "Please try again" or implement a safe retry mechanism in the background. This approach eliminates all bottlenecks that hinder system scaling.

As I also mentioned in my post [related: PostgreSQL index strategies], in such designs, the presence of correct indexes directly affects query speed. A unique index on the id and version fields ensures the database engine completes this update in milliseconds.

Practical Runbooks and Monitoring Strategies to Reduce On-Call Costs

If you are forced to use distributed locks in your current architecture and cannot change it, you should at least take operational measures to make life easier for on-call engineers. If a locking system is not monitored, you are flying blind.

First, absolutely metricize lock acquisition times and lock wait times. Visualizing these metrics using Prometheus and Grafana will allow you to detect a potential deadlock situation hours in advance.

# Example Prometheus metric format
distributed_lock_acquisition_duration_seconds_bucket{resource="order_payout", le="0.1"} 1240
distributed_lock_acquisition_duration_seconds_bucket{resource="order_payout", le="0.5"} 145
distributed_lock_acquisition_duration_seconds_bucket{resource="order_payout", le="1.0"} 12
distributed_lock_acquisition_duration_seconds_bucket{resource="order_payout", le="+Inf"} 3

In the metrics above, if lock acquisition times above +Inf (infinity) or 1 second are increasing, this indicates that a thread pool is exhausted in the background or the lock server is bottlenecked. Set your on-call alert thresholds based on the 99th percentile (p99 latency) of these metrics.

Also, add the following emergency steps to your on-call runbook:

Clean up Zombie Locks: Run SCAN and TTL commands on Redis to find out which pod holds which lock.
Terminate Connections: If PostgreSQL advisory locks are used, terminate the locked session with pg_terminate_backend(pid).
Graceful Shutdown: Ensure that cleanup functions (shutdown hooks) that release locks when application pods are shut down (upon receiving a SIGTERM signal) are working.

Conclusion

My clear stance is this: Distributed locks are a last resort used to patch design flaws and unplanned data flows in system architecture; they should not be the first choice. Before adding a distributed lock to a system, ask yourself: "Can I solve this inconsistency at the database level with optimistic locks, unique constraints, or an event-driven architecture?"

If the answer is yes, take the distributed lock option off the table. The way to build a stable system that doesn't wake you up at 3:00 AM with a phone call and is unaffected by network fluctuations is not by managing complexity, but by never creating complexity in the first place. Operational excellence is not about using the most advanced tools, but about achieving the highest resilience with the simplest designs.