DEV Community

Cover image for Beyond SETNX: Implementing a Production-Grade Distributed Lock with Node.js and Redis Lua Scripts
Roopam
Roopam

Posted on

Beyond SETNX: Implementing a Production-Grade Distributed Lock with Node.js and Redis Lua Scripts

Picture this: your restaurant booking platform is growing. You've scaled your Node.js API to four replicas behind a load balancer. Then Friday evening hits, and two guests—Charlie and Diana—both smash "Reserve Table 12" at the exact same millisecond. Both requests land on different instances. Both read the database: table 12 is free. Both write a booking row. Last writer wins. Charlie gets a confirmation email. Diana gets a confirmation email. Saturday night, both show up. The host is apologetic. The table is awkward. Your on-call engineer is not having a good weekend.

That is a double-booking race condition, and it's the canonical distributed systems bug for any stateful, time-sensitive resource. Seats on a flight. Stock in a flash sale. A hotel room. A consulting slot.

A database SELECT ... FOR UPDATE pessimistic lock would solve this on a single DB node—but only if every booking request hits that same node serially. The moment you have multiple Node.js processes, you need a lock that lives outside your application layer, in a system that all replicas share. Redis is that system.

But a naïve Redis lock is actually broken in subtle ways most developers don't discover until it bites them in production. Let me show you why, and how to build a Distributed Lock that's genuinely safe.


The Race Condition Window

Before we get to the solution, let's be precise about the failure mode.

The old-school approach is a two-step:

  1. GET lock:table:12 — check if key exists
  2. SET lock:table:12 "process-A" NX PX 5000 — set it if free

The problem? Steps 1 and 2 are two separate round-trips to Redis. Between them, the TCP connection yields. Another process—on another server, with its own event loop—can complete its own step 1 before either of you completes step 2.

Process A         Redis          Process B
    |                |                |
    |--- GET ------> |                |
    | <-- (nil) ---- |                |
    |                |  <-- GET ------|
    |                | --- (nil) ---> |
    |--- SET NX ---> |                |   ← A wins the SET
    | <-- OK ------- |                |
    |--- SET NX ---> |                |   ← B also calls SET NX...
    | <-- OK ------- |                |   ← ...but also gets OK? No—NX prevents this
Enter fullscreen mode Exit fullscreen mode

Wait—that scenario is actually safe if you use SET NX correctly. The real danger is the pattern where developers do the check-then-set in application code, not as a single atomic Redis command. Or when they use SETNX (the old command) followed by a separate EXPIRE call. That gap is exploitable.

But there's a worse problem lurking further down: releasing a lock you don't own anymore.


The Stolen Lock Problem

Here's the scenario that breaks nearly every naïve lock implementation:

t=0ms    Process A acquires lock:table:12, token="uuid-A", TTL=5000ms
t=4800ms Process A is still mid-transaction (slow DB query, GC pause, whatever)
t=5000ms Redis auto-expires the key. Lock is gone.
t=5001ms Process B acquires lock:table:12, token="uuid-B"
t=5100ms Process A finally finishes its transaction. Calls DEL lock:table:12.
         Redis deletes the key—which now belongs to B.
t=5101ms Process C sees the lock is free. Acquires it. Now B *and* C both think
         they hold the lock. Double-booking. Again.
Enter fullscreen mode Exit fullscreen mode

This is the Stolen Lock scenario. Process A's DEL is unconditional—it doesn't check whether it still owns the lock. The TTL saved you from an infinite deadlock, but the unguarded release created a new race.

The fix requires two things to happen atomically:

  1. Read the current lock value
  2. Only delete if it matches your token

And "atomically" here means nothing—no other Redis client command—can interleave between those two steps. That's exactly what Lua scripts give you.


Why Lua? The Atomicity Guarantee

Redis is single-threaded for command execution. Every command is processed one at a time, sequentially, by the Redis event loop. But a sequence of multiple commands—like a GET followed by a DEL—is not atomic. Another client's command can slip in between them.

A Lua script executed via EVAL is different. Redis treats the entire script as a single atomic unit. While the script is running, no other client can execute commands. It's not a lock around Redis itself—it's the guarantee that the script runs to completion without interleaving.

From the Redis docs:

"Redis guarantees the script's atomic execution. While executing the script, all server activities are blocked."

This is what makes the check-then-delete pattern safe:

-- RELEASE_LOCK_SCRIPT
local key   = KEYS[1]
local token = ARGV[1]

if redis.call('GET', key) == token then
  return redis.call('DEL', key)
else
  return 0
end
Enter fullscreen mode Exit fullscreen mode

The GET and the DEL happen as one indivisible operation. There is no window. Process A's expired token will never match Process B's live token, so the DEL is a no-op. Process B's lock survives.


Architecture at a Glance

  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
  │  Node.js     │     │  Node.js     │     │  Node.js     │
  │  Replica 1   │     │  Replica 2   │     │  Replica 3   │
  └──────┬───────┘     └──────┬───────┘     └──────┬───────┘
         │                    │                    │
         └────────────────────┼────────────────────┘
                              │
                     ┌────────▼────────┐
                     │     Redis       │
                     │  lock:table:42  │
                     │  → "uuid-A"     │
                     │  TTL: 4812ms    │
                     └─────────────────┘
Enter fullscreen mode Exit fullscreen mode

The lock lives in Redis. Every replica talks to the same instance. The Lua scripts guarantee that acquire and release are both atomic operations. The TTL is the safety net—if a process crashes mid-transaction, Redis cleans up the orphaned lock automatically.


Step-by-Step Code Walkthrough

1. The Redis Client

import Redis from "ioredis";

const redis = new Redis(process.env.REDIS_URL ?? "redis://localhost:6379", {
  // This shows up in `redis-cli CLIENT LIST` — invaluable when debugging
  // which connections are holding locks in production.
  connectionName: "table-lock-manager",

  // Cap at 3 retries so integration tests don't stall.
  // In production, tune this based on your latency SLA.
  maxRetriesPerRequest: 3,
});

redis.on("error", (err) => {
  // Emit but don't crash — ioredis will attempt to reconnect.
  console.error("[Redis] Connection error:", err.message);
});
Enter fullscreen mode Exit fullscreen mode

Why ioredis and not the official redis package?

Both are fine. I prefer ioredis for lock managers specifically because:

  • It has a stable, well-documented API for eval() with explicit numkeys parameter
  • Automatic exponential-backoff reconnection is on by default
  • The connectionName option is built-in (useful for redis-cli CLIENT LIST during incidents)
  • It handles command queuing during reconnection transparently

The maxRetriesPerRequest: 3 cap is deliberate. You don't want a lock acquisition attempt to silently retry for 30 seconds—you want a fast failure so the caller can surface a 503 to the user and they can retry manually.


2. Key Naming and TTL Configuration

const LOCK_KEY_PREFIX = "lock:table:";
const DEFAULT_TTL_MS = parseInt(process.env.LOCK_TTL_MS ?? "5000", 10);
Enter fullscreen mode Exit fullscreen mode

The namespace prefix (lock:table:) is not decoration. Redis is typically shared across features—caching, sessions, rate limiting, pub/sub. A bare key like 42 will collide with something eventually. The prefix also makes it trivial to inspect all live locks:

redis-cli KEYS "lock:table:*"
# or better, use SCAN in production:
redis-cli --scan --pattern "lock:table:*"
Enter fullscreen mode Exit fullscreen mode

The TTL from an environment variable matters in production. Your booking transaction time will vary—a fast Postgres query on a warm connection pool might take 50ms; a slow one with retries under load could take 800ms. Ops teams need to tune this without a code deploy. Expose it as a config value.


3. The Acquire Script

const ACQUIRE_LOCK_SCRIPT = `
  local key   = KEYS[1]
  local token = ARGV[1]
  local ttl   = tonumber(ARGV[2])

  -- SET key token NX PX ttl
  -- NX  → only set if Not eXists
  -- PX  → expiry in milliseconds (never omit this — it's your safety net)
  return redis.call('SET', key, token, 'NX', 'PX', ttl)
`;
Enter fullscreen mode Exit fullscreen mode

A quick note here: SET key value NX PX ttl is actually a single atomic Redis command. You could skip the Lua wrapper for acquisition and call it directly. I still use Lua here for three reasons:

  1. Consistency: both acquire and release are Lua scripts. The pattern is uniform.
  2. Extensibility: if you want to add retry counts, owner metadata, or conditional logic to the acquire path, you do it inside the script without additional round-trips.
  3. Explicitness: the Lua version is more readable for anyone who hasn't memorized the SET command's option flags.

The calling code:

async function acquireLock(
  tableId: string,
  ttlMs: number = DEFAULT_TTL_MS
): Promise<LockHandle | null> {
  const key = `${LOCK_KEY_PREFIX}${tableId}`;

  // randomUUID() is crypto-random. Negligible collision probability.
  // It's built into Node's crypto module — no extra dependency.
  const token = randomUUID();

  const result = await redis.eval(
    ACQUIRE_LOCK_SCRIPT,
    1,      // number of KEYS arguments
    key,    // KEYS[1]
    token,  // ARGV[1]
    ttlMs   // ARGV[2]
  );

  if (result !== "OK") {
    return null; // Lock is held by someone else
  }

  return {
    token,
    release: async (): Promise<boolean> => {
      // ... see release section
    },
  };
}
Enter fullscreen mode Exit fullscreen mode

Why return null instead of throwing?

Because lock contention is not an exceptional error—it's an expected, normal outcome in a concurrent system. Throwing forces try/catch at every call site and conflates "this is bad, something broke" with "this is expected, retry or backoff." Returning null lets the caller decide: queue the request, return a 409, prompt the user, or implement exponential retry. Keep the control flow clean.


4. The Release Script (The Critical Part)

This is the part most implementations get wrong. Read it carefully.

const RELEASE_LOCK_SCRIPT = `
  local key   = KEYS[1]
  local token = ARGV[1]

  if redis.call('GET', key) == token then
    return redis.call('DEL', key)
  else
    -- We no longer own this lock (expired or stolen). Return 0, not an error.
    return 0
  end
`;
Enter fullscreen mode Exit fullscreen mode

Let's trace through the stolen lock scenario again, but with this script:

t=0ms    Process A acquires lock:table:12, token="uuid-A", TTL=5000ms
t=5000ms Redis auto-expires the key
t=5001ms Process B acquires lock:table:12, token="uuid-B"
t=5100ms Process A finishes, calls release with token="uuid-A"
         Lua script: GET lock:table:12 → "uuid-B"
         "uuid-B" != "uuid-A" → return 0
         B's lock is UNTOUCHED. ✓
Enter fullscreen mode Exit fullscreen mode

Process A's release is a no-op. Process B continues safely. No race condition.

The release function returns a booleantrue if we released it cleanly, false if the lock had already expired. That false case is a warning, not a crash. The table is already unlocked (Redis TTL handled it). What false tells your ops team is: your TTL is too short for the actual transaction time. Tune it.

release: async (): Promise<boolean> => {
  const released = await redis.eval(
    RELEASE_LOCK_SCRIPT,
    1,
    key,
    token
  );

  if (released === 1) {
    console.log(`[Lock] RELEASED — key="${key}" token="${token}"`);
    return true;
  } else {
    // Lock expired before manual release. Warn ops to tune LOCK_TTL_MS.
    console.warn(
      `[Lock] WARN — Lock expired before manual release. ` +
      `key="${key}" token="${token}"`
    );
    return false;
  }
},
Enter fullscreen mode Exit fullscreen mode

5. The finally Block Is Non-Negotiable

async function bookTable(
  tableId: string,
  guestName: string,
  processingDelayMs = 200
): Promise<BookingResult> {
  const lock = await acquireLock(tableId);

  if (!lock) {
    return {
      success: false,
      message: `Table ${tableId} is currently being booked by another request.`,
    };
  }

  try {
    await performBookingTransaction(tableId, guestName, processingDelayMs);
    return { success: true, message: `Table ${tableId} booked for "${guestName}".` };
  } finally {
    // ALWAYS release in a finally block.
    // If performBookingTransaction throws, we still release.
    // The TTL is the last line of defence; manual release is the first.
    await lock.release();
  }
}
Enter fullscreen mode Exit fullscreen mode

If you put lock.release() in the try block, an exception from performBookingTransaction will skip it. The lock stays in Redis until the TTL fires. Every request for that table is rejected for up to 5 seconds. That's bad UX and, at scale, a thundering herd of retries. Use finally. Always.


Failure Scenario Simulations

The code includes three concrete scenarios that demonstrate the lock's behavior. Here's what each one proves.

Scenario 1: Sequential Bookings (Baseline)

const r1 = await bookTable("T1", "Alice");   // succeeds
const r2 = await bookTable("T1", "Bob");     // also succeeds (lock released before Bob tries)
Enter fullscreen mode Exit fullscreen mode

No surprise here. Sequential requests don't contend. Both complete. This is the sanity check that the lock doesn't break the happy path.

Scenario 2: Concurrent Race — The Real Test

const [r1, r2] = await Promise.all([
  bookTable("T2", "Charlie", 300), // 300ms artificial DB delay
  bookTable("T2", "Diana", 300),
]);
Enter fullscreen mode Exit fullscreen mode

Promise.all launches both at the exact same event-loop tick. The processingDelayMs: 300 ensures the first lock-holder is mid-transaction when the second attempt fires, making the contention visible in logs.

Expected output:

[Lock] ACQUIRED — key="lock:table:T2"  token="uuid-charlie"
[DB]   Writing booking: table=T2  guest="Charlie"
[Lock] REJECTED — key="lock:table:T2" (lock already held)

Result (Process 1): Table T2 successfully booked for "Charlie".
Result (Process 2): Table T2 is currently being booked by another request.
Enter fullscreen mode Exit fullscreen mode

One winner. One clean rejection. No double-booking. No database write for the loser—the lock prevented it from ever getting to the DB layer.

Scenario 3: TTL Safety Net — The Crash Simulation

const SHORT_TTL = 1500; // 1.5 seconds

// Process A acquires but "crashes" — never releases
const lock = await acquireLock("T3", SHORT_TTL);
// lock.release() is never called

// Wait past TTL
await sleep(SHORT_TTL + 500);

// Process B should now succeed
const result = await bookTable("T3", "Eve"); // succeeds ✓
Enter fullscreen mode Exit fullscreen mode

This validates the TTL as a self-healing mechanism. In a real crash—OOM kill, SIGKILL from a deployment, network partition—the process can't release its lock. Without TTL, the table would be permanently blocked. Redis's key expiration is what makes distributed locks safe to use in the first place.


Production Checklist

TTL Tuning

Your TTL should be: (p99 transaction time) × safety_multiplier

If your booking transaction (including DB write and downstream calls) completes in 200ms at p99, a 2000ms TTL gives you 10× headroom. If your p99 is 800ms under load, push it to 5000ms. Monitor [Lock] WARN — Lock expired before manual release log lines in production—if you see them more than occasionally, your TTL is too short.

Never set TTL lower than your actual processing time. The whole point of the TTL is to be a safety net, not a gate that triggers regularly.

Logging and Observability

Every lock acquisition and release should log:

  • The key (which resource)
  • The token (which process/request)
  • The TTL on acquire
  • Whether release succeeded or expired

This gives you a complete audit trail during incidents. When a support ticket says "we double-booked table 7 on Friday at 8:14 PM," you want to be able to reconstruct the exact sequence of lock events from your logs.

Tag your lock logs with a correlation ID (request ID, trace ID) so you can join them with your application logs in your log aggregator.

What About Redis Cluster / Sentinel?

Single-node Redis is sufficient for most applications. If you're running Redis with replication (Sentinel or Cluster), be aware of replication lag: a lock key written to the primary might not yet exist on a replica. If the primary fails before replication completes, a failover could allow two processes to simultaneously acquire what they both believe is the same lock.

For this specific concern, the Redlock algorithm by Salvatore Sanfilippo addresses it by requiring majority quorum across N independent Redis nodes. However, Redlock comes with significant operational overhead (you need 3-5 Redis instances) and there are well-documented theoretical edge cases around clock drift (see Martin Kleppmann's critique).

My opinion: if you're running a single Redis primary with appendonly yes (AOF persistence) and your durability requirements are "don't lose the lock key on restart," Redlock is overkill. The failure window during a Redis primary failover is typically seconds, and the probability of a lock race coinciding exactly with that window is extremely low. A hard-coded fencing token in your DB schema (an incrementing version column) is often a cheaper and more robust solution for true correctness guarantees. Use Redlock only if you have a specific, documented requirement for multi-node lock durability.

Connection Pooling

ioredis creates a single TCP connection per new Redis() instance. For a lock manager that sees high concurrency, you may want to share a single client instance across all requests (as this implementation does with the module-level singleton) rather than creating a new connection per request. The Redis command pipeline handles concurrent EVAL calls correctly—you don't need multiple connections for concurrency.

Health Checks

Include Redis connectivity in your application's /health endpoint. A lock manager that silently fails open (acquiring locks when Redis is unreachable) is arguably worse than failing closed. If redis.ping() fails, your health check should return 503 so the load balancer stops routing traffic to that instance.


A Note on SETNX vs SET NX

The title says "Beyond SETNX"—a quick note on why. The original SETNX command (Set if Not eXists) dates to Redis 1.0. It doesn't accept an expiry argument, so the old pattern was SETNX key value followed by a separate EXPIRE key seconds. That gap between SETNX and EXPIRE is exploitable—if the process crashes after SETNX but before EXPIRE, the key has no TTL and lives forever.

Redis 2.6.12 added options to the SET command: NX (only set if not exists) and PX (expiry in milliseconds). SET key value NX PX 5000 is atomic. There is no gap. The old SETNX + EXPIRE pattern is deprecated and should never appear in new code. If you see it in a codebase, replace it.


Wrapping Up

Here's what we built and why each piece matters:

Component What it does Why it matters
Lua acquire script Atomic SET NX PX No race between check and set
UUID owner token Unique per acquisition Enables ownership verification on release
Lua release script GET + conditional DEL (atomic) Prevents deleting a lock you no longer own
TTL on every lock Auto-expiry Self-healing on process crash
finally block Release always runs No orphaned locks from exceptions
null return on contention Structured failure Clean call-site control flow

The double-booking problem isn't exotic—it's what happens when you scale any stateful operation to multiple processes without coordination. Redis gives you a shared coordination layer that's fast, battle-tested, and trivial to operate. Lua gives you the atomicity that makes it correct. The owner token is the detail that makes it production-safe.

The full source is available on GitHub: TyRoopam9599/distributed-lock-redis-lua. Run npx ts-node src/lockManager.ts against a local Redis instance to see all three scenarios play out in your terminal.


Have you run into the stolen-lock scenario in production? Or do you have a different take on Redlock vs. single-node? Drop a comment below — I'd like to hear how others have approached this.

Top comments (0)