Kolade Fajimi

Posted on Jun 15 • Originally published at koladefaj.hashnode.dev

Redis Lua Scripting for Distributed Systems: How Atomicity Prevents Race Conditions.

#redis #distributedsystems #python #backend

There is a failure mode that is almost impossible to reproduce in tests but completely reproducible in production under load. Two workers. One Redis key. Both check whether a task has already been claimed. Both see "no." Both claim it. Now the task runs twice.

This is not a logic bug. The logic is correct. The problem is that correctness requires atomicity, and two separate Redis commands, however fast are never atomic.

Every critical operation in Relier is a Lua script. Not because Lua is elegant. Because Redis's single-threaded execution model means a Lua script is the only way to make a distributed check-then-act actually correct.

This is a walk through all six of them.

Why GET + SET is never safe

When you need to claim a task, check whether it's been taken, and if not, take it, the naive implementation is two commands:

existing = await redis.get(key)
if existing is None:
    await redis.set(key, worker_id)

This has a race condition that is invisible at low concurrency and guaranteed to trigger at high concurrency.

t=0ms  Worker A: GET "task:123" → nil   # not claimed
t=0ms  Worker B: GET "task:123" → nil   # not claimed (same read, no lock)
t=1ms  Worker A: SET "task:123" "worker-A"
t=1ms  Worker B: SET "task:123" "worker-B"

Both workers claimed the task. Worker A's write is immediately overwritten by Worker B. Now both are executing it.

The fix is not a Python-side lock. Python-side locks are process-local, they can't protect you across multiple workers, containers, or machines. The fix is moving the check-and-set into a single Redis operation that cannot be interleaved with anything else.

That is what Lua gives you. Redis executes Lua scripts atomically. No other command runs between the first line and the last.

ACQUIRE_LUA (idempotency claim)

The first script Relier uses is the idempotency claim. When @rl_task(idempotent=True) is set, every task submission runs this before any work happens:

local existing = redis.call('GET', KEYS[1])
if existing then
    return {1, existing}
end
redis.call('SET', KEYS[1], ARGV[1], 'NX', 'EX', ARGV[2])
return {0, false}

KEYS[1] is the idempotency key, derived from the task name and arguments, or set explicitly by the caller. ARGV[1] is the in-flight sentinel (a UUID tied to this execution attempt). ARGV[2] is the TTL.

What this closes: the race between two concurrent submissions of the same task. Both arrive at Redis. Both invoke the script. Redis executes them one at a time. The first one sees existing = nil, claims the key, returns {0, false}, proceed. The second one sees existing = <the sentinel>, returns {1, existing}, already claimed, skip. One execution. Not two.

The NX flag on the SET is belt-and-suspenders. The GET already gates on existence, but NX means the SET itself is also a no-op if the key was written between the GET and SET, which cannot happen inside a Lua script, but matters if you ever run the commands outside one.

The return value carries the existing state so the caller can distinguish between in-flight (sentinel value) and completed (cached result JSON). That distinction is what lets Relier return the cached result to a duplicate caller without re-running the task.

RELEASE_LUA (compare-and-delete)

When a task fails mid-execution, Relier needs to release the in-flight sentinel so future submissions can retry. The naive approach is a plain DEL:

await redis.delete(key)

This is wrong.

Consider: Worker A claims task X and sets its sentinel. Worker A dies. Relier's Phoenix resurrector detects the expired heartbeat and re-queues task X. Worker B claims it and sets its own sentinel. Now Worker A's crash handler wakes up (the process didn't actually die, it threw an exception) and runs DEL key. It deletes Worker B's sentinel. Worker B's task is now claimable by a third worker. Task runs twice.

The correct operation is conditional delete: delete the key only if the stored value is still your sentinel. That is RELEASE_LUA:

if redis.call('GET', KEYS[1]) == ARGV[1] then
    return redis.call('DEL', KEYS[1])
end
return 0

ARGV[1] is the sentinel value this worker set when it claimed the task. If the stored value no longer matches, because another worker claimed it after resurrection, the delete is skipped. Worker B's execution is protected.

The compare-and-delete pattern shows up in distributed systems literature under many names (CAS, compare-and-swap). The point is always the same: never release a lock without first verifying you still own it. Plain DEL is always wrong.

RESURRECT_LUA (atomic lease and fence token)

When the Phoenix resurrector detects a dead worker and re-queues a task, two things need to happen atomically:

Acquire a lease, so concurrent resurrectors don't both re-queue the same task
Publish a new fence token, so the zombie worker (if it wakes up) can't commit stale results

local lease_key = KEYS[1]
local fence_key = KEYS[2]

local token = ARGV[1]
local lease_ttl = tonumber(ARGV[2])
local fence_ttl = tonumber(ARGV[3])

if redis.call("EXISTS", lease_key) == 1 then
    return 0
end

redis.call("SET", lease_key, token, "EX", lease_ttl)
redis.call("SET", fence_key, token, "EX", fence_ttl)

return 1

The atomicity here solves a specific problem. Imagine two resurrectors scanning simultaneously, which happens whenever you run more than one worker embedding the scanner. Both detect the same expired heartbeat. If they could interleave:

Resurrector A: EXISTS lease_key → 0  (no lease)
Resurrector B: EXISTS lease_key → 0  (no lease, A hasn't written yet)
Resurrector A: SET lease_key ...
Resurrector B: SET lease_key ...     # both acquired the lease
Both dispatch the task to the queue

The script prevents this. The EXISTS check and the SET are one atomic unit. One of the two resurrectors wins. The other sees EXISTS = 1 and returns 0. One re-queue dispatch. Not two.

The fence token is set in the same script for a different reason: the new fence token must be visible in Redis before any worker picks up the re-queued task. If lease acquisition and fence token write were separate commands, a worker could pick up the task between them, before the fence token was set and have nothing to validate against. That window is eliminated by collapsing both writes into one atomic script.

VALIDATE_LUA (worker self-check)

While a task is running, the worker periodically validates that it still owns the execution slot, that it hasn't been declared dead and resurrected while it was doing real work:

local lease_key = KEYS[1]
local fence_key = KEYS[2]

local expected = ARGV[1]

local lease = redis.call("GET", lease_key)
if lease ~= expected then
    return 0
end

local fence = redis.call("GET", fence_key)
if fence ~= expected then
    return 2
end

return 1

Return values: 1 = still current, 0 = invalid lease, 2 = stale fence.

The worker checks both the lease key and the fence key against its own token. If either has been overwritten by a resurrector — because the worker's heartbeat expired while it was GC-paused or doing slow I/O — the worker gets a non-1 return, cancels its own execution, and exits cleanly.

This is how Relier handles the zombie worker problem: the zombie doesn't get detected by an external process and killed — it self-detects during a validation check and stops. The heartbeat expiry + periodic VALIDATE_LUA check forms a cooperative detection loop: the resurrector re-queues the task, the zombie eventually validates and cancels itself.

COMMIT_CHECK_LUA (the last gate before writing results)

Even with VALIDATE_LUA running periodically, there is a window: the last validation passes, then the fence token expires, then the task tries to write its result. Without a final gate, that write goes through even though the worker is now a zombie from Redis's perspective.

COMMIT_CHECK_LUA runs immediately before any result is committed to storage:

local fence_key = KEYS[1]
local expected = ARGV[1]

local current = redis.call("GET", fence_key)

if current ~= expected then
    return 0
end

return 1

0 = stale worker, result rejected. 1 = still current, write proceeds.

This is the commit protocol that makes "exactly-once execution" a verifiable claim rather than an optimistic assertion. The fence token is the proof of ownership. The Lua script makes checking and acting on that proof atomic. No commit can slip through between a successful check and the actual write because the check and the conditional return are the same Redis operation.

In our chaos tests, this script rejects stale commits from zombie workers on every run. The log line "zombie commit rejected" appears exactly as often as "GC-pause-length resurrection" events, one rejection per zombie wakeup. The math checks out.

ADMISSION_LUA (rate limiting without races)

The admission control script is the simplest of the set and the most illustrative of why Lua is the right tool:

local current = redis.call('INCR', KEYS[1])
if current == 1 then
    redis.call('EXPIRE', KEYS[1], ARGV[2])
end
local limit = tonumber(ARGV[1])
if current > limit then
    return {0, current, redis.call('TTL', KEYS[1])}
end
return {1, current, 0}

KEYS[1] = the rate limit window key (e.g., rl:admission:global). ARGV[1] = the request limit. ARGV[2] = the window duration in seconds.

The correctness problem here is the INCR + EXPIRE pair. INCR creates the key if it doesn't exist and increments atomically. But if the EXPIRE ran as a separate command:

Worker A: INCR → 1
Worker B: INCR → 2
Worker A: EXPIRE key 10   # only one EXPIRE runs
Worker B: EXPIRE key 10   # fine here, but what if B crashed before this?

If the EXPIRE never runs, crash, exception, anything between the two commands, the key has no TTL. It accumulates forever. Every request after the window should have reset is rejected until someone manually clears it.

Inside the Lua script, EXPIRE runs on the first increment (current == 1) in the same atomic block as the INCR. Either both happen or neither happens. The window key always has a TTL.

The returned TTL from the final TTL call becomes the Retry-After value in the HTTP 429 response. The client knows exactly when the window resets and when to retry. This is not an approximation, it is the live Redis TTL, accurate to the second.

Why EVALSHA matters on hot paths

Loading a Lua script into Redis via SCRIPT LOAD returns a SHA1 hash. Subsequent calls use EVALSHA instead of EVAL the script body never travels over the network again.

On a hot admission control path processing 5,000 requests per 10-second window, that is 5,000 round-trips that skip the script serialization and deserialization entirely. The Redis server has the script compiled in its script cache. The only network payload is the EVALSHA command, the key, and the arguments.

If the Redis server is restarted, the script cache is cleared. EVALSHA will return NOSCRIPT. Relier handles this with a fallback: on NoScriptError, reload the script and retry:

async def _evalsha_with_fallback(self, redis_client, sha, window_key, *args):
    try:
        return await redis_client.evalsha(sha, 1, window_key, *args)
    except redis.exceptions.NoScriptError:
        self._script_sha = await redis_client.script_load(ADMISSION_LUA)
        return await redis_client.evalsha(self._script_sha, 1, window_key, *args)

One reload on restart. Every call after that is EVALSHA again.

Our benchmark warmup phase (100 iterations discarded before measurement) covers exactly this: Redis Lua scripts load on first call, connection pools establish, the asyncio loop settles. By the time we start measuring, every script is loaded and every subsequent call hits the SHA cache. The p99 0.559ms admission control latency reflects the warm-path cost, not the cold-start cost.

What I learned

The failure modes Lua scripts prevent are all variations of the same shape: check a condition, act on it. In a single-threaded program, that is always safe. In a distributed system, the gap between the check and the action is where concurrent writes slip through.

The non-obvious lesson is that this problem does not get easier as your system gets faster. Faster workers mean more concurrent check-then-act operations per second, which means more collisions per second, which means more corrupted state per second. Atomicity requirements get harder to satisfy as throughput increases, not easier.

The six scripts in Relier each close a specific gap. ACQUIRE_LUA closes the claim race. RELEASE_LUA closes the stale-release race. RESURRECT_LUA closes the double-resurrection race. VALIDATE_LUA closes the zombie-detection gap. COMMIT_CHECK_LUA closes the stale-commit window. ADMISSION_LUA closes the TTL-expiry race.

None of these are novel patterns. Distributed systems literature has described all of them. What took time was mapping each abstract race to a concrete failure in a running test, watching it reproduce, and then verifying that the Lua script eliminated it.

The chaos suite in the Relier repo exists for exactly this: run it against your own cluster, on your own Redis, with your own task code. The correctness claims should survive your environment, not just ours.

GitHub: https://github.com/getrelier/relier
Docs: https://getrelier.github.io/relier
Architecture reference (all scripts): https://getrelier.github.io/relier/architecture/
pip install relier