Distributed Scheduled Job Locking with Redis

#backend #redis

We had a simple scheduled job: every hour, pull new products from a third party API. It worked fine on one server. Then we scaled to ten.

Suddenly, every hour, all ten servers fired the same job at the same time. Ten servers hitting the same third party API for the same data, ten times the work for a job that only needed to run once. It was pure waste: extra load, extra API calls, extra cost, for nothing. And in the worst case, if the job wasn't careful about how it wrote to the database, you could end up with duplicate rows or repeated updates piling up underneath all that wasted effort.

Here's how we fixed it.

The fix: only one server should win

The idea is simple. Every server's scheduler still fires on time, same as before. But before any of them actually do the work, they race to grab a lock. Whoever gets it runs the job. Everyone else just steps aside.

If the winner finishes, it lets go of the lock. If it crashes mid-job, the lock lets go on its own after a short timeout, so nobody's stuck waiting, and the next scheduled run picks it up fresh.

That's the whole idea. No permanent "leader" to manage, no failover logic to write by hand. Just a lock, held only for as long as the job runs.

Redis does this in one line

SET lock:import-products <unique-token> NX EX 60

NX means "only set this if it doesn't already exist." That's the lock itself. EX 60 means it disappears on its own after 60 seconds, so a crash never leaves it stuck. Both happen in one atomic step, so there's no gap for two servers to slip through at once.

That's it. One command, one Redis instance most teams already have running anyway.

Putting it into code

const { createClient } = require('redis');
const crypto = require('crypto');

const client = createClient({ url: 'redis://localhost:6379' });
await client.connect();

async function runWithLock(jobName, jobFn, { ttlSeconds = 60 } = {}) {
  const lockKey = `lock:${jobName}`;
  const token = crypto.randomUUID();

  const acquired = await client.set(lockKey, token, { NX: true, EX: ttlSeconds });
  if (!acquired) {
    console.log(`[${jobName}] Another server has it, skipping`);
    return;
  }

  try {
    await jobFn();
  } finally {
    // Only release the lock if it's still ours, never blindly delete it
    const releaseScript = `
      if redis.call("GET", KEYS[1]) == ARGV[1] then
        return redis.call("DEL", KEYS[1])
      else
        return 0
      end
    `;
    await client.eval(releaseScript, { keys: [lockKey], arguments: [token] });
  }
}

cron.schedule('0 * * * *', () => {
  runWithLock('import-products', async () => {
    await importProductsFromThirdParty();
  });
});

Two small things make this safe rather than just simple.

The first is the random token. If you released the lock with a plain DEL, here's the trap: your lock expires because the job ran a little long, another server grabs it and starts working, and then your own cleanup code fires and deletes their lock. Now a third server can jump in too. Tagging each attempt with its own token, and only deleting when the token still matches, closes that door.

The second is the tiny Lua script for the release. Checking the token and deleting need to happen as one atomic step. Redis runs Lua scripts that way, so there's no gap for another server to slide into between the check and the delete.

What happens when things go wrong

If a server dies before grabbing the lock, nothing happens. It just never started.

If a server dies after grabbing the lock, the lock sits there until its timer runs out, then lets go on its own. Nobody picks up the half-finished work mid-run. The next scheduled tick just starts clean.

If Redis itself restarts, every lock disappears with it, so briefly anyone can grab one. That's rare, and usually fine to live with.

One more piece: make the writes themselves safe too

Locking stops two servers from running at once. It doesn't promise a job finishes cleanly every time. If a run dies halfway through, the next attempt might touch some of the same records again.

The fix is to write in a way that's safe to repeat. Upsert by a stable ID instead of blindly inserting, so running the same import twice doesn't double anything up. Locking handles the "only one at a time" problem. Idempotent writes handle the "what if it runs twice anyway" problem. You want both.