DEV Community: Fathma Siddique

Distributed Scheduled Job Locking with Redis

Fathma Siddique — Sat, 27 Jun 2026 10:38:29 +0000

We had a simple scheduled job: every hour, pull new products from a third party API. It worked fine on one server. Then we scaled to ten.

Suddenly, every hour, all ten servers fired the same job at the same time. Ten servers hitting the same third party API for the same data, ten times the work for a job that only needed to run once. It was pure waste: extra load, extra API calls, extra cost, for nothing. And in the worst case, if the job wasn't careful about how it wrote to the database, you could end up with duplicate rows or repeated updates piling up underneath all that wasted effort.

Here's how we fixed it.

The fix: only one server should win

The idea is simple. Every server's scheduler still fires on time, same as before. But before any of them actually do the work, they race to grab a lock. Whoever gets it runs the job. Everyone else just steps aside.

If the winner finishes, it lets go of the lock. If it crashes mid-job, the lock lets go on its own after a short timeout, so nobody's stuck waiting, and the next scheduled run picks it up fresh.

That's the whole idea. No permanent "leader" to manage, no failover logic to write by hand. Just a lock, held only for as long as the job runs.

Redis does this in one line

SET lock:import-products <unique-token> NX EX 60

NX means "only set this if it doesn't already exist." That's the lock itself. EX 60 means it disappears on its own after 60 seconds, so a crash never leaves it stuck. Both happen in one atomic step, so there's no gap for two servers to slip through at once.

That's it. One command, one Redis instance most teams already have running anyway.

Putting it into code

const { createClient } = require('redis');
const crypto = require('crypto');

const client = createClient({ url: 'redis://localhost:6379' });
await client.connect();

async function runWithLock(jobName, jobFn, { ttlSeconds = 60 } = {}) {
  const lockKey = `lock:${jobName}`;
  const token = crypto.randomUUID();

  const acquired = await client.set(lockKey, token, { NX: true, EX: ttlSeconds });
  if (!acquired) {
    console.log(`[${jobName}] Another server has it, skipping`);
    return;
  }

  try {
    await jobFn();
  } finally {
    // Only release the lock if it's still ours, never blindly delete it
    const releaseScript = `
      if redis.call("GET", KEYS[1]) == ARGV[1] then
        return redis.call("DEL", KEYS[1])
      else
        return 0
      end
    `;
    await client.eval(releaseScript, { keys: [lockKey], arguments: [token] });
  }
}

cron.schedule('0 * * * *', () => {
  runWithLock('import-products', async () => {
    await importProductsFromThirdParty();
  });
});

Two small things make this safe rather than just simple.

The first is the random token. If you released the lock with a plain DEL, here's the trap: your lock expires because the job ran a little long, another server grabs it and starts working, and then your own cleanup code fires and deletes their lock. Now a third server can jump in too. Tagging each attempt with its own token, and only deleting when the token still matches, closes that door.

The second is the tiny Lua script for the release. Checking the token and deleting need to happen as one atomic step. Redis runs Lua scripts that way, so there's no gap for another server to slide into between the check and the delete.

What happens when things go wrong

If a server dies before grabbing the lock, nothing happens. It just never started.

If a server dies after grabbing the lock, the lock sits there until its timer runs out, then lets go on its own. Nobody picks up the half-finished work mid-run. The next scheduled tick just starts clean.

If Redis itself restarts, every lock disappears with it, so briefly anyone can grab one. That's rare, and usually fine to live with.

One more piece: make the writes themselves safe too

Locking stops two servers from running at once. It doesn't promise a job finishes cleanly every time. If a run dies halfway through, the next attempt might touch some of the same records again.

The fix is to write in a way that's safe to repeat. Upsert by a stable ID instead of blindly inserting, so running the same import twice doesn't double anything up. Locking handles the "only one at a time" problem. Idempotent writes handle the "what if it runs twice anyway" problem. You want both.

Dealing with Long-Running Kafka Consumers and Message Backlogs

Fathma Siddique — Wed, 25 Feb 2026 18:44:52 +0000

For a while I genuinely could not figure out what was wrong.
Nothing was throwing errors. The service was running. But messages were piling up, some were being processed twice, and the lag just kept climbing. I kept waiting for it to sort itself out. It did not.
Eventually I had to sit down and actually trace through what was happening.

The Problem
Our consumer was doing too much. Each message triggered external API calls, some heavy business logic, blocking operations. Individually, a message might take a couple of minutes to get through. That feels manageable until you remember that Kafka's default max.poll.records is 500. Pull a batch of even a handful of slow messages, and the cumulative processing time blows past Kafka's default max.poll.interval.ms of 5 minutes without much effort.
When that happens, Kafka assumes the consumer has died. It triggers a rebalance, reassigns the partitions, and those same messages get picked up and processed all over again.
That was our loop. Consumer pulls a batch, gets bogged down processing it, Kafka loses patience, rebalance happens, repeat.

What We Did
The first thing was just to stop the bleeding. We bumped max.poll.interval.ms up to 8 minutes to give the consumer a bit more breathing room. Rebalances stopped almost immediately. That was a relief, but it was a band-aid not a fix.

Next we set max.poll.records = 1. One message at a time. With each message taking a couple of minutes, pulling any larger batch was just asking for trouble. Throughput dropped considerably, but at least the system was stable and we could reason about it.

We also dropped auto-commit and switched to manual offset commits. Honestly we should have done this from the start. Auto-commit quietly marks messages as done on a timer whether processing actually succeeded or not. Manual commits meant we knew exactly what had been handled and what had not.

Kafka consumers are not meant to do heavy, long-running work.
After things stabilised, we redesigned the flow so that the consumer became lightweight. It would validate the message and quickly hand off the heavy work to background workers.Kafka went back to doing what it is good at: moving data fast. And our system stopped fighting it.

We also added retries and a dead letter queue so one broken message could not drag everything else down with it.

What Stuck With Me
I think I was treating Kafka like a job queue because that is what felt familiar. But it is not that. It is a streaming system that expects you to keep up with it. The moment you do slow, heavy work inside the consumer, you are borrowing time you do not have.
Once we aligned with how Kafka actually works, everything got simpler. The lag cleared. The rebalances stopped. The system finally felt like it was running the way it was supposed to.
Sometimes the fix is technical. But sometimes you just have to admit the design was wrong.

Optimizing Real-Time Location Tracking: A System-Wide Approach

Fathma Siddique — Fri, 12 Dec 2025 15:24:56 +0000

I recently worked on a location tracking feature that was causing major problems. The app would drain phone batteries quickly, the server costs were getting expensive, and users were complaining about lag. Here's how I fixed it.

🔴 What Was Wrong

The system had some serious issues:

Phones were losing 20-30% battery every hour
The app was sending way too many updates through Socket.IO
The database was handling thousands of writes every minute
Server costs kept increasing as more people used the app
The map would freeze and lag when updating locations

🔍 Why These Problems Happened

After checking the logs and monitoring the system, I found the main causes:

The app was sending location updates every single second via Socket.IO
Every time someone moved, the server sent everyone's locations to all users
Every location update was being saved to the database immediately
GPS was set to maximum accuracy all the time
There was no caching system to handle the load

⚡ How I Fixed It

1. Sending Only What Changed and Filtering Insignificant Movements

This was the biggest improvement. Instead of the previous approach, I implemented two key changes:

Client-side filtering: The phone only sends location updates when movement exceeds 10 meters, eliminating unnecessary network calls

Selective broadcasting: The server broadcasts only the changed user's location instead of sending everyone's data to all users

Result: Socket.IO traffic went down by 60% and made everything much faster.

2. Adding a Caching System

I set up Redis to handle the constant location updates:

The caching strategy:

The server stores the latest location in Redis with a 2-minute TTL to remove stale data
A background job saves active user's locations to the main database every 60 seconds
When a user stops tracking, their final location is immediately saved to the database before marking them inactive. This ensures we never lose the end point of a journey
Start/stop events instantly update the active-users list in Redis so the map doesn't show inactive devices

Result: Database writes reduced by 90% and saved a lot on server costs.

3. Better GPS Settings

I changed how the phone's GPS works:

Key changes:

Use balanced accuracy instead of maximum (saves battery)
Send updates only when movement exceeds 10 meters (reduces network usage and battery drain)
Background updates use OS-recommended minimum intervals to reduce battery drain further

Result: Battery life improved by 60%.

4. Fixing the Map Display

The map re-rendering was optimised by updating only affected markers and memoising static elements.

Key optimizations:

Only update markers for users whose locations actually changed
Memoize static map elements to prevent unnecessary re-renders

Result: The map stayed smooth with no lag, even with many active users.

📊 The Results

After all these changes, the improvements were significant:

Metric	Improvement
Battery life	60-70% better
Socket.IO traffic	60% reduction
Database writes	90% reduction
Server costs	50% reduction
Map performance	No lag
Data integrity	Zero data loss

💡 What I Learned

Real-time doesn't mean every second: Most apps don't need constant updates. Sending data only when the user has moved a meaningful distance (10+ meters) saves battery and network resources.

Only sending what changed: Broadcasting only the updated user's location instead of everyone's data was a game-changer for network efficiency.

Use caching wisely: Redis helped handle the constant flow of updates without overloading the database.

GPS accuracy costs battery: High accuracy mode drains battery really fast. Balanced mode works great for most cases and users can't tell the difference.

Handle stop events properly: Immediately saving the final location when users stop tracking prevents data loss and ensures complete trip records.

Filter on the client when possible: Processing location changes on the phone before sending them to the server reduces network traffic and server load.

🔄 How It All Works Now

Here's the complete flow:

┌─────────────────┐
│  User starts    │
│   tracking      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Phone GPS with  │
│ balanced mode   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Filter: Only    │
│ send if moved   │
│ 10+ meters      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Server stores   │
│ in Redis        │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Broadcast only  │
│ changed user    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Background job  │
│ saves to DB     │
│ every 60s       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ User stops:     │
│ Immediate save  │
│ to database     │
└─────────────────┘

Detailed steps:

User starts tracking: Phone GPS begins collecting location with balanced accuracy mode
Location filtering: Phone only sends updates to server when movement exceeds 10 meters
Server processing: Server stores the latest location in Redis and immediately broadcasts only the changed user's data via Socket.IO
Periodic saves: Background job runs every 60 seconds, saving all active users' locations from Redis to the database
User stops tracking: Server immediately saves the final location to database with active=false, removes user from active list in Redis, and broadcasts stop event to other users
New user joins: When a user opens the app, the server fetches all current active locations from Redis and sends them once as an initial payload

🎯 Final Thoughts

Fixing location tracking isn't about finding one magic solution—it's about making thoughtful decisions at each layer of the system. From GPS collection to network transmission to database storage, small improvements compound into significant gains.
I learned that handling edge cases (like user stops and starts) and filtering unnecessary updates early in the pipeline can prevent much bigger problems downstream.
These optimizations helped turn a problematic feature into something more reliable and cost-effective, though there's always room to learn and improve further. Every system has its unique constraints, and what worked here might need adjustment for different use cases.