Ahmed Rakan

Posted on Aug 29 • Edited on Aug 31

Designing and Implementing a Simple, Yet Powerful, Distributed Job Scheduler

#node #typescript #webdev

Building a Distributed Background Job Scheduler with Redis and Node.js

The Problem

Complex workflows in distributed systems often require background processing that can take minutes to complete. Traditional vertical scaling approaches are cost-prohibitive due to low average utilization, while horizontal scaling introduces race conditions where multiple workers might process the same job.

Architecture Design

When designing distributed systems, we must balance the CAP theorem constraints: Consistency, Availability, and Partition tolerance.

Leader-Follower vs Leaderless Architecture

Leader-Follower Pattern:

Maximum consistency through centralized job scheduling
Complex implementation with single points of failure
Unnecessary overhead for most use cases

Leaderless Pattern (Recommended):

Maximizes availability and partition tolerance
Embraces eventual consistency
Simpler implementation and better fault tolerance

Redis-Based Solution

Redis Lists provide atomic operations that eliminate race conditions inherently. The key insight is using Redis as both a distributed lock and a queue system.

Key Technical Benefits

Atomic Operations: LPUSH/RPUSH and BLPOP/BRPOP are atomic
Multiple Queue Support: Separate queues for jobs, retries, and priorities
Priority Queues: Enterprise vs standard user job prioritization
Horizontal Scaling: Deploy multiple worker nodes behind a load balancer

Implementation

import Redis from 'ioredis';
import cron from 'node-cron';

const redisClient = new Redis();

// Define priority levels (lower number = higher priority)
const PRIORITY_LISTS = [1, 2, 3].map(p => `jobs:priority:${p}`);

// Enqueue job
export const enqueueJob = async (job: string, priority = 2) => {
  const key = `jobs:priority:${priority}`;
  await redisClient.rPush(key, job);
  console.log(`Enqueued job "${job}" with priority ${priority}`);
};

// Process a single job from the highest available priority
export const processJob = async () => {
  for (const key of PRIORITY_LISTS) {
    const job = await redisClient.lPop(key); // atomic pop
    if (job) {
      console.log(`Processing job "${job}" from ${key}`);
      // Simulate work
      if (job === 'foo') console.log('Doing foo work...');
      if (job === 'boo') console.log('Doing boo work...');
      return; // only one job per tick
    }
  }
  console.log('No jobs found');
};

// Scheduler
export const initializeScheduler = () => {
  // Run every 5 seconds
  cron.schedule('*/5 * * * * *', async () => {
    console.log('CRON: Checking jobs...');
    await processJob();
  });
  console.log('Scheduler initialized');
};

// Demo
(async () => {
  await enqueueJob('foo', 1);  // highest priority
  await enqueueJob('boo', 3);  // lowest priority
  await enqueueJob('foo', 2);

  initializeScheduler();
})();

Scaling Strategy

Deploy Multiple Instances: Each worker node runs the same scheduler code
Load Balancer: Distribute API traffic across instances
Redis Cluster: Scale Redis horizontally for high-throughput scenarios
Monitoring: Track queue lengths, processing times, and failure rates

Key Advantages

Race Condition Free: Redis atomic operations eliminate duplicate processing
Cost Effective: Horizontal scaling with automatic load distribution
Fault Tolerant: Failed jobs automatically retry with exponential backoff
Priority Support: Critical jobs process before standard ones
Simple Deployment: Stateless workers behind a load balancer

Limitation

Connection Lock-in: BLPOP / BRPOP hold the redis connection open until a job arrives or timeout is hit. Each worker process needs its own connection which may lead to connection storm ( all workers waking up at once ) .

We can handle those limitations by splitting queues into multiple ones, jobs will be enqueued in the proper queue based on the priority.

As well the simple implementation of our design lacks the retry logic for purpose, I wanted to provide something for you that you can play with.

There are multiple ways to implement retry, but here is simple way:

We can implemented simply by storing integer with key "retry:jobId". We must configure the number of times the job is retried otherwise, it will be infinite loop. When job fails completely we will just log it into the database or a new requirement can be designed for such scenario.

Regarding monitoring and analytics of jobs, we can use bitmaps in Redis for maximizing the performance.

Graceful shutdown and handling edge cases such as when a node fails half-way into job processing, or fails completely, we need to handle that is well.

As well we could persist jobs into the database.

Take this as simple to implement works out of the box design and implementation, I spent few days on it, more robust architecture takes weeks of research and development.

This architecture provides a robust, scalable solution for distributed background job processing while maintaining simplicity and cost efficiency. There are more robust designs out there, but here we focused on simplicity greatly and the solution is scalable and just works , with minimal code and complexity !

Top comments (3)

Sofia Petrova • Aug 29

Great write-up—practical and clear. Using Redis lists for atomic pops and simple priority queues keeps the design lean while scaling well, and the leaderless approach fits the trade-offs nicely. A follow-up showing the retry/backoff and monitoring setup would be super helpful.

Ahmed Rakan • Aug 30 • Edited

Hello Sofia, thanks for your kind reply. I had previous implementation in this blog but I wanted to keep it simple something people can plug and play with. The retry and expansional backoff, can be implemented via a dedicated queue, with some setTimeout magic.

Ahmed Rakan • Sep 1

You might be interested to read this too dev.to/araldhafeeri/npc-architectu...