Elvis Sautet

Posted on Nov 25

Cron Jobs vs Real Task Schedulers: A Love Story

Hey folks. I know, I know, another long post. But if you've been following
along, you know I only write these when I've actually lived through the pain.

Today's about task scheduling. Specifically, about a problem I've spent way too much time solving in production.

You know that feeling when your startup grows from handling hundreds of tasks to millions, and suddenly your "simple cron job solution" is costing you thousands in Redis memory? Yeah, I've been there.

Today I want to walk through building a proper task scheduler – the kind that companies like Stripe, Airbnb, and Uber actually use in production. Not the toy examples you see in tutorials, but something you can actually deploy on Friday and sleep well over the weekend.

The Problem with Traditional Cron
Why Big Tech Uses a Two-Tier Architecture
The Architecture We're Building
Understanding the Queue Layer
The Intelligent Router: The Brain of the System
Implementing the Three Task Types
The Database Schema That Scales
Building the Scheduler
Worker Implementation
Putting It All Together
Production Considerations
What I Learned the Hard Way

The Problem with Traditional Cron

Let me start with what doesn't work. A lot of developers (including past me) start with something like this:

// Don't do this in production
cron.schedule('0 9 * * *', async () => {
  await sendDailyReports();
});

This works great until it doesn't. Here's what breaks:

Problem 1: No persistence. Your server restarts, all scheduled tasks are gone. I once lost 10,000 subscription renewal reminders because someone deployed during a scheduled task window. Not fun explaining that to the finance team.

Problem 2: Single point of failure. One cron job fails, you have no idea unless you're watching logs. One server dies, all your scheduled tasks stop.

Problem 3: Can't handle dynamic scheduling. User signs up, you want to send them an email in 7 days. Good luck adding that to crontab programmatically.

Problem 4: No priority system. Your payment processing cron and your "cleanup old logs" cron compete for the same resources.

Problem 5: Terrible at scale. Need to schedule a million tasks? Hope you have a lot of RAM and patience.

I learned this the hard way at my last job. We were using node-cron for everything. Hit 100k users, suddenly our server was using 4GB of RAM just tracking scheduled tasks. Redis costs were approaching $800/month. Something had to change.

Why Big Tech Uses a Two-Tier Architecture

After diving into how Stripe handles subscription billing for millions of customers and how Airbnb manages booking reminders, I noticed a pattern. They all use what I call the "two-tier architecture."

Here's the insight: most scheduled tasks are in the far future. Why keep them in expensive, fast memory when they won't execute for days or weeks?

The Two Tiers:

Cold Storage (PostgreSQL): Cheap persistent storage. Stores ALL tasks, even ones scheduled months away. At $0.10/GB/month, you can store millions of tasks for pennies.
Hot Queue (Redis): Fast execution queue. Only holds tasks executing within 24-48 hours. At $10/GB/month, you only pay for what's actively being processed.

The magic is the scheduler that moves tasks from cold to hot as their execution time approaches.

Let me show you the cost difference. Say you have 10 million scheduled tasks:

All in Redis:

10GB of data
At $10/GB/month = $100/month
Plus Redis is volatile, so you need persistence setup
Risk of data loss on restart

Two-tier approach:

PostgreSQL: 10GB at $0.10/GB = $1/month
Redis: 0.5GB (only next 24 hours) at $10/GB = $5/month
Total: $6/month
Full persistence, no data loss risk

Savings: 94%

That's not a typo. This is why Stripe can handle millions of subscription renewals without bankrupting themselves on infrastructure.

The Architecture We're Building

Before we dive into code, let me show you the high-level architecture:

┌─────────────────────────────────────────────────────────┐
│  Your Application                                       │
│  smartAPI.scheduleTask({...})                          │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
            ┌────────────────┐
            │ Intelligent    │
            │ Router         │  ← Decides where task goes
            └───────┬────────┘
                    │
        ┌───────────┼───────────┐
        │           │           │
        ▼           ▼           ▼
   Immediate     Hot         Cold
   < 15 min    < 24 hrs   > 24 hrs
        │           │           │
        ▼           ▼           │
    ┌──────┐   ┌──────┐        │
    │Redis │   │Redis │        │
    └───┬──┘   └───┬──┘        │
        │          │            │
        └────┬─────┘            │
             ▼                  ▼
        ┌─────────┐      ┌──────────┐
        │ Workers │      │PostgreSQL│
        │Execute  │      │(waiting) │
        └─────────┘      └─────┬────┘
                                │
                         Scheduler
                         (moves to Redis
                          when due)

The key components:

Smart API: Your interface. You just call scheduleTask() and forget about it.
Intelligent Router: Decides if task goes straight to Redis or waits in the database.
PostgreSQL: Cold storage. Holds all tasks.
Redis + BullMQ: Hot queue. Holds tasks ready to execute.
Scheduler: Runs every hour, moves tasks from PostgreSQL to Redis.
Workers: Execute tasks from Redis.

Understanding the Queue Layer

Let's talk about queuing systems because this is where a lot of people get stuck. We're using BullMQ with Redis, but you have options.

Why BullMQ + Redis?

BullMQ is fantastic for job queuing. Here's what sold me:

import { Queue, Worker } from 'bullmq';

// Create a queue
const emailQueue = new Queue('emails', {
  connection: { host: 'localhost', port: 6379 }
});

// Add a job with a delay
await emailQueue.add('welcome-email', {
  to: 'user@example.com',
  subject: 'Welcome!'
}, {
  delay: 3600000, // 1 hour in milliseconds
  attempts: 3,
  backoff: {
    type: 'exponential',
    delay: 5000
  }
});

// Process jobs
const worker = new Worker('emails', async job => {
  await sendEmail(job.data);
}, { connection: { host: 'localhost', port: 6379 } });

Pros:

Built-in retry logic with exponential backoff
Delayed jobs out of the box
Priority queues
Job progress tracking
Rate limiting
Automatic job cleanup
Great TypeScript support
Very active maintenance

Cons:

Requires Redis (another service to manage)
Memory usage can get high
No built-in persistence (Redis is in-memory)

Alternative 1: RabbitMQ

If you're already using RabbitMQ, you can use it instead:

import amqp from 'amqplib';

const connection = await amqp.connect('amqp://localhost');
const channel = await connection.createChannel();
await channel.assertQueue('tasks', { durable: true });

// Schedule a task
channel.sendToQueue('tasks', Buffer.from(JSON.stringify({
  taskType: 'email',
  payload: { to: 'user@example.com' }
})), {
  persistent: true,
  headers: { 'x-delay': 3600000 } // Requires delay plugin
});

Pros:

Message persistence built in
Excellent for microservices
Very mature ecosystem
Better for complex routing

Cons:

More complex to set up
Delayed messages require plugin
Heavier resource usage
Steeper learning curve

Alternative 2: AWS SQS + DynamoDB

If you're on AWS:

import { SQS } from 'aws-sdk';

const sqs = new SQS();

// Schedule a task
await sqs.sendMessage({
  QueueUrl: 'https://sqs.us-east-1.amazonaws.com/...',
  MessageBody: JSON.stringify({
    taskType: 'email',
    payload: { to: 'user@example.com' }
  }),
  DelaySeconds: 3600 // Max 15 minutes
});

Pros:

Fully managed, no maintenance
Scales infinitely
Cheap at scale
Built-in dead letter queues

Cons:

Max delay is 15 minutes (need DynamoDB + Lambda for longer)
Vendor lock-in
Costs add up with high message volume
Requires AWS

Alternative 3: Kafka

For event-driven architectures:

import { Kafka } from 'kafkajs';

const kafka = new Kafka({
  clientId: 'task-scheduler',
  brokers: ['localhost:9092']
});

const producer = kafka.producer();
await producer.connect();

await producer.send({
  topic: 'scheduled-tasks',
  messages: [{
    key: 'task-123',
    value: JSON.stringify({
      taskType: 'email',
      payload: { to: 'user@example.com' }
    }),
    timestamp: Date.now() + 3600000
  }]
});

Pros:

Excellent for high throughput
Message persistence and replay
Perfect for event sourcing
Scales horizontally

Cons:

Overkill for simple scheduling
Complex setup and maintenance
Delayed messages not native
High operational overhead

My Recommendation

For most applications, stick with BullMQ + Redis. Here's why:

Simple setup: Redis is easy to run and manage
Great developer experience: BullMQ API is intuitive
Rich features: Everything you need is built in
Battle-tested: Used by thousands of production apps
Performance: Fast enough for 99% of use cases

Only move to alternatives if:

You already have RabbitMQ (use it)
You're fully AWS (use SQS + Lambda)
You need Kafka-level throughput (rare)

The Intelligent Router: The Brain of the System

Now let's build the most important part: the router that decides where tasks go.

The logic is beautifully simple:

function routeTask(scheduledDate: Date): Route {
  const now = new Date();
  const minutesUntil = (scheduledDate - now) / (1000 * 60);
  const hoursUntil = minutesUntil / 60;

  // Task is overdue or due very soon (< 15 minutes)
  if (minutesUntil <= 15) {
    return {
      route: 'immediate',
      action: 'queue_to_redis_now',
      reason: 'Needs immediate execution'
    };
  }

  // Task is due soon (< 24 hours)
  if (hoursUntil <= 24) {
    return {
      route: 'hot',
      action: 'save_to_db_and_queue_to_redis',
      reason: 'Near future execution'
    };
  }

  // Task is far away (> 24 hours)
  return {
    route: 'cold',
    action: 'save_to_db_only',
    reason: 'Far future, scheduler will handle later'
  };
}

Let me walk through each path:

Path 1: Immediate (< 15 minutes)

// Task due in 5 minutes
scheduleTask({
  taskType: 'email',
  scheduledFor: new Date(Date.now() + 5 * 60 * 1000),
  payload: { to: 'user@example.com' }
});

// Router decision:
// "This is urgent, skip the database, go straight to Redis"
await queueToRedis(task, { delay: 5 * 60 * 1000 });

Why skip the database? Speed. Writing to PostgreSQL takes 50-100ms. Writing to Redis takes 1-5ms. For urgent tasks, every millisecond counts.

We still save it to the database asynchronously for audit purposes, but we don't wait for it.

Path 2: Hot (< 24 hours)

// Task due in 2 hours
scheduleTask({
  taskType: 'report',
  scheduledFor: new Date(Date.now() + 2 * 60 * 60 * 1000),
  payload: { reportType: 'sales' }
});

// Router decision:
// "Save to database for persistence, also queue to Redis"
await saveToDatabase(task);
await queueToRedis(task, { delay: 2 * 60 * 60 * 1000 });
await task.update({ status: 'queued' });

This is the belt-and-suspenders approach. We save to the database so if Redis crashes, we don't lose the task. We also queue it to Redis so it executes on time.

Path 3: Cold (> 24 hours)

// Task due in 30 days
scheduleTask({
  taskType: 'subscription_renewal',
  scheduledFor: new Date(Date.now() + 30 * 24 * 60 * 60 * 1000),
  payload: { subscriptionId: 'sub-123' }
});

// Router decision:
// "Too far away, just save to database"
await saveToDatabase(task);
// That's it! Scheduler will handle it later

This is the efficiency win. Why keep this in Redis for 30 days? It's not executing anytime soon. Save it to cheap PostgreSQL storage, and the scheduler will move it to Redis on day 29.

Implementing the Three Task Types

Real-world applications need more than just "run once" tasks. You need three types:

Type 1: One-Time Tasks

These execute once at a specific time. This is what we've been discussing.

await smartAPI.scheduleTask({
  taskType: 'email',
  taskName: 'Birthday Email',
  scheduledFor: '2025-12-25T09:00:00Z',
  payload: {
    to: 'user@example.com',
    subject: 'Happy Birthday!',
    template: 'birthday'
  },
  priority: 5
});

Implementation is straightforward. Store in database with status 'pending', router decides when to queue.

Type 2: Recurring Tasks

These run forever on a schedule. Like traditional cron, but persistent.

await smartAPI.scheduleRecurringTask({
  taskType: 'daily_report',
  taskName: 'Daily Sales Report',
  cronExpression: '0 9 * * *',  // Every day at 9 AM
  payload: {
    reportType: 'sales',
    emailTo: 'boss@company.com'
  },
  timezone: 'America/New_York',
  startDate: '2025-01-01',
  endDate: '2025-12-31'  // Optional
});

The trick here is maintaining state in the database:

// Database record
{
  id: 'task-123',
  scheduleType: 'recurring',
  cronExpression: '0 9 * * *',
  lastRunAt: '2025-11-23T09:00:00Z',
  nextRunAt: '2025-11-24T09:00:00Z',
  status: 'pending',
  enabled: true
}

The scheduler runs every minute, checking: "Is it time to run any recurring tasks?"

// Scheduler logic
async function checkRecurringTasks() {
  const dueTasks = await database.query(`
    SELECT * FROM scheduled_tasks
    WHERE schedule_type = 'recurring'
      AND enabled = true
      AND next_run_at <= NOW()
      AND (end_date IS NULL OR end_date >= NOW())
  `);

  for (const task of dueTasks) {
    // Queue to Redis for immediate execution
    await queueToRedis(task, { delay: 0 });

    // Calculate next run using cron-parser
    const nextRun = calculateNextRun(task.cronExpression);

    // Update database
    await task.update({
      lastRunAt: new Date(),
      nextRunAt: nextRun,
      status: 'pending'  // Reset for next run
    });
  }
}

The beauty is Redis crashes? No problem. The task definition is in PostgreSQL. The scheduler will queue it again.

Type 3: Repeating Tasks

These run N times with a fixed interval. Perfect for retry logic.

await smartAPI.scheduleRepeatingTask({
  taskType: 'payment_retry',
  taskName: 'Retry Failed Payment',
  intervalHours: 24,
  repeatCount: 5,  // Try 5 times
  payload: {
    paymentId: 'pay-123',
    amount: 29.99
  },
  priority: 10  // High priority
});

Implementation tracks execution count:

// Database record
{
  id: 'task-456',
  scheduleType: 'repeating',
  intervalMs: 86400000,  // 24 hours
  repeatCount: 5,
  executionCount: 2,  // Already tried twice
  nextRunAt: '2025-11-25T10:00:00Z',
  status: 'pending'
}

After each execution:

async function handleRepeatingTask(task) {
  // Execute the task
  await executeTask(task);

  // Increment counter
  task.executionCount += 1;

  // Done?
  if (task.executionCount >= task.repeatCount) {
    task.status = 'completed';
    await task.save();
    return;
  }

  // Schedule next run
  task.nextRunAt = new Date(Date.now() + task.intervalMs);
  task.status = 'pending';
  await task.save();
}

This is how Stripe handles payment retries. They don't set up 5 separate tasks. One task, configured to repeat.

The Database Schema That Scales

Let's talk about the database schema. This is crucial for performance at scale.

CREATE TABLE scheduled_tasks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

  -- Task identification
  task_type VARCHAR(50) NOT NULL,
  task_name VARCHAR(255) NOT NULL,
  schedule_type VARCHAR(20) NOT NULL,  -- 'one_time', 'recurring', 'repeating'

  -- One-time fields
  scheduled_for TIMESTAMP,

  -- Recurring fields
  cron_expression VARCHAR(100),
  timezone VARCHAR(50) DEFAULT 'UTC',
  start_date TIMESTAMP,
  end_date TIMESTAMP,
  last_run_at TIMESTAMP,
  next_run_at TIMESTAMP,

  -- Repeating fields
  interval_ms BIGINT,
  repeat_count INTEGER,
  execution_count INTEGER DEFAULT 0,

  -- State
  status VARCHAR(20) NOT NULL DEFAULT 'pending',
  priority INTEGER DEFAULT 5,
  enabled BOOLEAN DEFAULT true,

  -- Data
  payload JSONB NOT NULL,
  metadata JSONB DEFAULT '{}',

  -- Execution tracking
  attempts INTEGER DEFAULT 0,
  max_attempts INTEGER DEFAULT 3,
  queued_at TIMESTAMP,
  started_at TIMESTAMP,
  completed_at TIMESTAMP,
  last_error TEXT,
  result JSONB,

  -- Multi-tenancy
  user_id UUID,
  tenant_id UUID,
  tags TEXT[],

  -- Timestamps
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

Now the critical part: indexes. These make or break performance.

-- THE MOST IMPORTANT INDEX
-- Scheduler queries this constantly
CREATE INDEX idx_status_scheduled_priority 
ON scheduled_tasks (status, scheduled_for, priority)
WHERE schedule_type = 'one_time';

-- For recurring task checks (every minute)
CREATE INDEX idx_recurring_due 
ON scheduled_tasks (schedule_type, enabled, next_run_at)
WHERE schedule_type IN ('recurring', 'repeating');

-- For finding user tasks quickly
CREATE INDEX idx_user_status 
ON scheduled_tasks (user_id, status);

-- For the scheduler's main query
CREATE INDEX idx_next_run_enabled
ON scheduled_tasks (next_run_at, enabled)
WHERE enabled = true;

With these indexes, querying 10 million tasks takes under 50ms. Without them, it can take 30+ seconds. Ask me how I know.

Here's the query the scheduler runs every hour:

-- Find one-time tasks ready to queue
SELECT * FROM scheduled_tasks
WHERE schedule_type = 'one_time'
  AND status = 'pending'
  AND scheduled_for >= NOW()
  AND scheduled_for <= NOW() + INTERVAL '24 hours'
ORDER BY priority DESC, scheduled_for ASC
LIMIT 10000;

With the index idx_status_scheduled_priority, PostgreSQL uses an index scan. Query time: 40ms for 10M rows.

Without the index? Sequential scan. Query time: 28 seconds. Your scheduler would fall behind immediately.

Building the Scheduler

The scheduler is the heart of the system. It runs periodically, moving tasks from cold storage to hot queue.

Here's the implementation:

class IntelligentScheduler {
  private isRunning = false;

  async run() {
    if (this.isRunning) {
      console.log('Scheduler already running, skipping...');
      return;
    }

    this.isRunning = true;
    console.log('Scheduler started at', new Date().toISOString());

    try {
      // Step 1: Handle overdue tasks (highest priority)
      await this.processOverdueTasks();

      // Step 2: Move near-future tasks to Redis
      await this.processUpcomingTasks();

      // Step 3: Process recurring tasks
      await this.processRecurringTasks();

    } catch (error) {
      console.error('Scheduler error:', error);
    } finally {
      this.isRunning = false;
    }
  }

  private async processOverdueTasks() {
    const overdue = await database.query(`
      SELECT * FROM scheduled_tasks
      WHERE schedule_type = 'one_time'
        AND status = 'pending'
        AND scheduled_for < NOW()
      ORDER BY priority DESC, scheduled_for ASC
      LIMIT 1000
    `);

    console.log(`Found ${overdue.length} overdue tasks`);

    for (const task of overdue) {
      // Queue immediately with high priority
      await queueToRedis(task, {
        delay: 0,
        priority: 10  // Override priority for overdue tasks
      });

      await task.update({ status: 'queued', queuedAt: new Date() });
    }
  }

  private async processUpcomingTasks() {
    const upcoming = await database.query(`
      SELECT * FROM scheduled_tasks
      WHERE schedule_type = 'one_time'
        AND status = 'pending'
        AND scheduled_for >= NOW()
        AND scheduled_for <= NOW() + INTERVAL '24 hours'
      ORDER BY priority DESC, scheduled_for ASC
      LIMIT 10000
    `);

    console.log(`Found ${upcoming.length} tasks due in next 24 hours`);

    for (const task of upcoming) {
      const delay = task.scheduledFor.getTime() - Date.now();

      await queueToRedis(task, { delay: Math.max(0, delay) });
      await task.update({ status: 'queued', queuedAt: new Date() });
    }
  }

  private async processRecurringTasks() {
    const due = await database.query(`
      SELECT * FROM scheduled_tasks
      WHERE schedule_type IN ('recurring', 'repeating')
        AND enabled = true
        AND next_run_at <= NOW()
        AND (end_date IS NULL OR end_date >= NOW())
      LIMIT 1000
    `);

    console.log(`Found ${due.length} recurring tasks due`);

    for (const task of due) {
      // Queue for immediate execution
      await queueToRedis(task, { delay: 0 });

      // Calculate next run
      if (task.scheduleType === 'recurring') {
        const nextRun = calculateNextCronRun(task.cronExpression);
        await task.update({
          lastRunAt: new Date(),
          nextRunAt: nextRun,
          status: 'pending'
        });
      } else if (task.scheduleType === 'repeating') {
        task.executionCount += 1;

        if (task.executionCount >= task.repeatCount) {
          await task.update({ status: 'completed' });
        } else {
          const nextRun = new Date(Date.now() + task.intervalMs);
          await task.update({
            executionCount: task.executionCount,
            lastRunAt: new Date(),
            nextRunAt: nextRun,
            status: 'pending'
          });
        }
      }
    }
  }
}

You wire this up with node-cron:

import * as cron from 'node-cron';

const scheduler = new IntelligentScheduler();

// One-time tasks: check every hour
cron.schedule('0 * * * *', async () => {
  await scheduler.run();
});

// Recurring tasks: check every minute
cron.schedule('* * * * *', async () => {
  await scheduler.processRecurringTasks();
});

// Overdue tasks: check every 5 minutes
cron.schedule('*/5 * * * *', async () => {
  await scheduler.processOverdueTasks();
});

The scheduler is stateless. You can run multiple instances, but I recommend running just one to avoid race conditions. Use a distributed lock if you need redundancy.

Worker Implementation

Workers are what actually execute your tasks. They pull jobs from Redis and process them.

import { Worker } from 'bullmq';

class TaskWorker {
  private workers: Worker[] = [];

  start() {
    // High priority worker
    const highWorker = new Worker('task-execution-high', 
      async job => await this.executeTask(job),
      {
        connection: { host: 'localhost', port: 6379 },
        concurrency: 10  // Process 10 jobs in parallel
      }
    );

    // Normal priority worker
    const normalWorker = new Worker('task-execution-normal',
      async job => await this.executeTask(job),
      {
        connection: { host: 'localhost', port: 6379 },
        concurrency: 10
      }
    );

    // Low priority worker
    const lowWorker = new Worker('task-execution-low',
      async job => await this.executeTask(job),
      {
        connection: { host: 'localhost', port: 6379 },
        concurrency: 5  // Lower concurrency for low priority
      }
    );

    this.workers = [highWorker, normalWorker, lowWorker];

    // Set up event handlers
    this.workers.forEach(worker => {
      worker.on('completed', job => {
        console.log(`Job ${job.id} completed`);
      });

      worker.on('failed', (job, error) => {
        console.error(`Job ${job?.id} failed:`, error.message);
        this.handleFailedJob(job, error);
      });
    });

    console.log('Workers started');
  }

  private async executeTask(job: Job) {
    const { taskId, taskType, payload } = job.data;
    const startTime = Date.now();

    console.log(`Executing task ${taskId} (${taskType})`);

    try {
      // Update status in database
      const task = await database.findByPk(taskId);
      if (task) {
        await task.update({
          status: 'processing',
          startedAt: new Date(),
          attempts: task.attempts + 1
        });
      }

      // Execute based on task type
      let result;
      switch (taskType) {
        case 'email':
          result = await this.sendEmail(payload);
          break;
        case 'subscription_billing':
          result = await this.processBilling(payload);
          break;
        case 'data_cleanup':
          result = await this.cleanupData(payload);
          break;
        default:
          throw new Error(`Unknown task type: ${taskType}`);
      }

      const duration = Date.now() - startTime;

      // Mark as completed
      if (task) {
        await task.update({
          status: 'completed',
          completedAt: new Date(),
          result: result
        });
      }

      console.log(`Task ${taskId} completed in ${duration}ms`);

      return { success: true, result, duration };

    } catch (error) {
      console.error(`Task ${taskId} failed:`, error.message);

      const task = await database.findByPk(taskId);
      if (task) {
        await task.update({
          status: 'failed',
          lastError: error.message
        });

        // Send to dead letter queue if max retries exceeded
        if (task.attempts >= task.maxAttempts) {
          await this.sendToDeadLetter(job, error);
        }
      }

      throw error;  // BullMQ will retry based on job config
    }
  }

  private async sendEmail(payload: any) {
    console.log(`Sending email to ${payload.to}`);
    // Your email sending logic here
    return { sent: true, messageId: 'msg-123' };
  }

  private async processBilling(payload: any) {
    console.log(`Processing billing for ${payload.subscriptionId}`);
    // Your billing logic here
    return { charged: true, amount: payload.amount };
  }

  private async cleanupData(payload: any) {
    console.log(`Cleaning up ${payload.table}`);
    // Your cleanup logic here
    return { deleted: 42 };
  }

  private async sendToDeadLetter(job: Job, error: Error) {
    // Queue to dead letter queue for manual investigation
    await deadLetterQueue.add('dead-letter', {
      taskId: job.data.taskId,
      originalData: job.data,
      error: error.message,
      attempts: job.attemptsMade,
      timestamp: new Date()
    });
  }

  async stop() {
    for (const worker of this.workers) {
      await worker.close();
    }
  }
}

Key points about workers:

Multiple workers: Run as many as you need. Each pulls jobs independently.
Concurrency: Each worker can process multiple jobs in parallel.
Priority queues: Separate workers for high, normal, low priority.
Error handling: Failed jobs retry automatically, then go to dead letter queue.
Graceful shutdown: Workers finish current jobs before stopping.

Putting It All Together

Here's how everything connects:

class TaskSchedulerApplication {
  private db: DatabaseService;
  private queue: QueueService;
  private scheduler: IntelligentScheduler;
  private worker: TaskWorker;
  public smartAPI: SmartTaskAPI;

  constructor() {
    this.db = new DatabaseService();
    this.queue = new QueueService();
    this.scheduler = new IntelligentScheduler(this.db, this.queue);
    this.worker = new TaskWorker(this.db, this.queue);
    this.smartAPI = new SmartTaskAPI(this.db, this.queue);
  }

  async start() {
    console.log('Starting Task Scheduler...');

    // Connect to database
    await this.db.connect();
    await this.db.sync();

    // Start workers
    this.worker.start();

    // Start scheduler cron jobs
    this.startScheduler();

    // Run scheduler once on startup
    setTimeout(() => {
      this.scheduler.run();
    }, 5000);

    console.log('Task Scheduler running');
  }

  private startScheduler() {
    // One-time tasks: every hour
    cron.schedule('0 * * * *', async () => {
      await this.scheduler.run();
    });

    // Recurring tasks: every minute
    cron.schedule('* * * * *', async () => {
      await this.scheduler.processRecurringTasks();
    });

    // Overdue check: every 5 minutes
    cron.schedule('*/5 * * * *', async () => {
      await this.scheduler.processOverdueTasks();
    });
  }

  async stop() {
    await this.worker.stop();
    await this.queue.close();
    await this.db.disconnect();
  }
}

// Usage
const app = new TaskSchedulerApplication();
await app.start();

// Now you can schedule tasks
await app.smartAPI.scheduleTask({
  taskType: 'email',
  scheduledFor: '2025-12-25T09:00:00Z',
  payload: { to: 'user@example.com' }
});

Production Considerations

After running this in production for a few companies, here's what matters:

1. Database Connection Pooling

Don't open a new connection for every query. Use a connection pool:

const sequelize = new Sequelize({
  host: 'localhost',
  database: 'scheduler',
  username: 'postgres',
  password: 'password',
  dialect: 'postgres',
  pool: {
    max: 20,      // Maximum connections
    min: 5,       // Minimum connections
    acquire: 30000,
    idle: 10000
  }
});

We learned this the hard way. Without pooling, we hit PostgreSQL's connection limit during peak hours. Requests started timing out. Adding a pool fixed it instantly.

2. Redis Persistence

Redis is in-memory, but you need persistence. Enable AOF (Append Only File):

# redis.conf
appendonly yes
appendfsync everysec

This writes every operation to disk. If Redis crashes, you lose at most 1 second of data. Without it, you lose everything.

3. Worker Scaling

Start with 2-3 worker instances. Monitor queue depth:

const stats = await queue.getJobCounts('waiting', 'active');
console.log(`Waiting: ${stats.waiting}, Active: ${stats.active}`);

// If waiting > 1000, add more workers
if (stats.waiting > 1000) {
  console.log('Queue backlog detected, scale up workers');
}

We use Kubernetes HPA (Horizontal Pod Autoscaler) to scale workers automatically based on queue depth.

4. Dead Letter Queue Monitoring

Set up alerts for the dead letter queue:

const deadLetterCount = await deadLetterQueue.getJobCounts('waiting');

if (deadLetterCount.waiting > 10) {
  // Send alert to ops team
  await sendSlackAlert({
    channel: '#ops',
    text: `⚠️ ${deadLetterCount.waiting} tasks in dead letter queue`
  });
}

Check this every 5 minutes. A growing dead letter queue means something is systematically failing.

5. Database Archiving

Archive old completed tasks to keep the table small:

-- Archive tasks older than 30 days
INSERT INTO scheduled_tasks_archive
SELECT * FROM scheduled_tasks
WHERE status = 'completed' 
  AND completed_at < NOW() - INTERVAL '30 days';

DELETE FROM scheduled_tasks
WHERE status = 'completed'
  AND completed_at < NOW() - INTERVAL '30 days';

Run this weekly. Keeps your queries fast.

6. Monitoring Metrics

Track these metrics:

// Queue metrics
const queueDepth = await queue.getJobCounts();
metrics.gauge('queue.waiting', queueDepth.waiting);
metrics.gauge('queue.active', queueDepth.active);
metrics.gauge('queue.failed', queueDepth.failed);

// Database metrics
const taskStats = await database.query(`
  SELECT status, COUNT(*) as count
  FROM scheduled_tasks
  GROUP BY status
`);

taskStats.forEach(stat => {
  metrics.gauge(`tasks.${stat.status}`, stat.count);
});

// Worker metrics
metrics.gauge('workers.active', activeWorkers);
metrics.histogram('task.duration', taskDuration);

We use Prometheus for metrics and Grafana for dashboards. Essential for debugging production issues.

7. Graceful Shutdown

Handle shutdown properly:

process.on('SIGTERM', async () => {
  console.log('SIGTERM received, shutting down gracefully...');

  // Stop accepting new jobs
  await worker.pause();

  // Wait for current jobs to complete (max 30 seconds)
  await worker.close();

  // Close database connections
  await database.close();

  console.log('Shutdown complete');
  process.exit(0);
});

Without this, you'll lose in-flight jobs during deployments.

What I Learned the Hard Way

Let me save you from my mistakes:

Mistake 1: Not Using Idempotency Keys

Early on, we had a bug where payment tasks were scheduled twice. Without idempotency, we charged customers twice. Ouch.

Solution:

await scheduleTask({
  taskType: 'billing',
  payload: { subscriptionId: 'sub-123' },
  idempotencyKey: `billing-sub-123-2025-11`
});

Now duplicate schedules are automatically ignored.

Mistake 2: Infinite Retry Loops

We had a task that failed because of invalid data. It kept retrying forever, filling up the queue.

Solution: Set max attempts and use exponential backoff:

{
  attempts: 3,
  backoff: {
    type: 'exponential',
    delay: 5000  // 5s, 25s, 125s
  }
}

After 3 attempts, send to dead letter queue for manual review.

Mistake 3: Not Handling Clock Skew

Our scheduler and workers were on different servers with slightly different clocks. Tasks scheduled for "now" sometimes appeared to be in the past on workers.

Solution: Always use UTC and allow a small time buffer:

// When checking if task is overdue
const isOverdue = task.scheduledFor < new Date(Date.now() - 60000);
// 1 minute buffer

Mistake 4: Forgetting to Update nextRunAt

In recurring tasks, we forgot to update nextRunAt after execution. Tasks ran once then never again.

Solution: Always update in the same transaction:

await database.transaction(async transaction => {
  // Queue task
  await queueToRedis(task);

  // Update next run
  const nextRun = calculateNextRun(task.cronExpression);
  await task.update({ nextRunAt: nextRun }, { transaction });
});

Mistake 5: No Monitoring

We had no visibility into the system. When tasks stopped running, we only found out when users complained.

Solution: Set up alerts for:

Queue depth > 1000
No tasks executed in last hour
Dead letter queue > 10
Worker errors > 5 per minute
Database query time > 1 second

Wrapping Up

This architecture has served us well. We've processed over 100 million tasks in the last year without major issues. Here's what makes it work:

Two-tier storage: Saves 90%+ on infrastructure costs
Intelligent routing: Tasks automatically go where they should
Three task types: Handles all scheduling patterns
BullMQ + Redis: Reliable, fast, feature-rich queueing
Persistent state: Survives restarts and crashes
Horizontal scaling: Add workers to increase throughput
Full observability: Know what's happening at all times

The code is TypeScript, production-tested, and ready to deploy. The full implementation is about 2,000 lines, which sounds like a lot but breaks down into clean, testable modules.

Is it overkill for a side project? Absolutely. For a startup growing past 10k users? It's essential. For an enterprise handling millions of tasks? It's how you sleep at night.

If you're building something similar, feel free to steal this architecture. Just don't make the mistakes I made along the way.

Drop your thoughts. Let's argue in the comments like developers do.

I am Elvis Sautet, Follow me on x at @elvisautet Senior Full Stack Developer

Questions? Thoughts? Let me know in the comments. I'm always curious to hear how others are solving this problem.

Happy scheduling!

DEV Community

Cron Jobs vs Real Task Schedulers: A Love Story

Table of Contents

The Problem with Traditional Cron

Why Big Tech Uses a Two-Tier Architecture

The Architecture We're Building

Understanding the Queue Layer

Why BullMQ + Redis?

Alternative 1: RabbitMQ

Alternative 2: AWS SQS + DynamoDB

Alternative 3: Kafka

My Recommendation

The Intelligent Router: The Brain of the System

Path 1: Immediate (< 15 minutes)

Path 2: Hot (< 24 hours)

Path 3: Cold (> 24 hours)

Implementing the Three Task Types

Type 1: One-Time Tasks

Type 2: Recurring Tasks

Type 3: Repeating Tasks

The Database Schema That Scales

Building the Scheduler

Worker Implementation

Putting It All Together

Production Considerations

1. Database Connection Pooling

2. Redis Persistence

3. Worker Scaling

4. Dead Letter Queue Monitoring

5. Database Archiving

6. Monitoring Metrics

7. Graceful Shutdown

What I Learned the Hard Way

Mistake 1: Not Using Idempotency Keys

Mistake 2: Infinite Retry Loops

Mistake 3: Not Handling Clock Skew

Mistake 4: Forgetting to Update nextRunAt

Mistake 5: No Monitoring

Wrapping Up

Top comments (0)