Hey folks. I know, I know, another long post. But if you've been following
along, you know I only write these when I've actually lived through the pain.
Today's about task scheduling. Specifically, about a problem I've spent way too much time solving in production.
You know that feeling when your startup grows from handling hundreds of tasks to millions, and suddenly your "simple cron job solution" is costing you thousands in Redis memory? Yeah, I've been there.
Today I want to walk through building a proper task scheduler – the kind that companies like Stripe, Airbnb, and Uber actually use in production. Not the toy examples you see in tutorials, but something you can actually deploy on Friday and sleep well over the weekend.
Table of Contents
- The Problem with Traditional Cron
- Why Big Tech Uses a Two-Tier Architecture
- The Architecture We're Building
- Understanding the Queue Layer
- The Intelligent Router: The Brain of the System
- Implementing the Three Task Types
- The Database Schema That Scales
- Building the Scheduler
- Worker Implementation
- Putting It All Together
- Production Considerations
- What I Learned the Hard Way
The Problem with Traditional Cron
Let me start with what doesn't work. A lot of developers (including past me) start with something like this:
// Don't do this in production
cron.schedule('0 9 * * *', async () => {
await sendDailyReports();
});
This works great until it doesn't. Here's what breaks:
Problem 1: No persistence. Your server restarts, all scheduled tasks are gone. I once lost 10,000 subscription renewal reminders because someone deployed during a scheduled task window. Not fun explaining that to the finance team.
Problem 2: Single point of failure. One cron job fails, you have no idea unless you're watching logs. One server dies, all your scheduled tasks stop.
Problem 3: Can't handle dynamic scheduling. User signs up, you want to send them an email in 7 days. Good luck adding that to crontab programmatically.
Problem 4: No priority system. Your payment processing cron and your "cleanup old logs" cron compete for the same resources.
Problem 5: Terrible at scale. Need to schedule a million tasks? Hope you have a lot of RAM and patience.
I learned this the hard way at my last job. We were using node-cron for everything. Hit 100k users, suddenly our server was using 4GB of RAM just tracking scheduled tasks. Redis costs were approaching $800/month. Something had to change.
Why Big Tech Uses a Two-Tier Architecture
After diving into how Stripe handles subscription billing for millions of customers and how Airbnb manages booking reminders, I noticed a pattern. They all use what I call the "two-tier architecture."
Here's the insight: most scheduled tasks are in the far future. Why keep them in expensive, fast memory when they won't execute for days or weeks?
The Two Tiers:
Cold Storage (PostgreSQL): Cheap persistent storage. Stores ALL tasks, even ones scheduled months away. At $0.10/GB/month, you can store millions of tasks for pennies.
Hot Queue (Redis): Fast execution queue. Only holds tasks executing within 24-48 hours. At $10/GB/month, you only pay for what's actively being processed.
The magic is the scheduler that moves tasks from cold to hot as their execution time approaches.
Let me show you the cost difference. Say you have 10 million scheduled tasks:
All in Redis:
- 10GB of data
- At $10/GB/month = $100/month
- Plus Redis is volatile, so you need persistence setup
- Risk of data loss on restart
Two-tier approach:
- PostgreSQL: 10GB at $0.10/GB = $1/month
- Redis: 0.5GB (only next 24 hours) at $10/GB = $5/month
- Total: $6/month
- Full persistence, no data loss risk
Savings: 94%
That's not a typo. This is why Stripe can handle millions of subscription renewals without bankrupting themselves on infrastructure.
The Architecture We're Building
Before we dive into code, let me show you the high-level architecture:
┌─────────────────────────────────────────────────────────┐
│ Your Application │
│ smartAPI.scheduleTask({...}) │
└────────────────────┬────────────────────────────────────┘
│
▼
┌────────────────┐
│ Intelligent │
│ Router │ ← Decides where task goes
└───────┬────────┘
│
┌───────────┼───────────┐
│ │ │
▼ ▼ ▼
Immediate Hot Cold
< 15 min < 24 hrs > 24 hrs
│ │ │
▼ ▼ │
┌──────┐ ┌──────┐ │
│Redis │ │Redis │ │
└───┬──┘ └───┬──┘ │
│ │ │
└────┬─────┘ │
▼ ▼
┌─────────┐ ┌──────────┐
│ Workers │ │PostgreSQL│
│Execute │ │(waiting) │
└─────────┘ └─────┬────┘
│
Scheduler
(moves to Redis
when due)
The key components:
- Smart API: Your interface. You just call scheduleTask() and forget about it.
- Intelligent Router: Decides if task goes straight to Redis or waits in the database.
- PostgreSQL: Cold storage. Holds all tasks.
- Redis + BullMQ: Hot queue. Holds tasks ready to execute.
- Scheduler: Runs every hour, moves tasks from PostgreSQL to Redis.
- Workers: Execute tasks from Redis.
Understanding the Queue Layer
Let's talk about queuing systems because this is where a lot of people get stuck. We're using BullMQ with Redis, but you have options.
Why BullMQ + Redis?
BullMQ is fantastic for job queuing. Here's what sold me:
import { Queue, Worker } from 'bullmq';
// Create a queue
const emailQueue = new Queue('emails', {
connection: { host: 'localhost', port: 6379 }
});
// Add a job with a delay
await emailQueue.add('welcome-email', {
to: 'user@example.com',
subject: 'Welcome!'
}, {
delay: 3600000, // 1 hour in milliseconds
attempts: 3,
backoff: {
type: 'exponential',
delay: 5000
}
});
// Process jobs
const worker = new Worker('emails', async job => {
await sendEmail(job.data);
}, { connection: { host: 'localhost', port: 6379 } });
Pros:
- Built-in retry logic with exponential backoff
- Delayed jobs out of the box
- Priority queues
- Job progress tracking
- Rate limiting
- Automatic job cleanup
- Great TypeScript support
- Very active maintenance
Cons:
- Requires Redis (another service to manage)
- Memory usage can get high
- No built-in persistence (Redis is in-memory)
Alternative 1: RabbitMQ
If you're already using RabbitMQ, you can use it instead:
import amqp from 'amqplib';
const connection = await amqp.connect('amqp://localhost');
const channel = await connection.createChannel();
await channel.assertQueue('tasks', { durable: true });
// Schedule a task
channel.sendToQueue('tasks', Buffer.from(JSON.stringify({
taskType: 'email',
payload: { to: 'user@example.com' }
})), {
persistent: true,
headers: { 'x-delay': 3600000 } // Requires delay plugin
});
Pros:
- Message persistence built in
- Excellent for microservices
- Very mature ecosystem
- Better for complex routing
Cons:
- More complex to set up
- Delayed messages require plugin
- Heavier resource usage
- Steeper learning curve
Alternative 2: AWS SQS + DynamoDB
If you're on AWS:
import { SQS } from 'aws-sdk';
const sqs = new SQS();
// Schedule a task
await sqs.sendMessage({
QueueUrl: 'https://sqs.us-east-1.amazonaws.com/...',
MessageBody: JSON.stringify({
taskType: 'email',
payload: { to: 'user@example.com' }
}),
DelaySeconds: 3600 // Max 15 minutes
});
Pros:
- Fully managed, no maintenance
- Scales infinitely
- Cheap at scale
- Built-in dead letter queues
Cons:
- Max delay is 15 minutes (need DynamoDB + Lambda for longer)
- Vendor lock-in
- Costs add up with high message volume
- Requires AWS
Alternative 3: Kafka
For event-driven architectures:
import { Kafka } from 'kafkajs';
const kafka = new Kafka({
clientId: 'task-scheduler',
brokers: ['localhost:9092']
});
const producer = kafka.producer();
await producer.connect();
await producer.send({
topic: 'scheduled-tasks',
messages: [{
key: 'task-123',
value: JSON.stringify({
taskType: 'email',
payload: { to: 'user@example.com' }
}),
timestamp: Date.now() + 3600000
}]
});
Pros:
- Excellent for high throughput
- Message persistence and replay
- Perfect for event sourcing
- Scales horizontally
Cons:
- Overkill for simple scheduling
- Complex setup and maintenance
- Delayed messages not native
- High operational overhead
My Recommendation
For most applications, stick with BullMQ + Redis. Here's why:
- Simple setup: Redis is easy to run and manage
- Great developer experience: BullMQ API is intuitive
- Rich features: Everything you need is built in
- Battle-tested: Used by thousands of production apps
- Performance: Fast enough for 99% of use cases
Only move to alternatives if:
- You already have RabbitMQ (use it)
- You're fully AWS (use SQS + Lambda)
- You need Kafka-level throughput (rare)
The Intelligent Router: The Brain of the System
Now let's build the most important part: the router that decides where tasks go.
The logic is beautifully simple:
function routeTask(scheduledDate: Date): Route {
const now = new Date();
const minutesUntil = (scheduledDate - now) / (1000 * 60);
const hoursUntil = minutesUntil / 60;
// Task is overdue or due very soon (< 15 minutes)
if (minutesUntil <= 15) {
return {
route: 'immediate',
action: 'queue_to_redis_now',
reason: 'Needs immediate execution'
};
}
// Task is due soon (< 24 hours)
if (hoursUntil <= 24) {
return {
route: 'hot',
action: 'save_to_db_and_queue_to_redis',
reason: 'Near future execution'
};
}
// Task is far away (> 24 hours)
return {
route: 'cold',
action: 'save_to_db_only',
reason: 'Far future, scheduler will handle later'
};
}
Let me walk through each path:
Path 1: Immediate (< 15 minutes)
// Task due in 5 minutes
scheduleTask({
taskType: 'email',
scheduledFor: new Date(Date.now() + 5 * 60 * 1000),
payload: { to: 'user@example.com' }
});
// Router decision:
// "This is urgent, skip the database, go straight to Redis"
await queueToRedis(task, { delay: 5 * 60 * 1000 });
Why skip the database? Speed. Writing to PostgreSQL takes 50-100ms. Writing to Redis takes 1-5ms. For urgent tasks, every millisecond counts.
We still save it to the database asynchronously for audit purposes, but we don't wait for it.
Path 2: Hot (< 24 hours)
// Task due in 2 hours
scheduleTask({
taskType: 'report',
scheduledFor: new Date(Date.now() + 2 * 60 * 60 * 1000),
payload: { reportType: 'sales' }
});
// Router decision:
// "Save to database for persistence, also queue to Redis"
await saveToDatabase(task);
await queueToRedis(task, { delay: 2 * 60 * 60 * 1000 });
await task.update({ status: 'queued' });
This is the belt-and-suspenders approach. We save to the database so if Redis crashes, we don't lose the task. We also queue it to Redis so it executes on time.
Path 3: Cold (> 24 hours)
// Task due in 30 days
scheduleTask({
taskType: 'subscription_renewal',
scheduledFor: new Date(Date.now() + 30 * 24 * 60 * 60 * 1000),
payload: { subscriptionId: 'sub-123' }
});
// Router decision:
// "Too far away, just save to database"
await saveToDatabase(task);
// That's it! Scheduler will handle it later
This is the efficiency win. Why keep this in Redis for 30 days? It's not executing anytime soon. Save it to cheap PostgreSQL storage, and the scheduler will move it to Redis on day 29.
Implementing the Three Task Types
Real-world applications need more than just "run once" tasks. You need three types:
Type 1: One-Time Tasks
These execute once at a specific time. This is what we've been discussing.
await smartAPI.scheduleTask({
taskType: 'email',
taskName: 'Birthday Email',
scheduledFor: '2025-12-25T09:00:00Z',
payload: {
to: 'user@example.com',
subject: 'Happy Birthday!',
template: 'birthday'
},
priority: 5
});
Implementation is straightforward. Store in database with status 'pending', router decides when to queue.
Type 2: Recurring Tasks
These run forever on a schedule. Like traditional cron, but persistent.
await smartAPI.scheduleRecurringTask({
taskType: 'daily_report',
taskName: 'Daily Sales Report',
cronExpression: '0 9 * * *', // Every day at 9 AM
payload: {
reportType: 'sales',
emailTo: 'boss@company.com'
},
timezone: 'America/New_York',
startDate: '2025-01-01',
endDate: '2025-12-31' // Optional
});
The trick here is maintaining state in the database:
// Database record
{
id: 'task-123',
scheduleType: 'recurring',
cronExpression: '0 9 * * *',
lastRunAt: '2025-11-23T09:00:00Z',
nextRunAt: '2025-11-24T09:00:00Z',
status: 'pending',
enabled: true
}
The scheduler runs every minute, checking: "Is it time to run any recurring tasks?"
// Scheduler logic
async function checkRecurringTasks() {
const dueTasks = await database.query(`
SELECT * FROM scheduled_tasks
WHERE schedule_type = 'recurring'
AND enabled = true
AND next_run_at <= NOW()
AND (end_date IS NULL OR end_date >= NOW())
`);
for (const task of dueTasks) {
// Queue to Redis for immediate execution
await queueToRedis(task, { delay: 0 });
// Calculate next run using cron-parser
const nextRun = calculateNextRun(task.cronExpression);
// Update database
await task.update({
lastRunAt: new Date(),
nextRunAt: nextRun,
status: 'pending' // Reset for next run
});
}
}
The beauty is Redis crashes? No problem. The task definition is in PostgreSQL. The scheduler will queue it again.
Type 3: Repeating Tasks
These run N times with a fixed interval. Perfect for retry logic.
await smartAPI.scheduleRepeatingTask({
taskType: 'payment_retry',
taskName: 'Retry Failed Payment',
intervalHours: 24,
repeatCount: 5, // Try 5 times
payload: {
paymentId: 'pay-123',
amount: 29.99
},
priority: 10 // High priority
});
Implementation tracks execution count:
// Database record
{
id: 'task-456',
scheduleType: 'repeating',
intervalMs: 86400000, // 24 hours
repeatCount: 5,
executionCount: 2, // Already tried twice
nextRunAt: '2025-11-25T10:00:00Z',
status: 'pending'
}
After each execution:
async function handleRepeatingTask(task) {
// Execute the task
await executeTask(task);
// Increment counter
task.executionCount += 1;
// Done?
if (task.executionCount >= task.repeatCount) {
task.status = 'completed';
await task.save();
return;
}
// Schedule next run
task.nextRunAt = new Date(Date.now() + task.intervalMs);
task.status = 'pending';
await task.save();
}
This is how Stripe handles payment retries. They don't set up 5 separate tasks. One task, configured to repeat.
The Database Schema That Scales
Let's talk about the database schema. This is crucial for performance at scale.
CREATE TABLE scheduled_tasks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- Task identification
task_type VARCHAR(50) NOT NULL,
task_name VARCHAR(255) NOT NULL,
schedule_type VARCHAR(20) NOT NULL, -- 'one_time', 'recurring', 'repeating'
-- One-time fields
scheduled_for TIMESTAMP,
-- Recurring fields
cron_expression VARCHAR(100),
timezone VARCHAR(50) DEFAULT 'UTC',
start_date TIMESTAMP,
end_date TIMESTAMP,
last_run_at TIMESTAMP,
next_run_at TIMESTAMP,
-- Repeating fields
interval_ms BIGINT,
repeat_count INTEGER,
execution_count INTEGER DEFAULT 0,
-- State
status VARCHAR(20) NOT NULL DEFAULT 'pending',
priority INTEGER DEFAULT 5,
enabled BOOLEAN DEFAULT true,
-- Data
payload JSONB NOT NULL,
metadata JSONB DEFAULT '{}',
-- Execution tracking
attempts INTEGER DEFAULT 0,
max_attempts INTEGER DEFAULT 3,
queued_at TIMESTAMP,
started_at TIMESTAMP,
completed_at TIMESTAMP,
last_error TEXT,
result JSONB,
-- Multi-tenancy
user_id UUID,
tenant_id UUID,
tags TEXT[],
-- Timestamps
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
Now the critical part: indexes. These make or break performance.
-- THE MOST IMPORTANT INDEX
-- Scheduler queries this constantly
CREATE INDEX idx_status_scheduled_priority
ON scheduled_tasks (status, scheduled_for, priority)
WHERE schedule_type = 'one_time';
-- For recurring task checks (every minute)
CREATE INDEX idx_recurring_due
ON scheduled_tasks (schedule_type, enabled, next_run_at)
WHERE schedule_type IN ('recurring', 'repeating');
-- For finding user tasks quickly
CREATE INDEX idx_user_status
ON scheduled_tasks (user_id, status);
-- For the scheduler's main query
CREATE INDEX idx_next_run_enabled
ON scheduled_tasks (next_run_at, enabled)
WHERE enabled = true;
With these indexes, querying 10 million tasks takes under 50ms. Without them, it can take 30+ seconds. Ask me how I know.
Here's the query the scheduler runs every hour:
-- Find one-time tasks ready to queue
SELECT * FROM scheduled_tasks
WHERE schedule_type = 'one_time'
AND status = 'pending'
AND scheduled_for >= NOW()
AND scheduled_for <= NOW() + INTERVAL '24 hours'
ORDER BY priority DESC, scheduled_for ASC
LIMIT 10000;
With the index idx_status_scheduled_priority, PostgreSQL uses an index scan. Query time: 40ms for 10M rows.
Without the index? Sequential scan. Query time: 28 seconds. Your scheduler would fall behind immediately.
Building the Scheduler
The scheduler is the heart of the system. It runs periodically, moving tasks from cold storage to hot queue.
Here's the implementation:
class IntelligentScheduler {
private isRunning = false;
async run() {
if (this.isRunning) {
console.log('Scheduler already running, skipping...');
return;
}
this.isRunning = true;
console.log('Scheduler started at', new Date().toISOString());
try {
// Step 1: Handle overdue tasks (highest priority)
await this.processOverdueTasks();
// Step 2: Move near-future tasks to Redis
await this.processUpcomingTasks();
// Step 3: Process recurring tasks
await this.processRecurringTasks();
} catch (error) {
console.error('Scheduler error:', error);
} finally {
this.isRunning = false;
}
}
private async processOverdueTasks() {
const overdue = await database.query(`
SELECT * FROM scheduled_tasks
WHERE schedule_type = 'one_time'
AND status = 'pending'
AND scheduled_for < NOW()
ORDER BY priority DESC, scheduled_for ASC
LIMIT 1000
`);
console.log(`Found ${overdue.length} overdue tasks`);
for (const task of overdue) {
// Queue immediately with high priority
await queueToRedis(task, {
delay: 0,
priority: 10 // Override priority for overdue tasks
});
await task.update({ status: 'queued', queuedAt: new Date() });
}
}
private async processUpcomingTasks() {
const upcoming = await database.query(`
SELECT * FROM scheduled_tasks
WHERE schedule_type = 'one_time'
AND status = 'pending'
AND scheduled_for >= NOW()
AND scheduled_for <= NOW() + INTERVAL '24 hours'
ORDER BY priority DESC, scheduled_for ASC
LIMIT 10000
`);
console.log(`Found ${upcoming.length} tasks due in next 24 hours`);
for (const task of upcoming) {
const delay = task.scheduledFor.getTime() - Date.now();
await queueToRedis(task, { delay: Math.max(0, delay) });
await task.update({ status: 'queued', queuedAt: new Date() });
}
}
private async processRecurringTasks() {
const due = await database.query(`
SELECT * FROM scheduled_tasks
WHERE schedule_type IN ('recurring', 'repeating')
AND enabled = true
AND next_run_at <= NOW()
AND (end_date IS NULL OR end_date >= NOW())
LIMIT 1000
`);
console.log(`Found ${due.length} recurring tasks due`);
for (const task of due) {
// Queue for immediate execution
await queueToRedis(task, { delay: 0 });
// Calculate next run
if (task.scheduleType === 'recurring') {
const nextRun = calculateNextCronRun(task.cronExpression);
await task.update({
lastRunAt: new Date(),
nextRunAt: nextRun,
status: 'pending'
});
} else if (task.scheduleType === 'repeating') {
task.executionCount += 1;
if (task.executionCount >= task.repeatCount) {
await task.update({ status: 'completed' });
} else {
const nextRun = new Date(Date.now() + task.intervalMs);
await task.update({
executionCount: task.executionCount,
lastRunAt: new Date(),
nextRunAt: nextRun,
status: 'pending'
});
}
}
}
}
}
You wire this up with node-cron:
import * as cron from 'node-cron';
const scheduler = new IntelligentScheduler();
// One-time tasks: check every hour
cron.schedule('0 * * * *', async () => {
await scheduler.run();
});
// Recurring tasks: check every minute
cron.schedule('* * * * *', async () => {
await scheduler.processRecurringTasks();
});
// Overdue tasks: check every 5 minutes
cron.schedule('*/5 * * * *', async () => {
await scheduler.processOverdueTasks();
});
The scheduler is stateless. You can run multiple instances, but I recommend running just one to avoid race conditions. Use a distributed lock if you need redundancy.
Worker Implementation
Workers are what actually execute your tasks. They pull jobs from Redis and process them.
import { Worker } from 'bullmq';
class TaskWorker {
private workers: Worker[] = [];
start() {
// High priority worker
const highWorker = new Worker('task-execution-high',
async job => await this.executeTask(job),
{
connection: { host: 'localhost', port: 6379 },
concurrency: 10 // Process 10 jobs in parallel
}
);
// Normal priority worker
const normalWorker = new Worker('task-execution-normal',
async job => await this.executeTask(job),
{
connection: { host: 'localhost', port: 6379 },
concurrency: 10
}
);
// Low priority worker
const lowWorker = new Worker('task-execution-low',
async job => await this.executeTask(job),
{
connection: { host: 'localhost', port: 6379 },
concurrency: 5 // Lower concurrency for low priority
}
);
this.workers = [highWorker, normalWorker, lowWorker];
// Set up event handlers
this.workers.forEach(worker => {
worker.on('completed', job => {
console.log(`Job ${job.id} completed`);
});
worker.on('failed', (job, error) => {
console.error(`Job ${job?.id} failed:`, error.message);
this.handleFailedJob(job, error);
});
});
console.log('Workers started');
}
private async executeTask(job: Job) {
const { taskId, taskType, payload } = job.data;
const startTime = Date.now();
console.log(`Executing task ${taskId} (${taskType})`);
try {
// Update status in database
const task = await database.findByPk(taskId);
if (task) {
await task.update({
status: 'processing',
startedAt: new Date(),
attempts: task.attempts + 1
});
}
// Execute based on task type
let result;
switch (taskType) {
case 'email':
result = await this.sendEmail(payload);
break;
case 'subscription_billing':
result = await this.processBilling(payload);
break;
case 'data_cleanup':
result = await this.cleanupData(payload);
break;
default:
throw new Error(`Unknown task type: ${taskType}`);
}
const duration = Date.now() - startTime;
// Mark as completed
if (task) {
await task.update({
status: 'completed',
completedAt: new Date(),
result: result
});
}
console.log(`Task ${taskId} completed in ${duration}ms`);
return { success: true, result, duration };
} catch (error) {
console.error(`Task ${taskId} failed:`, error.message);
const task = await database.findByPk(taskId);
if (task) {
await task.update({
status: 'failed',
lastError: error.message
});
// Send to dead letter queue if max retries exceeded
if (task.attempts >= task.maxAttempts) {
await this.sendToDeadLetter(job, error);
}
}
throw error; // BullMQ will retry based on job config
}
}
private async sendEmail(payload: any) {
console.log(`Sending email to ${payload.to}`);
// Your email sending logic here
return { sent: true, messageId: 'msg-123' };
}
private async processBilling(payload: any) {
console.log(`Processing billing for ${payload.subscriptionId}`);
// Your billing logic here
return { charged: true, amount: payload.amount };
}
private async cleanupData(payload: any) {
console.log(`Cleaning up ${payload.table}`);
// Your cleanup logic here
return { deleted: 42 };
}
private async sendToDeadLetter(job: Job, error: Error) {
// Queue to dead letter queue for manual investigation
await deadLetterQueue.add('dead-letter', {
taskId: job.data.taskId,
originalData: job.data,
error: error.message,
attempts: job.attemptsMade,
timestamp: new Date()
});
}
async stop() {
for (const worker of this.workers) {
await worker.close();
}
}
}
Key points about workers:
- Multiple workers: Run as many as you need. Each pulls jobs independently.
- Concurrency: Each worker can process multiple jobs in parallel.
- Priority queues: Separate workers for high, normal, low priority.
- Error handling: Failed jobs retry automatically, then go to dead letter queue.
- Graceful shutdown: Workers finish current jobs before stopping.
Putting It All Together
Here's how everything connects:
class TaskSchedulerApplication {
private db: DatabaseService;
private queue: QueueService;
private scheduler: IntelligentScheduler;
private worker: TaskWorker;
public smartAPI: SmartTaskAPI;
constructor() {
this.db = new DatabaseService();
this.queue = new QueueService();
this.scheduler = new IntelligentScheduler(this.db, this.queue);
this.worker = new TaskWorker(this.db, this.queue);
this.smartAPI = new SmartTaskAPI(this.db, this.queue);
}
async start() {
console.log('Starting Task Scheduler...');
// Connect to database
await this.db.connect();
await this.db.sync();
// Start workers
this.worker.start();
// Start scheduler cron jobs
this.startScheduler();
// Run scheduler once on startup
setTimeout(() => {
this.scheduler.run();
}, 5000);
console.log('Task Scheduler running');
}
private startScheduler() {
// One-time tasks: every hour
cron.schedule('0 * * * *', async () => {
await this.scheduler.run();
});
// Recurring tasks: every minute
cron.schedule('* * * * *', async () => {
await this.scheduler.processRecurringTasks();
});
// Overdue check: every 5 minutes
cron.schedule('*/5 * * * *', async () => {
await this.scheduler.processOverdueTasks();
});
}
async stop() {
await this.worker.stop();
await this.queue.close();
await this.db.disconnect();
}
}
// Usage
const app = new TaskSchedulerApplication();
await app.start();
// Now you can schedule tasks
await app.smartAPI.scheduleTask({
taskType: 'email',
scheduledFor: '2025-12-25T09:00:00Z',
payload: { to: 'user@example.com' }
});
Production Considerations
After running this in production for a few companies, here's what matters:
1. Database Connection Pooling
Don't open a new connection for every query. Use a connection pool:
const sequelize = new Sequelize({
host: 'localhost',
database: 'scheduler',
username: 'postgres',
password: 'password',
dialect: 'postgres',
pool: {
max: 20, // Maximum connections
min: 5, // Minimum connections
acquire: 30000,
idle: 10000
}
});
We learned this the hard way. Without pooling, we hit PostgreSQL's connection limit during peak hours. Requests started timing out. Adding a pool fixed it instantly.
2. Redis Persistence
Redis is in-memory, but you need persistence. Enable AOF (Append Only File):
# redis.conf
appendonly yes
appendfsync everysec
This writes every operation to disk. If Redis crashes, you lose at most 1 second of data. Without it, you lose everything.
3. Worker Scaling
Start with 2-3 worker instances. Monitor queue depth:
const stats = await queue.getJobCounts('waiting', 'active');
console.log(`Waiting: ${stats.waiting}, Active: ${stats.active}`);
// If waiting > 1000, add more workers
if (stats.waiting > 1000) {
console.log('Queue backlog detected, scale up workers');
}
We use Kubernetes HPA (Horizontal Pod Autoscaler) to scale workers automatically based on queue depth.
4. Dead Letter Queue Monitoring
Set up alerts for the dead letter queue:
const deadLetterCount = await deadLetterQueue.getJobCounts('waiting');
if (deadLetterCount.waiting > 10) {
// Send alert to ops team
await sendSlackAlert({
channel: '#ops',
text: `⚠️ ${deadLetterCount.waiting} tasks in dead letter queue`
});
}
Check this every 5 minutes. A growing dead letter queue means something is systematically failing.
5. Database Archiving
Archive old completed tasks to keep the table small:
-- Archive tasks older than 30 days
INSERT INTO scheduled_tasks_archive
SELECT * FROM scheduled_tasks
WHERE status = 'completed'
AND completed_at < NOW() - INTERVAL '30 days';
DELETE FROM scheduled_tasks
WHERE status = 'completed'
AND completed_at < NOW() - INTERVAL '30 days';
Run this weekly. Keeps your queries fast.
6. Monitoring Metrics
Track these metrics:
// Queue metrics
const queueDepth = await queue.getJobCounts();
metrics.gauge('queue.waiting', queueDepth.waiting);
metrics.gauge('queue.active', queueDepth.active);
metrics.gauge('queue.failed', queueDepth.failed);
// Database metrics
const taskStats = await database.query(`
SELECT status, COUNT(*) as count
FROM scheduled_tasks
GROUP BY status
`);
taskStats.forEach(stat => {
metrics.gauge(`tasks.${stat.status}`, stat.count);
});
// Worker metrics
metrics.gauge('workers.active', activeWorkers);
metrics.histogram('task.duration', taskDuration);
We use Prometheus for metrics and Grafana for dashboards. Essential for debugging production issues.
7. Graceful Shutdown
Handle shutdown properly:
process.on('SIGTERM', async () => {
console.log('SIGTERM received, shutting down gracefully...');
// Stop accepting new jobs
await worker.pause();
// Wait for current jobs to complete (max 30 seconds)
await worker.close();
// Close database connections
await database.close();
console.log('Shutdown complete');
process.exit(0);
});
Without this, you'll lose in-flight jobs during deployments.
What I Learned the Hard Way
Let me save you from my mistakes:
Mistake 1: Not Using Idempotency Keys
Early on, we had a bug where payment tasks were scheduled twice. Without idempotency, we charged customers twice. Ouch.
Solution:
await scheduleTask({
taskType: 'billing',
payload: { subscriptionId: 'sub-123' },
idempotencyKey: `billing-sub-123-2025-11`
});
Now duplicate schedules are automatically ignored.
Mistake 2: Infinite Retry Loops
We had a task that failed because of invalid data. It kept retrying forever, filling up the queue.
Solution: Set max attempts and use exponential backoff:
{
attempts: 3,
backoff: {
type: 'exponential',
delay: 5000 // 5s, 25s, 125s
}
}
After 3 attempts, send to dead letter queue for manual review.
Mistake 3: Not Handling Clock Skew
Our scheduler and workers were on different servers with slightly different clocks. Tasks scheduled for "now" sometimes appeared to be in the past on workers.
Solution: Always use UTC and allow a small time buffer:
// When checking if task is overdue
const isOverdue = task.scheduledFor < new Date(Date.now() - 60000);
// 1 minute buffer
Mistake 4: Forgetting to Update nextRunAt
In recurring tasks, we forgot to update nextRunAt after execution. Tasks ran once then never again.
Solution: Always update in the same transaction:
await database.transaction(async transaction => {
// Queue task
await queueToRedis(task);
// Update next run
const nextRun = calculateNextRun(task.cronExpression);
await task.update({ nextRunAt: nextRun }, { transaction });
});
Mistake 5: No Monitoring
We had no visibility into the system. When tasks stopped running, we only found out when users complained.
Solution: Set up alerts for:
- Queue depth > 1000
- No tasks executed in last hour
- Dead letter queue > 10
- Worker errors > 5 per minute
- Database query time > 1 second
Wrapping Up
This architecture has served us well. We've processed over 100 million tasks in the last year without major issues. Here's what makes it work:
- Two-tier storage: Saves 90%+ on infrastructure costs
- Intelligent routing: Tasks automatically go where they should
- Three task types: Handles all scheduling patterns
- BullMQ + Redis: Reliable, fast, feature-rich queueing
- Persistent state: Survives restarts and crashes
- Horizontal scaling: Add workers to increase throughput
- Full observability: Know what's happening at all times
The code is TypeScript, production-tested, and ready to deploy. The full implementation is about 2,000 lines, which sounds like a lot but breaks down into clean, testable modules.
Is it overkill for a side project? Absolutely. For a startup growing past 10k users? It's essential. For an enterprise handling millions of tasks? It's how you sleep at night.
If you're building something similar, feel free to steal this architecture. Just don't make the mistakes I made along the way.
Drop your thoughts. Let's argue in the comments like developers do.
I am Elvis Sautet, Follow me on x at @elvisautet Senior Full Stack Developer
Questions? Thoughts? Let me know in the comments. I'm always curious to hear how others are solving this problem.
Happy scheduling!
Top comments (0)