DEV Community

caishengold
caishengold

Posted on

Building a Robust Task Scheduler in TypeScript

Building a Robust Task Scheduler in TypeScript: Task Templates, SQLite State Machine, and Retry Strategies

In modern AI agent operations, task scheduling is a critical component for managing workflows, background jobs, and automated processes. This tutorial walks through the implementation of a production-grade task scheduler in TypeScript using scheduler.ts as a reference model. We'll cover task templates, SQLite-based state management, cron-triggered execution, and resilient retry strategies. By the end, you'll understand how to build a system that balances flexibility, reliability, and scalability.


1. Task Templates: Defining Reusable Execution Blueprints

Task templates provide a structured way to define repeatable units of work. They separate what needs to be done from when and how it gets executed.

Template Structure

interface TaskTemplate {
  id: string;            // Unique template identifier
  name: string;          // Human-readable name
  parameters: Record<string, any>; // Default parameters
  scriptPath: string;    // Path to execution logic
  timeout: number;       // Max execution time in ms
}
Enter fullscreen mode Exit fullscreen mode

Dynamic Task Creation

class TaskFactory {
  static createFromTemplate(
    template: TaskTemplate, 
    overrides: Record<string, any> = {}
  ): ScheduledTask {
    return {
      id: uuidv4(),
      ...template,
      parameters: { ...template.parameters, ...overrides },
      status: 'pending',
      retries: 0,
      createdAt: new Date().toISOString()
    };
  }
}
Enter fullscreen mode Exit fullscreen mode

Templates enable consistent task creation while allowing runtime customization through parameter overrides. This pattern is particularly useful for AI workflows where the same model inference task might need different hyperparameters across executions.


2. SQLite State Machine: Persistent Task Management

SQLite provides a lightweight, transactional storage layer perfect for managing task states in distributed systems. Our state machine supports these core states:

stateDiagram-v2
    [*] --> Pending
    Pending --> Running : Worker starts execution
    Running --> Completed : Success
    Running --> Failed : Error threshold reached
    Failed --> RetryPending : Auto-retry scheduled
    RetryPending --> Running : Retry window elapsed
    Running --> Timeout : Execution timeout
Enter fullscreen mode Exit fullscreen mode

Schema Design

CREATE TABLE tasks (
  id TEXT PRIMARY KEY,
  template_id TEXT NOT NULL,
  parameters TEXT NOT NULL, -- JSON object
  status TEXT NOT NULL DEFAULT 'pending',
  retries INTEGER DEFAULT 0,
  max_retries INTEGER DEFAULT 3,
  timeout INTEGER,
  scheduled_at TEXT DEFAULT (DATETIME('now')),
  last_error TEXT,
  created_at TEXT DEFAULT (DATETIME('now'))
);
Enter fullscreen mode Exit fullscreen mode

State Transitions with Transactions

async function updateTaskState(
  taskId: string, 
  newState: TaskStatus,
  error?: string
): Promise<void> {
  const now = new Date().toISOString();

  await db.run(
    `UPDATE tasks SET 
      status = ?,
      retries = CASE WHEN ? = 'failed' THEN retries + 1 ELSE retries END,
      last_error = CASE WHEN ? IS NOT NULL THEN ? ELSE last_error END,
      updated_at = ?
    WHERE id = ?`,
    [newState, newState, error, error, now, taskId]
  );
}
Enter fullscreen mode Exit fullscreen mode

Using SQLite's transaction support ensures atomic state updates, preventing race conditions when multiple workers access the same task queue.


3. Cron Triggers: Precision Timing with Flexibility

Cron expressions provide a powerful way to schedule recurring tasks. We'll use the cron-parser library to handle schedule evaluation.

Schedule Validation

function validateCronExpression(expr: string): boolean {
  try {
    cronParser.parseExpression(expr);
    return true;
  } catch (error) {
    logger.error(`Invalid cron expression: ${expr}`, error);
    return false;
  }
}
Enter fullscreen mode Exit fullscreen mode

Scheduled Execution Loop

async function pollScheduler() {
  const now = new Date();
  const activeSchedules = await db.all(
    `SELECT * FROM schedules WHERE next_run <= ? AND active = 1`,
    [now.toISOString()]
  );

  for (const schedule of activeSchedules) {
    const task = TaskFactory.createFromTemplate(
      schedule.template_id,
      { scheduledTime: now.toISOString() }
    );

    await taskQueue.add(task);
    await updateNextRun(schedule.id);
  }
}
Enter fullscreen mode Exit fullscreen mode

For high-frequency schedules (>1 minute intervals), consider using a hybrid approach with Redis for distributed locking to prevent duplicate executions across worker nodes.


4. Retry Strategy: Building Resilience into Failures

Transient failures are inevitable in distributed systems. Our retry strategy combines exponential backoff with jitter to prevent thundering herds.

Exponential Backoff Implementation

function calculateRetryDelay(retryCount: number): number {
  const baseDelay = 1000; // 1 second
  const maxDelay = 30000; // 30 seconds
  const jitter = 0.1; // 10% random variation

  const exponential = Math.min(
    baseDelay * Math.pow(2, retryCount),
    maxDelay
  );

  const jitterRange = exponential * jitter;
  return exponential - jitterRange + Math.random() * (2 * jitterRange);
}
Enter fullscreen mode Exit fullscreen mode

Retry Lifecycle Management

async function handleTaskFailure(task: ScheduledTask): Promise<void> {
  if (task.retries >= task.max_retries) {
    await updateTaskState(task.id, 'failed');
    alertSystem.notify(task.id, 'failed');
    return;
  }

  const delay = calculateRetryDelay(task.retries);
  await scheduleRetry(task.id, delay);
  await updateTaskState(task.id, 'retry_pending');
}
Enter fullscreen mode Exit fullscreen mode

This approach ensures temporary issues like API rate limits or network glitches don't result in permanent failures while preventing infinite retry loops.


5. Component Integration: Building the Scheduler Engine

Putting it all together, the core scheduler engine follows this workflow:

async function workerLoop() {
  while (isRunning) {
    const task = await taskQueue.getNext();
    if (!task) continue;

    try {
      await updateTaskState(task.id, 'running');
      await executeTask(task);
      await updateTaskState(task.id, 'completed');
    } catch (error) {
      logger.error(`Task failed: ${task.id}`, error);
      await handleTaskFailure(task);
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The engine coordinates:

  1. Task retrieval from the queue
  2. State updates via SQLite transactions
  3. Execution through template-specific logic
  4. Error handling and retry scheduling

6. Comparative Analysis: Design Tradeoffs

Feature SQLite Approach Redis Alternative PostgreSQL Option
Persistence Disk-based, ACID In-memory (optional) Full ACID compliance
Horizontal Scaling Single-node Cluster-friendly Connection pooling
Cron Accuracy ~1s resolution ~50ms resolution ~1s resolution
Complex Queries Limited JSON support Basic key-value JSONB support
Distributed Locking SQLite-based mutex Redis Redlock Advisory locks
Setup Complexity Zero dependencies Requires Redis server DBA expertise needed

For teams needing horizontal scaling, consider a Redis-backed queue with SQLite metadata storage hybrid approach.


7. Actionable Takeaways

  1. Template Inheritance: Create base templates for common AI operations (e.g., model training, data preprocessing) to reduce duplication.
  2. State Auditing: Add a task_events table to log every state transition with timestamps for post-mortem analysis.
  3. Dynamic Scaling: Implement worker autoscaling based on queue depth metrics using cloud provider APIs.
  4. Priority Queues: Extend the schema with a priority column and index to support weighted task processing.
  5. Circuit Breakers: Integrate Hystrix-like patterns to prevent repeated failures against unstable services.
  6. Metrics Collection: Track key metrics like task latency, retry rates, and queue depth using Prometheus.
  7. Schema Migrations: Use tools like migrate to manage SQLite schema changes during system upgrades.

8. Conclusion

Building a production-ready task scheduler requires careful consideration of task definition, state management, timing precision, and failure recovery. By combining TypeScript's type safety with SQLite's reliability and cron's scheduling power, you can create a system that handles thousands of tasks daily with minimal operational overhead.

The complete implementation in scheduler.ts demonstrates how these components integrate into a cohesive system. For large-scale deployments, consider adding features like distributed workers, priority-based scheduling, and advanced monitoring integrations.

Remember: A good scheduler doesn't just run tasks—it provides visibility, resilience, and predictability to your AI operations pipeline. Start with this foundation and extend it to meet your specific throughput requirements and operational constraints.

Top comments (0)