Building a Robust Task Scheduler in TypeScript: Task Templates, SQLite State Machine, and Retry Strategies
In modern AI agent operations, task scheduling is a critical component for managing workflows, background jobs, and automated processes. This tutorial walks through the implementation of a production-grade task scheduler in TypeScript using scheduler.ts as a reference model. We'll cover task templates, SQLite-based state management, cron-triggered execution, and resilient retry strategies. By the end, you'll understand how to build a system that balances flexibility, reliability, and scalability.
1. Task Templates: Defining Reusable Execution Blueprints
Task templates provide a structured way to define repeatable units of work. They separate what needs to be done from when and how it gets executed.
Template Structure
interface TaskTemplate {
id: string; // Unique template identifier
name: string; // Human-readable name
parameters: Record<string, any>; // Default parameters
scriptPath: string; // Path to execution logic
timeout: number; // Max execution time in ms
}
Dynamic Task Creation
class TaskFactory {
static createFromTemplate(
template: TaskTemplate,
overrides: Record<string, any> = {}
): ScheduledTask {
return {
id: uuidv4(),
...template,
parameters: { ...template.parameters, ...overrides },
status: 'pending',
retries: 0,
createdAt: new Date().toISOString()
};
}
}
Templates enable consistent task creation while allowing runtime customization through parameter overrides. This pattern is particularly useful for AI workflows where the same model inference task might need different hyperparameters across executions.
2. SQLite State Machine: Persistent Task Management
SQLite provides a lightweight, transactional storage layer perfect for managing task states in distributed systems. Our state machine supports these core states:
stateDiagram-v2
[*] --> Pending
Pending --> Running : Worker starts execution
Running --> Completed : Success
Running --> Failed : Error threshold reached
Failed --> RetryPending : Auto-retry scheduled
RetryPending --> Running : Retry window elapsed
Running --> Timeout : Execution timeout
Schema Design
CREATE TABLE tasks (
id TEXT PRIMARY KEY,
template_id TEXT NOT NULL,
parameters TEXT NOT NULL, -- JSON object
status TEXT NOT NULL DEFAULT 'pending',
retries INTEGER DEFAULT 0,
max_retries INTEGER DEFAULT 3,
timeout INTEGER,
scheduled_at TEXT DEFAULT (DATETIME('now')),
last_error TEXT,
created_at TEXT DEFAULT (DATETIME('now'))
);
State Transitions with Transactions
async function updateTaskState(
taskId: string,
newState: TaskStatus,
error?: string
): Promise<void> {
const now = new Date().toISOString();
await db.run(
`UPDATE tasks SET
status = ?,
retries = CASE WHEN ? = 'failed' THEN retries + 1 ELSE retries END,
last_error = CASE WHEN ? IS NOT NULL THEN ? ELSE last_error END,
updated_at = ?
WHERE id = ?`,
[newState, newState, error, error, now, taskId]
);
}
Using SQLite's transaction support ensures atomic state updates, preventing race conditions when multiple workers access the same task queue.
3. Cron Triggers: Precision Timing with Flexibility
Cron expressions provide a powerful way to schedule recurring tasks. We'll use the cron-parser library to handle schedule evaluation.
Schedule Validation
function validateCronExpression(expr: string): boolean {
try {
cronParser.parseExpression(expr);
return true;
} catch (error) {
logger.error(`Invalid cron expression: ${expr}`, error);
return false;
}
}
Scheduled Execution Loop
async function pollScheduler() {
const now = new Date();
const activeSchedules = await db.all(
`SELECT * FROM schedules WHERE next_run <= ? AND active = 1`,
[now.toISOString()]
);
for (const schedule of activeSchedules) {
const task = TaskFactory.createFromTemplate(
schedule.template_id,
{ scheduledTime: now.toISOString() }
);
await taskQueue.add(task);
await updateNextRun(schedule.id);
}
}
For high-frequency schedules (>1 minute intervals), consider using a hybrid approach with Redis for distributed locking to prevent duplicate executions across worker nodes.
4. Retry Strategy: Building Resilience into Failures
Transient failures are inevitable in distributed systems. Our retry strategy combines exponential backoff with jitter to prevent thundering herds.
Exponential Backoff Implementation
function calculateRetryDelay(retryCount: number): number {
const baseDelay = 1000; // 1 second
const maxDelay = 30000; // 30 seconds
const jitter = 0.1; // 10% random variation
const exponential = Math.min(
baseDelay * Math.pow(2, retryCount),
maxDelay
);
const jitterRange = exponential * jitter;
return exponential - jitterRange + Math.random() * (2 * jitterRange);
}
Retry Lifecycle Management
async function handleTaskFailure(task: ScheduledTask): Promise<void> {
if (task.retries >= task.max_retries) {
await updateTaskState(task.id, 'failed');
alertSystem.notify(task.id, 'failed');
return;
}
const delay = calculateRetryDelay(task.retries);
await scheduleRetry(task.id, delay);
await updateTaskState(task.id, 'retry_pending');
}
This approach ensures temporary issues like API rate limits or network glitches don't result in permanent failures while preventing infinite retry loops.
5. Component Integration: Building the Scheduler Engine
Putting it all together, the core scheduler engine follows this workflow:
async function workerLoop() {
while (isRunning) {
const task = await taskQueue.getNext();
if (!task) continue;
try {
await updateTaskState(task.id, 'running');
await executeTask(task);
await updateTaskState(task.id, 'completed');
} catch (error) {
logger.error(`Task failed: ${task.id}`, error);
await handleTaskFailure(task);
}
}
}
The engine coordinates:
- Task retrieval from the queue
- State updates via SQLite transactions
- Execution through template-specific logic
- Error handling and retry scheduling
6. Comparative Analysis: Design Tradeoffs
| Feature | SQLite Approach | Redis Alternative | PostgreSQL Option |
|---|---|---|---|
| Persistence | Disk-based, ACID | In-memory (optional) | Full ACID compliance |
| Horizontal Scaling | Single-node | Cluster-friendly | Connection pooling |
| Cron Accuracy | ~1s resolution | ~50ms resolution | ~1s resolution |
| Complex Queries | Limited JSON support | Basic key-value | JSONB support |
| Distributed Locking | SQLite-based mutex | Redis Redlock | Advisory locks |
| Setup Complexity | Zero dependencies | Requires Redis server | DBA expertise needed |
For teams needing horizontal scaling, consider a Redis-backed queue with SQLite metadata storage hybrid approach.
7. Actionable Takeaways
- Template Inheritance: Create base templates for common AI operations (e.g., model training, data preprocessing) to reduce duplication.
-
State Auditing: Add a
task_eventstable to log every state transition with timestamps for post-mortem analysis. - Dynamic Scaling: Implement worker autoscaling based on queue depth metrics using cloud provider APIs.
-
Priority Queues: Extend the schema with a
prioritycolumn and index to support weighted task processing. - Circuit Breakers: Integrate Hystrix-like patterns to prevent repeated failures against unstable services.
- Metrics Collection: Track key metrics like task latency, retry rates, and queue depth using Prometheus.
-
Schema Migrations: Use tools like
migrateto manage SQLite schema changes during system upgrades.
8. Conclusion
Building a production-ready task scheduler requires careful consideration of task definition, state management, timing precision, and failure recovery. By combining TypeScript's type safety with SQLite's reliability and cron's scheduling power, you can create a system that handles thousands of tasks daily with minimal operational overhead.
The complete implementation in scheduler.ts demonstrates how these components integrate into a cohesive system. For large-scale deployments, consider adding features like distributed workers, priority-based scheduling, and advanced monitoring integrations.
Remember: A good scheduler doesn't just run tasks—it provides visibility, resilience, and predictability to your AI operations pipeline. Start with this foundation and extend it to meet your specific throughput requirements and operational constraints.
Top comments (0)