Introduction
Inspired by Google’s “Reliable Cron across the Planet”, I set out to design my own version of a distributed cron system. The core goal was simple: execute a binary at scheduled times - but do so in a scalable and fault-tolerant way. In other words, my system should continue operating even when nodes crash.
To guide design decisions and avoid unbounded scaling of my service, I set a few baseline expectations for system load:
- 500–1,000 worker nodes
- Thousands of scheduled tasks per day
- Task duration: between 1–30 seconds
One more thing is that I deliberately trade off liveness (okay to skip a launch) for safety (never duplicate work).
Overall Architecture
Centralized state store: PostgreSQL (replicated for HA)
Coordinator-based dispatch: Coordinators own all DB I/O and task routing via gRPC.
Workers: Push heartbeats; receive tasks over RPC.
Scheduler: Separate component ingests user cron specs and writes them to the DB.
Naive Solution:
The naive solution for a fault-tolerant cron is to let each worker node independently poll the database for pending tasks and execute them when found. While conceptually simple, this approach does not scale. With hundreds of workers polling every few seconds, the database becomes a central bottleneck, overwhelmed by read contention.
Moreover, implementing heartbeats under this decentralized model only amplifies database load, as each worker must constantly read and write metadata about its own liveness or task state.
Using a Coordinator:
To avoid overloading the database, I adopted a coordinator-based design. In this model, a centralized coordinator node (or set of nodes) handles all database interactions — dispatching tasks to workers to updating task's metadata. Workers do not poll the database directly; instead, they receive tasks pushed from the coordinator via RPC. This architecture offloads and shifts the burden of task coordination from the database.
Additionally, worker heartbeats are sent to the coordinator (not the database), further reducing I/O pressure on persistent storage.
For fault tolerance, the system supports multiple coordinator instances. While leader election via consensus protocols like Raft (my own implementation) or Paxos can ensure a single active coordinator, I opted for a simpler approach: allowing multiple coordinators to run concurrently. To avoid race conditions, PostgreSQL advisory locks are used during task table scans, ensuring mutual exclusion. This design keeps coordinators stateless and enables horizontal scalability without complex coordination logic.
Worker Nodes
The coordinator is responsible for distributing tasks to worker nodes and maintaining visibility into their availability. A naive strategy might involve the coordinator pinging workers on-demand when a task needs dispatching. However, this has major drawbacks:
- It fails to proactively detect crashes or network partitions — if a node silently dies, the system won’t know until it tries (and fails) to assign work.
- It offers no insight into system load, making it hard to balance task distribution based on available capacity.
To address these limitations, I would use a heartbeat mechanism. Here, each worker node periodically sends a heartbeat RPC to the coordinator, reporting its liveness and current workload. This allows the coordinator to maintain an up-to-date in-memory view of cluster health and dispatch tasks intelligently if needed.
A simplified version of the heartbeat payload looks like this:
{
"worker_id": "string", // worker ID
"timestamp": 1234567890, // Unix epoch (in ms) when this heartbeat was sent
"status": "idle"|"busy", // state flag
"load": {
"running_tasks": // number of tasks currently executing
"max_capacity": // total parallel slots this worker can handle
}
}
Coordinator
To track worker node health, the coordinator maintains an in-memory map of active nodes, keyed by node ID, with status and last heartbeat timestamp. A background reaper loop periodically scans this map to detect unresponsive nodes. If multiple coordinators may be active, this map would typically be backed by a centralized store such as Redis for shared access.
Task Dispatching:
To assign tasks, the coordinator periodically scans the database for eligible tasks (e.g., scheduled and uncompleted). A naive approach might wrap the scan, RPC dispatch, and database update in a single transaction, but this is inefficient—it holds database locks during potentially slow RPC calls, causing contention under load.
Ensuring At-Most-Once Delivery:
Nonetheless, handling this is very tricky because this is essentially a matter of exactly-once delivery semantics. If coordinator’s task dispatching RPC reaches the worker node but the response partitions, coordinator may think the task hasn’t been dispatched and dispatches it on the next scan, resulting in a duplicate processing. To craft a solution, let me first define our business requirement discussed earlier: favor skipping task launches rather than risking duplicate executions.
Two-Phase Dispatch Protocol:
To guard against RPC/network partitions, I designed a lightweight, two-phase dispatch protocol (inspired by 2PC’s prepare/commit) that enforces at-most-once task execution:
The flow is straightforward:
-
Phase 1: Pre-Assignment
- Worker immediately replies “yes”/“no.”
- yes → mark task TEMPORARY_ASSIGNED; no → leave NEW.
-
Phase 2: Completion Confirmation
- After work, worker calls ConfirmDone(taskID) (retry up to N times).
- If task is still TEMPORARY_ASSIGNED, coordinator marks it DONE and returns “ok.”
- Otherwise returns “not ok,” and worker aborts.
In the face of lost RPCs, a task may be retried later (skipped once), but never executed twice. This achieves at-most-once semantics with minimal overhead compared to full 2PC.
Worker Node Crashes:
When a worker completes its task, it sends a response back to the coordinator, which then updates the database to mark the task as completed.
If a worker crashes or becomes unresponsive, the reaper loop detects this via the in-memory heartbeat map. The coordinator then queries the database for any active tasks previously assigned to that node and logs them for further action — such as for re-dispatching, retries, or administrative reporting.
Google’s approach, as noted earlier, employs sophisticated checkpointing to ensure idempotent task processing and seamless failover for running tasks. To keep my system straightforward, I opted for a simpler strategy: record crash context for manual or automated remediation.
Scheduler
We have coordinator and worker nodes. The final component of the service is scheduler, which accepts user cron definitions (e.g. “0 0 * * *”), and writes them into the DB as NEW tasks.
A naive approach would have the coordinator handle client requests directly. While feasible, this is uncommon in production systems, as it makes it difficult to scale components independently. Instead, introducing a separate scheduler node enables a clean separation of responsibilities and improves scalability.
Conclusion
Ensuring exactly-once task delivery in distributed systems is inherently challenging. My design consciously favors occasional task skipping over duplicate execution, striking a balance between reliability and simplicity. While it doesn’t aim to meet the scale of Google’s stringent SLAs, I hope it provides a practical and effective solution for moderate workloads.
Top comments (0)