DEV Community

Cover image for Design a Job Scheduler at 10M Jobs/Day: 4 Components, 3 Failure Modes
Gabriel Anhaia
Gabriel Anhaia

Posted on

Design a Job Scheduler at 10M Jobs/Day: 4 Components, 3 Failure Modes


Cron is fine until it isn't. One box, a flat file, a tab character that nobody can type from memory. Works great for the nightly backup. Then somebody asks you to fire ten million scheduled jobs a day, across timezones, with sub-minute accuracy, and the interviewer wants to know what k8s CronJob controllers and Airflow schedulers do under the hood.

This is one of the cleanest system design prompts to practice on because the failure modes are real, the components are small, and the wrong answer is loud. Let's build it.

What "scheduler" actually means

A scheduler is four things glued together. People conflate them and the design suffers.

  1. A schedule store: the durable truth about what should run and when.
  2. A tick coordinator: the thing that wakes up every second (or 100ms, or 10ms), reads the store, and decides "these N jobs are due now."
  3. A dispatcher: the thing that hands due jobs to a queue. Owns idempotency keys.
  4. An executor: the worker pool that actually runs the job, with timeouts and retries.

Confuse "dispatcher" with "executor" and you get a design where the same box reads the schedule and runs the user's code. That's cron. It doesn't scale past one machine because the work the executor does (run arbitrary user code for arbitrary lengths of time) has wildly different failure modes from the work the dispatcher does (publish a tiny message to a queue, deterministically).

Keep these separated. The interviewer will nod.

Component 1: the schedule store

This is the part most candidates get wrong because they reach for Redis. Redis is fine for the due-now queue. It is not the source of truth for "every Tuesday at 14:30 Europe/Berlin, run job 47 forever."

The store needs three things: durability, an index on the next firing time, and the ability to update both atomically when a job runs. Postgres works at this scale. So does CockroachDB or any decent OLTP database with a proper index.

CREATE TABLE scheduled_jobs (
    job_id           UUID PRIMARY KEY,
    tenant_id        UUID NOT NULL,
    name             TEXT NOT NULL,
    payload          JSONB NOT NULL,
    -- cron expression OR a one-shot timestamp; never both
    cron_expr        TEXT,
    timezone         TEXT NOT NULL DEFAULT 'UTC',
    -- the calculated next firing time, always in UTC
    next_run_at      TIMESTAMPTZ NOT NULL,
    last_run_at      TIMESTAMPTZ,
    -- versioned so dispatcher can do optimistic locking
    version          BIGINT NOT NULL DEFAULT 0,
    enabled          BOOLEAN NOT NULL DEFAULT true,
    created_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    CONSTRAINT one_of_schedule
      CHECK ((cron_expr IS NULL) <> (next_run_at IS NOT NULL))
);

-- the only index the tick coordinator hits
CREATE INDEX scheduled_jobs_due_idx
  ON scheduled_jobs (next_run_at)
  WHERE enabled = true;

-- audit table, append-only, never updated
CREATE TABLE scheduled_runs (
    run_id           UUID PRIMARY KEY,
    job_id           UUID NOT NULL REFERENCES scheduled_jobs(job_id),
    scheduled_for    TIMESTAMPTZ NOT NULL,
    dispatched_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
    -- the idempotency key the executor will use
    idempotency_key  TEXT NOT NULL UNIQUE,
    status           TEXT NOT NULL  -- 'dispatched','succeeded','failed','skipped'
);
Enter fullscreen mode Exit fullscreen mode

A few things to call out before the interviewer asks.

next_run_at is always UTC. The timezone column exists so you can re-compute the next firing time correctly when DST hits. Storing local time in the index is how you end up firing twice in November and zero times in March.

The version column is the optimistic lock the dispatcher uses to claim a job. No SELECT ... FOR UPDATE row locks held across a tick; that doesn't scale. The dispatcher reads candidates, then issues an UPDATE ... WHERE version = $expected and only proceeds if the row count is 1.

The partial index on enabled = true is a real win at 10M rows. You almost never query disabled jobs, and the index stays small.

The scheduled_runs audit table is append-only. This is the table you'll be glad you had at 03:00 when someone asks "did job 47 actually run on Tuesday or did we just say it did."

A back-of-envelope sanity check: 10M jobs/day is ~116 inserts/second sustained on scheduled_runs. A single Postgres primary handles that without sweating. The schedule store itself is mostly read traffic.

Component 2: the tick coordinator

The tick coordinator is the dangerous part. One job here: every second, find the jobs whose next_run_at <= now() and hand them to the dispatcher.

Single instance is a single point of failure. Multiple instances without coordination is duplicate dispatch. The textbook answer is leader election: run N instances, exactly one is leader, the rest are warm standbys.

// the leader-election loop, etcd-style
func RunCoordinator(ctx context.Context, etcd *clientv3.Client) error {
    session, err := concurrency.NewSession(etcd, concurrency.WithTTL(10))
    if err != nil { return err }
    defer session.Close()

    election := concurrency.NewElection(session, "/scheduler/leader")

    // blocks until we win the election
    if err := election.Campaign(ctx, hostname()); err != nil {
        return err
    }

    log.Info("became leader, starting tick loop")
    return tickLoop(ctx, session.Done())
}

func tickLoop(ctx context.Context, lost <-chan struct{}) error {
    // 1-second ticks. Don't try to be cleverer than this.
    ticker := time.NewTicker(time.Second)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-lost:
            // we lost the lease; step down immediately
            return fmt.Errorf("leader lease lost")
        case now := <-ticker.C:
            // hand the tick to the dispatcher; don't block here
            if err := dispatchDue(ctx, now); err != nil {
                log.Error("dispatch failed", "err", err)
            }
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Two non-obvious things here.

The <-lost channel matters more than the Campaign call. etcd's lease can be lost without you noticing if you only listen for context cancellation. If you keep dispatching after losing the lease, the new leader is also dispatching, and you've recreated the duplicate-dispatch problem you were trying to avoid. Step down the moment the session ends.

Don't tick faster than your dispatch can finish. If dispatchDue takes longer than 1 second sometimes (it will, under load), tickers in Go won't queue; the next tick is dropped. That's the behavior you want. What you don't want is goroutine-per-tick where you fan out and overlap dispatches. Serialize the tick. Parallelism happens inside dispatchDue, where it's bounded.

For 10M jobs/day, peak rate matters more than average. If 60% of jobs are on 0 0 * * * (midnight UTC), you're dispatching 6M jobs in a 1-minute window. The tick coordinator doesn't run those; it hands them off. So the question becomes how fast the dispatcher can drain that backlog.

Component 3: the dispatcher

The dispatcher reads due jobs, publishes to a queue (SQS, Kafka, RabbitMQ; pick one), updates the schedule store with the next firing time, and writes an entry to scheduled_runs. All under an idempotency key.

func dispatchDue(ctx context.Context, tickAt time.Time) error {
    // batch read, process in chunks so a single tick can drain a spike
    const batchSize = 1000
    for {
        rows, err := db.QueryContext(ctx, `
            SELECT job_id, tenant_id, name, payload,
                   cron_expr, timezone, next_run_at, version
            FROM scheduled_jobs
            WHERE enabled = true
              AND next_run_at <= $1
            ORDER BY next_run_at
            LIMIT $2
        `, tickAt, batchSize)
        if err != nil { return err }

        jobs, err := scanJobs(rows)
        if err != nil { return err }
        if len(jobs) == 0 { return nil }

        for _, j := range jobs {
            // idempotency key = job + scheduled instant.
            // Same key for retries; never collides across schedules.
            idem := fmt.Sprintf("%s:%d", j.JobID, j.NextRunAt.Unix())

            nextRun := computeNext(j.CronExpr, j.Timezone, tickAt)

            // claim-and-advance in one statement
            res, err := db.ExecContext(ctx, `
                UPDATE scheduled_jobs
                SET next_run_at = $1,
                    last_run_at = $2,
                    version     = version + 1
                WHERE job_id = $3 AND version = $4
            `, nextRun, j.NextRunAt, j.JobID, j.Version)
            if err != nil { return err }

            n, _ := res.RowsAffected()
            if n == 0 {
                // another dispatcher (shouldn't happen with leader
                // election, but belt-and-braces) already claimed it
                continue
            }

            if err := publish(ctx, j, idem); err != nil {
                // queue down; the job will be picked up next tick
                // because we'll revert the version on a retry tick
                log.Warn("publish failed", "job", j.JobID, "err", err)
                continue
            }

            _, _ = db.ExecContext(ctx, `
                INSERT INTO scheduled_runs
                  (run_id, job_id, scheduled_for, idempotency_key, status)
                VALUES ($1, $2, $3, $4, 'dispatched')
            `, uuid.New(), j.JobID, j.NextRunAt, idem)
        }

        if len(jobs) < batchSize { return nil }
    }
}
Enter fullscreen mode Exit fullscreen mode

The idempotency key is job_id + scheduled_instant_unix. Two retries of the same scheduled fire share the key. Two different schedulings of the same job (today's midnight and tomorrow's midnight) get different keys because the timestamp differs. The executor uses this key to enforce at-most-once execution at the application layer if the user opted in.

Note the at-most-once vs at-least-once choice lives here, not in the queue. The queue is at-least-once (because SQS, Kafka, RabbitMQ all are). The dispatcher's INSERT into scheduled_runs happens after publish, so if the publish succeeds and the insert fails, you get a duplicate dispatch on the next tick. Flip the order and you can lose dispatches. Pick one, document the failure mode loudly. Most real schedulers pick at-least-once because losing jobs is worse than running them twice.

Component 4: the executor

The executor is a worker pool consuming from the queue. Standard stuff: pull a message, run the user's job with a timeout, ack on success, nack on failure with a retry policy.

import asyncio
import signal
from contextlib import asynccontextmanager

class Executor:
    def __init__(self, queue, registry, *, concurrency=64,
                 default_timeout=300):
        self.queue = queue
        self.registry = registry          # name -> handler
        self.sem = asyncio.Semaphore(concurrency)
        self.default_timeout = default_timeout
        self._running = True

    async def run(self):
        while self._running:
            msg = await self.queue.receive()
            if msg is None:
                continue
            asyncio.create_task(self._handle(msg))

    async def _handle(self, msg):
        async with self.sem:
            handler = self.registry.get(msg.job_name)
            if handler is None:
                # unknown job; quarantine, don't drop silently
                await self.queue.dead_letter(msg, reason="unknown_handler")
                return

            timeout = msg.timeout_s or self.default_timeout
            try:
                async with asyncio.timeout(timeout):
                    await handler(msg.payload, idempotency_key=msg.idem)
                await self.queue.ack(msg)
            except TimeoutError:
                # log scheduled_for, not now(): the audit trail wants the
                # original tick, not the time the worker happened to drain it
                await self.queue.nack(msg, retryable=True,
                                      reason="timeout")
            except Exception as e:
                await self.queue.nack(msg, retryable=is_retryable(e),
                                      reason=str(e))
Enter fullscreen mode Exit fullscreen mode

Two things to highlight for the interviewer.

The default timeout is per-job, not global. A schedule that runs send_daily_email and one that runs rebuild_search_index have different tolerances. Storing timeout_s on the job and shipping it with the message means workers don't need to know about job semantics.

The bounded semaphore is what keeps a single worker from blowing out memory when a spike arrives. Without it, every received message creates a task immediately and you fan out to whatever the queue gave you. 10k pending tasks waiting on shared resources is how an executor box dies.

Failure mode 1: missed ticks during failover

The leader dies. The standby takes over. The transition isn't instant; etcd's TTL is 10 seconds in the code above, so up to 10 seconds of ticks didn't fire.

What did you miss? Anything with next_run_at between the leader's death and the new leader's first tick.

The fix is the catch-up window. The dispatcher query is next_run_at <= $1 where $1 is the current tick instant, not next_run_at = $1. So when the new leader fires its first tick, it picks up everything that was due during the gap. Late, but fired.

For schedules where lateness matters (cron, but not "the next-run-at" semantics of a one-shot job), this is the right tradeoff. For schedules where firing late is actively wrong (a stock-market open trigger at 09:30:00 doesn't want a 09:30:09 firing), you need a different model. The job carries a "skip if older than" tolerance, and the dispatcher checks now() - next_run_at < tolerance before publishing.

The catch-up window also covers the case where the dispatcher itself is slow. If a tick takes 8 seconds to drain a spike, the next tick fires immediately when the previous one returns, and it sees everything that became due during the slow batch.

What you do not want is the new leader rewinding its clock and re-firing already-dispatched ticks. The version check in the UPDATE plus the audit table's unique constraint on idempotency_key are the belt and braces against that.

Failure mode 2: clock skew across executors

Your tick coordinator's clock is 400ms ahead of one executor box. The executor box's clock is 600ms behind a third box. A job named mark_subscription_expired runs at 00:00:00 UTC against a row that has expires_at = 00:00:00.000. The executor's clock says it's 23:59:59.4. The row isn't expired yet.

This isn't a thought experiment. It's a class of bug that gets shipped quarterly somewhere in the industry.

Two rules. UTC everything. No localtime in the schedule store, no localtime in the queue message, no localtime in the executor's comparisons. The user's timezone is metadata used to compute next_run_at once, not a thing you carry around at dispatch time.

NTP is non-negotiable. Every box runs chronyd or systemd-timesyncd against a known-good pool. AWS EC2 has its time-sync endpoint at 169.254.169.123; GCP has Google's NTP servers; on-prem teams run their own stratum-2. The interviewer wants to hear that you'd alert on chrony tracking reporting more than 100ms of offset.

A third belt-and-braces move: include scheduled_for in every queue message and have the executor compare its own clock to that timestamp on receipt. If the executor sees scheduled_for = 00:00:00 and its own clock says 23:59:59, it knows to either wait or refuse, and either is better than running the job against a state that isn't yet the state the schedule was designed for.

Failure mode 3: long jobs blocking the next tick

The dispatcher publishes to a queue. The executor pulls from the queue. A job that takes 4 hours doesn't block the dispatcher, it blocks an executor worker. This is the right separation, and it's exactly the reason dispatcher and executor are different components.

The failure shape: a job takes 4 hours. The schedule fires every 1 hour. By hour 5, you have five concurrent runs of the same job, all working on overlapping data, racing each other to corrupt state.

The schedule store needs one more field for this.

ALTER TABLE scheduled_jobs ADD COLUMN concurrency_policy TEXT
    NOT NULL DEFAULT 'allow'
    CHECK (concurrency_policy IN ('allow','forbid','replace'));
Enter fullscreen mode Exit fullscreen mode

Three policies, taken straight from k8s CronJob because Kubernetes already solved this and there's no point re-inventing the vocabulary.

  • allow: fire it, let them overlap. Default for stateless jobs.
  • forbid: skip the new run if the previous run hasn't completed.
  • replace: cancel the previous run and start the new one.

The dispatcher enforces this by checking the most recent scheduled_runs row for the job before publishing. forbid means: if there's a dispatched row without a succeeded/failed row, write a skipped row and don't publish. replace means: send a cancellation message for the in-flight run, then publish the new one.

Queue depth isn't execution depth. A backed-up queue and a backed-up executor pool look the same to a naive dashboard. You want metrics for both, and the alert that matters is executions_in_flight > expected_max for any job tagged forbid. That's the one that catches the silent overlap.

The 90-second answer

You'll be asked to summarize at the end. Here's the version that fits in the time you'll be given:

"Four components, kept separate. A schedule store in Postgres with the cron expression, the next-run-at as UTC, and a partial index on the next-run-at where enabled is true. A tick coordinator running with etcd leader election so exactly one instance is firing per second; standbys are warm. A dispatcher that reads due rows in batches, advances the next-run-at with an optimistic version check, then publishes to SQS with an idempotency key that's job_id + scheduled_unix. An executor worker pool consuming from the queue, with per-job timeouts and a bounded concurrency semaphore. The audit table scheduled_runs is append-only and tells operators exactly what was dispatched and when.

Three failure modes I'd call out. Missed ticks during failover: the dispatcher query is next_run_at <= now(), not equality, so the catch-up window absorbs the leader-transition gap. Clock skew: UTC stored, NTP enforced via chronyd, alert on >100ms offset, and scheduled_for ships in the message so the executor can sanity-check its own clock. Long jobs overlapping their own next run: concurrency_policy column with allow/forbid/replace, dispatcher checks the latest scheduled_runs row before publishing. Queue depth and execution depth are different metrics; alerting on one without the other hides the overlap."

That's the answer. 90 seconds, hits every component, names the failure modes by their actual shapes, and shows you know that cron and Airflow and k8s CronJob all converge on roughly this architecture for the same reasons.


If this was useful

This walk-through is the exact shape the System Design Pocket Guide: Interviews uses across all 15 designs in the book: components first, failure modes second, a 90-second summary you can actually deliver in the room. The job-scheduler chapter goes deeper into the catch-up-window math and the cron-expression edge cases (DST, leap seconds, the L and W operators most parsers get wrong). If you liked this one, that's the chapter to start with.

What part of the scheduler do you spend the most time arguing about in interviews: the leader-election story, the at-least-once vs at-most-once tradeoff, or the concurrency-policy column? Drop your take in the comments.

System Design Pocket Guide: Interviews

Top comments (0)