Aditya Prasad

Posted on Feb 9

What I Learned Building a Job Scheduler That Doesn’t Trust Redis

#go #backend #redis #mysql

1. Intro

What it is?

Tickr is a background job scheduler built using Go, Redis, and MySQL.
It consists of a pool of workers waiting to be assigned jobs. Jobs move through waiting and ready queues implemented using Redis, and are assigned to workers by a scheduler. Everything runs concurrently and is event driven.

Why did I build it?

I was getting tired of building CRUD backend APIs in Node js and Go. I wanted to work on something different, something unfamiliar, that would force me to think about concurrency, failure handling, and system behavior instead of just endpoints and schemas.

2. What broke in v1?

Polling

In v1, the scheduler was polling every second to check if a job was ready to be executed. This worked, but it was inefficient and didn’t feel right.

Fixing this wasn’t trivial for me at first, but I eventually replaced polling with unbuffered Go channels and blocking operations. Channels ended up being one of the most important concepts in this entire project.

Assumptions

In v1, I assumed too many things would just work. Once those assumptions broke, debugging and recovery became difficult.

One example:
If the server was terminated while jobs were still in the waiting queue, recovering those jobs on restart was messy and unreliable.

Redis as single source of truth

Initially, I was fine with Redis being the only source of truth. But as I tested more edge cases, I started asking uncomfortable questions:
- What happens if Redis crashes while jobs are still queued?
- What if Redis loses its state after a crash?
- What if a job fails midway?
- What if I need logs or history for completed jobs?

Answering these questions properly meant rethinking the architecture — which led to Tickr v2.

3. v2 Architecture

MySQL as the source of truth

In v2, MySQL became the source of truth.

Jobs are first persisted in MySQL. Only the JobID along with scheduling information is pushed to Redis. Redis is used purely for scheduling and coordination, not durability.

This approach:
- prevents job loss during crashes
- avoids unnecessary MySQL polling
- allows jobs to move across queues using lightweight IDs
- lets workers fetch the full job data from MySQL only when execution is needed

Scheduler and Workers

The scheduler runs a single goroutine (PopReadyQueue) that blocks on Redis until a job becomes available.

If Redis disconnects:
- the scheduler waits until Redis is reachable again
- checks whether Redis lost its state
- if state is lost, it rebuilds queues from MySQL

Jobs popped from the ready queue are sent into an unbuffered job channel, which all workers are listening to. When a job appears on the channel, exactly one worker picks it up and executes it.

If a job fails:
- it is retried with a delay
- retries are bounded (max 3 attempts)
- delay increases linearly with each attempt

4. Edge Cases Handled

Redis Crash

When Redis goes down, the scheduler pauses and waits until Redis becomes available again. Once Redis is back, it checks whether state was lost and triggers recovery from MySQL if required. Otherwise, it continues normally.

Overdue jobs

If Redis is down while jobs are scheduled, their execution time may pass without being triggered.

Originally, these jobs wouldn’t execute until a new job entered the queue. I fixed this by signaling the scheduler through a channel when the waiting queue is refilled, forcing it to re-evaluate overdue jobs immediately.

Shutdown during execution

If the server is shut down while jobs are executing:
- workers stop accepting new jobs
- in flight jobs are allowed to complete
- the program waits for workers to finish before exiting

One issue I ran into was that completed jobs couldn’t update their status in MySQL because the global context was already cancelled.
I solved this by using a background context specifically for final status updates, ensuring correctness even during shutdown.

5. What I learnt?

This project was very different from anything I had built before, and I enjoyed it a lot.

I learned:
- why ownership of state matters
- how fragile assumptions can be
- how to reason about failure instead of avoiding it
- how to design systems that recover instead of restarting
- how to write backend code that doesn’t break easily under edge cases

Tickr v2 taught me more about backend systems than any tutorial ever did.

The github repository is available here: Tickr

Feedback is welcome — especially around architecture and failure handling

DEV Community