DEV Community

Cover image for I built a distributed job queue in Go to understand how they actually work
Uthman Oladele
Uthman Oladele

Posted on

I built a distributed job queue in Go to understand how they actually work

I have used job queues my whole developer life without knowing what was inside them.

So I built one.

Not a wrapper around an existing queue. A full implementation from scratch
with Redis, PostgreSQL, goroutines, and real failure handling.

Here is everything I learned.


Why Dual Storage

Most job queues use one store. Redis is fast. PostgreSQL is durable. I wanted both.

Redis handles dispatch via a sorted set priority queue. Fast enqueue, fast dequeue.

PostgreSQL is the source of truth. Every job lives there permanently.

The rule: no critical state lives only in Redis. If Redis wipes completely,
no job is lost. PostgreSQL has everything.


Three Things Running Concurrently

  • A worker pool that executes jobs
  • A scheduler that promotes jobs from PostgreSQL into Redis when their time arrives
  • A stale reaper that detects crashed workers and requeues their jobs automatically

All three run as goroutines. All three coordinate without stepping on each other.


What Happens When a Worker Crashes

This is the part most tutorials skip.

When a worker picks up a job it marks it as in-progress. If that worker crashes
mid-execution the job stays marked in-progress forever unless something intervenes.

The stale reaper scans for jobs that have been in-progress longer than their timeout.
It requeues them automatically with exponential backoff.

No manual intervention. No lost jobs.


The Numbers

Metric Result
Job registration 52ns/op, 0 allocations
Job execution 950ns/op

Benchmarked with Go's built-in benchmark tooling.

Ships with a Prometheus metrics endpoint and a pre-built Grafana dashboard
covering queue depth, throughput, and failure rates by job type.


What I Actually Understand Now

  • Why Redis alone is not enough for a job queue
  • Why crashed worker recovery needs to be a first class feature not an afterthought
  • Why exponential backoff matters more than immediate retries

The project is open source with one external contributor already.

  • GitHub: github.com/codetesla51/kyu
  • Landing page: kyu-job-queue.vercel.app


`
Enter fullscreen mode Exit fullscreen mode

Top comments (0)