Polliog for AWS Community Builders

Posted on Mar 17

I Removed Redis From My Stack and Used PostgreSQL for Job Queues Instead

#postgres #aws #node #webdev

Every Node.js project eventually needs background jobs. Send this email. Process this file. Run this alert evaluation at midnight. The default answer in the ecosystem is Redis + BullMQ. It's fast, battle-tested, and has a great API.

It also means running Redis.

For projects already running PostgreSQL, that's a second database to provision, monitor, back up, and pay for. On AWS, an ElastiCache instance starts at ~$15/month for the smallest node not catastrophic, but not nothing either. More importantly, it's another moving part that can fail.

I recently shipped a Redis-free deployment mode for an open-source project I maintain. The job queue runs entirely on PostgreSQL using graphile-worker. Here's everything I learned from the experience what graphile-worker does well, where it has real limits, and when you should just keep Redis.

The Problem With "Just Add Redis"

Before getting into the comparison, it's worth being honest about what the Redis dependency actually costs.

Operationally, Redis is simple to run but adds surface area. Every additional service is another thing that can go down, run out of memory, or need a version upgrade. In Docker Compose deployments (which is how most self-hosted tools get deployed), it's another container, another volume, another health check.

On AWS, the options are:

ElastiCache (managed, ~$15-50/month for a usable instance)
Redis on EC2 (self-managed, cheaper but more work)
Redis on the same instance as your app (fine for dev, risky in prod)

For multi-instance scaling, Redis becomes mandatory you can't share BullMQ queues across processes without it. But for a single-instance deployment, you're paying the Redis tax without getting the multi-instance benefit.

The question I asked: if I'm already running PostgreSQL and my job volume doesn't justify a dedicated queue broker, what do I lose by using Postgres as the queue?

How graphile-worker Works

graphile-worker stores jobs in a PostgreSQL table and uses SELECT ... FOR UPDATE SKIP LOCKED to claim them. That one clause is the key insight: it's an atomic operation that lets multiple workers poll the same table concurrently without contention.

SELECT id, queue_name, task_identifier, payload, run_at
FROM graphile_worker.jobs
WHERE run_at <= NOW()
  AND locked_by IS NULL
  AND attempts < max_attempts
ORDER BY priority DESC, run_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 1;

When a worker claims a job, it locks the row. If the worker crashes, the lock is released automatically when the connection drops. No dead letter queue configuration needed just max_attempts and exponential backoff.

Job results are kept in the same table. Completed jobs get deleted (or archived, if you configure it). Failed jobs increment their attempt counter and get rescheduled with backoff.

It's genuinely elegant. The entire queue state lives in a place you already know how to query, back up, and monitor.

Setting It Up in a Node.js/TypeScript Project

import { run, makeWorkerUtils } from 'graphile-worker'

// Register task handlers
const runner = await run({
  connectionString: process.env.DATABASE_URL,
  concurrency: 5,
  taskList: {
    send_email: async (payload, helpers) => {
      const { to, subject, body } = payload as EmailPayload
      await sendEmail({ to, subject, body })
    },
    evaluate_alert: async (payload, helpers) => {
      const { alertId } = payload as AlertPayload
      await evaluateAlert(alertId)
    },
    generate_report: async (payload, helpers) => {
      const { reportId } = payload as ReportPayload
      await generateReport(reportId)
    },
  },
})

// Add jobs from anywhere in your app
const utils = await makeWorkerUtils({
  connectionString: process.env.DATABASE_URL,
})

// One-off job
await utils.addJob('send_email', {
  to: 'user@example.com',
  subject: 'Your report is ready',
  body: '...',
})

// Scheduled job (run at specific time)
await utils.addJob('generate_report', { reportId: 123 }, {
  runAt: new Date('2026-03-15T09:00:00Z'),
})

// Recurring job (cron syntax)
await utils.addJob('evaluate_alert', { alertId: 456 }, {
  jobKey: 'alert-456-eval',
  jobKeyMode: 'replace',
  runAt: cronNextRun('*/5 * * * *'), // every 5 minutes
})

The API is intentionally minimal. No queue configuration, no connection pooling setup, no separate Redis client. You point it at your existing PostgreSQL connection string and start adding jobs.

BullMQ vs graphile-worker: The Real Comparison

Let me be direct about where each one wins.

Where graphile-worker wins

Zero additional infrastructure. If you're already on RDS PostgreSQL or Aurora, graphile-worker is free. No ElastiCache, no Redis on EC2, no second managed service to babysit.

Full SQL visibility. Your jobs are rows in a table. You can query them, join them against other tables, build admin UIs with a SELECT, and debug failures with psql. Compare this to inspecting BullMQ queues via the Bull Board UI or raw Redis commands.

-- Find all failed jobs in the last hour
SELECT task_identifier, payload, last_error, attempts
FROM graphile_worker.jobs
WHERE last_error IS NOT NULL
  AND updated_at > NOW() - INTERVAL '1 hour'
ORDER BY updated_at DESC;

-- Count pending jobs by type
SELECT task_identifier, COUNT(*) as pending
FROM graphile_worker.jobs
WHERE locked_by IS NULL
GROUP BY task_identifier;

Transactional job enqueueing. This is the killer feature that BullMQ can't match. You can enqueue a job inside a database transaction, guaranteeing it only gets scheduled if the transaction commits:

await db.transaction(async (trx) => {
  // Create the user
  const user = await trx.insertInto('users').values(userData).returningAll().executeTakeFirstOrThrow()

  // Enqueue welcome email — only runs if user creation succeeds
  await trx.executeQuery(
    sql`SELECT graphile_worker.add_job('send_welcome_email', ${JSON.stringify({ userId: user.id })})`
  )
})

With BullMQ, you'd add the job after the transaction commits and if your process crashes between the commit and the queue.add() call, the job never gets scheduled. Not a common failure mode, but a real one. To achieve this guarantee with BullMQ, you'd have to implement the Transactional Outbox pattern writing the job to a database table first, then running a separate relay worker to move it to Redis. graphile-worker gives you this for free.

Operational simplicity for single-instance deployments. One less service to configure in Docker Compose, one less thing to include in your backup strategy, one less connection string to manage in environment variables.

Where BullMQ wins

Throughput. Redis is an in-memory data structure store purpose-built for this. BullMQ can process thousands of jobs per second. graphile-worker tops out around 100-200 jobs/second on typical PostgreSQL hardware before you start hitting lock contention. For most applications this is irrelevant. For high-volume pipelines (image processing, webhook delivery at scale, bulk email), it matters.

Advanced queue features. BullMQ has rate limiting, job priorities with fine-grained control, delayed jobs with millisecond precision, parent-child job dependencies, and repeatable jobs with complex scheduling. graphile-worker has most of these, but BullMQ's implementation is more complete and battle-hardened.

Real-time job events. BullMQ emits events (completed, failed, progress) via Redis pub/sub. You can build live job monitoring dashboards easily. With graphile-worker, you'd poll the jobs table.

Multi-instance horizontal scaling. BullMQ was designed from the ground up for multiple workers across multiple processes/machines, all sharing the same Redis. graphile-worker supports this too (multiple workers polling the same PostgreSQL), but the throughput ceiling is lower.

The honest performance numbers

On commodity hardware (the same AMD Ryzen 5 3600 from the benchmark article):

Scenario	BullMQ	graphile-worker
Job enqueue rate	~5,000/s	~500/s
Job processing throughput (simple tasks)	~2,000/s	~100-200/s
Job processing throughput (I/O bound tasks)	~500/s	~100/s
Latency from enqueue to pickup	<10ms	<10ms (LISTEN/NOTIFY), 2s max (poll fallback)

graphile-worker polls for new jobs at a configurable interval (default: every 2 seconds, plus LISTEN/NOTIFY for immediate pickup). For most background job use cases — sending emails, generating reports, running scheduled checks — 500ms latency is completely acceptable. For near-real-time processing where job pickup latency matters, BullMQ wins.

The AWS Decision Framework

This is where the choice becomes concrete.

Use graphile-worker when:

You're already on RDS PostgreSQL or Aurora
Your job volume is under ~100 jobs/second
You have a single-instance deployment or modest horizontal scale
You want transactional job enqueueing
You want SQL-queryable job state
You want to avoid ElastiCache costs

Use BullMQ when:

You need >200 jobs/second sustained throughput
You have real-time job progress tracking requirements
You're scaling to many workers across many instances
You already have ElastiCache for other purposes (caching, sessions)
You need fine-grained rate limiting (e.g., "max 10 API calls/second to this external service")

The cost math on AWS (rough estimates, us-east-1):

Setup	Monthly cost (approx)
RDS PostgreSQL db.t3.medium	~$30
RDS PostgreSQL db.t3.medium + ElastiCache cache.t3.micro	~$45
Aurora PostgreSQL (serverless v2, min capacity)	~$45
Aurora PostgreSQL + ElastiCache cache.t3.micro	~$60

If you're already paying for RDS and your jobs fit within graphile-worker's throughput ceiling, you're spending money on ElastiCache for infrastructure you don't need.

The Migration Path

If you're currently on BullMQ and considering a migration, it's straightforward. graphile-worker runs schema migrations automatically on startup you don't manage the tables yourself.

// Before: BullMQ
import Queue from 'bullmq'
const emailQueue = new Queue('email', { connection: redisConnection })
await emailQueue.add('send', { to, subject, body }, { attempts: 3, backoff: { type: 'exponential', delay: 1000 } })

// After: graphile-worker
const utils = await makeWorkerUtils({ connectionString: process.env.DATABASE_URL })
await utils.addJob('send_email', { to, subject, body }, { maxAttempts: 3 })

The retry/backoff configuration moves from the job definition to the worker configuration. The task handler API is nearly identical.

One thing to handle explicitly: BullMQ lets you attach removeOnComplete and removeOnFail policies per job. graphile-worker always removes completed jobs (keeping failed ones with their error details). If you need a completed job archive, add a separate table and write to it from your task handlers.

What I Actually Run in Production

The project I maintain ships two Docker Compose configurations: one with Redis + BullMQ for teams that need horizontal scaling, and one with graphile-worker only for single-instance deployments that want minimum operational overhead.

The Redis-free setup works well for SMB deployments teams running their own observability stack on a single VPS or a modest EC2 instance. The full setup with Redis makes sense when you're running multiple backend instances behind a load balancer.

Both queue implementations share the same task handler interface. Switching between them is a config change, not a code change.

Summary

PostgreSQL-based job queues aren't a new idea Delayed Job in Ruby, django-q in Python, and several others have proven the pattern works. graphile-worker brings it to Node.js with a clean API and genuine PostgreSQL integration.

The choice isn't "which is better." It's "which matches your constraints." If you're paying for ElastiCache already, BullMQ is probably the right call. If you're running PostgreSQL and your job volume fits within graphile-worker's ceiling, eliminating Redis simplifies your stack without meaningful cost.

The SELECT ... FOR UPDATE SKIP LOCKED pattern is one of those PostgreSQL features that most developers don't know exists until they need it. Now you do.

The Redis-optional deployment mode ships in Logtide since v0.5.0 a self-hosted observability platform built on Node.js + TimescaleDB. The docker-compose.simple.yml uses graphile-worker; the standard docker-compose.yml uses BullMQ.

DEV Community