The first background job I ever shipped was a setTimeout inside an Express route handler. It sent a welcome email two seconds after signup so the user would see a "you're in" screen before the inbox notification hit. It worked on my laptop. It worked on staging. It worked in production for about three weeks, until the day my server restarted in the middle of a deploy and forty-one in-flight timers vanished with the process. Forty-one people got no welcome email. I found out from a support ticket.
That was the day I learned what a job queue is actually for. Not throughput. Not scaling. Durability. The promise that if the work is acknowledged, the work will eventually happen, even if the box catches fire in the next second.
This is the post I wish someone had handed me before that deploy. It covers what background jobs really are, the moment a config-file solution stops being enough, the four lanes of 2026 background work, and the setup I run today on a project doing about three hundred thousand jobs a month with one engineer and no Kubernetes.
What A Background Job Actually Is
A background job is any unit of work that should not happen in the request your user is waiting on. That is the whole definition. Everything else is engineering on top.
The three reasons to move work into the background are the same three reasons every job queue post starts with, and they are worth being honest about because most of the time only one of them applies to a solo developer.
The first is latency. Your user clicked a button, you can return the response in fifty milliseconds, but the work behind the button takes eight seconds. You move the work to the background and return the fifty-millisecond response.
The second is fan-out. One event triggers ten or a hundred or a thousand downstream actions. A new signup triggers a welcome email, a Slack notification, a CRM sync, an analytics event, an audit log, a referral check, a feature flag bootstrap. Doing them inline turns one cheap request into one slow request.
The third is durability. Some piece of work has to happen, eventually, even if the process that scheduled it dies before it finishes. Charging a customer's card on a recurring schedule. Reconciling against a Stripe webhook. Sending a one-week-later onboarding email. Cleaning up an abandoned file upload. The work is not urgent. It is important.
For indie developers the third reason is almost always the one that actually matters. Latency you can paper over with a loading state. Fan-out you can do in series and still hit a two-second total. But the day a process restart eats a charge attempt or a payment-failed email is the day you realise that durability is not a "nice to have" you bolt on later.
The Moment You Outgrow setTimeout
setTimeout and friends are not background jobs. They are scheduled in-process timers. If the process dies, they die with it. If the process is on Vercel or any serverless platform, the process may die the moment your response returns, which means your "background" job never runs at all.
The same is true of setImmediate, process.nextTick, "fire and forget" promises, and any pattern that lives inside the request lifecycle. They are fine for one thing only: work that is genuinely best-effort, where losing it is acceptable. Logging to a metric backend is best-effort. Sending a payment confirmation is not.
You have outgrown setTimeout the moment any of the following is true:
- The work is connected to money or to a record the user can see.
- The work runs on a schedule that does not match a request (every hour, every day).
- The work has to retry if a downstream service is down.
- The work can take longer than the longest acceptable request timeout for your runtime.
You also have outgrown it the moment you deploy to a serverless runtime. Vercel functions, Cloudflare Workers, AWS Lambda, anything that completes the function the moment you return the response. There is no "background" in serverless. There is only inside-the-request and not-at-all. Anything that needs to happen after the response has to live in something else.
That something else is a queue.
What A Queue Actually Buys You
A queue is a list of work items that survives the death of the process that added them. That is the entire core feature. Everything else (retries, backoff, dead letter queues, scheduling, fan-out, concurrency control, priority lanes) is built on top of that one property.
The minimum useful queue has four pieces.
A producer that puts work on the queue. Usually this is your web request handler. The handler decides "this work needs to happen," writes a row to a queue, and returns immediately to the user.
A storage layer that holds the work durably. A Redis list, a Postgres table, an SQS queue, a managed queue service. The shape does not matter. The promise does: once the producer is told "I have it," the work will not be lost.
A consumer (or worker) that pulls work off the queue and runs it. Usually a long-lived process, or in serverless land, a function that is invoked when a job appears.
A retry policy that decides what happens when a job fails. Did the worker crash? Did the downstream API return 500? Is this a permanent failure or a transient one? Without a retry policy the queue is just a slightly fancier inbox.
Every queue product on the market is some combination of those four pieces with different defaults and a different bill. The decision is mostly about which combination makes sense at the size you are now, and which makes sense at the size you will be in a year. Those are different answers more often than the marketing pages admit.
The Four Lanes Of Background Work In 2026
There is no single best queue for indie developers. There are four lanes, each with a different cost profile and a different operational story, and the right choice depends on which lane your work actually lives in.
Lane 1: Postgres As The Queue
You already have Postgres. If you are doing fewer than a few thousand jobs an hour and you can tolerate polling latency on the order of seconds, the simplest queue you can run is a table in your existing database.
Three options worth knowing in this lane:
pgmq is a Postgres extension that gives you SQS-style queues backed by tables. Send a message, receive it with a visibility timeout, archive it on success. Production-grade, used in actual stacks, supported by Neon and Supabase out of the box.
graphile-worker is a job runner for Node that uses Postgres LISTEN/NOTIFY so it does not poll. Sub-second job latency, retries built in, cron-style scheduled jobs, transactional enqueue. The transactional bit is what nobody else gives you: you can enqueue a job in the same transaction as the database write that produced it, so either both happen or neither does.
River is the Go equivalent of graphile-worker, also Postgres-backed, also using LISTEN/NOTIFY, also transactional. If your worker is in Go, this is the obvious pick.
The reason Postgres-as-queue is the right default for most indie projects is the same reason a config-file feature flag is the right default for most indie projects (the same instinct I wrote about in feature flags for solo developers): you already have the infrastructure, the operational story is "the database is up," and you do not pay for a second service that can also be down at three in the morning.
The transactional enqueue is genuinely a killer feature. If you have ever had a webhook handler that wrote to a database, then crashed before it could fan out to a queue, then woke up to a customer whose record exists but whose downstream side effects never happened, you know what I am talking about. Postgres queues make that bug structurally impossible.
The limit is throughput. Postgres can do tens of thousands of jobs an hour comfortably, and beyond that it depends on your hardware and your job shape. You will know when you outgrow this lane because your dashboard will tell you. You will not outgrow it on day one.
Lane 2: Redis And A Worker (BullMQ)
BullMQ is the modern Node version of the original Bull. It uses Redis as the queue storage, supports priority queues, repeatable jobs, rate limiting, and a usable dashboard.
This was the default indie-developer queue for most of the late 2010s and most of the early 2020s. It is still good. The reason to pick it in 2026 is that you already have Redis running for caching or session storage and adding a queue on top is essentially free.
The reason not to pick it is that running your own Redis adds an operational burden that Postgres does not. Redis is fine. Redis going down at 3am is also fine if you remember the password and have monitoring. If you do not, this is one more thing to keep alive.
If you are on Upstash, the managed Redis story is genuinely good and BullMQ works against it with no surprises. If you are running your own Redis on Hetzner, the math gets less friendly.
The real reason BullMQ has lost some ground in 2026 is not the queue itself. It is that "I need a worker process running 24/7 to poll Redis" stops fitting nicely once your app is on Vercel or Cloudflare or any serverless platform. You can still run the worker somewhere (Railway, Fly, a small VPS), but the operational story stops being "one box."
Lane 3: Managed Serverless Queues
The lane that did not really exist five years ago is now the default for half the indie projects I see in 2026. Managed queue services that are explicitly designed for serverless platforms and bill per job.
Inngest is the one most indie devs reach for first. You write your job as a function in your app. Inngest invokes it as a webhook into your serverless platform. Retries, scheduling, fan-out, and step functions are first-class. The pricing has a real free tier and the dev experience inside Next.js or Hono is excellent.
Vercel Queues is the new entrant from Vercel, in public beta in 2026, built on Fluid Compute. At-least-once delivery, dead letter queues, exponential backoff, fan-out, all wired into the same project as your functions. If your stack is already on Vercel this is the path of least resistance.
Upstash QStash is a delivery service that calls a webhook on a schedule or with delay. The mental model is "scheduled HTTP request." Cheap, simple, works against any HTTP endpoint anywhere. Perfect for "send this email in three days."
Trigger.dev sits between Inngest and a workflow engine. Long-running jobs are first-class, you can pause and resume, retries are deep, observability is the best of the bunch. The pricing scales with usage and can get expensive at high volume, but the dev experience is hard to beat for AI-heavy workloads.
The common thread across all four is that you do not run a worker process. The service calls your serverless function when there is work to do. That fits the modern indie stack the way the older Redis-worker model fit the Heroku stack.
The cost is that you are now coupled to a service. If Inngest is down, your jobs do not run. If your webhook signature scheme is wrong, your jobs run but fail silently. The operational story is "two vendors" instead of "one database." That trade is usually fine. Sometimes it is not.
Lane 4: Cron Plus A Webhook
The lane that the engineering blogs do not write about because it is too embarrassing, but that ships an absurd amount of indie SaaS, is the "cron job hits a webhook" pattern.
Vercel Cron, GitHub Actions cron, a crontab on a small box, cloud scheduler, any of them. The cron fires every minute or every hour. The webhook reads a "things to do" table from your database and processes them. There is no queue server. There is no worker. There is a table and a scheduled HTTP request.
This is honestly the right answer for a surprisingly large class of indie SaaS work. Daily digest emails, weekly report generation, abandoned cart reminders, anything that is "do this every X minutes." It is not durable in the queue sense (a failed run will not retry until the next cron tick), but for work that is naturally idempotent and runs on a clock anyway, it is fine.
The trap is reaching for cron when what you actually need is a queue. If the work is event-driven (a user did a thing), cron is wrong. If the work is time-driven (it is 9am Monday), cron is right. The shape of the trigger tells you the shape of the tool.
My Default Pick And Why
For a new project in 2026 I default to graphile-worker on top of the Postgres instance I am already paying for, unless one of three things is true.
The first is that the project is on Vercel and I do not want to run a worker process anywhere. Then it is Inngest or Vercel Queues. The choice between them is a coin flip. I lean Inngest because it has been around longer and the docs are better, but Vercel Queues is fine and the integration is tighter.
The second is that the project is doing more than ten thousand jobs an hour from day one (rare for indie work, but it happens). Then it is BullMQ on managed Redis, or it is one of the managed serverless options if I can swallow the per-job pricing.
The third is that the work is purely scheduled (no event triggers, just "every day at midnight"). Then it is Vercel Cron or a GitHub Actions cron hitting a webhook. No queue at all.
The reason graphile-worker is the default is the transactional enqueue. The same database connection that writes the "subscription created" row also writes the "send welcome email" job, in the same transaction. If anything in the transaction fails, both rows roll back, and your user does not get a half-created subscription with a phantom welcome email queued behind it. That property is a free correctness upgrade that no managed queue service can give you, because managed services live in a different database than your business logic.
It is also free in the literal sense. You are already paying for Postgres. You do not pay anyone else.
Idempotency Is The Whole Game
The single most important thing to know about background jobs is that they will run more than once. Not in theory. In practice, on Tuesday, with a customer watching.
Every queue worth using delivers at-least-once. Some advertise exactly-once. They are lying or they are defining the word weirdly. Network partitions exist. Worker crashes exist. The window between "I finished the job" and "I told the queue I finished the job" exists. Jobs run twice. Sometimes three times.
Your job code has to assume this. The technical word is idempotency. The plain English version is "running this job twice has to produce the same result as running it once." That is the same instinct that drives idempotent Stripe webhook handlers, and the implementation pattern is the same.
The cleanest way to make a job idempotent is to give it a stable key and refuse to do the work twice for the same key. A welcome email job for a user gets the key welcome_email:user_42. Before doing the work, the job inserts that key into an idempotency_keys table with a UNIQUE constraint. If the insert succeeds, do the work. If the insert fails on the unique constraint, the job has already run, so log and exit.
CREATE TABLE idempotency_keys (
key TEXT PRIMARY KEY,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Inside the job:
INSERT INTO idempotency_keys (key) VALUES ($1)
ON CONFLICT DO NOTHING
RETURNING key;
If the RETURNING row is empty, the job already ran. Bail. If it returns the key, do the work, and let the row stay. You can clean the table up with a cron that deletes rows older than thirty days, because by then the same job retrying is no longer realistic.
The version of this that does not work, and that I have seen ship more than once, is "check the table, then do the work." The check-then-act pattern has a race condition wide enough to drive a billing bug through. Two workers see the key is missing at the same time, both decide to do the work, both do it. The unique constraint plus ON CONFLICT is the only version that survives concurrent workers.
Every job you write should have an idempotency key. Every one. The day you skip it is the day you discover what your queue actually delivers.
Retries, Backoff, And The Dead Letter Queue
Every queue has retry semantics. Most have sensible defaults. Almost none have defaults that match your specific workload, which is why the retry settings are the first thing I tune on a new project.
Three knobs matter.
Max attempts. Five is a reasonable default. More than ten is almost always wrong, because if something has failed ten times it is not a transient error, it is a real bug, and retrying it again is just spamming your logs.
Backoff strategy. Exponential is almost always the right answer. Try at 0s, 30s, 2min, 10min, 1hr. Do not use a fixed delay. Do not retry immediately. A failed downstream API is not going to be healthy again in five seconds.
The dead letter queue. When a job exhausts its retries, where does it go? The answer "into the void" is wrong. The answer "an OnConflictDo error in a log file you will never read" is also wrong. The right answer is a dead letter table or queue that you actually monitor, with a Slack alert when something lands there.
const worker = new Worker({
connectionString: process.env.DATABASE_URL,
taskList: {
send_welcome_email: async (payload, helpers) => {
await sendEmail(payload);
},
},
// graphile-worker handles retries with exponential backoff by default
// Customise via the task options:
maxAttempts: 5,
});
The most useful pattern is a dashboard view or a small admin page that lists dead letter jobs with their last error and a "retry" button. Once you have that, the dead letter queue stops being a graveyard and starts being a triage list. That is the difference between an indie SaaS that loses occasional jobs and one that quietly fixes them within the day.
Observability For Jobs
The thing nobody tells you about background jobs is that they are invisible by default. Your user does not see them. Your logs scroll past them. Your error tracker does not pick up async failures unless you wire it in. The first sign of a broken job is usually a customer support ticket two days later, which is exactly the wrong direction for that information to flow.
Three things to wire up on day one.
A jobs dashboard. Whatever queue you pick, set up the dashboard. Inngest has a hosted one. graphile-worker can use worker-graphile-pro for a UI, or you can write a small Astro page that reads the jobs table directly. BullMQ has Bull Board. If you cannot see what is queued, what is running, and what is failing, you cannot fix anything.
Error tracking inside the worker. Sentry, Highlight, whatever you use for the web app. Wire it into the worker process. The job failed errors should look exactly like the request failed errors in your error tracker, with the job name as the "endpoint" and the payload as the "params." This is a five-minute setup that turns invisible failures into visible ones.
Slack alerts on dead letter. When a job lands in the dead letter queue, post to Slack. Not for every failure (that is noise), but for every job that has exhausted its retries (that is a signal). The body of the message should include the job name, the last error, and a link to the dashboard. This is the single most useful piece of observability I have on background work, and it has caught real bugs within an hour of them shipping.
The same instinct that drives production observability for solo developers applies here in miniature: you do not need a full APM stack to see what is going wrong, you just need the few signals that turn unknown unknowns into known ones.
When To Graduate To Something Heavier
The point of all of this is that you do not need to graduate for a long time. Most indie SaaS projects can run their entire first year on Postgres-as-a-queue or one managed serverless service, and never feel the limit.
You have graduated when one of three things is true.
The first is sustained throughput. If you are running more than fifty thousand jobs an hour for hours at a time, Postgres as a queue starts to feel the pressure. Move to a dedicated Redis queue or to a managed service that does not bill per job at that volume.
The second is latency requirements you cannot meet. If "send this email" needs to happen within two seconds of the trigger, and your Postgres polling loop runs every five, you have outgrown polling. LISTEN/NOTIFY solves most of this, but at very high event rates you eventually want a real broker.
The third is the workflow complexity. If your "job" is actually a fifteen-step process with branching paths, human-in-the-loop approvals, and pauses that last days, you have outgrown a queue and you want a workflow engine. Trigger.dev, Inngest's step functions, Temporal, Vercel Workflow. That is a different tool and a different post.
You will know when you are there. You will not be there for a while.
What I'd Tell Past Me
Most of what I had to learn about background jobs the hard way is in the boring sections above. The unsexy stuff. The job table, the unique constraint, the retry policy, the dead letter monitor. None of it would have made a good blog post when I was rushing to ship. All of it would have saved me from the bugs that did.
If I could hand my past self one paragraph it would be this. Pick the lane that matches your platform: Postgres if you have a real database, managed serverless if you are all-in on Vercel, BullMQ if you are running your own infra and have Redis anyway. Make every job idempotent on day one, with a unique key in a table, not a check-then-act. Set up the dead letter monitor before you set up the first real job. Wire Slack alerts to it. Then go ship.
The queue is not the interesting part. The interesting part is the product you can build because work happens reliably, in the background, even when the server restarts in the middle of a deploy and forty-one timers vanish into the void.
You only learn that the first time you ship without a queue. Hopefully now you do not have to.
Top comments (0)