DEV Community

Cover image for Background Jobs in Production: The Problems Queues Don’t Solve
Taras H
Taras H

Posted on

Background Jobs in Production: The Problems Queues Don’t Solve

Moving work out of the request path is one of the most common ways to
speed up backend systems.

Emails are sent asynchronously.
Invoices are generated by workers.
Webhooks are delivered through queues.
Image processing and indexing run in background jobs.

Latency improves immediately.

But many teams eventually notice something strange in production:

  • duplicate emails appear
  • retries increase system load
  • dead-letter queues slowly grow
  • workflows technically "succeed"... but the outcome is wrong

The queue is healthy.
Workers are running.

Yet the system behaves incorrectly.

Moving work to the background changes where failures happen.
It does not remove them.

This post is a shorter version of a deeper engineering write-up
originally published on CodeNotes.


The Assumption Behind Background Jobs

Background job systems are usually introduced with a simple expectation:

If a job fails, the queue will retry it until it succeeds.

Queues also provide useful features:

  • buffering traffic spikes
  • independent worker scaling
  • retry handling
  • isolation from request latency

Because of this, async processing often feels safer than synchronous
execution.

But that assumption depends on something rarely guaranteed in
production:

that running a job multiple times produces the same result as running
it once.


What "At-Least-Once Delivery" Actually Means

Most queue systems guarantee at-least-once delivery.

That means the system will try hard to deliver a message - even if it
results in duplicate execution.

It does not mean:

  • the job runs exactly once
  • side effects happen exactly once
  • messages are processed in order

In other words, the queue protects against message loss, not
duplicate work.

Once duplicate execution becomes possible, correctness has to come from
somewhere else.

Usually that means:

  • idempotent handlers
  • deduplication keys
  • explicit state transitions
  • retry boundaries

Without those protections, the infrastructure is reliable while the
workflow is not.


A Classic Failure Scenario

Consider a worker that sends a payment receipt:

await emailClient.send(...)

await db.payment.update({
  receiptSentAt: new Date()
})
Enter fullscreen mode Exit fullscreen mode

If the worker crashes after sending the email but before updating
the database
, the job will be retried.

Now the customer receives two receipts.

The queue behaved exactly as designed.

But the business outcome is incorrect.


Why Production Systems Break Here

Background job systems introduce two things that make correctness
harder.

1. Duplicate execution

Workers can crash after performing side effects but before acknowledging
the message.

2. Time separation

Jobs may execute minutes or hours after they were created, when system
state has already changed.

Because of this, retries often interact with partial state or
outdated context.


The Design Rule Most Teams Learn Later

A background job should never be treated as a one-time action.

It should be treated as a replayable command.

Every handler should be safe if it runs:

  • twice
  • later than expected
  • after partial completion
  • out of order

If those conditions break the workflow, retries will eventually corrupt
system behavior.


The Monitoring Trap

Teams often monitor queue infrastructure:

  • queue depth
  • worker throughput
  • retry counts
  • dead-letter volume

Those metrics matter - but they don't answer questions like:

  • Did users receive duplicate emails?
  • Did a payment create multiple ledger entries?
  • Did downstream systems receive conflicting updates?

A queue dashboard can look completely healthy while the workflow is
incorrect.


Read the Full Production Breakdown

This post only covers the core failure patterns.

The full article explains:

  • why retries can make outages worse
  • how idempotent background jobs are designed
  • why dead-letter queues silently grow
  • what production teams monitor beyond queue depth
  • a practical rollout checklist for new background jobs

👉 Full article:
https://codenotes.tech/blog/background-jobs-in-production

Top comments (0)