Taras H

Posted on Mar 8 • Edited on Apr 26 • Originally published at codenotes.tech

Background Jobs in Production: The Problems Queues Don’t Solve

#sre #backend #distributedsystems

Moving work out of the request path is one of the most common ways to
speed up backend systems.

Emails are sent asynchronously.
Invoices are generated by workers.
Webhooks are delivered through queues.
Image processing and indexing run in background jobs.

Latency improves immediately.

But many teams eventually notice something strange in production:

duplicate emails appear
retries increase system load
dead-letter queues slowly grow
workflows technically "succeed"... but the outcome is wrong

The queue is healthy.
Workers are running.

Yet the system behaves incorrectly.

Moving work to the background changes where failures happen.
It does not remove them.

This post is a shorter version of a deeper engineering write-up
originally published on CodeNotes.

The Assumption Behind Background Jobs

Background job systems are usually introduced with a simple expectation:

If a job fails, the queue will retry it until it succeeds.

Queues also provide useful features:

buffering traffic spikes
independent worker scaling
retry handling
isolation from request latency

Because of this, async processing often feels safer than synchronous
execution.

But that assumption depends on something rarely guaranteed in
production:

that running a job multiple times produces the same result as running
it once.

What "At-Least-Once Delivery" Actually Means

Most queue systems guarantee at-least-once delivery.

That means the system will try hard to deliver a message - even if it
results in duplicate execution.

It does not mean:

the job runs exactly once
side effects happen exactly once
messages are processed in order

In other words, the queue protects against message loss, not
duplicate work.

Once duplicate execution becomes possible, correctness has to come from
somewhere else.

Usually that means:

idempotent handlers
deduplication keys
explicit state transitions
retry boundaries

Without those protections, the infrastructure is reliable while the
workflow is not.

A Classic Failure Scenario

Consider a worker that sends a payment receipt:

await emailClient.send(...)

await db.payment.update({
  receiptSentAt: new Date()
})

If the worker crashes after sending the email but before updating
the database, the job will be retried.

Now the customer receives two receipts.

The queue behaved exactly as designed.

But the business outcome is incorrect.

Why Production Systems Break Here

Background job systems introduce two things that make correctness
harder.

1. Duplicate execution

Workers can crash after performing side effects but before acknowledging
the message.

2. Time separation

Jobs may execute minutes or hours after they were created, when system
state has already changed.

Because of this, retries often interact with partial state or
outdated context.

The Design Rule Most Teams Learn Later

A background job should never be treated as a one-time action.

It should be treated as a replayable command.

Every handler should be safe if it runs:

twice
later than expected
after partial completion
out of order

If those conditions break the workflow, retries will eventually corrupt
system behavior.

The Monitoring Trap

Teams often monitor queue infrastructure:

queue depth
worker throughput
retry counts
dead-letter volume

Those metrics matter - but they don't answer questions like:

Did users receive duplicate emails?
Did a payment create multiple ledger entries?
Did downstream systems receive conflicting updates?

A queue dashboard can look completely healthy while the workflow is
incorrect.

Read the Full Production Breakdown

This post only covers the core failure patterns.

The full article explains:

why retries can make outages worse
how idempotent background jobs are designed
why dead-letter queues silently grow
what production teams monitor beyond queue depth
a practical rollout checklist for new background jobs

👉 Full article:
https://codenotes.tech/blog/background-jobs-in-production

Top comments (1)

Incident Copilot • Mar 11

This is the real production view. A queue gives you delivery mechanics, not operational clarity. Retries, poison jobs, out-of-order effects, and hidden backlog growth are where things get expensive.

A lot of incidents around async systems are really observability failures first. If you cannot quickly answer what is stuck, what is retrying, and what side effects already escaped, you are debugging blind.