DEV Community

Yury
Yury

Posted on

The Other Half of the Dual-Write Problem: What Happens When a Job Finishes

In the first post of this series I talked about the dual-write problem on the producer side — the moment your app inserts a user and enqueues a "send welcome email" job, and one of those writes goes to Postgres while the other goes to Redis or Temporal. Two systems, no shared transaction, two ways to be inconsistent.

Queuert closes that gap by making startChain a row in the same transaction as the user insert. If the commit fails, the job never existed.

But the producer side is only half the story. The same dual-write hazard shows up again at the other end of the pipe — inside the attempt handler, when the job is actually running.

The Symmetric Problem

Same setup as the first post — Temporal activity, since that's the example we already have shared vocabulary for. The shape is the same in BullMQ, Sidekiq, Resque, SQS, anything where the queue's "completed" state lives outside your DB:

export async function chargePaymentActivity(orderId: number) {
  const order = await db.orders.findById(orderId);

  const { paymentId } = await stripe.charge(order.amount);

  await db.orders.update(order.id, { paymentId });

  return { paymentId };
  // ← Activity returns. The worker now reports completion to the
  //   Temporal cluster. Separate connection, separate transaction,
  //   separate system.
}
Enter fullscreen mode Exit fullscreen mode

Four steps: read the order, charge Stripe, write payment_id back, return so Temporal records the activity as completed. Looks atomic. It isn't.

The seam is the last line. The DB update is one transaction against Postgres. The "mark this activity completed" write is a different transaction against the Temporal cluster's persistence layer, issued by the worker after the function returns. If the worker crashes between them — pod eviction, OOM, network blip, anything — Postgres has the payment_id, Temporal still thinks the activity is in flight, and the workflow re-runs the activity on the next attempt. Including the Stripe charge.

The fixes everyone reaches for are familiar: pass an idempotency key to Stripe, add a WHERE payment_id IS NULL to the update, maybe wrap the handler in a "did I already do this work?" guard. All of that is real, all of it works, and all of it is defensive plumbing that exists purely because the DB and the queue can't agree on whether the work happened.

This is the same dual-write problem, mirrored. Producer side: did the row and the job both get created? Consumer side: did the result write and the ack both land?

Why Most Libraries Don't Fix This

The honest answer is that most job libraries can't fix this, because the worker process and the library's completion write don't share a transaction with your handler. The queue's notion of "this job is done" and your DB's notion of "this work happened" live in different transactions, and the library's answer is to hand you idempotency as homework.

Receipts, one library at a time — expand the one you actually use:

BullMQ, Bee, Sidekiq, Resque, SQS, Cloud Tasks
Anything where the queue lives in Redis, RabbitMQ, or a managed broker. The worker does business work against your DB, then sends a separate ACK over the wire to the broker. Two systems, no shared transaction. The standard advice — "just make your handlers idempotent" — is real advice, but it's the library handing you back the dual-write problem and calling it a feature. You're now responsible for inventing an idempotency key for every write your handler performs.

Temporal
Temporal owns the truth, but in a third cluster. The activity finishes, reports success to the Temporal server, and that's a separate write from whatever your activity did to your application DB. The workflow engine reconciles the world by re-running activities until they look done — which is powerful, but it pushes you toward idempotency-by-default for every activity that touches state, and it's an entire service tier to operate.

graphile-worker
Postgres-native, which is the right substrate, but the task handler's withPgClient pulls a fresh pooled connection and the job's completion runs in a separate transaction issued by the worker pool after the task returns. Two transactions, no way to merge them. Better than Redis, but the dual-write surface is still there.

pg-boss
Worth a careful look, because its README markets "exactly-once job delivery" and it's easy to assume that means the consumer side is solved. It doesn't. The phrase refers to SKIP LOCKED on the fetch path — two workers can't claim the same row. The standard work() worker still runs your handler, returns, and then pg-boss calls complete() against its own pool in a separate transaction. If your handler commits domain writes and the worker crashes before complete() lands — or the job's lease expires (expireInSeconds, default 15 minutes) — the job is re-fetched and the handler runs again. Domain writes commit twice. pg-boss v12.17 added a { db } option that's accepted on send(), complete(), fetch(), and friends — so the primitives for atomic completion exist. But the work() worker loop doesn't surface them; to actually get a handler-tx-fused-with-completion-tx flow you have to opt out of work(), write your own fetch() + handler + complete(jobId, output, { db: tx }) loop, and re-implement the lease, retry, and backoff machinery yourself. Atomic consumer-side processing is possible, but it's a parallel API you build, not the shape of the supported worker.

Atomic Mode: When Dual-Write Never Happens

Queuert handlers come in two shapes. The simpler one — and the one most jobs should use — is atomic mode: the entire handler runs inside one transaction in your DB (Postgres or SQLite — Queuert supports both).

Here's reserve-inventory, which decrements stock, records the reservation, and queues the next step in the chain:

'reserve-inventory': {
  attemptHandler: async ({ job, complete }) =>
    complete(async ({ sql, continueWith }) => {
      const [item] = await sql`
        SELECT stock FROM items WHERE id = ${job.input.itemId} FOR UPDATE
      `;
      if (item.stock < 1) throw new Error("Out of stock");

      await sql`UPDATE items SET stock = stock - 1 WHERE id = ${job.input.itemId}`;
      await sql`
        INSERT INTO reservations (order_id, item_id)
        VALUES (${job.input.orderId}, ${job.input.itemId})
      `;

      return continueWith({
        typeName: "charge-payment",
        input: { orderId: job.input.orderId },
      });
    }),
}
Enter fullscreen mode Exit fullscreen mode

One transaction wraps all of this:

  • The SELECT ... FOR UPDATE that locks the item row.
  • The UPDATE that decrements stock.
  • The INSERT into reservations.
  • The mark-as-completed write on the job row.
  • The insert of the charge-payment continuation.

If anything throws, nothing commits — the savepoint rolls back, the job stays pending, and the engine reschedules it with backoff. If the worker process dies mid-transaction, the DB rolls everything back on connection close. The next attempt starts from a clean slate.

There is no dual-write window because there is no second write. You never need an idempotency key for reserve-inventory, because the engine cannot leave the job half-done — Postgres won't let it. If your handlers are DB-bound (state machines, counters, ledger updates, workflow transitions), this is the mode you live in, and the consumer-side dual-write problem simply doesn't exist for your code.

Staged Mode: When You Have to Call Out

Atomic mode works because nothing in the handler escapes the transaction. The moment you need to call an external system — Stripe, SES, an internal microservice — that stops being true. You can't hold a database transaction open across a Stripe API call (the lock contention alone would melt your DB), and you don't want network failures rolling back DB work that's already valid.

That's what staged mode is for: read in one transaction, do the external work outside any transaction, write the result in a second transaction.

'charge-payment': {
  attemptHandler: async ({ job, prepare, complete }) => {
    const order = await prepare({ mode: "staged" }, async ({ sql }) => {
      const [row] = await sql`SELECT * FROM orders WHERE id = ${job.input.orderId}`;
      return row;
    });

    // External call — outside any transaction, must be idempotent.
    const { paymentId } = await stripe.charge(order.amount, {
      idempotencyKey: `order-${order.id}-attempt-${job.attempt}`,
    });

    return complete(async ({ sql, continueWith }) => {
      await sql`UPDATE orders SET payment_id = ${paymentId} WHERE id = ${order.id}`;
      return continueWith({
        typeName: "send-payment-receipt",
        input: { orderId: order.id, paymentId },
      });
    });
  },
}
Enter fullscreen mode Exit fullscreen mode

The Stripe call still has to be idempotent — that's physics, not a library problem. You can't take back a network request, and any handler that calls an external system has to assume it might re-run after a retry. The idempotency key is unavoidable.

But everything that records what happened — the UPDATE, the job's transition to completed, the send-payment-receipt continuation — lands in one DB commit. There is no second system to ack. The job's "done" is a row update next to your UPDATE orders. Either both commit or neither does.

If the worker dies after Stripe returns but before complete finishes, the next attempt re-runs Stripe with the same idempotency key (Stripe returns the original charge), reads the order, sees payment_id is still null, and writes it. One source of truth, no reconciliation, no "did the ack make it back?" branch.

The thing staged mode buys you is this: even though the external call breaks the single-transaction property, the recording of its result stays atomic with everything downstream of it.

The Shape of the Guarantee

Put the two posts together and the picture is:

  • Producer side: "I created a user and a job." → One commit. Atomic.
  • Consumer side, atomic mode: "My job is DB-bound." → One commit. Reads, writes, completion, next chain step — all atomic. No idempotency keys needed.
  • Consumer side, staged mode: "My job calls an external system." → External call must be idempotent (always true, in every library). Recording the result is one commit, atomic with marking the job done and queuing the next step.

No Temporal cluster keeping a third ledger in sync. No outbox table to babysit. No "what if the ack didn't make it back?" branch in your retry logic. The job table lives next to your business tables in the same DB — Postgres or SQLite — and the database handles atomicity for both. That's the one thing relational databases are unambiguously good at.

The dual-write problem was never really about queues. It was about having two systems that needed to agree and no way to make them. Removing the second system — or, in pg-boss's case, opting into a wiring pattern that papers it over — is the only real fix. Queuert just makes it the default shape of the API rather than a feature you have to remember to turn on.

There's one more tier of side effect that doesn't fit either mode — the "fire after commit, but I'm fine if it gets dropped" kind. That's what transaction hooks are for, and they have an important non-guarantee worth understanding before you reach for them. That's the next post.

Top comments (0)