Yury

Posted on May 1

The dual-write problem (and a Postgres-native fix for Node.js background jobs)

#node #postgres #showdev #opensource

What Temporal taught me about state

I've spent years building on Temporal — workflows-as-code, durable execution, the whole gospel. The thing Temporal got right, the thing that made it hard for me to go back to ordinary Redis-backed queues, is this: workflow state is durable on every step. It lives in a transactional database. There's no in-memory state that disappears on a crash, no message that's "in flight" without being persisted, no step that completes without its result being safely written.

But Temporal's durability story has a quiet caveat: it only protects state inside a workflow. The moment you have an application database alongside the workflow service — and most projects do — you're back to two systems.

Consider this signup handler:

import { Client } from "@temporalio/client";
import { sendWelcomeEmail } from "./workflows";

const temporal = new Client();

await db.transaction(async (tx) => {
  const user = await tx.users.create({ name: "Alice", email: "alice@example.com" });

  await temporal.workflow.start(sendWelcomeEmail, {
    taskQueue: "main",
    workflowId: `welcome-${user.id}`,
    args: [{ userId: user.id, email: user.email }],
  });
});

Innocent enough. Insert a user, start a workflow, life moves on.

Except sometimes Alice never gets the email. Other times Alice gets a welcome email pointing to a userId that doesn't exist. Why?

The dual-write problem

The application database and the workflow service are two different systems. The transaction in the snippet above is a Postgres transaction. temporal.workflow.start() writes to the Temporal cluster's own database. Neither system knows about the other.

So: tx.users.create() succeeds. workflow.start() succeeds. Then commit fails — constraint violation, network blip, anything. You now have a workflow running for a user that doesn't exist.

Reverse it: tx.users.create() succeeds. workflow.start() fails. You now have a user with no welcome workflow, and no record of the missing one.

Reverse it again: workflow.start() succeeds in transit but the connection drops before you get the handle back. The workflow is running. You retry, and (if your workflowId collides) you fail open or get two workflows.

This is the dual-write problem, and it's not unique to BullMQ-on-Redis — it shows up any time job/workflow state lives in a different system from your business state. BullMQ is just the most obvious case because Redis isn't transactional with your Postgres. Temporal hides it better but doesn't eliminate it.

Temporal's official answer is to make the workflow service the source of truth — keep workflow-relevant state inside Temporal, lean on Sagas for cross-service consistency, and treat your app DB as a read model fed by workflow events. That works, and it's the right answer for Temporal-native projects. But it's a real architectural commitment: a service tier with its own cluster, history/matching/frontend services, a Postgres or Cassandra of its own, and SDKs to keep in sync.

The standard workaround: transactional outbox

The textbook fix in the BullMQ-style world is the transactional outbox pattern. Instead of writing directly to your queue, you:

Insert a row into an outbox table inside the same transaction as your business write.
A separate process polls the outbox and forwards rows to your queue.
Once acknowledged, the outbox row is deleted (or marked sent).

This works. It also means you're now maintaining:

An outbox table and its schema migrations.
A poller process and its supervisor.
An idempotency layer (your queue can receive the same outbox row twice if the poller crashes between forwarding and marking sent).
Monitoring for outbox lag.
A retry policy.

For a small team shipping background jobs, that's a lot of infrastructure to maintain just to avoid orphaned data. And once you're maintaining it, you have to ask: what is Redis actually buying me here, that's worth this much glue code?

A smaller-shape question

Temporal's answer to the dual-write problem is "run the whole workflow service." That's the right answer when you need it — workflows-as-code, replay semantics, polyglot SDKs, a managed control plane, history that survives any single component dying. But it's also a service tier: separate cluster, history/matching/frontend services, a Postgres or Cassandra of its own, worker SDKs to keep in sync with the server.

For a lot of Node.js projects, that's overkill. You just want background jobs with the same durability guarantees Temporal taught you to expect — atomic with your business writes, no orphaned state, no in-flight messages that vanish on a crash.

What if that durability instinct could come in a smaller shape? What if the job queue was your existing Postgres — queue.add() becoming an INSERT into a jobs table inside your transaction?

That's the design behind Queuert, an open-source library I wrote to scratch exactly that itch.

What it looks like

Same signup handler, rewritten with Queuert (Kysely + Postgres in this example):

import { withTransactionHooks } from "queuert";

await withTransactionHooks(async (transactionHooks) =>
  db.transaction().execute(async (tx) => {
    const user = await tx
      .insertInto("users")
      .values({ name: "Alice", email: "alice@example.com" })
      .returningAll()
      .executeTakeFirstOrThrow();

    return client.startChain({
      db: tx,
      transactionHooks,
      typeName: "send-welcome-email",
      input: { userId: user.id, email: user.email },
    });
  }),
);

Two key things:

startChain takes the same transaction tx as the user insert. They're literally part of the same Postgres transaction. If the transaction rolls back, the job is never created. If the transaction commits, the job is created exactly once. No outbox needed.
transactionHooks defers side effects until after commit. Things like notifying workers (so they pick up the new job immediately) are buffered during the transaction and only fire if commit succeeds. If commit fails, the hooks are discarded.

That's the whole story for atomic consistency. No second system, no forwarder, no retry loop — just a BEGIN/COMMIT wrapped around your business write and an INSERT into the queuert_jobs table.

What you stop needing

Redis as a state store. You can still use Redis for low-latency wake-up notifications (Pub/Sub-style), but it's optional and stateless. Jobs are rows in your database. If Redis disappears, workers fall back to polling and nothing is lost.
An outbox table. The jobs table is the outbox.
A separate forwarder process. Workers query the jobs table directly with SELECT … FOR UPDATE SKIP LOCKED — a pattern Postgres has had since 9.5.
Two backup strategies, two replication setups, two monitoring stacks. One database, one set of operational tooling.

What you gain (besides the consistency)

If jobs are first-class rows in your database, a lot of things become easier:

You can JOIN on them. "Find all pending welcome emails for users created in the last hour" is a query, not a custom analytics pipeline.
You can use your existing audit logging. Every job state transition is a database write. If you already track changes via triggers or CDC, you get job history for free.
You can run integration tests against a real database. No mocking the queue. The job table is just another table.
Schema migrations stay in one place. Job table changes ride along with your application's migrations.

What this isn't

This isn't a Temporal replacement, and it isn't trying to be. If you need workflows-as-code, replay semantics, deterministic execution, polyglot SDKs, or a managed control plane — use Temporal. That's still what I reach for on bigger projects, and it's worth every operational dollar when the workflow surface justifies it.

It also isn't going to outscale dedicated queue infrastructure at the high end. Postgres handles a lot — the published benchmarks show ~21k chain creations per second batched and ~770 jobs/sec processed atomically on a Dockerized Postgres against a single worker — but if you're at the scale where a job queue is its own infrastructure tier with a dedicated team, dedicated tools probably make sense.

The target is the in-between: small-to-medium Node.js projects where Temporal is too heavy and BullMQ reintroduces the dual-write problem.

Try it

Queuert is MIT-licensed and pre-1.0. Postgres and SQLite adapters; in-process, Redis (incl. Cluster), NATS, and Postgres LISTEN/NOTIFY for the optional notify layer; an embeddable web dashboard; OpenTelemetry tracing across job chains.

Repo: github.com/kvet/queuert
Docs: kvet.github.io/queuert

A follow-up post on the TypeScript story — how job chains stay type-checked end-to-end across continueWith, branching, loops, and fan-in blockers — is up next.

DEV Community