Durable Workflows on Postgres: What "You Don't Need Temporal" Actually Buys You

#webdev #database #postgres #backend

A DBOS blog post titled "Postgres is all you need for durable execution" reached the Hacker News front page this week (306 points, 132 comments). The thread split the way these threads always do. One half read it as relief: no more standing up a separate workflow service only to make a multi-step job survive a crash. The other half read it as a warning: you are about to put orchestration load on the same database that already runs your app. Both reactions are correct, and which one applies to you depends on a question the headline skips.

I run a site that compares developer tools, so I read a lot of "X vs Y." Durable execution is one of the categories where the architecture choice matters far more than the brand on the box. Here is the grounded version, checked against DBOS's own docs, an independent write-up from Supabase, and the source libraries themselves.

What durable execution solves

Picture a function that runs five steps and crashes on step four. Steps one through three already happened, step four half-happened, and a naive retry runs all five again. If step two charged a customer, you charged them twice. Durable execution fixes this with checkpointing. Each step records its result before the next one starts, so a crash-and-restart resumes from the last completed step instead of the top. The promise is that a workflow finishes exactly once even if the machine running it dies in the middle. Payment capture, order fulfillment, any multi-step process where a double-run costs real money is the use case that makes this worth the effort.

How the Postgres approach works

The traditional model, used by systems like Temporal, runs a separate orchestrator. A central service holds the workflow state and hands tasks out to workers over the network. DBOS's argument is that the database can play that role directly, so you skip the extra service entirely.

Mechanically, the workers checkpoint each step straight to Postgres as they go, rather than reporting back to an orchestrator. State lives in ordinary tables. When a server crashes, another server reads the latest checkpoint and resumes the workflow from its last completed step. Duplicate work gets caught the way Postgres catches any duplicate: with integrity constraints. If two workers grab the same workflow, the second one fails the constraint on checkpoint and backs off.

The exactly-once claim is the part worth understanding, because it is where the database earns its keep. DBOS runs a step and writes its checkpoint inside the same Postgres transaction. The work and the record of the work commit together or not at all, which closes the gap where a step succeeds but the system forgets it did. Systems that keep workflow state in a separate service typically settle for at-least-once delivery and ask you to make every step idempotent yourself.

Two smaller properties fall out of putting everything in one database. Queues become a SELECT ... FOR UPDATE SKIP LOCKED off a table, so workers cooperatively pull jobs without a separate broker. And observability is plain SQL: "how many workflows are stuck on step three" is a query you already know how to write, not a dashboard you have to buy.

What you give up

None of this is free, and the honest framing matters more than the pitch. Putting workflow state in your primary database means your workflow load and your application load share one Postgres instance. For modest volumes that is a feature, because it is one fewer system to run, back up, and reason about. At high throughput it becomes a coupling you have to plan around, since a workflow spike now competes with user queries for the same connections and I/O.

There is a subtler caveat too. The exactly-once guarantee is exactly-once for the database write, not for the outside world. A step that calls a third-party API and crashes after the call but before its checkpoint commits can still fire that call twice on recovery. The transaction protects the part Postgres owns; anything reaching past the database still wants an idempotency key. That is true of every durable execution system, and it is the line item people skip when they read "exactly once."

Temporal earns its complexity at the other end of the curve. It is language-agnostic by design, it separates orchestration from your data store so the two scale independently, and it carries years of production mileage at serious scale. The price is real: you rearchitect your code around a worker and a client, you run or pay for the Temporal service, and you operate one more piece of infrastructure. DBOS asks for none of that, in exchange for living inside the database you already have.

The decision that counts

So the calibration comes out clean, the way it does for most infrastructure choices. If you already run Postgres, your workflow volume is bounded by your application's volume, and you want the smallest number of moving parts, the database-backed approach removes an entire service from your diagram and hands you transactional exactly-once for free. That description covers a large share of real applications.

Reach for a dedicated orchestrator when the workflow load is genuinely independent of your app, like fan-outs that dwarf your user traffic or pipelines that need to scale on their own schedule, or when you need polyglot workers and a managed control plane more than you need fewer systems. The mistake in both directions is identical: picking the architecture for the logo instead of for where your load really comes from.

The libraries that implement the Postgres approach, DBOS among them, ship for TypeScript, Go, Python, and Java, so you can adopt durable execution without leaving your existing stack. If you are wiring it into a Node or Go service, I keep practical setup kits for both: a Full-Stack TypeScript Cookbook and a Go Development Cookbook, each with CLAUDE.md rules, editor hooks, and project patterns already wired in. They will not write your workflows for you, but they cut the first-day setup so you reach the interesting part faster.

Durable execution used to mean adopting a platform. The Postgres approach reframes it as a library decision, which is a much smaller bet and a much easier one to reverse. Start with the one workflow that currently ruins your day when it half-completes, make it durable, and watch what a crash does to it now. The behavior you see on your own code settles the architecture argument faster than any blog post, this one included.

Sources

DBOS, "Postgres Is All You Need for Durable Execution": https://www.dbos.dev/blog/postgres-is-all-you-need-for-durable-execution
DBOS Architecture (official docs): https://docs.dbos.dev/architecture
Supabase, "Running Durable Workflows in Postgres using DBOS": https://supabase.com/blog/durable-workflows-in-postgres-dbos

I build small reliability tools on exactly this kind of Postgres-first plumbing (cron dead-man's-switch monitoring, request tracing). More writeups and the tools at tools.thesoundmethod.me.