Paulo Victor Leite Lima Gomes

Posted on May 29

postgres is all you need for durable execution

#postgres #architecture #workflows #engineering

There is a corner of the internet that keeps re-discovering the same truth: Postgres can do more than most people give it credit for.

The latest iteration is a post called "Just Use Postgres for Durable Workflows" that made the rounds on HN this week—227 points, the kind of number that means a lot of people have been thinking about this quietly.

The argument is simple: durable execution does not need a dedicated engine. You can build it on Postgres with transactions, advisory locks, LISTEN/NOTIFY, and a bit of discipline. No Temporal cluster. No queue broker. No Step Functions state machine. No framework at all, really. Just Postgres, doing what Postgres does.

I read it and nodded along for most of it. Then I started thinking about why the idea keeps surfacing.

Because people keep reaching for expensive orchestration before they have outgrown cheap transactions.

durable execution is not new

Let me be clear about what "durable execution" means before the debate starts.

It means your code continues running even when the process dies. The workflow survives restarts, crashes, deployments, and network failures. If your function did one thing and then crashed halfway through, the retry picks up where it left off—not from scratch.

Temporal does this beautifully. AWS Step Functions does it too, in a different way. Azure Durable Functions. DBOS. Even Sidekiq in the Ruby world has been doing a version of this for years.

The value is real. Agent workflows need it badly: a multi-step agent task might call an LLM, write to a database, call another tool, wait for human approval, and then finish. If the process crashes after step two, you want to resume at step three, not start over.

Nobody disputes that.

What people dispute is whether you need a dedicated workflow engine for every durable execution need.

the temporal-shaped hole

Temporal is great software. If you are running multi-service orchestration with complex compensation logic, long-running sagas, human-in-the-loop steps, and strict history guarantees across dozens of services, Temporal is probably the right answer.

But a lot of teams adopt Temporal (or Step Functions, or Durable Functions) for problems Postgres already solves.

A single service that needs to retry a few steps with some state? Postgres transaction with a status column.

A background job that should survive process restarts? Postgres row with a status enum and a retry counter.

A multi-step task that needs to checkpoint progress? A table with a step identifier and the serialized state.

The pattern is everywhere: teams import a workflow engine because the problem feels big, when the real problem is that nobody made the simple version explicit.

The DBOS post makes this concrete. The authors show how a few Postgres features—transactional DDL, advisory locks for idempotent execution, LISTEN/NOTIFY for waking workers, and pg_cron for scheduling—cover a surprising amount of what people use workflow engines for.

Not everything. But enough that the default should be Postgres, not a new cluster.

what postgres actually needs for this

The core pattern for Postgres-based durable execution looks something like this:

You have a tasks table with a row per execution. Each row has a status, current step, input, output, retry count, and maybe serialized context. Workers pick up eligible rows, mark them as running, execute the step, update the row, and move on.

If the worker crashes, the row stays at "running" with a heartbeat timestamp. A recovery process picks up stale rows and resets them back to "pending" or retries them.

That is the simplest version. It is also surprisingly effective.

Advisory locks prevent two workers from executing the same task. LISTEN/NOTIFY lets workers wake up when new tasks arrive, instead of polling. Transactions ensure that updating the task status and writing side effects happen atomically.

The DBOS post adds more sophistication: versioned workflows, event-driven scheduling, timeout guarantees, and multi-step orchestration. But the core is the same. Postgres already has the primitives.

the trap is reaching for scale you do not have

I think the real reason durable execution engines proliferate is not technical necessity. It is premature architecture.

A team building an agent orchestrator or a background job system reads about Temporal, sees the feature list, and decides they need the full stack. They set up a cluster. They learn the SDK. They model everything as workflows and activities. They install a separate state store and visibility store. The operational surface keeps growing.

Three months later, they are running a few hundred workflows a day. Postgres could have handled that on a single instance, with less code, less operational overhead, and fewer moving parts to debug.

The engine is not wrong. But it is expensive for the wrong reason: it solves a scaling problem the team does not have yet, while adding operational surface area that shows up immediately.

If you are Google or Uber, the scalability and isolation story of a dedicated engine matters. You have millions of workflows, dozens of services, and teams that need independent lifecycle management.

If you are a ten-person team building an agent platform, Postgres is probably fine.

when you should actually upgrade

Durable execution on Postgres has limits. They are worth naming honestly.

First, Postgres does not handle cross-service retries with workflow-level semantics natively. If your workflow spans service boundaries and needs to compensate a transaction in service A when service B fails, you are writing that compensation logic yourself. Temporal gives you that by default.

Second, history visibility. Postgres can store workflow history easily—just a JSONB column or a side table. But querying history at scale, especially across many workflows with filtering and sorting, is not what Postgres is optimized for. Temporal has a dedicated visibility store for this.

Third, rate limits and backpressure. Postgres advisory locks work fine for a few hundred concurrent workers. At higher concurrency, contention becomes real. Dedicated engines use task partitioning, sticky queues, and smart routing to keep throughput smooth.

Fourth, SDK ergonomics. Temporal gives you a strong SDK with replay, timeouts, signals, queries, and testing utilities. The Postgres approach means you are building or borrowing those utilities yourself.

The question is not whether dedicated engines have advantages. They do. The question is whether those advantages are relevant at your scale.

For most teams, the answer is no.

what i would actually do

If I were building a durable execution layer today, I would start with Postgres and keep it there until the pain became concrete.

Not "until we worry about scale." Concrete pain. Real incidents where the Postgres approach failed.

I would define my task shape, pick the Postgres primitives that match, and build a thin layer on top. The shape would be something like:

A tasks table with status, step, input, context, retries, heartbeats, and result
A worker that polls eligible rows, acquires a lock, executes, and commits
A recovery worker that re-queues stale running tasks
Maybe LISTEN/NOTIFY for low-latency wakeup

That would get me a long way. Probably far enough to decide whether the problem is genuinely hard or just unfamiliar.

If I outgrew it, I would migrate to Temporal or a similar engine. The migration path is usually straightforward because the task shape is the same. The engine changes the delivery mechanics, not the domain model.

Starting with Postgres does not lock you out of a dedicated engine later. Starting with a dedicated engine locks you into its operational model immediately.

this is a boring technology argument

The durable execution pattern is getting attention because agent workflows need it. Every agent platform post on the internet mentions retries, checkpoints, and idempotent execution. The workflow-engine vendors are happy to fill that gap.

But the "boring technology wins" pattern keeps repeating for a reason. Postgres already exists in almost every infrastructure. It has decades of reliability engineering behind it. It handles transactional durability better than most new systems. And the operational cost is close to zero if you already run it.

The question is not whether durable execution engines are useful. They are.

The question is whether you need one before your problem is bigger than a database row with a status column.

The answer, for most teams, is probably not.

So start simple. Write to a tasks table. Let the worker crash. Read from the table again. Use a transaction when the state matters.

If that breaks in a way that a dedicated engine would fix, you will know exactly why. And you will be in a much better position to appreciate what the engine actually does for you.

Until then, Postgres is enough.

DEV Community