אחיה כהן

Posted on Jul 2

Exactly-Once by Default: How Durable Execution Changed the Way I Build Automations

#typescript #backend #tutorial #architecture

In the previous article I described moving 34 production automations off a visual no-code platform and rewriting them in TypeScript. The single feature that made that migration worth the effort was durable execution with exactly-once semantics. This post is the deep-dive.

The problem: a crash in the middle

Here's a scenario every automation eventually hits. A workflow receives a new lead, sends them a welcome message, then writes them to the CRM:

Send welcome message
Save to CRM

Now imagine the process crashes exactly between step 1 and step 2 — a deploy, an OOM kill, a dropped node. What happens on restart?

Re-run the whole thing → the lead gets the welcome message twice.
Don't re-run it → the lead never lands in the CRM.

Both outcomes are wrong. This is the at-least-once vs at-most-once dilemma, and in a system doing real side effects (sending messages, charging cards, creating records) it is not academic.

The usual fix, and why it hurts

Most tools give you retry-on-failure. But retry alone re-runs side effects. To get exactly-once you build it yourself:

Generate an idempotency key per lead.
Before each side effect, check "did I already do this?" against some store.
Persist progress after each step so a restart knows where to resume.
Repeat this bookkeeping for every workflow you ever write.

It works, but it's tedious, easy to get subtly wrong, and it clutters every automation with plumbing that has nothing to do with the business logic.

How DBOS makes it the default

DBOS flips this: durability is the baseline, not a feature you assemble. You annotate ordinary TypeScript functions. A workflow orchestrates; steps are the units that do side effects and get checkpointed.

import { DBOS } from "@dbos-inc/dbos-sdk";

export class Onboarding {
  @DBOS.workflow()
  static async welcomeLead(lead: Lead) {
    await Onboarding.sendWelcome(lead);   // step 1
    await Onboarding.saveToCRM(lead);     // step 2
  }

  @DBOS.step()
  static async sendWelcome(lead: Lead) {
    await whatsapp.send(lead.phone, "Welcome aboard!");
  }

  @DBOS.step()
  static async saveToCRM(lead: Lead) {
    await crm.upsert(lead);
  }
}

As the workflow runs, DBOS records the completion of each step in Postgres. From the docs:

"If a workflow is interrupted for any reason (e.g., an executor restarts or crashes), when your program restarts the workflow automatically resumes execution from the last completed step."

And crucially:

"Steps are tried at least once but are never re-executed after they complete."

So in our crash scenario: sendWelcome already completed and was recorded. On restart, DBOS skips it and resumes at saveToCRM. The welcome message is not sent twice; the CRM write finally happens. Exactly-once, with zero idempotency bookkeeping in my code.

No separate workflow server, no queue broker to babysit — just your program and Postgres.

The one rule to internalize

Durability isn't free magic — there's a contract. The workflow function must be deterministic: given the same recorded step results, replaying it must take the same path. So anything non-deterministic — network calls, random values, reading the clock, DB writes — belongs inside a step, never loose in the workflow body. Steps are the checkpointed boundary; the workflow is the recomposable script that ties them together.

Once that clicks, the mental model is clean: workflow = the plan, steps = the effects.

What this replaced

On the visual platform, I got retry and error branches, but exactly-once across a crash was something I had to design per flow — manual idempotency keys and "already done?" checks. Here it's the substrate. My code shrank to the business logic, and the reliability guarantee got stronger, not weaker.

That reliability is also what I sell to clients: fewer leads slipping through the cracks, no duplicate messages, no half-finished processes. (See the client-facing angle in the LinkedIn series.)

A note on how I built it

I'm one person, and wiring durable execution into 34 real automations is a lot of surface area. I did it in pairing with Claude Code — it turned "I understand exactly-once in theory" into workflows running in production, TypeScript module by TypeScript module. The barrier between a concept and a shipped system is thinner than it's ever been.

Sources: DBOS Workflows tutorial · Workflows & Steps reference

How do you handle mid-workflow crashes today — hand-rolled idempotency, an outbox, something else? Curious what patterns people have settled on.

Top comments (6)

Valentin Monteiro • Jul 4

You named the exact gap: n8n-style retries are at-least-once, never exactly-once across a crash, so you end up hand-rolling idempotency keys per flow anyway. The usual escape hatch before durable execution is outbox plus a dedup table. One thing I'd still watch with the DBOS model: a step's external call (send WhatsApp, charge a card) can succeed while the checkpoint write fails, and since steps are "at least once" that effect can fire again on replay, so non-idempotent calls still want a provider-level idempotency key. What does it do if the process dies mid-step, after the effect but before the record lands?

אחיה כהן • Jul 6

You've got the model exactly right, and the honest answer to your question is: the effect fires twice. DBOS checkpoints a step as complete after the function returns, so a crash in the window between "WhatsApp accepted the send" and "the completion row commits" leaves the step looking un-run — recovery replays it. That's the whole reason I called the guarantee "exactly-once by default" and not "exactly-once, period."

The precise framing that unlocked it for me: DBOS gives you exactly-once for the workflow's durable state machine — which step you're on, what the inputs were, the orchestration — and at-least-once for the steps themselves. It cannot give exactly-once to a side effect that lives on someone else's server, because that's not its state to checkpoint. So the boundary is exactly where you drew it: idempotent steps get exactly-once for free, non-idempotent external calls still need a provider-level idempotency key (or the outbox+dedup you mentioned) to collapse the at-least-once back down.

What durable execution actually buys you over a hand-rolled outbox isn't removing that key — it's that you write it once per non-idempotent call instead of rebuilding the entire replay/dedup harness per flow. The dangerous ones become a small, countable list instead of "every node, just in case." Great question — it's the exact seam people miss.

Valentin Monteiro • Jul 8

That matches what I'd expect, and the provider-level key is the right patch, with one gotcha: those idempotency keys have a finite dedup window (Stripe holds them ~24h, some providers far less). Durable execution can replay a step long after that, a crash left unrecovered overnight or a workflow resumed hours later, and once the window's expired the provider treats the replay as a fresh call and the effect fires twice anyway. So the key only closes the gap inside the provider's retention, not across an arbitrary recovery delay. Do you pin a max replay age to stay inside it, or keep your own dedup record that outlives the provider's?

אחיה כהן • Jul 10

Own dedup record — for exactly the reason you named: the provider's window is their retention policy, and my recovery delay is mine, so leaning on theirs makes my correctness depend on a knob I don't control.

Concretely, the ledger is just a table in the same Postgres DBOS checkpoints into, keyed by (workflow id, step name). The send step checks it before calling out and records the provider's message id after. On a replay — even days later — the ledger hit short-circuits the call. For the WhatsApp flows this was never optional anyway: WAHA has no idempotency keys at all, so a self-owned ledger was the only dedup on offer.

Being honest about what it buys: check-then-send still isn't atomic with the provider call, so the crash-between-"provider accepted"-and-"row commits" window from my previous reply still exists. The ledger closes the long-tail replay (resumed hours later, past any provider window) — which is the failure mode that actually scares me — not the millisecond one.

And yes to a max replay age too, but as a circuit breaker rather than dedup: a workflow that's been dead longer than a few hours gets parked for manual review instead of auto-resumed. If it slept through the night, the world it's replaying into may have moved in ways an idempotency key can't express — prices, stock, whether the customer still wants the thing.

Valentin Monteiro • Jul 10

The ledger-for-dedup, replay-age-for-circuit-breaker split is the right cut. Those are two different jobs, and conflating them is where most "exactly-once" claims quietly fall apart. On the crash-between-accepted-and-commits window: you can't fully close it without a provider-side idempotency key, but you can shrink what it costs you. Write an "intent" row (about to send, id X) inside the checkpoint tx before the call, flip it to "sent" after. On replay the only ambiguous state is intent-with-no-sent, a bounded queue sized by in-flight sends, not by how long the workflow was dead. Your manual-review park then fires on that specific state instead of on wall-clock age, so a workflow that slept all night but had no open send just resumes clean.

אחיה כהן • Jul 12

The intent row is the piece I was missing, and it fixes the exact thing that bothered me about the wall-clock circuit breaker: age is a proxy, "there was an open send when we died" is the actual condition. Parking on the proxy means a workflow that slept overnight with nothing in flight gets a human's attention it doesn't need, while a workflow that died three seconds into a send gets auto-resumed because it's young. Backwards on both ends.

Writing the intent inside the checkpoint transaction is what makes it work — it can't be lost separately from the step's own progress, so replay always sees a consistent pair. And the ambiguous set really is bounded by concurrency, not by downtime, which is the property that lets you park on it: a handful of rows to reconcile, not a night's worth.

The remaining honest gap is that intent-with-no-sent still doesn't tell you which way it went — the provider may have accepted and the ack died in flight. You've turned an unbounded silent-duplicate risk into a small, explicit, reviewable set. That's the trade I want: not "impossible to double-send," but "if it can double-send, a human sees the row." I'm going to steal this for the send steps that have no provider-side key at all, where the ledger is doing all the work alone.