In the previous article I described moving 34 production automations off a visual no-code platform and rewriting them in TypeScript. The single feature that made that migration worth the effort was durable execution with exactly-once semantics. This post is the deep-dive.
The problem: a crash in the middle
Here's a scenario every automation eventually hits. A workflow receives a new lead, sends them a welcome message, then writes them to the CRM:
- Send welcome message
- Save to CRM
Now imagine the process crashes exactly between step 1 and step 2 — a deploy, an OOM kill, a dropped node. What happens on restart?
- Re-run the whole thing → the lead gets the welcome message twice.
- Don't re-run it → the lead never lands in the CRM.
Both outcomes are wrong. This is the at-least-once vs at-most-once dilemma, and in a system doing real side effects (sending messages, charging cards, creating records) it is not academic.
The usual fix, and why it hurts
Most tools give you retry-on-failure. But retry alone re-runs side effects. To get exactly-once you build it yourself:
- Generate an idempotency key per lead.
- Before each side effect, check "did I already do this?" against some store.
- Persist progress after each step so a restart knows where to resume.
- Repeat this bookkeeping for every workflow you ever write.
It works, but it's tedious, easy to get subtly wrong, and it clutters every automation with plumbing that has nothing to do with the business logic.
How DBOS makes it the default
DBOS flips this: durability is the baseline, not a feature you assemble. You annotate ordinary TypeScript functions. A workflow orchestrates; steps are the units that do side effects and get checkpointed.
import { DBOS } from "@dbos-inc/dbos-sdk";
export class Onboarding {
@DBOS.workflow()
static async welcomeLead(lead: Lead) {
await Onboarding.sendWelcome(lead); // step 1
await Onboarding.saveToCRM(lead); // step 2
}
@DBOS.step()
static async sendWelcome(lead: Lead) {
await whatsapp.send(lead.phone, "Welcome aboard!");
}
@DBOS.step()
static async saveToCRM(lead: Lead) {
await crm.upsert(lead);
}
}
As the workflow runs, DBOS records the completion of each step in Postgres. From the docs:
"If a workflow is interrupted for any reason (e.g., an executor restarts or crashes), when your program restarts the workflow automatically resumes execution from the last completed step."
And crucially:
"Steps are tried at least once but are never re-executed after they complete."
So in our crash scenario: sendWelcome already completed and was recorded. On restart, DBOS skips it and resumes at saveToCRM. The welcome message is not sent twice; the CRM write finally happens. Exactly-once, with zero idempotency bookkeeping in my code.
No separate workflow server, no queue broker to babysit — just your program and Postgres.
The one rule to internalize
Durability isn't free magic — there's a contract. The workflow function must be deterministic: given the same recorded step results, replaying it must take the same path. So anything non-deterministic — network calls, random values, reading the clock, DB writes — belongs inside a step, never loose in the workflow body. Steps are the checkpointed boundary; the workflow is the recomposable script that ties them together.
Once that clicks, the mental model is clean: workflow = the plan, steps = the effects.
What this replaced
On the visual platform, I got retry and error branches, but exactly-once across a crash was something I had to design per flow — manual idempotency keys and "already done?" checks. Here it's the substrate. My code shrank to the business logic, and the reliability guarantee got stronger, not weaker.
That reliability is also what I sell to clients: fewer leads slipping through the cracks, no duplicate messages, no half-finished processes. (See the client-facing angle in the LinkedIn series.)
A note on how I built it
I'm one person, and wiring durable execution into 34 real automations is a lot of surface area. I did it in pairing with Claude Code — it turned "I understand exactly-once in theory" into workflows running in production, TypeScript module by TypeScript module. The barrier between a concept and a shipped system is thinner than it's ever been.
Sources: DBOS Workflows tutorial · Workflows & Steps reference
How do you handle mid-workflow crashes today — hand-rolled idempotency, an outbox, something else? Curious what patterns people have settled on.
Top comments (0)