fjavierm

Posted on Jun 29 • Originally published at binarycoders.wordpress.com on May 15

Durable Execution: The Runtime for Distributed Systems

#systemsinfrastructur #distributedsystems #durableexecution #workfloworchestratio

Note: This article has two main sections. The first one is an abstract explanation of the Durable Execution concept. The second one is a simple workflow example to try to reduce the abstraction, and show a more realistic view to anchor the explanation. Depending on how you like to learn, feel free to read the explanation first, the example first, or even alternate between them while reading.

There is a quiet but important change happening in how we build software. It isn’t a sudden “revolution”, but rather a new way of thinking about how programs run across multiple servers. We call this Durable Execution.

Durable Execution could be described as a simple inversion of responsibility: instead of treating failure as something applications must anticipate and recover from, durable execution systems assume that failure is constant and design the execution model itself to survive it. Which means we are no longer just coordinating work across services, we are starting to treat execution itself as a persistent, stateful entity.

Most modern backend systems are built on an architecture that, on paper, looks clean and composable. A request enters a system, an orchestrator decomposes it into tasks, and a fleet of stateless workers executes those tasks independently. For example, when you click “buy” on a website, a request goes to a server, which then talks to a database, a payment processor, a storage system, and a shipping service among other things. In this set up, each service is responsible for doing one thing well, and persistence is delegated to databases and queues.

This model scales remarkably well in terms of throughput and organisational clarity. It is the backbone of microservices architecture. But as systems grow in complexity, something inevitable and subtle happens: workflows begin to leak across boundaries.

A “simple” business process such as processing a payment, or fulfilling an order, quietly evolves into a distributed orchestration of services. Each step is straightforward in isolation, yet the overall process becomes fragile not because any single component is complex, but because no single component owns the lifecycle of the workflow itself. The orchestration of these systems eventually involves:

retry policies embedded in clients and workers
state stored in databases with evolving schemas
queues that act as implicit progress trackers
compensating logic scattered across services
and operational heuristics encoded in dashboards and alerts

Eventually, the boundaries of the workflow start blurring, the workflow itself ceases to exist, and becomes highly entangled with the infrastructure. This is the root tension that durable execution tries to address.

To understand the shift, it helps to contrast two models of thinking about distributed systems.

In the traditional worker-based architecture, the system guarantees delivery of work. A message will eventually reach a worker. A job will eventually be retried. A task will eventually be processed. A completed or failed result will be published. But what is not guaranteed is the continuity of execution. If a process begins, partially completes, and then fails mid-way, nothing in the infrastructure inherently remembers what step it was on or what should happen next. That responsibility is pushed upward into application code and external state stores.

Durable execution tries to flip this assumption by, instead of treating an execution as ephemeral, it treats it as a persistent object. A workflow is not something that is “run”; it is something that exists over time. It has a history, a state, and a deterministic progression that can be paused, resumed, replayed, or migrated. This is the core idea behind systems such as Temporal, which model workflows as durable state machines whose execution history is recorded and reconstructed as needed. The runtime becomes responsible not only for executing steps, but for preserving the identity of the execution itself.

At the centre of durable execution lies a constraint that initially feels unnatural to most engineers: workflow code must be deterministic. This does not mean the system itself is deterministic in the mathematical sense. It means that given the same recorded history of events, the workflow must always reconstruct the same state and make the same decisions. This requirement exists because durable systems often rely on replay. When a workflow resumes after a failure, the runtime does not “continue” execution in the traditional sense. Instead, it reconstructs the workflow by replaying prior decisions and rehydrating state from a persisted event history. This has an important consequence: side effects cannot be executed freely during replay. External interactions such as API calls, database writes, message emissions, must be carefully separated from the logical flow of the workflow. In practice, this introduces a separation between:

activities (the side-effecting operations performed by workers)
workflow logic (the durable orchestration layer)

This separation is what allows executions to be safely paused and resumed without ambiguity. While this may feel restrictive, it is precisely this constraint that enables durability.

Let’s try to put side by side some of the characteristics of a traditional system, and the characteristics of Durable Execution systems.

In traditional systems, retries are usually an implementation detail. A worker fails, a message is requeued, and eventually the task is attempted again. But retries quickly become more complex than they first appear, especially when failures happen mid-workflow rather than at the boundaries of tasks. What should happen if a payment succeeds but inventory reservation fails? Should the system retry the inventory step, or compensate the payment? What if compensation itself fails? What if the system crashes between deciding to compensate and actually doing so? These questions are not edge cases; they are the natural consequence of long-running distributed coordination.

Durable execution systems turn retries into a first-class runtime concept. Instead of scattering retry logic across services, the workflow engine tracks execution attempts as part of its history. Time itself becomes a managed dimension of the system, with timers, delays, and waiting periods becoming durable constructs rather than external scheduling hacks. Even waiting for days becomes structurally simple, because the workflow state is persisted independently of process memory. In this sense, durable execution is not just about reliability under failure. It is about treating time as a durable resource.

Moving deeper, once a workflow spans multiple services, failure is no longer binary, it is often partial. Some steps succeed, others fail, and the system must reconcile an inconsistent reality. This is where compensation logic enters the picture, often through patterns such as sagas.

If you don’t know, a saga is essentially a distributed transaction without atomicity. Instead of rolling everything forward or backward as a single unit, the system defines compensating actions that attempt to undo completed steps when later steps fail. In traditional architectures, sagas are notoriously difficult to implement correctly because their logic is distributed across services and tightly coupled to operational state.

Durable execution brings sagas into the workflow layer itself. Compensation is no longer a scattered concern but part of the execution model. The workflow runtime knows what has succeeded, what has failed, and what needs to be undone. This does not eliminate complexity, but it changes where complexity lives. Instead of being embedded in infrastructure glue code, it becomes explicit in the structure of the workflow.

Perhaps the most important conceptual shift introduced by durable execution is the normalisation of long-running processes. In traditional request-driven systems, time is implicitly assumed to be short. A request is expected to complete within milliseconds or seconds. Anything longer is pushed out of band into queues, schedulers, or cron jobs. But many real-world processes do not fit this model. They are inherently extended in time:

waiting for human approval
integrating with third-party systems
coordinating multi-stage financial flows
handling asynchronous physical-world processes

Durable execution embraces this directly. A workflow can span seconds, hours, or weeks without requiring external orchestration mechanisms to simulate persistence. These systems treat long-running execution not as an anomaly, but as a primary use case.

An increasingly important driver of interest in durable execution comes from the domain of AI agents. Modern agentic systems are not stateless request handlers. They maintain evolving context, interact with external tools, retry operations, and often run through multi-step reasoning and action loops that can span long periods of time. Without durable execution, these systems are fragile in predictable ways:

a crash loses context
a timeout breaks continuity
partial tool execution leads to inconsistent state
retries can duplicate side effects

What is emerging is a recognition that AI agents are, structurally, workflows. They are not single computations, they are long-running, stateful processes that require persistence, replayability, and controlled side effects. This is why systems such as durable workflow engines are increasingly being explored as the underlying runtime for agent orchestration, not just business process automation.

It is tempting to view durable execution systems as simply better workflow engines, but that framing underestimates what is actually changing. Traditional orchestrators coordinate tasks across workers. They are, in essence, message routers with state tracking bolted on. Durable execution systems, by contrast, begin to resemble execution environments. They manage:

state persistence
execution history
scheduling and timers
retries and failure recovery
deterministic replay
coordination across services

This is why a useful mental model is to think of them as a kind of distributed operating system. Not for hardware resources, but for business logic execution over time. The comparison is not perfect, but it is instructive. Just as operating systems abstracted away hardware complexity to allow applications to run reliably on unstable machines, durable execution abstracts away distributed failure to allow workflows to run reliably on unstable networks.

Durable execution is still an emerging paradigm. It is not yet the default model for building distributed systems, and many teams continue to rely successfully on queues, workers, and stateless services. But the pressure that led to its development is becoming more pronounced. Systems are becoming more distributed, workflows are becoming longer-lived, and AI systems are introducing a new class of stateful computation that does not fit neatly into request/response paradigms.

What durable execution offers is not a new tool, but a new boundary. It draws a line around execution itself and says: this is something worth making persistent. If databases taught us how to make data reliable, durable execution is attempting something analogous for computation. And if that trajectory continues, workflow engines may evolve from infrastructure components into something closer to what operating systems became for the hardware era: a foundational layer that quietly defines how everything else runs.

Simple workflow example

If you are like me, after reading this you are probably thinking “Yeah, that sounds good, but what does it actually look like?“. For that reason, let’s try to make the abstraction more concrete, and look at a single workflow.

Imagine a simple order fulfilment process: ordering a pizza. In a traditional system, this would be split across services, queues, and callbacks. In a durable execution system, it is expressed as a single continuous workflow, but importantly, it does not execute like a function call. It behaves more like a stateful timeline.

The workflow (conceptual view)


Workflow: PizzaOrder

Step 1 → Take order ("margherita")
Step 2 → Prepare pizza (orderId)
Step 3 → Wait for 5 seconds (or 5 hours, or 5 days)
Step 4 → Deliver pizza
Step 5 → Send receipt

At first glance, this looks trivial. The important part is what happens under the hood.

What actually happens at runtime

When the workflow starts, the runtime does not “execute everything”. It begins a controlled sequence of recorded decisions.

1. Start of execution

The system creates a durable record:


WorkflowInstance: PizzaOrder-128
State: STARTED
History: []

It then executes Step 1.


→ Schedule activity: TakeOrder("margherita")

This is not executed inline. It is dispatched to a worker. The workflow pauses.

2. First suspension point (important idea)

At this moment, the workflow is not “running”. It is:

persisted
waiting
fully safe to crash
resumable from history


WorkflowInstance: PizzaOrder-128
State: WAITING_FOR(TakeOrder)
History:
  - Scheduled TakeOrder("margherita")

A worker eventually responds:


Result: ORDER-MARGHERITA

3. Replay begins (the non-obvious part)

Now comes the key durable execution concept. If the workflow needs to continue (or recover from failure), the runtime does not simply “continue execution”. Instead, it replays the workflow from the beginning using recorded history:


Replay:
  - Step 1: TakeOrder → already completed (from history)
  - Step 2: PreparePizza(orderId=ORDER-MARGHERITA)

This is where determinism matters, the workflow code must behave consistently during replay.

4. Another suspension

The system schedules the next activity:


→ Schedule activity: PreparePizza(ORDER-MARGHERITA)

Again, execution pauses.


State: WAITING_FOR(PreparePizza)
History:
  - TakeOrder completed
  - PreparePizza scheduled

A worker completes it:


Result: PIZZA-ORDER-MARGHERITA

5. Time becomes a first-class construct

Now the workflow reaches an unusual step:


WAIT 5 seconds

In a traditional system, this would require:

cron jobs
timers
sleep threads
external schedulers

In a durable system, time itself is persisted:


Workflow state:
  Timer scheduled: +5 seconds

The workflow is now completely idle, but still alive as a durable object. It can survive:

process crashes
machine restarts
deployments
network partitions

Nothing is lost.

6. Resumption after time passes

When the timer fires:


→ Resume workflow

Replay reconstructs state again:


History:
  - TakeOrder ✓
  - PreparePizza ✓
  - Wait(5s) ✓

Now execution continues:


→ Schedule activity: DeliverPizza(PIZZA-ORDER-MARGHERITA)

7. Completion

Finally:


→ Schedule activity: SendReceipt(deliveryId)
→ Workflow COMPLETE

At the end, the system has not just executed steps. It has produced a durable execution trace:


Workflow History:
  1. TakeOrder → ORDER-MARGHERITA
  2. PreparePizza → PIZZA-ORDER-MARGHERITA
  3. Timer → 5s elapsed
  4. DeliverPizza → DELIVERED
  5. SendReceipt → RECEIPT SENT

The key insight this example is trying to surface is that what matters is not the steps themselves, any system can execute steps. What matters is that the workflow is not stored as state we manage, but as history the runtime owns. This is why durable execution feels different from:

queues
cron jobs
orchestration services
worker pools

It is not just “a better way to coordinate tasks”, it is a system where execution itself becomes a recoverable data structure. Once we internalise this model, several earlier concepts become much clearer: