Mark Effect

Posted on May 31

Steered, Not Replayed: Execution Graphs vs Workflow Graphs

#architecture #backend #distributedsystems #systemdesign

Originally published at docs.cmdop.com/blog/execution-state-continuity-03-steered-not-replayed — part of the series The Command-Operator Execution Layer.

Steered, Not Replayed: Execution Graphs vs Workflow Graphs

There are two fundamentally different ways to make a running computation survive failure, and the industry keeps confusing them because both ship under the word durable and both draw something they call an execution graph.

The first way is replay. You record every decision and side effect a program made into an append-only journal, and when the host dies you start the program over from its entry point on a fresh worker — except the runtime feeds the journaled results back in so the re-execution lands on exactly the state it had before the crash. This is durable execution as Temporal, Cadence, Microsoft Orleans, Dapr, and Azure Durable Functions practice it. It is excellent engineering, and it is the right tool for an enormous class of problems.

The second way is steering. You do not re-derive state from a log; you maintain a live OS-level execution graph and keep it alive by directly observing the running system, so that interactive clients (a human, an AI agent, a monitoring service) can attach to it concurrently and act on it while it runs.

By execution graph this article means something concrete: the live OS process tree — real parent/child lineage — together with its associated PTY, file-descriptor, and socket resource state. It is an OS-level resource graph, not a steerable action-DAG; it carries no orchestration semantics of its own. "Graph" here is reserved for that resource topology and nothing more.

The first is reconstructed. The second is held. One is replayed; the other is steered. To be clear up front: steering live state is not new — a debugger attaching with gdb, a REPL, a notebook kernel have always steered live execution. What every one of those is, though, is a single actor against a session that dies with its host and carries no identity beyond it. The load this article puts on the word "steered" is narrower and newer: concurrent multi-operator steering of one identity-bearing live graph that survives host and transport change. This article is about why that distinction is not pedantry — and why agentic systems are about to need the second kind whether the vocabulary exists for it or not.

A short, fair account of durable execution by replay

Let's give replay-based durable execution the respectful, accurate description it deserves, because it is one of the most quietly important ideas in modern backend design.

The core problem durable execution solves is this: in an unreliable world — flaky networks, crashing nodes, downstream APIs that time out — you want to write ordinary-looking imperative code that runs to completion anyway. You want to write:

chargeCard(user)
sleep(30 days)
sendRenewalReceipt(user)

and have it be true that the charge happens once, the sleep survives a datacenter reboot, and the receipt eventually goes out — even if the machine that started the function no longer exists.

Durable-execution engines achieve this with event sourcing plus deterministic replay. The runtime does not try to serialize the language's native thread stack or heap (the JVM, V8, and the CLR do not natively let you freeze and ship a live call stack). Instead it separates code into two roles. Workflow code is the orchestration logic and it must be strictly deterministic. Any genuinely non-deterministic act — calling an external API, reading the clock, generating a random number, writing to a database — is pushed out into an Activity, and the result of each Activity is written to an append-only event history.

When the workflow worker crashes, the orchestrator detects the lost liveness and schedules the workflow on a different worker. That new worker re-runs the workflow code from the top. As the code re-executes and reaches each Activity call, the SDK intercepts it, finds the already-recorded result in the event history, and returns it immediately — without re-executing the side effect. The program "fast-forwards" through everything it already did and resumes exactly where it left off. Each workflow instance carries a stable workflow ID, so signals and queries can find it regardless of which physical worker is hosting it.

This is what makes "sleep for a month" real rather than a metaphor. When the workflow hits its 30-day delay, the worker frees all in-memory resources and the service registers a timer. Thirty days later a worker picks the task back up, replays the event log to rebuild the in-memory variables, and continues. Idle cost approaches zero. Long-running business processes — subscriptions, onboarding flows, multi-step sagas, human-approval chains — become ordinary code.

It is genuinely powerful, and the constraints are the source of the power: because state is derived from a deterministic re-execution of a log, the engine never has to capture volatile heap pointers or CPU registers, and it can run on stock language runtimes. The price is the determinism contract. Read the wall clock outside an Activity and your replay diverges; the rebuilt state no longer matches reality and the engine raises a non-determinism error to protect you from silent corruption.

What replay actually models — and what it doesn't

Here is the key observation. Replay reconstructs logical workflow state: which steps completed, what they returned, where the program counter logically sits in the orchestration graph. That logical graph is the workflow graph — a state machine over completed activities and pending steps.

It does not model, and structurally cannot model, live OS-level execution state.

A deterministic replay engine cannot represent a running bash process mid-command with a half-filled input buffer. It cannot hold a PTY whose scrollback a human is reading right now. It cannot keep a TCP socket in flight, an ssh child waiting on a prompt, a long-lived REPL with a populated namespace, or a process tree where killing the parent should cascade to the children. None of that is journalable as a sequence of deterministic decisions, because none of it is a sequence of deterministic decisions — it is the messy, non-deterministic, concurrently-mutated reality of a live operating system. Replay deliberately forbids exactly the thing a live interactive environment is made of.

Stated as plainly as possible:

Execution continuity ≠ workflow orchestration. Durable-execution engines reconstruct a logical workflow by deterministic replay; they do not maintain a live OS-level execution graph that heterogeneous interactive clients attach to. Steered, not replayed.

Now consider the opposite extreme, because it clarifies the middle. CRIU (Checkpoint/Restore in Userspace) is the purest form of "capture the live OS state": it uses ptrace to seize a process, dumps its memory pages, CPU registers, file-descriptor table and even TCP connection state (via TCP_REPAIR), and can restore the whole thing — bit for bit — on another machine. CRIU captures precisely what replay throws away.

But CRIU is a snapshot mechanism, not a continuity architecture. It is single-process-tree oriented; it has no control plane, no logical identity that outlives the snapshot, no way for multiple interactive clients to attach to one coherent live object, no model for serializing and attributing writes when more than one operator acts on it, no routing fabric that lets "the execution" be addressed independent of which host currently holds it. It is architecture-locked, too — restore generally demands a matching ISA and kernel. CRIU answers "how do I freeze this one process," not "how do I make a live execution a first-class, addressable, multi-actor object."

So we have three points on a map, not two:

The middle column is the one the industry has names for the edges of but not for the center.

The comparison, dimension by dimension

The picture above lays the three columns side by side; here is the same comparison spelled out dimension by dimension.

Dimension	Replay-based durable execution	Live execution-state graph	CRIU single-process snapshot
State model	Logical workflow state (completed steps, activity results, logical position)	Live OS execution graph: process tree with lineage, PTY, file descriptors, sockets — held live	Raw OS state of one process tree (pages, regs, fds, TCP), serialized to disk
Determinism requirement	Mandatory; non-determinism outside an activity diverges	None; the live system is inherently non-deterministic and that's fine	None; captures state as-is, no re-exec ever happens
What's persisted	Append-only event history (inputs / outputs / side effects)	The live execution graph as a first-class, addressable object	A point-in-time image; nothing between snapshots
Multi-client interactive attach (ownerless)	No — clients send signals/queries to a logical instance; no shared live surface	Yes — heterogeneous clients attach concurrently to one ownerless object; no privileged host-occupant	No — restore yields one process for one restorer; no shared live surface
AI / human steering	Drive the workflow between steps via signals; cannot grab a live shell mid-run	Observe and mutate the running environment mid-flight as operator; hand-off live	None while running; you freeze, ship, thaw — you do not steer
Recovery method	Re-run from entry point, replay journal to rebuild state	Re-home the live graph where checkpoint available; else re-establish from persisted session state	Restore the dumped image on a (matching) host

The shape of the table is the whole argument. Replay buys you indefinite, cheap, deterministic durability for logical processes. CRIU buys you a faithful freeze of one live process tree. Neither buys you a live execution graph that several operators can observe at once and steer under serialized, attributed authority, and that — while the live graph survives — re-homes as an identity rather than as a re-derivation or a re-thaw. That center column is a distinct architectural category.

A precise word on recovery, because the table's recovery row carries a condition that's easy to over-read. The "identity survives host change" claim holds where checkpoint/restore of the live graph is available: the still-live graph is re-homed, identity intact, and recovery is genuinely not a re-derivation. But that capability is not unconditional. When the host dies with no checkpoint of the live graph, the live process tree is simply gone — what persists is the identity and the durable session state, and recovery then re-establishes the environment from that persisted session state. That re-establishment is, honestly, closer to a reconstruction than to re-homing a live graph. The steered-not-replayed thesis is a claim about recovery mode when the live graph survives; it does not claim that a live OS process tree can be conjured back out of nothing. Where the graph is gone, only the identity and session state carry across — and the category's job there is to make that boundary explicit rather than pretend the live graph is immortal.

A line through history

The lineage helps locate where this category sits, because each prior era solved one axis and dropped the others (a screen → tmux → tmate → Jupyter → Guacamole → cloud-workspace → agent-runtime arc runs through the whole series; here we trace the execution-continuity thread specifically).

Distributed-OS process migration (Sprite, Mosix, Locus, Condor, late 1980s–1990s). The first serious attempt to make a live process outlive its host. Sprite migrated a running process to a new node, forwarding host-specific syscalls back to a "home node" via kernel RPC; Mosix shipped the user address space between processors. Visionary — and fatally dependent on the home node: a partition or a home-node crash killed every migrated process. Live state, no durable identity.
CRIU (2011). Migration done right at the snapshot level: userspace ptrace seizure, full memory/register/fd/TCP capture, restore anywhere with a matching kernel and ISA. It nailed capturing live OS state — and stopped exactly there. No control plane, no multi-client coherence, no logical identity layer.
Durable-execution engines (Temporal, Cadence, Dapr, Azure Durable Functions, ~2014 onward). Solved durability and identity at the logical level — stable workflow IDs, indefinite sleeps, exactly-once steps, scale-to-zero while idle — by abandoning live OS state entirely in favor of deterministic replay. Durable identity, no live state.

Lay those out and the gap is obvious. Sprite had live state but no durable identity. Temporal has durable identity but no live state. CRIU can capture live state but offers no continuity architecture around it. The execution-state-continuity direction is the synthesis the lineage keeps pointing at but never reaches: a live OS-level execution graph that has a stable logical identity and is concurrently observable and serially steerable by multiple operators with no privileged host-occupant — an ownerless identity. That last qualifier is load-bearing, and it is sharper than "survives host change." Durable hosting alone is no longer rare: a collaboration tool on a cloud backend keeps a shared session alive across a dropped laptop too. What none of the prior art has is ownerless identity — every prior shared-session design routes through one privileged occupant whose departure ends the session. The synthesis axis is therefore not bare host-durability but concurrent multi-operator steering with no privileged owner, an identity that no single occupant — including the one that created it — can take down by leaving. Note what is and isn't distributed here. The live graph is single-homed by design — it runs on one host at a time; there is no replicated copy executing elsewhere. What is distributed is the access topology: the operators and clients that reach the graph are spread across machines and transports, addressing one single-homed execution object rather than a replicated one. "Distributed access to a single-homed execution object" is the precise claim, not "a distributed object."

Why this matters now

For a decade the gap was tolerable because the thing on the other end of an execution was a program — deterministic, headless, content to run to completion and report back. Replay fits that world perfectly. You do not need to "take over" a Temporal workflow mid-step; you need it to finish reliably.

Agentic systems break that assumption in two specific ways.

First, agents must pause for a human and then resume a live environment. Not resume a logical position in a state machine — resume an actual shell with a half-built project in it, a dev server still listening on a port, a database connection still open, a Python REPL with two hours of populated namespace. Replay can pause a workflow for a month and rebuild its variables; it cannot rebuild a live process tree and a PTY, because those were never deterministically derivable in the first place. The thing the agent is working inside is exactly the thing replay does not model.

Second, a human needs to take over a running session mid-flight — and then hand it back. The agent is three commands into a deploy, something looks wrong, an engineer wants to attach to the same live execution the agent is in, inspect it, type a few commands, and let the agent continue from the now-modified reality. That is multiple operators observing one live execution at once, with the write handed across actors as transferable, attributed authority — a turn passes from agent to human and back, not two hands fighting over one keyboard. Replay has no shared live surface to attach to — it linearizes signals to a logical instance, it does not host a live shell two parties can both touch. CRIU has the live surface but no multi-client coherence — restore hands one restorer one process; there is no model for several operators observing one process while authority to act on it passes between them, no routing identity for "the session" independent of the host.

The agentic workload sits precisely in the hole between them. It needs replay's durable, host-independent identity and continuity, and it needs CRIU-grade live OS state, and it needs something neither has: concurrent observation of one live execution with serialized, attributed, transferable write authority across multiple operators. Pause for human input and resume a live environment; let a human grab the wheel of a running session and give it back. Replay cannot do live shared steering at all. Raw snapshot cannot do multi-client coherence. You can watch the convergence pressure in the field already, and it splits along the two axes. On continuity — a live environment that persists across the dying client — sandbox runtimes are reaching for it from different starting points, the way E2B's whole-guest snapshots capture live OS state much as CRIU does. On the harder multi-actor / transferable-authority axis — more than one party acting on one live execution — the cleanest signal is Warp, with multiple humans and an agent steering a single live session under grant-based edit access. But even the systems that admit a second actor admit it as a guest of a privileged host whose departure ends the session. That is the line the center column draws and the others do not cross — not "survives the host changing," which a cloud-backed collaboration tool now does too, but an ownerless identity with no privileged host-occupant. They are converging into the center column from different edges, without yet having agreed on a name for it.

Steering is the verb that marks the column

The center column deserves its name. It is the command-operator execution layer: live OS-level execution state — the process tree with its lineage, the PTY, the file descriptors, the sockets — elevated to a single-homed, first-class, persistent, addressable object, decoupled from any one client or transport, that humans, AI agents, devices, and services reach as operators (peers in one live execution) over distributed access. The operative distinction it draws is the operator model (operators reach a live, durable, ownerless, identity-bearing execution — one with no privileged host-occupant — and act on it under transferable authority) against the controller model (one actor drives a process to completion). What is distributed is the operator and client topology, not the execution state itself.

Hold the slogan, because it is the cleanest way to keep the three paradigms apart: a durable-execution engine replays a logical workflow; CRIU freezes and thaws one process; an execution-state system steers a live graph. Replayed, frozen, and steered are three different verbs, and only one of them describes a running environment that a human and an agent can stand inside together.

The systems that exist today are not wrong; they are aimed elsewhere. Temporal and Orleans are superb at making logical processes invincible, and that will remain true and important. But the agentic era needs the other thing too, and the other thing has been the missing column on the map all along.

One implementation built explicitly around this center column — a live, single-homed, addressable, multi-operator execution graph that is steered rather than replayed — is cmdop, one reference implementation, offered here as a reference point for what the category looks like in practice rather than as the category itself.

Next in the series — Part 4 of 7: "AI as Operator, Not Controller: The Multi-Actor Execution Model." Why "the model calls tools" is the wrong shape for systems where humans and agents share one live execution.

Previous — Part 2 of 7: Persistent Memory Is Not Persistent Execution State

Next — Part 4 of 7: AI as Operator, Not Controller