What I learned building a workflow engine from scratch in Rust

#showdev #rust #python #programming

About 2 years ago, I needed a way to run a handful of tasks reliably: validate an order, charge a card, check inventory — in parallel where possible, with retries, and crash recovery. That's it.

So I evaluated Temporal. Spun up the server cluster, read the SDK docs, then hit the first wall: a whole complex framework to learn, a platform to understand and heavy investment.. I moved on to Airflow — wrote a DAG file, set up the scheduler, the webserver, the metadata DB, just to run three functions. I looked at Celery, Prefect, Step Functions. Each time, I was paying an infrastructure or complexity tax that felt wildly disproportionate to what I actually needed.

So I built my own: Sayiir. This post isn't a pitch for it — it's the five design decisions I wrestled with and what I landed on. If you've ever been curious about what goes into a workflow engine, or if you're building something similar, maybe this saves you some wrong turns.

Decision 1: How do you represent a workflow?

The first question is deceptively simple: what data structure describes "do A, then B, then C in parallel with D, then E"?

Most engines use a directed acyclic graph — nodes are tasks, edges are dependencies. I went with a continuation tree: a recursive structure where each node carries a pointer to what comes next.

enum WorkflowContinuation {
    Task { id: String, next: Option<Box<Self>>, ... },
    Fork { branches: Box<[Arc<Self>]>, join: Option<Box<Self>>, ... },
    Loop { body: Box<Self>, next: Option<Box<Self>>, ... },
    // Delay, AwaitSignal, Branch, ChildWorkflow...
}

Two reasons I preferred this over a flat graph:

"Where am I?" is trivial. In a continuation tree, execution position is a pointer to a node. In a DAG, you have to track completed nodes and compute the frontier. That simplicity pays off hugely when you need to checkpoint and resume — which is the entire point of a durable engine.
Nesting is natural. Loops, conditionals, and child workflows are just recursive nodes. In a flat DAG, these become subgraphs with synthetic entry/exit nodes and painful bookkeeping.

The trade-off: continuation trees can't represent arbitrary dependency graphs — only structured concurrency patterns (fork/join, not "A and B both independently feed into C"). In practice, that constraint hasn't been limiting. Most real workflows are structured.

Decision 2: How do you survive crashes?

This is where workflow engines diverge the most. The dominant approach — used by Temporal, Azure Durable Functions, and Restate — is deterministic replay: when a process crashes, re-execute the entire workflow from the beginning, skipping steps that already completed by matching them against an event log.

It works. But it has a brutal constraint: your workflow code must produce the exact same sequence of operations on every execution. No datetime.now(), no random(), no API calls that might return different data. You have to split your code into "workflow code" (deterministic orchestration) and "activities" (the actual work). It's a programming model you have to learn, and the bugs you get when you violate it are subtle and production-only.

I went with continuation-based checkpointing instead. The idea is simple:

Execute a task.
Checkpoint the result and the current position in the continuation tree.
Move to the next node.
On crash, load the last checkpoint and resume from that position.

No re-execution. No event log to replay. No determinism requirements. Your tasks are just functions — call whatever you want inside them.

Here's what the checkpoint looks like (simplified):

struct WorkflowSnapshot {
    instance_id: String,
    state: WorkflowSnapshotState, // InProgress | Completed | Failed | Paused | Cancelled
    // ...
}

enum ExecutionPosition {
    AtTask { task_id: String },
    AtFork { fork_id: String, completed_branches: HashMap<String, TaskResult> },
    AtDelay { delay_id: String, wake_at: DateTime<Utc> },
    InLoop { loop_id: String, iteration: u32 },
    // ...
}

The ExecutionPosition enum knows exactly where to resume — which task, which fork branch, which loop iteration. Combined with the map of completed tasks and their outputs, you have everything you need to pick up exactly where you left off.

The trade-off? Checkpointing is more write-heavy than replay (you write to storage after every task, not just at the end). For most workloads, this is fine — a PostgreSQL INSERT per task is negligible compared to the task itself. But if you have workflows with thousands of sub-millisecond tasks, replay might serve you better.

Decision 3: Where does state live?

A workflow engine is only as reliable as its storage. But I didn't want to bake in a specific database — I wanted to start with in-memory for development and tests, use PostgreSQL in production, and leave the door open for other backends without touching the engine.

The answer was hexagonal architecture — the core engine talks to a trait (interface), not to a concrete database:

trait SnapshotStore {
    fn save_snapshot(&self, snapshot: &WorkflowSnapshot) -> Result<()>;
    fn load_snapshot(&self, instance_id: &str) -> Result<WorkflowSnapshot>;
    fn save_task_result(&self, instance_id: &str, task_id: &str, output: Bytes) -> Result<()>;
    // ...
}

trait SignalStore: SnapshotStore {
    fn store_signal(&self, instance_id: &str, kind: SignalKind, ...) -> Result<()>;
    fn send_event(&self, instance_id: &str, signal_name: &str, payload: Bytes) -> Result<()>;
    // + cancel, pause, unpause...
}

The persistence layer is decomposed into focused sub-traits: SnapshotStore (5 methods) is enough for single-process runners. SignalStore adds lifecycle operations (cancel, pause). TaskClaimStore is opt-in, only needed for distributed workers claiming tasks with TTL-based locks.

One non-obvious decision: save_task_result is separate from save_snapshot. When two fork branches complete concurrently, you don't want them to overwrite each other's snapshot. Instead, each branch writes its own result atomically, and the join step reads them all. This was a bug I hit early and was glad to have caught before anyone else did.

Today, there are two backends: InMemory (zero setup, used in the playground and tests) and PostgreSQL (ACID transactions, advisory locks for distributed task claiming). The trait boundary means adding a new backend — Redis, DynamoDB, SQLite — is a contained effort with no changes to the engine.

Decision 4: One engine, three languages

I write Python at work and TypeScript for side projects. I wanted the same workflow engine in both, without maintaining two implementations that inevitably drift apart. Writing the core in Rust and exposing it through native bindings was the obvious play.

Python gets bindings through PyO3. The Rust core handles all orchestration, checkpointing, and task claiming. Python provides task implementations — plain functions stored in a dict, looked up by task ID at runtime. The @task decorator is pure Python sugar that registers a callable; when the engine needs to run a task, it calls back into Python with the deserialized input and gets the output back.

Node.js uses NAPI-RS. This one had a subtle challenge: V8's microtask queue only drains at the outermost native callback boundary. If you try to await a JavaScript promise from inside a Rust loop, it never resolves. The solution was a stepper pattern — the Rust engine yields (taskId, input) pairs to JavaScript, JavaScript runs the task and feeds the output back, and Rust advances the continuation. It's cooperative multitasking across the FFI boundary.

Rust is first-class. A #[task] proc macro generates a struct implementing the CoreTask trait, handling serialization, retry policies, and timeout configuration. A workflow! macro lets you declare workflows inline:

let workflow = workflow! {
    name: "order-processing",
    steps: [validate_order || charge_payment || check_inventory, finalize]
};

The key insight was keeping the language bindings thin. They're FFI wrappers and nothing more. All logic lives in Rust crates. When I add a feature — say, loop support — I implement it once in sayiir-core, and the Python and Node.js bindings get it by updating the bridge code. The surface area for drift is minimal.

Decision 5: What should the API feel like?

The whole point was "just write normal code." If the API itself felt foreign, I'd have failed.

I settled on a builder pattern with type-tracked chaining. In Python and TypeScript, it looks like this:

from sayiir import task, Flow, run_workflow

@task
def fetch_user(user_id: int) -> dict:
    return {"id": user_id, "name": "Alice"}

@task(timeout="30s", retries=3)
def send_email(user: dict) -> str:
    return f"Sent welcome to {user['name']}"

workflow = Flow("welcome").then(fetch_user).then(send_email).build()
result = run_workflow(workflow, 42)

const fetchUser = task("fetch-user", (id: number) => ({ id, name: "Alice" }));
const sendEmail = task("send-email", (user) => `Sent to ${user.name}`, { timeout: "30s", retries: 3 });

const workflow = flow<number>("welcome")
  .then(fetchUser)
  .then(sendEmail)
  .build();

const result = await runWorkflow(workflow, 42);

No YAML. No custom DSL. Tasks are regular functions. The workflow is built with method chaining. When you need parallel execution, .fork() opens a builder that collects .branch() calls and closes with .join(). It reads top-to-bottom, like the execution order.

Under the hood, the rust builder uses phantom type parameters to track input/output types through the chain — so then(fetchUser).then(sendEmail) is statically checked: sendEmail must accept what fetchUser returns. The Python and TypeScript wrappers can't enforce this at build time (dynamic languages), but you get a clear error at workflow construction if types don't line up.

Switching from in-memory to durable is a one-line change:

# In-memory (development)
result = run_workflow(workflow, 42)

# Durable (production)
backend = PostgresBackend("postgresql://localhost/sayiir")
status = run_durable_workflow(workflow, "welcome-42", 42, backend=backend)

Same workflow definition. Same task functions. You just plug in a backend and an instance ID.

What's working, and what's still hard

Sayiir is at v0.3 with sequential, parallel, conditional, and looping workflows all stable. Crash recovery, durable delays, external signals, distributed workers over PostgreSQL, and OpenTelemetry tracing are all in production shape. It supports Python 3.10–3.13, Node.js 18/20/22, and Rust natively. There's an interactive playground if you want to try it without installing anything.

What's still hard: workflow versioning (what happens when you change a workflow definition while instances are in flight), streaming outputs from long-running tasks, and edge runtime support (Cloudflare Workers, where you can't run a full Rust binary). These are active areas of work.

If any of this was interesting — whether you're building your own engine, evaluating workflow tools, or just like reading about systems design — the [source is on GitHub (https://github.com/sayiir/sayiir) and the docs have a 5-minute getting started guide.

Feedback and contributions are welcome !