DEV Community

Theodor Diaconu
Theodor Diaconu

Posted on

durable workflows for agents

Everyone is building "agent workflow" libraries right now.

Many of them optimize the easiest part.

It is not that hard to wire:

  • model calls
  • tool calls
  • retries
  • fallback models
  • structured outputs

Give that task to a strong model and you can get a surprisingly decent orchestration layer very quickly.

I am not saying those abstractions are useless.

I am saying they are usually not the real problem.

The real problem begins when your workflow lasts longer than the HTTP request that created it.

That is where agent systems stop being a prompt-engineering exercise and start becoming a systems problem. The solution is to employ durable workflows from Runner: https://runner.bluelibs.com/guide/durable-workflows

The Real Problem Is Time

Real agent workflows often span minutes, hours, or days.

They wait for:

  • human approval
  • revised input
  • compliance review
  • a webhook from another system
  • a scheduled retry window
  • a sub-workflow to finish somewhere else

And while they wait, life happens:

  • your server restarts
  • you deploy a new version
  • a worker crashes
  • a queue redelivers
  • another node picks up the execution
  • a user wants to cancel the whole thing right now

That is the part many agent libraries underplay.

They help you choreograph the model.
They do not help you survive time.

And sadly, a JavaScript closure is not a distributed systems primitive. It has great vibes, but terrible crash recovery.

Why Durable Workflows Fit Agent Systems So Well

Durable workflows solve the part that actually hurts:

  • they persist progress
  • they survive crashes and restarts
  • they let you wait for outside signals cleanly
  • they support safe cancellation
  • they scale horizontally without relying on in-memory timers

This makes them an excellent fit for agent systems, because agent systems are full of pauses, retries, gates, and long-lived state transitions.

You do not adopt durable workflows because your LLM is clever.
You adopt them because your servers are mortal.

The Mental Model

A durable workflow is a normal async function with checkpoints.

In Runner, the workflow does not resume from some frozen JavaScript instruction pointer.

Instead, on recovery or wake-up, it re-runs from the top and fast-forwards through completed steps using persisted results.

The important rule is simple:

Put side effects inside step(). Code outside a durable step may run more than once.

That rule is what prevents "oops, we charged the customer twice" or "great news, we emailed them three times."

And one more practical note: keep step IDs stable for in-flight workflows across deploys.

This mental model feels a bit like useEffect() for the backend, if useEffect() had persistence, signals, a helmet, and significantly better survival instincts.

A Tiny Workflow Example

This is the shape I want from an agent workflow:

// assume Approved is a typed event declared elsewhere
const workflow = r
  .task("publishArticleWorkflow")
  .dependencies({ durable })
  .tags([
    tags.durableWorkflow.with({
      category: "content",
      signals: [Approved],
    }),
  ])
  .run(async (input: { articleId: string }, { durable }) => {
    const d = durable.use();

    await d.step("generate-draft", async () => {
      // call model, save draft, persist metadata
      return { articleId: input.articleId };
    });

    await d.waitForSignal(Approved, { stepId: "wait-approval" });
    await d.sleep(60 * 60 * 1000, { stepId: "cooldown-before-publish" });

    await d.step("publish", async () => {
      // publish once, even across crashes/recovery
      return { ok: true };
    });
  })
  .build();
Enter fullscreen mode Exit fullscreen mode

That is a much better foundation for AI work than "please do not restart this container for the next six hours."

Runner's Durable Workflows

Runner is a backend framework, but its durable workflow capability can be used on its own inside an existing app.

Today, Runner durable workflows live on the Node side via @bluelibs/runner/node.

The setup is intentionally small:

import { r, run } from "@bluelibs/runner";
import { resources, tags } from "@bluelibs/runner/node";

const Approved = r.event<{ approvedBy: string }>("approved").build();

const durable = resources.memoryWorkflow.fork("app-durable");

const workflow = r
  .task("publishArticleWorkflow")
  .dependencies({ durable })
  .tags([
    tags.durableWorkflow.with({
      category: "articles",
      signals: [Approved],
    }),
  ])
  .run(async (input: { articleId: string }, { durable }) => {
    const d = durable.use();

    const result = await d.step("prepare-article", async () => {
      return { articleId: input.articleId, ready: true };
    });

    const approval = await d.waitForSignal(Approved, {
      stepId: "wait-approval",
      timeoutMs: 24 * 60 * 60 * 1000,
    });

    if (approval.kind === "timeout") {
      return { status: "timed_out" as const };
    }

    return await d.step("publish-article", async () => {
      return {
        status: "published" as const,
        approvedBy: approval.payload.approvedBy,
        publishedAt: Date.now(),
      };
    });
  })
  .build();

const app = r
  .resource("app")
  .register([
    resources.durable,
    durable.with({
      queue: { consume: true },
      polling: { enabled: true, interval: 1000 },
      recovery: { onStartup: true },
    }),
    workflow,
    Approved,
  ])
  .build();

const runtime = await run(app);
const durableRuntime = runtime.getResourceValue(durable);
Enter fullscreen mode Exit fullscreen mode

Then you start it:

const executionId = await durableRuntime.start(workflow, {
  articleId: "article-42",
});
Enter fullscreen mode Exit fullscreen mode

And later, from an API route, admin screen, webhook, or another service:

await durableRuntime.signal(executionId, Approved, {
  approvedBy: "editor@company.com",
});
Enter fullscreen mode Exit fullscreen mode

That signal wakes the workflow and it continues from the persisted point.
Not from scratch. Not from memory. From durable state.

Why This Matters for Human-in-the-Loop AI

This is the part I really like.

A lot of "AI workflows" are actually:

  1. do some work
  2. pause
  3. wait for a human or another system
  4. continue safely

Runner models that cleanly with typed signals.

In the agent-orchestration example, the workflow can:

  • create an initial draft
  • wait for a review decision
  • handle approve vs revise
  • wait for a revised draft
  • publish only inside a durable step

That means the workflow can survive real editorial loops instead of only looking impressive in a single uninterrupted demo.

Replay Is the Secret Sauce

The most important concept to understand is replay.

If your process crashes while the workflow is sleeping, waiting for approval, or simply sitting in the queue, recovery does not mean "run everything again and hope for the best."

It means:

  • completed steps return cached results
  • waits remain satisfied once their signal is stored
  • sleeps continue from persisted timers
  • new work executes only from the next unfinished checkpoint

This is what makes durable workflows fe el safe.

Without replay, long-lived workflows eventually turn into a haunted house of ad hoc tables, boolean flags, timer drift, and very nervous engineers.

Cancellation Is Not an Afterthought

Agent systems also need smart cancellation.

Not "we removed it from the UI, good luck everybody."

Runner exposes cancellation directly:

await durableRuntime.cancelExecution(executionId, "User requested");
Enter fullscreen mode Exit fullscreen mode

If a step is currently running, its AbortSignal is triggered.
If the workflow is sleeping or waiting for a signal, cancellation can complete immediately.

That gives you a sane model for:

  • user-aborted workflows
  • policy shutdowns
  • stale executions
  • operator interventions

Horizontal Scaling Without Fear

This is the other big reason durable workflows fit agent systems.

Once workflows can span real time, you do not want ownership to depend on one process keeping a timer alive in RAM.

Runner's production durable setup uses:

  • Redis for durable execution state
  • RabbitMQ for work distribution
  • polling/recovery for timers and orphaned executions

That split is healthy:

  • the store makes it correct
  • the queue makes it fast

Or put differently: pub/sub can be speedy, but storage is where truth goes to pay taxes.

A production-flavored setup looks like this:

const durable = resources.redisWorkflow.fork("content-durable");

const durableRegistration = durable.with({
  redis: { url: process.env.REDIS_URL! },
  queue: {
    url: process.env.RABBITMQ_URL!,
    consume: true,
    quorum: true,
  },
  polling: { enabled: true, interval: 1000, concurrency: 10 },
  recovery: { onStartup: true },
});
Enter fullscreen mode Exit fullscreen mode

This gives you a workflow model that can survive deploys, recover orphaned executions, and scale across workers without duplicating side effects.

What Durable Workflows Buy You for AI Systems

If you are building AI systems that involve any of the following:

  • human approval
  • review and revision loops
  • long-running research
  • multi-step tool execution
  • scheduled follow-ups
  • compliance gates
  • child workflows

then durable workflows are probably a better primitive than another layer of clever prompt choreography.

The hard part is rarely "how do I call the model?"

The hard part is:

  • how do I pause safely?
  • how do I resume safely?
  • how do I avoid doing the same side effect twice?
  • how do I cancel safely?
  • how do I scale safely?

Durable workflows answer those questions directly.

Final Thought

I think a lot of the market is solving the shiny part of agent orchestration.

The serious part is persistence, replay, signaling, cancellation, and scale.

That is why durable workflows feel like such a natural fit for agent systems.

Not because they make your prompts smarter.
Because they make your workflows harder to kill.

And in production, that is a very beautiful quality.

Take a look at this simple example to better understand the capability:

https://github.com/bluelibs/runner/tree/main/examples/agent-orchestration

Top comments (0)