Everyone is building "agent workflow" libraries right now.
Many of them optimize the easiest part.
It is not that hard to wire:
- model calls
- tool calls
- retries
- fallback models
- structured outputs
Give that task to a strong model and you can get a surprisingly decent orchestration layer very quickly.
I am not saying those abstractions are useless.
I am saying they are usually not the real problem.
The real problem begins when your workflow lasts longer than the HTTP request that created it.
That is where agent systems stop being a prompt-engineering exercise and start becoming a systems problem. The solution is to employ durable workflows from Runner: https://runner.bluelibs.com/guide/durable-workflows
The Real Problem Is Time
Real agent workflows often span minutes, hours, or days.
They wait for:
- human approval
- revised input
- compliance review
- a webhook from another system
- a scheduled retry window
- a sub-workflow to finish somewhere else
And while they wait, life happens:
- your server restarts
- you deploy a new version
- a worker crashes
- a queue redelivers
- another node picks up the execution
- a user wants to cancel the whole thing right now
That is the part many agent libraries underplay.
They help you choreograph the model.
They do not help you survive time.
And sadly, a JavaScript closure is not a distributed systems primitive. It has great vibes, but terrible crash recovery.
Why Durable Workflows Fit Agent Systems So Well
Durable workflows solve the part that actually hurts:
- they persist progress
- they survive crashes and restarts
- they let you wait for outside signals cleanly
- they support safe cancellation
- they scale horizontally without relying on in-memory timers
This makes them an excellent fit for agent systems, because agent systems are full of pauses, retries, gates, and long-lived state transitions.
You do not adopt durable workflows because your LLM is clever.
You adopt them because your servers are mortal.
The Mental Model
A durable workflow is a normal async function with checkpoints.
In Runner, the workflow does not resume from some frozen JavaScript instruction pointer.
Instead, on recovery or wake-up, it re-runs from the top and fast-forwards through completed steps using persisted results.
The important rule is simple:
Put side effects inside
step(). Code outside a durable step may run more than once.
That rule is what prevents "oops, we charged the customer twice" or "great news, we emailed them three times."
And one more practical note: keep step IDs stable for in-flight workflows across deploys.
This mental model feels a bit like useEffect() for the backend, if useEffect() had persistence, signals, a helmet, and significantly better survival instincts.
A Tiny Workflow Example
This is the shape I want from an agent workflow:
// assume Approved is a typed event declared elsewhere
const workflow = r
.task("publishArticleWorkflow")
.dependencies({ durable })
.tags([
tags.durableWorkflow.with({
category: "content",
signals: [Approved],
}),
])
.run(async (input: { articleId: string }, { durable }) => {
const d = durable.use();
await d.step("generate-draft", async () => {
// call model, save draft, persist metadata
return { articleId: input.articleId };
});
await d.waitForSignal(Approved, { stepId: "wait-approval" });
await d.sleep(60 * 60 * 1000, { stepId: "cooldown-before-publish" });
await d.step("publish", async () => {
// publish once, even across crashes/recovery
return { ok: true };
});
})
.build();
That is a much better foundation for AI work than "please do not restart this container for the next six hours."
Runner's Durable Workflows
Runner is a backend framework, but its durable workflow capability can be used on its own inside an existing app.
Today, Runner durable workflows live on the Node side via @bluelibs/runner/node.
The setup is intentionally small:
import { r, run } from "@bluelibs/runner";
import { resources, tags } from "@bluelibs/runner/node";
const Approved = r.event<{ approvedBy: string }>("approved").build();
const durable = resources.memoryWorkflow.fork("app-durable");
const workflow = r
.task("publishArticleWorkflow")
.dependencies({ durable })
.tags([
tags.durableWorkflow.with({
category: "articles",
signals: [Approved],
}),
])
.run(async (input: { articleId: string }, { durable }) => {
const d = durable.use();
const result = await d.step("prepare-article", async () => {
return { articleId: input.articleId, ready: true };
});
const approval = await d.waitForSignal(Approved, {
stepId: "wait-approval",
timeoutMs: 24 * 60 * 60 * 1000,
});
if (approval.kind === "timeout") {
return { status: "timed_out" as const };
}
return await d.step("publish-article", async () => {
return {
status: "published" as const,
approvedBy: approval.payload.approvedBy,
publishedAt: Date.now(),
};
});
})
.build();
const app = r
.resource("app")
.register([
resources.durable,
durable.with({
queue: { consume: true },
polling: { enabled: true, interval: 1000 },
recovery: { onStartup: true },
}),
workflow,
Approved,
])
.build();
const runtime = await run(app);
const durableRuntime = runtime.getResourceValue(durable);
Then you start it:
const executionId = await durableRuntime.start(workflow, {
articleId: "article-42",
});
And later, from an API route, admin screen, webhook, or another service:
await durableRuntime.signal(executionId, Approved, {
approvedBy: "editor@company.com",
});
That signal wakes the workflow and it continues from the persisted point.
Not from scratch. Not from memory. From durable state.
Why This Matters for Human-in-the-Loop AI
This is the part I really like.
A lot of "AI workflows" are actually:
- do some work
- pause
- wait for a human or another system
- continue safely
Runner models that cleanly with typed signals.
In the agent-orchestration example, the workflow can:
- create an initial draft
- wait for a review decision
- handle
approvevsrevise - wait for a revised draft
- publish only inside a durable step
That means the workflow can survive real editorial loops instead of only looking impressive in a single uninterrupted demo.
Replay Is the Secret Sauce
The most important concept to understand is replay.
If your process crashes while the workflow is sleeping, waiting for approval, or simply sitting in the queue, recovery does not mean "run everything again and hope for the best."
It means:
- completed steps return cached results
- waits remain satisfied once their signal is stored
- sleeps continue from persisted timers
- new work executes only from the next unfinished checkpoint
This is what makes durable workflows fe el safe.
Without replay, long-lived workflows eventually turn into a haunted house of ad hoc tables, boolean flags, timer drift, and very nervous engineers.
Cancellation Is Not an Afterthought
Agent systems also need smart cancellation.
Not "we removed it from the UI, good luck everybody."
Runner exposes cancellation directly:
await durableRuntime.cancelExecution(executionId, "User requested");
If a step is currently running, its AbortSignal is triggered.
If the workflow is sleeping or waiting for a signal, cancellation can complete immediately.
That gives you a sane model for:
- user-aborted workflows
- policy shutdowns
- stale executions
- operator interventions
Horizontal Scaling Without Fear
This is the other big reason durable workflows fit agent systems.
Once workflows can span real time, you do not want ownership to depend on one process keeping a timer alive in RAM.
Runner's production durable setup uses:
- Redis for durable execution state
- RabbitMQ for work distribution
- polling/recovery for timers and orphaned executions
That split is healthy:
- the store makes it correct
- the queue makes it fast
Or put differently: pub/sub can be speedy, but storage is where truth goes to pay taxes.
A production-flavored setup looks like this:
const durable = resources.redisWorkflow.fork("content-durable");
const durableRegistration = durable.with({
redis: { url: process.env.REDIS_URL! },
queue: {
url: process.env.RABBITMQ_URL!,
consume: true,
quorum: true,
},
polling: { enabled: true, interval: 1000, concurrency: 10 },
recovery: { onStartup: true },
});
This gives you a workflow model that can survive deploys, recover orphaned executions, and scale across workers without duplicating side effects.
What Durable Workflows Buy You for AI Systems
If you are building AI systems that involve any of the following:
- human approval
- review and revision loops
- long-running research
- multi-step tool execution
- scheduled follow-ups
- compliance gates
- child workflows
then durable workflows are probably a better primitive than another layer of clever prompt choreography.
The hard part is rarely "how do I call the model?"
The hard part is:
- how do I pause safely?
- how do I resume safely?
- how do I avoid doing the same side effect twice?
- how do I cancel safely?
- how do I scale safely?
Durable workflows answer those questions directly.
Final Thought
I think a lot of the market is solving the shiny part of agent orchestration.
The serious part is persistence, replay, signaling, cancellation, and scale.
That is why durable workflows feel like such a natural fit for agent systems.
Not because they make your prompts smarter.
Because they make your workflows harder to kill.
And in production, that is a very beautiful quality.
Take a look at this simple example to better understand the capability:
https://github.com/bluelibs/runner/tree/main/examples/agent-orchestration
Top comments (0)