Piotr Wachowski

Posted on May 30

Your AI Agent Just Crashed at Step 9 of 12. Here's How to Make That Not Matter.

#python #ai #agents #llm

How to build crash-proof, resumable AI agents with Temporal's durable execution: a DeepAgents-style developer experience where killing the process doesn't kill the run.

If you've built an AI agent that does real work (calling tools, delegating to sub-agents, looping until a task is done), you've probably felt this particular kind of pain:

The agent is nine steps into a twelve-step job. It has searched the web, written three files, and delegated to a sub-agent. Then the process dies. A deploy, an OOM kill, a dropped network connection, a transient 500 from your model provider. Whatever the cause, the result is the same: the entire run is gone. All that state lived in process memory, and process memory just evaporated.

Durability usually isn't the first thing you reach for when prototyping an agent, and for good reason: it's plumbing, not the fun part. But once an agent starts doing real work, it's worth taking seriously. This article is about a mental model that makes that durability almost free, an agent is not an object in memory, it's a durable workflow, and how you can build agents that survive crashes, restarts, and infrastructure failures by running them on Temporal.

I'll also show you a small open-source library I've been building, durable-agents, that packages this pattern so you don't have to write the plumbing yourself. But the ideas matter more than the library: you can apply them with raw Temporal, and you'll learn something even if you never touch my repo.

The core insight: an agent run is a workflow

Here's the whole idea in one sentence:

Stop storing your agent's state in RAM. Store it in an append-only event
history that survives any crash.

That's exactly what Temporal's durable execution gives you. Temporal is a system for running workflows, functions whose every step is persisted to an event history as it happens. If the worker process dies, Temporal replays that history on a new worker and your function continues from precisely where it left off. No checkpoint code. No "resume from step N" logic. It just continues.

Map an agent onto that model and everything clicks into place:

Agent concept	Temporal primitive
The agent run	A workflow
An LLM call (plan / execute / synthesize)	An activity
A tool invocation	An activity
A sub-agent	A child workflow
The agent's memory	The workflow's event history

The split between workflow and activity is the key. Workflow code is deterministic: it's the orchestration logic, and it must replay identically every time. Anything non-deterministic or with side effects (an HTTP call to OpenAI, reading a file, running a tool) happens in an activity. Activities are retried automatically on failure, and their results are recorded in history so they never re-run after they've succeeded.

This is why a crash is survivable: when the worker comes back, Temporal replays the workflow up to the last recorded event and resumes. The plan you already generated, the three files you already wrote, the sub-agent results you already collected, all still there. Only the in-flight step retries.

What this looks like in practice

Let me make it concrete. Here's a complete research agent. The thing to notice is how little ceremony there is, and that the durability is invisible. You write ordinary async Python.

First, a tool. A tool is just an async function with a decorator; the JSON schema the model needs is generated from your type hints and docstring:

from durable_agents import tool

@tool
async def web_search(query: str) -> str:
    """Search the web for information about the given query."""
    # ... call your search API ...
    return f"Results for {query}: ..."

Now the agent. This lives on the worker, the process that does the work:

from durable_agents import create_durable_agent

agent = create_durable_agent(
    model="openai:gpt-4o-mini",
    tools=[web_search],
    system_prompt="You are a helpful research assistant.",
    task_queue="research-agent",
)

await agent.run_worker()   # blocks; serves the task queue

And triggering it. Here's a detail worth pausing on: the client is thin. It imports no tools, knows no schemas. It only knows a task-queue name:

from durable_agents import DurableAgentClient

client = DurableAgentClient(task_queue="research-agent")
result = await client.run("What is quantum entanglement?")

The agent definition (model, tools, prompt, sub-agents) lives only on the worker. The worker is the agent. A web handler, a cron job, or another service can trigger an agent without depending on its implementation, and your tool code and credentials never leave the worker. Tool schemas are never sent over the wire.

The detail that changes how you think about agents: two retry layers

This is the part that reframes a problem every agent builder hits. Agents fail in two completely different ways, and they need two completely different recovery strategies:

1. Infrastructure faults. The network blips. The model API returns a 503. The worker is redeployed. These are transient and not the agent's fault. The right response is: retry the exact same operation, with backoff, until it works. Temporal does this for you, automatically, at the activity level. You write zero retry code.

2. Semantic faults. The model returns malformed JSON. It calls a tool that doesn't exist. A tool raises an exception. Retrying the identical call won't help: the input needs to change. The right response is to feed the failure back to the model as an observation and let it correct itself on the next step.

It's tempting to collapse these into one mechanism, usually a try/except that crashes the loop on bad output. Keeping them separate is what makes the loop resilient:

A tool that raises does not crash the agent. The exception is caught and returned to the model as ERROR calling tool 'x': ..., an observation it can reason about.
Malformed model output (often just a truncated response) becomes an empty result the model sees and retries differently.
Meanwhile, underneath all of that, Temporal is transparently retrying any activity that failed for infrastructure reasons.

Bad model output is data, not an exception. That single reframing makes agents dramatically more robust.

Watching it work: a multi-agent pipeline

To show this with something more interesting than a single agent, here's the example I use: a Code Archaeologist, a four-agent pipeline that modernises legacy Python on disk.

An orchestrator plans the work and delegates each phase. It has no tools of its own; it only coordinates.
An archaeologist reads the legacy code and reports what's wrong (missing type hints, %-formatting, global state).
A modernizer rewrites the files: annotations, f-strings, pathlib, context managers.
A documenter adds docstrings and writes a README.

Each sub-agent runs as a child workflow on its own task queue, with its own isolated, independently-visible history. The orchestrator delegates by handing each child a self-contained task that carries forward the previous stage's findings; the child never inherits the parent's full message history, so context stays clean and scoped.

In the Temporal Web UI you can see the whole thing: the orchestrator spawning three child workflows in sequence, each one's plan-then-execute loop, every LLM call and tool call as a discrete, inspectable event. The question "what did my agent actually do?" has a precise, visual answer.

The orchestrator plus its three child workflows: archaeologist, modernizer, and documenter, each running as its own Temporal workflow.

The money shot: I killed the worker, and the agent didn't care

Here's the test that proves the whole thesis.

I started the pipeline and let it run. The three sub-agents finished their work: the analysis, the rewritten files, the documentation were all done and recorded. Then, while the orchestrator was on its final step (synthesize_result, writing up the summary), I killed the worker. Ctrl-C. Dead process.

In an in-memory framework, that's a total loss. Three agents' worth of completed work, gone. Start over.

Here's what actually happened:

Worker killed mid-run: synthesize_result sits in Attempt 2 / ∞, while the plan and all three child workflows stay Completed. Nothing re-ran.

The synthesize_result activity went into a retry loop, patiently attempting against an empty task queue. Crucially, none of the completed work re-ran. The plan, the three child workflows, all still marked completed in history. Only the single in-flight activity was pending.

Then I restarted the worker. The instant it reconnected:

The instant the worker reconnected, the pending activity finished and the workflow completed. Total wall-clock includes the dead time, but no work was lost or repeated.

It picked up the pending activity, completed it, and the client that submitted the task got its result as if nothing had happened, no work lost, none repeated.

That's durable execution. The agent's state lived in Temporal's event history, not in the process I killed.

Where to start, and what's honest about this

If you want to try the pattern, you don't need my library: you can wire LLM calls and tools as Temporal activities directly. The Temporal docs are excellent, and the workflow / activity split is the only concept you really need.

If you want the DeepAgents-style ergonomics (@tool, @skill, sub-agent
delegation, the plan-then-execute loop) on top of that durability,
durable-agents packages it.
Let me be straight about its status: it's alpha, and currently OpenAI-only.
The core loop, sub-agents, skills, and filesystem tools work today; things like persistent memory, human-in-the-loop, and more model providers are on the roadmap. I'm sharing it mainly because the idea is worth sharing; the code is just one small, runnable example of wiring it up, and you might wire it differently.

What's next

This is the first piece in a series. Durable execution is the foundation, but it's also what unlocks the features that are genuinely hard to do well in memory-bound agents:

Human-in-the-loop: pausing an agent for days waiting on human approval, with no process held open, by parking the workflow on a Temporal signal.
Persistent memory: facts and preferences that outlive a single run.
Deeper observability: tracing and streaming on top of the event history.

I'll write each of those up as I build them.

If the "agent is a workflow" framing was useful, I'd love to hear how you're thinking about durability in your own agents, and if you want to poke at the code, the repo and the runnable crash-test are linked below.

Repo: https://github.com/piotrwachowski/durable-agents
Reproduce the crash test: see the Durability test (crash the worker and watch it resume) section in docs/09-examples.md.

Top comments (1)

Harjot Singh • May 31

Crashed at step 9 of 12, make it not matter is the exact problem that separates a toy agent from a production one, and the fix is durable, resumable execution. The naive agent holds all its progress in memory, so a crash at step 9 throws away steps 1 to 8 and restarts from zero, which is expensive (re-spending all those tokens) and dangerous (re-firing side effects that already happened). Making it not matter means two things together. Checkpoint each step's result to durable storage so a restart resumes at 9 instead of 1, and make side effects idempotent so the steps that did run aren't re-executed on resume, no double charges, no duplicate emails. Get those two right and a crash becomes a pause, not a disaster. The subtle part is the boundary between deciding to do something and having done it, you want to record completion atomically with the side effect, otherwise a crash in that gap leaves you unsure whether step 9 happened, which is the classic exactly-once problem agents inherit from distributed systems. Treat the agent run like a durable workflow, not a script you hope finishes. Checkpoint progress, make steps idempotent, resume instead of restart. That design-for-crash-recovery instinct is core to how I think about agents in Moonshift. Are you persisting a per-step checkpoint yourself, or leaning on a durable-execution engine to get resume-from-step-9 for free?