Your AI Agent Crashed at Step 47. Why Isn't Crash Recovery the Default?

#ai #python #agents #durability

Your agent is running a 50-step data pipeline. Extract, validate, transform, deduplicate, load. 25 minutes in.

Step 47. OOM killed. Process gone. 25 minutes of work gone.

You restart the agent. It starts from step 1.

The "You Should Have Configured It" Problem

Every framework has an answer for this. And every answer is the same: you should have set it up before the crash.

# LangGraph - opt-in persistence
from langgraph.checkpoint.postgres import PostgresSaver

checkpointer = PostgresSaver.from_conn_string(DB_URI)
graph = builder.compile(checkpointer=checkpointer)  # forgot this? start over.

# CrewAI - limited state management
# "Failures typically require restart"

# Swarm - no persistence at all
# State exists only in memory

# Raw Python - hope you wrote your own

The pattern is consistent: durability is an add-on. Something you bolt on after you build the agent. Something you forget until the first crash.

And the checkpoint code is never simple. With LangGraph's PostgresSaver you also manage database connections, schema migrations when LangGraph updates, cleanup of old checkpoints, serialization errors when state objects change shape, and resume logic. That's 50+ lines of infrastructure code unrelated to what your agent actually does.

Why Durability Should Be the Default

Think about how you use Stripe. You don't write checkpoint code in case your server crashes mid-payment. Stripe handles it - idempotency keys, retry logic, durable state on their side.

Agent operations are the exception. The one place where durability is still opt-in. Still your problem.

Agent Stateless, Platform Stateful

from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

intent_id = client.send_intent({
    "intent_type": "intent.pipeline.process.v1",
    "to_agent": "agent://myorg/production/data-pipeline",
    "payload": {
        "pipeline": "etl-customers",
        "steps": ["extract", "validate", "transform", "load"],
        "total_rows": 500000,
    },
})
result = client.wait_for(intent_id)

No PostgresSaver. No checkpoint database. No serialization code.

The state lives in the platform. The agent is stateless. When the agent crashes:

The intent stays at its current state in PostgreSQL
The agent restarts (Cloud Run, Kubernetes, whatever)
The platform redelivers the intent
The agent resumes from where it stopped

Up to 3 delivery attempts by default. Configurable per intent type.

The Real Cost of Opt-In Durability

It's not just the code. It's the incidents. The agent that crashed at record 98k of 100k and started over. The deployment pipeline that failed at step 9, re-ran all 10, and double-deployed services 1 through 9. The enrichment job that crashed and hit the same API 50,000 times on restart.

These happen not because teams are careless - but because they were busy building the product and didn't get to the checkpoint code yet.

Comparison

	LangGraph	CrewAI	AXME
Durability	Opt-in (PostgresSaver)	None	Default
Checkpoint code	30-50 lines	N/A	0
DB management	You operate	N/A	Managed
Resume after crash	From last checkpoint	Start over	Automatic redelivery
Cross-machine	No (state is local)	No	Yes (state in platform)
Framework lock-in	LangGraph only	CrewAI only	Any framework