Your agent is running a 50-step data pipeline. Extract, validate, transform, deduplicate, load. 25 minutes in.
Step 47. OOM killed. Process gone. 25 minutes of work gone.
You restart the agent. It starts from step 1.
The "You Should Have Configured It" Problem
Every framework has an answer for this. And every answer is the same: you should have set it up before the crash.
# LangGraph - opt-in persistence
from langgraph.checkpoint.postgres import PostgresSaver
checkpointer = PostgresSaver.from_conn_string(DB_URI)
graph = builder.compile(checkpointer=checkpointer) # forgot this? start over.
# CrewAI - limited state management
# "Failures typically require restart"
# Swarm - no persistence at all
# State exists only in memory
# Raw Python - hope you wrote your own
The pattern is consistent: durability is an add-on. Something you bolt on after you build the agent. Something you forget until the first crash.
And the checkpoint code is never simple. With LangGraph's PostgresSaver you also manage database connections, schema migrations when LangGraph updates, cleanup of old checkpoints, serialization errors when state objects change shape, and resume logic. That's 50+ lines of infrastructure code unrelated to what your agent actually does.
Why Durability Should Be the Default
Think about how you use Stripe. You don't write checkpoint code in case your server crashes mid-payment. Stripe handles it - idempotency keys, retry logic, durable state on their side.
Agent operations are the exception. The one place where durability is still opt-in. Still your problem.
Agent Stateless, Platform Stateful
from axme import AxmeClient, AxmeClientConfig
client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))
intent_id = client.send_intent({
"intent_type": "intent.pipeline.process.v1",
"to_agent": "agent://myorg/production/data-pipeline",
"payload": {
"pipeline": "etl-customers",
"steps": ["extract", "validate", "transform", "load"],
"total_rows": 500000,
},
})
result = client.wait_for(intent_id)
No PostgresSaver. No checkpoint database. No serialization code.
The state lives in the platform. The agent is stateless. When the agent crashes:
- The intent stays at its current state in PostgreSQL
- The agent restarts (Cloud Run, Kubernetes, whatever)
- The platform redelivers the intent
- The agent resumes from where it stopped
Up to 3 delivery attempts by default. Configurable per intent type.
The Real Cost of Opt-In Durability
It's not just the code. It's the incidents. The agent that crashed at record 98k of 100k and started over. The deployment pipeline that failed at step 9, re-ran all 10, and double-deployed services 1 through 9. The enrichment job that crashed and hit the same API 50,000 times on restart.
These happen not because teams are careless - but because they were busy building the product and didn't get to the checkpoint code yet.
Comparison
| LangGraph | CrewAI | AXME | |
|---|---|---|---|
| Durability | Opt-in (PostgresSaver) | None | Default |
| Checkpoint code | 30-50 lines | N/A | 0 |
| DB management | You operate | N/A | Managed |
| Resume after crash | From last checkpoint | Start over | Automatic redelivery |
| Cross-machine | No (state is local) | No | Yes (state in platform) |
| Framework lock-in | LangGraph only | CrewAI only | Any framework |
Try It
Working example - submit a multi-step pipeline, kill the agent mid-processing, restart it, watch it resume automatically:
github.com/AxmeAI/ai-agent-checkpoint-and-resume
Built with AXME - durable execution for agent operations. Alpha - feedback welcome.
Top comments (0)