Your AI Agent Crashed at Step 47. Now What?

#ai #python #agents #durability

Your agent is running a 50-step data pipeline. Extract, validate, transform, load. It's been working for 20 minutes.

Step 47. OOM kill. Process gone. State gone.

Now what?

The State of Crash Recovery in 2026

LangGraph:  "Did you configure PostgresSaver?" No? Start over.
CrewAI:     "Limited state management, failures typically require restart."
Swarm:      "No persistence, state exists only in memory."
Raw Python: Hope you wrote checkpoint logic yourself.

Every framework has its own answer to this. Most of them are "you should have thought about this earlier."

The Checkpoint Tax

If you want crash recovery in LangGraph, you write this:

from langgraph.checkpoint.postgres import PostgresSaver

DB_URI = "postgresql://user:pass@localhost/checkpoints"
checkpointer = PostgresSaver.from_conn_string(DB_URI)

graph = builder.compile(checkpointer=checkpointer)

config = {"configurable": {"thread_id": job_id}}
state = graph.get_state(config)
if state and state.values:
    result = graph.invoke(None, config)   # resume
else:
    result = graph.invoke(initial_state, config)  # start fresh

Plus: manage DB connections. Handle schema migrations when LangGraph updates. Clean up old checkpoints. Deal with serialization errors when your state objects change shape.

And this only works with LangGraph. Switch to CrewAI? Rewrite everything. Use both? Two checkpoint systems.

This is the checkpoint tax - code you write that has nothing to do with your agent's job, just to survive crashes.

What If Durability Was the Default?

Agent starts ETL pipeline via durable intent:
  [1/4] Extract    - done (state in platform)
  [2/4] Validate   - done (state in platform)
  [3/4] Transform  - done (state in platform)
  [4/4] Load       - CRASH

Restart agent:
  Platform redelivers the intent (state: IN_PROGRESS)
  Agent resumes. No data lost. No code changes.

The agent is stateless. The platform is stateful. State lives in PostgreSQL, not in process memory.

In Code

from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

intent_id = client.send_intent({
    "intent_type": "intent.pipeline.process.v1",
    "to_agent": "agent://myorg/production/pipeline-agent",
    "payload": {
        "pipeline": "etl-customers",
        "steps": ["extract", "validate", "transform", "load"],
        "total_rows": 500000,
    },
})

result = client.wait_for(intent_id)

No PostgresSaver. No checkpoint DB. No schema migrations. No serialization code.

If the agent crashes, the intent stays at its current state. Restart the agent - the platform redelivers. Up to 3 attempts by default.

How It Compares

	LangGraph	CrewAI	Swarm	AXME
Persistence	PostgresSaver (opt-in)	None	None	Default
Checkpoint code	20+ lines	N/A	N/A	0 lines
DB setup	You manage	N/A	N/A	Managed
Resume after crash	From last checkpoint	Start over	Start over	Automatic
Cross-machine	No (local state)	No	No	Yes
Framework lock-in	LangGraph only	CrewAI only	Swarm only	Any

The Key Insight

Durability shouldn't be opt-in. It should be the default.

When you send an HTTP request to Stripe, you don't write checkpoint code in case your process crashes mid-request. Stripe handles it - idempotency keys, retry logic, durable state on their side.

Agent operations should work the same way. Submit the operation. The platform guarantees it completes - through crashes, restarts, network failures.

Try It

Working example - submit a 4-step pipeline, kill the agent mid-processing, restart, watch it pick up where it left off:

github.com/AxmeAI/ai-agent-checkpoint-and-resume

Built with AXME - durable execution for agent operations. Alpha - feedback welcome.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.