DEV Community

George Belsky
George Belsky

Posted on

Your AI Agent Crashed at Step 47. Now What?

Your agent is running a 50-step data pipeline. Extract, validate, transform, load. It's been working for 20 minutes.

Step 47. OOM kill. Process gone. State gone.

Now what?

The State of Crash Recovery in 2026

LangGraph:  "Did you configure PostgresSaver?" No? Start over.
CrewAI:     "Limited state management, failures typically require restart."
Swarm:      "No persistence, state exists only in memory."
Raw Python: Hope you wrote checkpoint logic yourself.
Enter fullscreen mode Exit fullscreen mode

Every framework has its own answer to this. Most of them are "you should have thought about this earlier."

The Checkpoint Tax

If you want crash recovery in LangGraph, you write this:

from langgraph.checkpoint.postgres import PostgresSaver

DB_URI = "postgresql://user:pass@localhost/checkpoints"
checkpointer = PostgresSaver.from_conn_string(DB_URI)

graph = builder.compile(checkpointer=checkpointer)

config = {"configurable": {"thread_id": job_id}}
state = graph.get_state(config)
if state and state.values:
    result = graph.invoke(None, config)   # resume
else:
    result = graph.invoke(initial_state, config)  # start fresh
Enter fullscreen mode Exit fullscreen mode

Plus: manage DB connections. Handle schema migrations when LangGraph updates. Clean up old checkpoints. Deal with serialization errors when your state objects change shape.

And this only works with LangGraph. Switch to CrewAI? Rewrite everything. Use both? Two checkpoint systems.

This is the checkpoint tax - code you write that has nothing to do with your agent's job, just to survive crashes.

What If Durability Was the Default?

Agent starts ETL pipeline via durable intent:
  [1/4] Extract    - done (state in platform)
  [2/4] Validate   - done (state in platform)
  [3/4] Transform  - done (state in platform)
  [4/4] Load       - CRASH

Restart agent:
  Platform redelivers the intent (state: IN_PROGRESS)
  Agent resumes. No data lost. No code changes.
Enter fullscreen mode Exit fullscreen mode

The agent is stateless. The platform is stateful. State lives in PostgreSQL, not in process memory.

In Code

from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

intent_id = client.send_intent({
    "intent_type": "intent.pipeline.process.v1",
    "to_agent": "agent://myorg/production/pipeline-agent",
    "payload": {
        "pipeline": "etl-customers",
        "steps": ["extract", "validate", "transform", "load"],
        "total_rows": 500000,
    },
})

result = client.wait_for(intent_id)
Enter fullscreen mode Exit fullscreen mode

No PostgresSaver. No checkpoint DB. No schema migrations. No serialization code.

If the agent crashes, the intent stays at its current state. Restart the agent - the platform redelivers. Up to 3 attempts by default.

How It Compares

LangGraph CrewAI Swarm AXME
Persistence PostgresSaver (opt-in) None None Default
Checkpoint code 20+ lines N/A N/A 0 lines
DB setup You manage N/A N/A Managed
Resume after crash From last checkpoint Start over Start over Automatic
Cross-machine No (local state) No No Yes
Framework lock-in LangGraph only CrewAI only Swarm only Any

The Key Insight

Durability shouldn't be opt-in. It should be the default.

When you send an HTTP request to Stripe, you don't write checkpoint code in case your process crashes mid-request. Stripe handles it - idempotency keys, retry logic, durable state on their side.

Agent operations should work the same way. Submit the operation. The platform guarantees it completes - through crashes, restarts, network failures.

Try It

Working example - submit a 4-step pipeline, kill the agent mid-processing, restart, watch it pick up where it left off:

github.com/AxmeAI/ai-agent-checkpoint-and-resume

Built with AXME - durable execution for agent operations. Alpha - feedback welcome.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.