DEV Community

George Belsky
George Belsky

Posted on

Your AI Agent Crashed at Step 47. Why Isn't Crash Recovery the Default?

Your agent is running a 50-step data pipeline. Extract, validate, transform, deduplicate, load. 25 minutes in.

Step 47. OOM killed. Process gone. 25 minutes of work gone.

You restart the agent. It starts from step 1.

The "You Should Have Configured It" Problem

Every framework has an answer for this. And every answer is the same: you should have set it up before the crash.

# LangGraph - opt-in persistence
from langgraph.checkpoint.postgres import PostgresSaver

checkpointer = PostgresSaver.from_conn_string(DB_URI)
graph = builder.compile(checkpointer=checkpointer)  # forgot this? start over.

# CrewAI - limited state management
# "Failures typically require restart"

# Swarm - no persistence at all
# State exists only in memory

# Raw Python - hope you wrote your own
Enter fullscreen mode Exit fullscreen mode

The pattern is consistent: durability is an add-on. Something you bolt on after you build the agent. Something you forget until the first crash.

And the checkpoint code is never simple. With LangGraph's PostgresSaver you also manage database connections, schema migrations when LangGraph updates, cleanup of old checkpoints, serialization errors when state objects change shape, and resume logic. That's 50+ lines of infrastructure code unrelated to what your agent actually does.

Why Durability Should Be the Default

Think about how you use Stripe. You don't write checkpoint code in case your server crashes mid-payment. Stripe handles it - idempotency keys, retry logic, durable state on their side.

Agent operations are the exception. The one place where durability is still opt-in. Still your problem.

Agent Stateless, Platform Stateful

from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

intent_id = client.send_intent({
    "intent_type": "intent.pipeline.process.v1",
    "to_agent": "agent://myorg/production/data-pipeline",
    "payload": {
        "pipeline": "etl-customers",
        "steps": ["extract", "validate", "transform", "load"],
        "total_rows": 500000,
    },
})
result = client.wait_for(intent_id)
Enter fullscreen mode Exit fullscreen mode

No PostgresSaver. No checkpoint database. No serialization code.

The state lives in the platform. The agent is stateless. When the agent crashes:

  1. The intent stays at its current state in PostgreSQL
  2. The agent restarts (Cloud Run, Kubernetes, whatever)
  3. The platform redelivers the intent
  4. The agent resumes from where it stopped

Up to 3 delivery attempts by default. Configurable per intent type.

The Real Cost of Opt-In Durability

It's not just the code. It's the incidents. The agent that crashed at record 98k of 100k and started over. The deployment pipeline that failed at step 9, re-ran all 10, and double-deployed services 1 through 9. The enrichment job that crashed and hit the same API 50,000 times on restart.

These happen not because teams are careless - but because they were busy building the product and didn't get to the checkpoint code yet.

Comparison

LangGraph CrewAI AXME
Durability Opt-in (PostgresSaver) None Default
Checkpoint code 30-50 lines N/A 0
DB management You operate N/A Managed
Resume after crash From last checkpoint Start over Automatic redelivery
Cross-machine No (state is local) No Yes (state in platform)
Framework lock-in LangGraph only CrewAI only Any framework

Try It

Working example - submit a multi-step pipeline, kill the agent mid-processing, restart it, watch it resume automatically:

github.com/AxmeAI/ai-agent-checkpoint-and-resume

Built with AXME - durable execution for agent operations. Alpha - feedback welcome.

Top comments (0)