We Gave Our AI Agent a Memory. Here Is What Broke.

#opensource #programming #developer #python

Building an LLM agent pipeline feels straightforward until step four of twelve calls a model that times out, the worker dies, and you realize you just spent $3.20 on tokens that are now gone.
No checkpoint. No resume. Just a dead process and an empty response.
This is the part of agent development that nobody talks about when they are showing you the happy path demo. The real problem is not the prompt. It is what happens to the execution when the infrastructure does not cooperate.

The specific failure mode
An agent loop is just a workflow. It calls a model, gets a response, maybe calls a tool, feeds the result back, calls the model again. Ten steps, maybe twenty. Each step costs time and money.
If any step fails and you have no checkpointing, you restart from zero. Every time.
At development scale this is annoying. At production scale, with parallel agent runs across hundreds of user requests, it compounds fast.

What we actually needed
Not a retry mechanism. Retries re-run the whole thing.
Not a message broker. That solves delivery, not state.
What we needed was atomic step checkpointing: write the output before marking the step done, so a crashed worker that restarts picks up exactly where it stopped.

Gravtory

Gravtory is a Python library that backs durable workflow execution with your existing database. No broker, no orchestration server, no new infrastructure. Just your Postgres or whatever you are already running.

python@grav.workflow(id="agent-{run_id}")
class AgentWorkflow:

@step(1)
@llm_step(model="claude-sonnet-4-20250514", fallback="gpt-4o")
async def reason(self, prompt: str) -> str:
    return prompt

@step(2, depends_on=1)
async def call_tool(self, result: str) -> dict:
    return await tool_router.execute(result)

@step(3, depends_on=2)
@llm_step(model="claude-sonnet-4-20250514")
async def synthesize(self, tool_output: dict) -> str:
    return f"Given this tool result: {tool_output}, produce final answer"

If the process dies after step one, step one does not repeat. The model response was checkpointed. On resume, it is loaded from the database and passed forward.
Token usage is tracked per step. You know exactly what each step cost, per run.
Model fallback is built in. If the primary provider fails, it moves to the fallback automatically, and the fallback call is also checkpointed.

The thing that actually saved us
The testing framework. In-memory, no database required, with crash simulation:

pythonrunner.simulate_crash_after(step=1)
result = await runner.run(AgentWorkflow, run_id="test")
result = await runner.resume("agent-test")
assert result.steps[1].was_replayed # model was not called twice
Being able to write that assertion in a unit test, without spinning up any infrastructure, changed how we thought about reliability in the agent layer. It moved crash recovery from a thing you hope works to a thing you can verify.

One line to get started

bashpip install gravtory[postgres]

Source: github.com/vatryok/Gravtory

DEV Community

We Gave Our AI Agent a Memory. Here Is What Broke.

Top comments (0)