Agent State Desync: Why Your Agent Forgets and How to Fix It

#ai #llm #python #agents

Book: Agents in Production — Building, Tracing, and Shipping Multi-Step AI You Can Trust
Also by me: Observability for LLM Applications — the companion book in The AI Engineer's Library (2-book series)
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You give your agent a task: "book the Lisbon trip, stay under €900, prefer window seats." It searches flights. It calls the hotel API. It reserves a seat. On step seven, the pod restarts. A deploy, an OOM kill, a spot instance reclaimed. Does not matter which.

The user retries. The agent starts over from an empty messages list. It searches flights again. It calls the hotel API again. It reserves a second seat, because the first reservation happened in a process that no longer exists and nothing wrote it down. The agent has amnesia, and the amnesia just cost your user a duplicate booking.

This is state desync. The model's context and the real state of the task drifted apart, and the agent believed the wrong one.

Two kinds of state, and only one of them is durable

There are two things people call "agent state," and conflating them is where the bug lives.

The first is the model context: the messages list you send on each turn. System prompt, user turns, tool calls, tool results. This is short-term memory. It rides in the prompt and it lives in RAM.

The second is the real state of the task: which flights you already searched, which reservation you already made, which €340 you already spent. Some of that lives in the messages list too, but the authoritative copy lives in the outside world: in Stripe, in the hotel's booking system, in a row in your database.

When those two agree, the agent works. When they drift, the agent either repeats work it already did or skips work it never did. A crash is the fastest way to make them drift, because a crash wipes the RAM copy and leaves the world copy untouched.

The fix is not more memory. The fix is durability: write the trajectory somewhere that survives the process, and rebuild from it when the process comes back.

The scratchpad is the thing you actually need to persist

Inside the loop, the agent keeps a scratchpad: the Thought-Action-Observation history of the current task. Step one, I searched flights, I got these three. Step two, I picked the 09:40, I reserved seat 14A. The user never sees this. It exists so that on step seven the agent remembers what it did on step three.

That scratchpad is the state that matters. If it dies with the process, the agent forgets. So model it as an explicit object, not as a slice of the messages array you hope stays intact.

# scratchpad.py
from dataclasses import dataclass, field, asdict


@dataclass
class Step:
    thought: str
    action: str       # tool name
    args: dict
    observation: str


@dataclass
class Scratchpad:
    task_id: str
    task: str
    steps: list[Step] = field(default_factory=list)

    def record(self, thought, action, args, obs):
        self.steps.append(
            Step(thought, action, args, obs)
        )

    def to_dict(self) -> dict:
        return asdict(self)

    @classmethod
    def from_dict(cls, d: dict) -> "Scratchpad":
        steps = [Step(**s) for s in d["steps"]]
        return cls(d["task_id"], d["task"], steps)

to_dict and from_dict are the whole point. A scratchpad you can serialize is a scratchpad you can checkpoint. One you can only hold in memory is one you lose on the next OOM kill.

Checkpoint after every step, not at the end

The naive instinct is to save state when the task finishes. That is exactly backwards. A task that finished does not need a checkpoint. The one that dies on step seven does, and it dies before it finishes, so you have to write after every step.

Here is a durable store backed by SQLite. Swap it for Postgres or Redis in production; the shape is the same.

# checkpoint.py  -- stdlib only
import json
import sqlite3
from scratchpad import Scratchpad

_DDL = """
CREATE TABLE IF NOT EXISTS checkpoints (
    task_id TEXT PRIMARY KEY,
    state   TEXT NOT NULL,
    updated INTEGER NOT NULL
)
"""


class CheckpointStore:
    def __init__(self, path: str = "agent.db"):
        self.db = sqlite3.connect(path)
        self.db.execute(_DDL)
        self.db.commit()

    def save(self, pad: Scratchpad) -> None:
        self.db.execute(
            "REPLACE INTO checkpoints VALUES "
            "(?, ?, strftime('%s','now'))",
            (pad.task_id, json.dumps(pad.to_dict())),
        )
        self.db.commit()

    def load(self, task_id: str) -> Scratchpad | None:
        row = self.db.execute(
            "SELECT state FROM checkpoints "
            "WHERE task_id = ?",
            (task_id,),
        ).fetchone()
        if row is None:
            return None
        return Scratchpad.from_dict(json.loads(row[0]))

Two details earn their place. REPLACE INTO on a task_id primary key means each save overwrites the last checkpoint for that task, so the store holds one row per in-flight task, not a growing log. And save() commits every call. That commit is the durability boundary. If the process dies one instruction after it returns, the checkpoint is on disk.

Rebuild, then continue where you left off

Now the loop. On start, try to load a checkpoint. If one exists, replay it into the model context and keep going from the step after the last one recorded. If none exists, this is a fresh task.

# agent.py  -- pip install "anthropic==0.69.0"
from anthropic import Anthropic
from scratchpad import Scratchpad
from checkpoint import CheckpointStore

client = Anthropic()
store = CheckpointStore()
MODEL = "claude-sonnet-4-5"


def context_from(pad: Scratchpad) -> list[dict]:
    lines = [f"Task: {pad.task}"]
    for i, s in enumerate(pad.steps, 1):
        lines.append(
            f"{i}. {s.action}({s.args}) "
            f"-> {s.observation[:200]}"
        )
    return [{"role": "user",
             "content": "\n".join(lines)}]


def run(task_id: str, task: str) -> Scratchpad:
    pad = store.load(task_id) or Scratchpad(
        task_id=task_id, task=task,
    )
    if pad.steps:
        print(f"resumed at step {len(pad.steps)}")

    while not done(pad):
        resp = client.messages.create(
            model=MODEL,
            max_tokens=1024,
            messages=context_from(pad),
        )
        thought, action, args = decide(resp)
        obs = dispatch(action, args)
        pad.record(thought, action, args, obs)
        store.save(pad)          # durability boundary
    return pad

The line that matters is store.save(pad) immediately after pad.record(...). The scratchpad in memory and the scratchpad on disk are never more than one step apart. When the pod restarts and the user retries with the same task_id, store.load returns the seven steps that already happened, context_from replays them into the model's context, and the agent picks up at step eight instead of step one.

The model gets its memory back by reading state you owned, not state you hoped survived.

The trap: side effects that already happened

Rebuilding the scratchpad closes the gap between context and your own storage. It does not close the gap between your storage and the outside world. That seat reservation on step six was a real API call with a real effect. Replaying the scratchpad tells the agent it reserved a seat. It does not un-reserve the duplicate if step six ran twice because the crash landed between the API call and the store.save.

The honest fix is idempotency at the effect boundary. Every tool call that mutates the world carries a key derived from the task and the step, and the downstream system dedupes on it.

def reserve_seat(task_id, step, flight, seat):
    key = f"{task_id}:{step}:reserve"
    return booking_api.reserve(
        flight, seat, idempotency_key=key,
    )

Now a replayed step six reaches the booking API with the same key it used the first time, and the API returns the original reservation instead of making a second one. Stripe, most booking systems, and any well-built internal API accept an idempotency key for exactly this. Checkpointing gives the agent its memory back. Idempotency keys keep that memory from turning into duplicate charges.

The rule

Model context is bounded by the window and lives in RAM. Real state is bounded by your storage and the outside world. A crash wipes the first and leaves the second, and desync is what you get when the agent trusts the wrong one on resume.

Serialize the scratchpad. Checkpoint after every step, not at the end. Rebuild from the checkpoint on restart. Put an idempotency key on every tool call that touches the world. Do those four things and a mid-trajectory crash becomes a resume instead of a duplicate booking.

If you are building agents that have to survive real infrastructure (pods that restart, deploys mid-task, spot instances that vanish), Agents in Production is the book for the loop, the checkpointing, and the idempotency patterns above. Its companion, Observability for LLM Applications, is where you learn to trace the resume so you can see, after the fact, exactly which step the agent replayed and whether it double-fired an effect. Together they are The AI Engineer's Library.