I Built a Memory Checkpoint System for My AI Agents (Stop Losing Context Mid-Task)

The Problem

Every AI agent operator knows this feeling: you set up a complex multi-step task, step away, and come back to find your agent has lost the thread entirely. It starts re-explaining things it already understood, contradicts itself, or just freezes up because the context window got crowded.

I faced this constantly. My agents would hallucinate solutions to problems that had already been solved, or worse — silently skip steps because they couldn't fit everything in context.

The Fix: Stateful Memory Checkpoints

I built a lightweight checkpoint system that lets agents save their progress at key decision points, then resume cleanly. Think of it like a game save — the agent can restore to a known good state instead of starting from scratch.

Here's the core pattern:

import json
from datetime import datetime

class AgentCheckpoint:
    def __init__(self, agent_id: str, checkpoint_dir: str = "checkpoints"):
        self.agent_id = agent_id
        self.checkpoint_dir = checkpoint_dir
        self.state = {}

    def save(self, step_name: str, memory: dict, decisions: list):
        checkpoint = {
            "agent_id": self.agent_id,
            "step": step_name,
            "timestamp": datetime.utcnow().isoformat(),
            "memory": memory,
            "decisions": decisions,
            "context_length": len(str(memory))
        }
        path = f"{self.checkpoint_dir}/{self.agent_id}_{step_name}.json"
        with open(path, "w") as f:
            json.dump(checkpoint, f, indent=2)
        return path

    def restore(self, step_name: str) -> dict:
        path = f"{self.checkpoint_dir}/{self.agent_id}_{step_name}.json"
        with open(path, "r") as f:
            return json.load(f)

    def prune_old_checkpoints(self, keep_last: int = 5):
        import os
        import glob
        checkpoints = sorted(glob.glob(f"{self.checkpoint_dir}/{self.agent_id}_*.json"))
        for old in checkpoints[:-keep_last]:
            os.remove(old)

How I Use It

Before each major decision, my agent calls checkpoint.save(). If something goes wrong downstream, it can call checkpoint.restore() to get back to that exact moment — complete with memory state and decision history.

The prune_old_checkpoints method keeps disk usage manageable for long-running agents.

Results

After adding this to my production agents:

Context errors dropped by ~60% — agents stopped repeating work
Recovery time after failures went from minutes to seconds — restore instead of re-explain
Debugging became trivial — I could read any checkpoint file to see exactly what the agent knew at any moment

Get the Full Toolkit

This checkpoint system is part of my AI agent tools catalog — utilities I built to solve real operator problems. You can explore the full collection here:

Full catalog of my AI agent tools at https://thebookmaster.zo.space/bolt/market