From 45 Minutes to 3: Automated Testing for AI Agent Memory

#python #programming

At 2 AM, a colleague dropped a message in the group chat: “Our Agent messed up the VIP client’s budget by an order of magnitude.” It wasn’t an LLM hallucination — it was the long-term memory layer silently swallowing a record, and our manual regression tests never touched that edge case.

How reliable an AI agent’s memory is isn’t something you can gauge by “feeling.” It needs to be verified relentlessly, automatically. If you’re still clicking through conversations and manually inspecting databases to confirm whether your agent remembers correctly, this article will show you the peace that settles over your world once you turn memory verification into an automated pipeline with GitHub Actions.

Breaking down the problem: why manual memory testing is almost like no testing at all

Our agent handles multi-turn user conversations. It uses LangChain’s ConversationBufferMemory for short-term memory and a long-term memory layer to persist critical information in SQLite (we started with Chroma, then cut complexity). Whenever a user mentions a quote, preference, or constraint, the agent calls a tool to write it into long-term memory, and retrieves it in the next conversation.

The trouble begins during iteration: we frequently tweak prompts, swap memory strategies, adjust retrieval thresholds. Every change meant re-running an entire set of conversation scenarios, then manually validating in the database — was it written? Duplicated? Old info expired and cleaned up? It’s deeply unnatural: click through dialogues one by one, check rows one by one. A full round took at least 45 minutes, and there was no way I could think of every edge case every time.

The root cause is obvious: verifying memory storage depends on state, and in a small team manual testing simply cannot guarantee that every state combination is covered. Regular unit tests don’t help either, because memory persists across sessions — you need an integration test environment and a clean initial state.

Designing the solution: let GitHub Actions be that “never-annoyed quality inspector”

The idea is brutally simple:

On every push or PR, the runner spins up an isolated, deterministic test memory store (a SQLite file, zero external dependencies).
A Python test script runs a series of simulated multi-turn conversations. After each turn it directly verifies the underlying stored data — not just comparing what the agent said, but asserting what records exist in the database.
After the tests, the SQLite file is thrown away. No external services are involved, so the environment stays clean.

Why not other approaches?

Spinning up services with Docker Compose: Adding Chroma/Postgres would slow CI down, and we’re only testing business logic — no reason to introduce the non-determinism of a real vector database.
Only mocking database calls: That bypasses real SQL/vector retrieval logic, rendering the tests meaningless. We genuinely want to verify “it was really written, it was really read back.”
UI automation only: That’s yet another maintenance hell. Testing the storage layer directly is small, stable, and costs almost nothing to maintain.

So the final setup: pure Python script + pytest + tmp_path to create a temporary SQLite + GitHub Actions default Ubuntu runner. No Docker, no cloud services. The CI configuration is clean, under 40 lines.

Core implementation

1. A testable memory store abstraction

This snippet makes the memory store swappable with a test-friendly SQLite, completely free of production wiring.

# memory_store.py
import sqlite3
import json
from datetime import datetime, timezone

class MemoryStore:
    """长期记忆存储：负责写入、检索、过期清理"""

    def __init__(self, db_path: str):
        self.conn = sqlite3.connect(db_path, check_same_thread=False)
        self.conn.row_factory = sqlite3.Row
        self._init_table()

    def _init_table(self):
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS memories (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                session_id TEXT NOT NULL,
                key TEXT NOT NULL,
                value TEXT NOT NULL,
                created_at TIMESTAMP DEFAULT (datetime('now')),
                ttl_seconds INTEGER DEFAULT 86400
            )
        """)
        self.conn.commit()

    def upsert(self, session_id: str, key: str, value: str, ttl_seconds: int = 86400):
        # 原子性 upsert：如果 key 已存在则更新，否则插入
        self.conn.execute("""
            INSERT INTO memories (session_id, key, value, ttl_seconds, created_at)
            VALUES (?, ?, ?, ?, datetime('now'))
            ON CONFLICT(session_id, key) DO UPDATE SET
                value = excluded.value,
                created_at = datetime('now'),
                ttl_seconds = excluded.ttl_seconds
        """, (session_id, key, value, ttl_seconds))
        self.conn.commit()

    def retrieve(self, session_id: str, key: str) -> str | None:
        row = self.conn.execute(
            "SELECT value FROM memories WHERE session_id = ? AND key = ? "
            "AND datetime(created_at, '+' || ttl_seconds || ' seconds') > datetime('now')",
            (session_id, key)
        ).fetchone()
        return row["va