Automating Agent Memory Regression with pytest & Vector DB: 5x Defect Discovery Speedup

#python #programming

It was 1 AM when my phone rang. "You updated the embedding model this afternoon, and now the agent’s memory retrieval is completely broken — every user session is spouting nonsense." I scrolled through the chat history and realized the issue: after the model change, a memory that used to get perfect hits was now ranked eighth. I had manually tested only two examples and thought everything was fine. While rolling back the change that night, I made a decision: regression testing for agent memory storage must be automated.

Breaking down the problem

Long-term memory in an AI agent boils down to “store vectors, query by similarity.” Facts, preferences, and context generated during user interactions are embedded into high-dimensional vectors and inserted into a vector database. Later, when the agent needs to make a decision, it converts the current question into a vector, retrieves the top‑N most similar memories, and feeds them as part of the prompt.

The crux: semantic similarity is not an exact match. Every change to the embedding model, chunking strategy, or even a version bump of the vector database can silently alter retrieval results. Manual verification is a disaster — you throw a few keywords into the UI and eyeball whether the recalled results “look right.” This approach has three fatal flaws:

No quantification: “Looks about right” can’t be turned into an assertion; ranking shifts often go unnoticed.
Not repeatable: Each manual test uses different query terms, making it normal to miss regressions.
Maintenance cost explosion: Once the number of stored memories grows large, manually running a full regression is impossible — and therefore simply doesn’t happen.

Ordinary unit tests can check that a function “returns a list,” but they can’t verify that “the first item in the list is the sentence the user said last week.” We need a regression test suite that can make precise assertions about vector search results while being fast and repeatable.

Designing the solution

My goal was clear: drive real vector database queries through pytest, in a controlled environment, and validate retrieval accuracy and stability.

For the tech stack, I chose ChromaDB’s in-memory mode as the test vector store, with all-MiniLM-L6-v2 as a lightweight embedding model. Why not just compute cosine similarity with numpy and assert on that? Because we want to test real production behavior — the vector store’s indexing algorithm (e.g., HNSW), distance metric, and result ordering. None of these can be simulated by numpy. If we recalculate similarity ourselves, we’re testing a toy that never runs in production.

ChromaDB’s EphemeralClient fits pytest perfectly: a function-scoped fixture creates a temporary database that is destroyed after the test, leaving zero pollution. Each test case inserts its own memories, performs queries, and makes assertions in complete isolation.

I structured the solution in three layers:

Fixture layer: responsible for creating the Chroma client and collection, and inserting a standard set of test memories.
Case layer: uses @pytest.mark.parametrize to define different query scenarios and calls the agent’s memory retrieval function.
Assertion layer: validates the ID order of top‑k results, text prefixes, score thresholds, and fallback behavior for empty results.

With this setup, after any change — model, chunking, or library version — a single pytest -v command runs 30+ regression scenarios in under five seconds.

Core implementation

First code block: conftest.py. Solves the problem of giving each test case an isolated, pre-populated vector database.

# conftest.py
import pytest
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer

# Use a small model for test speed, fix model version to avoid result drift
EMBEDDING_MODEL = SentenceTransformer("all-MiniLM-L6-v2")

@pytest.fixture(scope="function")
def memory_collection():
    """Create a temporary collection per test function, insert standard memories, and return it"""
    client = chromadb.Client(Settings(anonymized_telemetry=False))
    collection = client.create_collection(
        name=f"test_memory_{id(client)}",  # Unique name to prevent leftovers
        metadata={"hnsw:space": "cosine"}  # Critical: use cosine similarity, matching production
    )

    # Standard memory set: 7 facts covering varying lengths and semantic distances
    memories = [
        {"id": "mem_1", "text": "用户名叫张明，住在北京朝阳区"},
        {"id": "mem_2", "text": "张明喜欢吃川菜，尤其是水煮鱼"},
        {"id": "mem_3", "text": "张明的车是白色特斯拉 Model 3"},
        {"id": "mem_4", "text": "用户计划下个月去杭州旅游"},
        {"id": "mem_5", "text": "张明的手机号是 138xxxx1234"},
        {"id": "mem_6", "text": "张明说他不喜欢吃香菜"},
        {"id": "mem_7", "text": "用户是一名后端工程师"},
    ]

    texts = [m["text"] for m in memories]
    ids = [m["id"] for m in memories]
    embeddings = EMBEDDING_MODEL.encode(texts).tolist()

    collection.add(
        ids=ids,
        embeddings=embeddings,
        metadatas=[{"text": t} for t in texts]  # Redundant storage for easy assertion
    )

    yield collection  # The test case receives the collection for direct querying

    # teardown: delete collection to prevent chroma in-memory leftovers
    client.delete_collection(collection.name)

Second code block: memory_retrieval.py. This is the actual retrieval function called within the agent (the system under test).

# memory_retrieval.py
class AgentMemory:
    def __init__(self, collection, embedding_model):
        self.collection = collection
        self.embedding_model = embedding_model

    def retrieve(self, query:

DEV Community

Automating Agent Memory Regression with pytest & Vector DB: 5x Defect Discovery Speedup

Breaking down the problem

Designing the solution

Core implementation

Top comments (0)