From 2 Hours to 10 Minutes: Pytest Parametrization for LLM Memory Tests

#python #programming

It was 1 a.m. The product manager was furiously @mentioning me in the group chat: “A user reported that the agent got their dietary preferences wrong – they clearly said no cilantro, but every recommended dish is packed with cilantro.” Squinting at the logs, I discovered that a critical record was missing from the memory recall. To reproduce the issue, I spent half an hour crafting curl commands, manually scrolling through the returned memory list line by line in the terminal. That’s when I realized – if I kept testing like this manually, a disaster was inevitable.

Breaking Down the Problem

Anyone building LLM agents knows that memory storage is the key to personality consistency. Whether you use vector databases (Chroma, Pinecone) for long-term memory or LangChain’s ConversationBufferMemory for session windows, every time you tweak the retrieval strategy – changing the embedding model, adjusting top_k, shifting the similarity threshold, adding summarization compression – you must re-verify two core metrics:

Recall: Are all the memories that should be remembered actually recalled, with no missing ones?
Precision: Is the retrieved content truly relevant and free of hallucinations?

If you’ve ever done manual testing, you know the pain: you first inject conversations one by one into the memory store according to a test case table, then manually ask questions, and finally scrutinize the returned content to check if it contains the expected key information. Five scenarios take over an hour; a full regression before a release easily takes two hours. The worst part: human eyes get tired and miss things. “But I thought I tested that” becomes the most common excuse when something breaks.

Why Conventional Approaches Fail

Why don’t conventional approaches work? You might suggest writing a few hard-coded cases with unittest. But LLM memory test scenarios often involve combinatorial explosions – varying conversation history lengths, semantically similar distractors, different recall limits. unittest’s parameterization capabilities are too weak; maintaining fixtures and a pile of test data makes you want to delete the repository. Running a quick check in a Jupyter Notebook lacks regression capability – you run it and throw it away. What we need is a repeatable, extensible, and clear-reporting automated validation system.

Solution Design

The tech choice is straightforward: Pytest + a pluggable memory abstraction.

Pytest: Its parametrization (@pytest.mark.parametrize) is a natural fit for generating batches of test cases combining “injected conversation × query condition × expected result”. Fixtures elegantly manage the lifecycle of memory instances. The plugin ecosystem is rich – generating HTML reports or integrating with CI costs nothing.
Why not unittest? Its ddt and parameterized are like stepchildren – you end up with a pile of decorators, and data-driven testing is so painful that newcomers want to give up right away.
Why not just use online monitoring like LangSmith? That’s for production observability. What we need is offline regression: after changing a single line of retrieval code, we can run the entire test suite in under 5 minutes, leaving no hidden risks for production.

Architecture idea: extract an abstract memory interface to avoid coupling to a real vector database, so unit tests don’t depend on external services and execute in seconds. Then build an injectable test dataset that defines “preloaded conversation list”, “query string”, “set of expected memory_id to recall”, and “expected key text snippets”. Pytest takes this data and automatically runs the flow: “load memories → trigger retrieval → assert both metrics”. This pattern can easily be adapted to any memory backend.

Core Implementation

Let’s dive into runnable code, broken down by module, with each snippet explaining what problem it solves.

1. Abstract a minimal memory store, removing external dependencies

To make tests run instantly on any dev machine, we first write a minimal FakeVectorMemory that stores memories in a list and simulates “semantic retrieval” via simple keyword matching. In a real project, you would just replace retrieve with a call to vectorstore.similarity_search().

# fake_memory.py —— 解决“如何离线模拟向量检索”的问题
from typing import List, Dict, Optional

class FakeVectorMemory:
    """模拟向量记忆存储：插入带ID的记忆，检索返回包含query关键词的记录"""

    def __init__(self):
        self._memories: List[Dict] = []  # 每条记忆：{"memory_id": str, "content": str}

    def add_memory(self, memory_id: str, content: str) -> None:
        """插入一条记忆"""
        # 过滤重复 ID，防止误报召回率
        if not any(m["memory_id"] == memory_id for m in self._memories):
            self._memories.append({"memory_id": memory_id, "content": content})

    def retrieve(self, query: str, top_k: int = 5) -> List[Dict]:
        """
        模拟检索：找出 content 包含 query 中任一非停用词的记忆。
        真实场景这里会调用 embedding + 向量搜索。
        """
        # 简易分词（生产环境请用真正的 tokenizer）
        keywords = [w.strip().lower() for w in query.split() if len(w.strip()) > 1]
        if not keywords:
            return []

        matched = []
        for mem in self._memories:
            content_lower = mem["content"].lower()
            # 只要内容包含任意关键词就认为“命中”，模拟相似度匹配
            if any(kw in content_lower for kw in keywords):
                matched.append(mem)
                if len(matched) >= top_k:
                    break
        return matched

2. Manage memory lifecycle with a Pytest fixture

Before each test case, the reset_memory fixture provides a fresh, empty memory instance, preventing data contamination between tests. This is where fixtures are more elegant than setUp/tearDown.


python
# conftest.py —— 解决“每个测试用例都需要干净记忆环境”的问题
import pytest
from fake_memory import

Top comments (1)

Alex Shev • Jun 27

Turning memory failures into parametrized tests is the right move. Manual curl reproduction is useful once, but it does not protect the system next week. The valuable artifact is the case table: input fact, expected recall, competing memories, and the reason the retrieval should prefer one.