BAOFUFAN

Posted on Jun 2

From Manual Checks to Pytest + Vector DB: 10x Faster AI Agent Memory Testing

#python #programming

At 2 a.m., I was jolted awake by an alert call. A user was complaining that our AI customer service agent "Xiao Yi" suddenly lost its memory—it had just been told the user’s name was "Lao Wang" in the previous turn, yet in the next turn it asked, "May I have your name, please?" Nothing else: the memory was gone. I opened the database and manually combed through thousands of vector records, trying to find the log of that conversation, then comparing embedding similarities. I worked until dawn, my eyes nearly bleeding, before I finally pinpointed the issue: a concurrent timing problem caused an update operation to overwrite the freshly written memory with an old version. Right then I thought: This kind of crap will never be verified manually a second time.

Breaking Down the Problem

If you’ve ever built a memory module for an AI agent, you know exactly the kind of pain I’m talking about. These days, agent memory storage is typically backed by a vector database—conversation history, user preferences, factual information are all encoded into vectors and written to Chroma / Qdrant / Milvus, then retrieved by similarity search for relevant memory later. Sounds simple, but consistency verification is hellish:

Updates aren’t exact replacements: inserting an updated memory isn’t a simple UPDATE SET vector=? WHERE id=?. Instead, you insert a new vector and mark the old record as invalid. If a retrieval happens between these two steps, the user may get stale memory.
The "uncertainty" of approximate matching: the same text embedded twice might not yield a cosine similarity of exactly 1.0 due to model precision and floating-point errors. So in tests, should you assert equality to 1.0, or just > 0.99? What threshold works? Manual verification was all gut feeling.
Decoupling of metadata and vectors: you can’t visually tell whether a vector is correct. You have to rely on auxiliary metadata fields (like user_id, session_id, ts) to locate records. During manual verification, you had to keep three windows open simultaneously: logs, database queries, and a script to compute vector distances.

Standard unit testing approaches simply couldn’t handle this—because the embedding process usually depends on an external model service, unit tests either mock it or run unbearably slow. Integration tests, on the other hand, often only check "can a result be returned" without caring "is the returned result precisely correct". So back then, our testing consisted of manually running 20 conversations before release, then querying the database—a process with a huge miss rate, and exactly how that concurrent bug escaped into production.

Designing the Solution

After that painful experience, I decided to build a deterministic, automated consistency verification pipeline. The choices:

Test framework: Pytest. Its rich ecosystem, fixtures, and parameterization make it easy to manage test resources and multiple scenarios. I skipped unittest because it’s too verbose, and Robot Framework because it’s too heavy.
Vector database: Continue using Chroma (already in production). But for testing, switch to an isolated test collection, using fixtures to create it before and destroy it after each test.
Embedding approach: Hardcode a fake embedding function that returns controllable, fixed-length vectors. Why not mock with random vectors? Because I need precisely calculable distances. I can craft vectors for two pieces of text so that their distance is exactly 0.5 or 0.0, enabling hard assertions without tolerating any floating-point error. This completely solves the "approximate matching uncertainty" problem.
Architecture: Each test case simulates an agent memory operation pipeline (write, update, delete), then queries Chroma to verify that the vectors and metadata match expectations. The key is aligning the "operation interface" with the real business code while being able to inject our controllable embedding function.

Why not other approaches? Simple: using Chroma’s REST API directly is too thin—it only tests connectivity. Full end-to-end integration tests that hit the real OpenAI embedding every time are slow, expensive, and uncontrollable. Pytest + fake embeddings is fast, deterministic, and costs nothing.

Core Implementation

First, install dependencies:

pip install pytest chromadb numpy

1. Controllable-Distance Fake Embedding Function

What this code solves: it lets us compute exact similarity values in advance, completely eliminating the pain of "unstable tests due to floating-point errors".

# fake_embedding.py
import numpy as np
from chromadb.api.types import EmbeddingFunction, Documents

class FakeEmbeddingFunction(EmbeddingFunction):
    """返回确定性的、可计算距离的向量。每个文本映射到固定维度的 one-hot 风格向量，方便距离断言。"""
    def __call__(self, input: Documents) -> list[list[float]]:
        dim = 128
        vectors = []
        for text in input:
            # 用文本的 hash 决定有效维度的索引，保证相同文本产生相同向量
            idx = abs(hash(text)) % dim
            vec = [0.0] * dim
            vec[idx] = 1.0   # one-hot，使得同文本内积=1，不同文本大概率内积=0
            vectors.append(vec)
        return vectors

Here, using one-hot construction creates a simple, extreme vector space where identical texts have an inner product of 1.0, and different texts almost certainly have 0.0. This way, similarity assertions become hard assertions—no more ambiguous values like 0.998.

2. Pytest Fixture: Managing Test-Level Collections

What this code solves: each test case runs in an independent collection, ensuring data isolation and automatic cleanup after execution.

# conftest.py
import pytest
import chromadb
from fake_embedding import FakeEmbeddingFunction

@pytest.fixture
def chroma_memory_store():
    """提供记忆存储的客户端和 collection 名称，测试结束后自动删除。"""
    client = chromadb.Client()   # 内存模式，无需持久化
    ef = FakeEmbeddingFunction()
    collection_name = "test_memory"
    collection = client.create_collection(
        name=collection_name,
        embedding_function=ef,
        metadata={"hnsw:space": "cosine"}
    )
    yield collection, client, collection_name
    # 后置清理：直接删除 collection，不留痕迹
    client.delete_collection(collection_name)

Using Chroma’s in-memory client means no persistent state; after the test, we explicitly delete the collection to leave no trace. The fake embedding function is injected directly into the collection, so all vector operations are under our control.

3. Writing Test Cases: Simulating the Memory Pipeline

Now let’s write a test that simulates a typical memory lifecycle: write a memory, then update it, and finally verify that the database holds exactly the expected vectors and metadata.

# test_memory_consistency.py
import pytest
from fake_embedding import FakeEmbeddingFunction

def test_write_and_update_memory(chroma_memory_store):
    collection, client, name = chroma_memory_store
    ef = FakeEmbeddingFunction()

    # 1. 写入初始记忆：用户名叫“老王”
    doc_id = "mem_001"
    metadata = {"user_id": "u1", "session_id": "s1", "label": "name"}
    original_text = "老王"
    collection.add(
        ids=[doc_id],
        documents=[original_text],
        metadatas=[metadata]
    )

    # 2. 更新记忆：用户改叫“小王”
    updated_text = "小王"
    # 模拟更新逻辑：插入新记录并标记旧记录失效（这里简化为直接 upsert）
    collection.upsert(
        ids=[doc_id],
        documents=[updated_text],
        metadatas=[{**metadata, "label": "name_updated"}]
    )

    # 3. 查询该 ID，验证向量与元数据
    result = collection.get(ids=[doc_id], include=["embeddings", "metadatas", "documents"])
    assert result["documents"][0] == updated_text
    assert result["metadatas"][0]["label"] == "name_updated"

    # 4. 验证向量：应与更新后文本的假嵌入一致
    expected_vec = ef([updated_text])[0]
    actual_vec = result["embeddings"][0]
    assert actual_vec == expected_vec

def test_similarity_search_returns_correct_memory(chroma_memory_store):
    collection, client, name = chroma_memory_store
    ef = FakeEmbeddingFunction()

    # 写入两条记忆
    collection.add(
        ids=["mem_1", "mem_2"],
        documents=["用户喜欢喝咖啡", "用户住在北京"],
        metadatas=[{"type": "preference"}, {"type": "fact"}]
    )

    # 用与“咖啡”完全相同的文本进行查询
    query_text = "用户喜欢喝咖啡"
    results = collection.query(query_texts=[query_text], n_results=2, include=["documents", "distances"])
    docs = results["documents"][0]
    distances = results["distances"][0]

    # 相同文本距离应为 0.0（one-hot 下余弦距离为 0）
    assert docs[0] == query_text
    assert distances[0] == 0.0

Notice the assertions: no epsilon tolerances, no guesswork. Because we control the embedding function, the distance for an identical text is exactly 0.0, and for different texts it’s 1.0 (since our one-hot vectors are orthogonal). This makes tests rock-solid.

4. Parameterized Testing for Multiple Scenarios

Pytest’s @pytest.mark.parametrize makes it trivial to cover concurrent updates, deletions, and edge cases without duplicating code.

@pytest.mark.parametrize("operation,initial_text,new_text,expected_text", [
    ("upsert", "老王", "小王", "小王"),
    ("add_then_delete", "临时信息", None, None),  # delete 后查询应为空
])
def test_memory_operations(chroma_memory_store, operation, initial_text, new_text, expected_text):
    collection, _, _ = chroma_memory_store
    doc_id = "op_test"

    collection.add(ids=[doc_id], documents=[initial_text], metadatas=[{"op": "test"}])

    if operation == "upsert":
        collection.upsert(ids=[doc_id], documents=[new_text], metadatas=[{"op": "test"}])
    elif operation == "add_then_delete":
        collection.delete(ids=[doc_id])

    result = collection.get(ids=[doc_id], include=["documents"])
    if expected_text is None:
        assert len(result["documents"]) == 0
    else:
        assert result["documents"][0] == expected_text

Results and Experience

After replacing manual verification with this pipeline, the testing efficiency genuinely improved by 10x—what used to take half a day of manual database digging now runs in under a minute, right inside CI. The concurrent update bug that once took an all-nighter to locate? Now it’s caught instantly by a parameterized test that simulates two interleaved write operations.

Key takeaways for fellow developers working on AI agent memory:

Determinism beats realism in testing: A fake, controllable embedding function not only speeds up tests but eliminates flakiness caused by model variance.
Isolate test data completely: A per-test collection with automatic teardown keeps tests Hermetic and allows parallel execution.
Assert on both vectors and metadata: Just checking "memory was retrieved" isn’t enough; explicitly verifying the exact vector and metadata ensures consistency at all layers.

If you’re still manually verifying AI agent memory by running conversations and eyeballing database records, give this approach a try. Your sleep schedule will thank you.

Original Chinese text, code comments preserved. The fake embedding approach demonstrated here can be adapted to Qdrant, Milvus, or any vector DB that supports custom embedding functions. The core idea remains the same: make your tests fast, deterministic, and trustworthy.

DEV Community