10x Your AI Conversation Memory Regression Tests: From Manual Checks to Pytest + LangChain Memory

#python #programming

At 1 a.m., the product manager @mentioned me in the group chat: “The assistant completely forgot what was said three turns ago—this is the third time this month.” I opened the code, swapped ConversationBufferMemory for ConversationSummaryBufferMemory, tweaked a few parameters, and then started testing manually. To verify memory behavior across five typical conversation paths, I typed in the terminal for 20 minutes, repeatedly “clearing context, re-entering inputs, observing responses”—by the end my brain was as scrambled as the memory I was testing. That’s when I thought: isn’t there an automated regression test for this?

Turns out, no. Most teams still “run a few turns by hand” to validate memory logic—it’s basically the same as manually filling out a form to test a login feature ten years ago. The testability of AI conversation memory is still in the stone age. This article shows how to build a repeatable, automatable test suite using Pytest and LangChain’s Memory module, boosting memory regression efficiency by over 10x.

Breaking down the problem: why is testing conversation memory so painful?

At its core, testing conversation memory means verifying that a multi-turn, stateful interaction system accumulates, compresses, and forgets state correctly. It’s harder than an ordinary API test for three reasons:

Cross-turn state dependency: The correct answer on turn 5 may depend on whether the history from the first four turns was preserved intact. In manual testing, just retyping those previous turns exactly the same way drives you crazy.
LLM non-determinism: Given the same input and history, the model can produce semantically similar but textually different output. That makes traditional “assert returned text” testing completely useless.
Variety of memory strategies: BufferMemory keeps everything, SummaryMemory compresses with summaries, TokenBuffer truncates by token count… the trigger boundaries for these different strategies are nearly impossible to cover manually.

The usual “just run it by hand” approach is not only slow but unreliable: you never know if it passed because the memory really works or because you accidentally mistyped a test sentence. Ad-hoc scripts written just for the current change? They usually cover only the scenario you’re touching, and the next time someone modifies a different strategy, that script is long gone.

Solution design: Pytest as the testing framework, Fake LLM to eliminate non-determinism

The idea is simple: treat conversation memory as an ordinary Python object under test, and don’t let a real LLM participate in your assertions.

Why these choices:

Test framework: Pytest. Its fixtures naturally manage “reusable conversation chain instances,” and parametrization easily covers multiple memory strategies.
System under test: LangChain’s ConversationChain combined with various Memory implementations. What we verify is the message list, summary content, and other data stored inside the memory object—not the natural language text returned by the model.
LLM substitute: A Fake LLM. Its only job is to return a fixed string (or some predictable mapping) based on the input, so the whole chain runs without errors, but the memory reads and writes are completely real.
Why not just unit-test the Memory class? Because ConversationChain internally calls memory.save_context and memory.load_memory_variables with specific timing and argument ordering. Only an integration test can surface problems at that level (you’ll see a pitfall later).

Architecturally, a test case follows three steps:

Use a fixture to create a ConversationChain that is equipped with a specific Memory and a Fake LLM underneath.
Call chain.predict(input=...) for several rounds to simulate a conversation.
Assert the content of chain.memory.chat_memory.messages (or other memory variables) matches expectations.

Core implementation: from Fake LLM to a complete test suite

1. Implementing a controllable FakeLLM

This code solves the problem of “make the LLM shut up completely so it doesn’t interfere with memory assertions.” We let the _call method return a fixed string that contains the input, so we can track calls without the model derailing our tests with unpredictable output.

from typing import Any, List, Optional
from langchain.llms.base import BaseLLM
from langchain.schema import LLMResult, Generation

class FakeLLM(BaseLLM):
    """返回固定文本的 LLM，用于隔离 memory 测试。"""
    response_template: str = "Echo: {prompt}"

    @property
    def _llm_type(self) -> str:
        return "fake"

    def _generate(
        self,
        prompts: List[str],
        stop: Optional[List[str]] = None,
        **kwargs: Any,
    ) -> LLMResult:
        # 每个 prompt 对应一个 Generation
        generations = [
            [Generation(text=self.response_template.format(prompt=prompt))]
            for prompt in prompts
        ]
        return LLMResult(generations=generations)

Why not use unittest.mock.MagicMock? Because LangChain wraps LLM calls through several layers internally (generate → _generate), and a mock can easily miss parts of the internal call chain. FakeLLM fully implements the BaseLLM interface, so its behavior is completely controlled.

2. Pytest fixture: creating a reusable conversation chain

This code solves the problem of “reassembling the chain in every test case.” With fixtures we can inject different memory strategies into different tests.

import pytest
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

@pytest.fixture
def fake_llm():
    return FakeLLM()

@pytest.fixture
def memory():
    # 默认使用 buffer，具体测试中可通过参数化覆盖
    return ConversationBufferMemory(return_messages=True)

@pytest.fixture
def conversation_chain(fake_llm, memory):
    """返