I Spent 6 Hours Fixing LangChain's ConversationBufferMemory — Here's the Automated Test You Need

#python #programming

At 4:59 PM on a Friday, I was about to close my laptop and sneak out when the QA colleague's icon flashed on DingTalk: "Come check this out. The support bot remembers I'm Zhang San, but when I ask for my order number, it insists it belongs to Li Si." I pulled up the logs and saw LangChain's ConversationBufferMemory behaving like it had severe amnesia — Session A was mixing up chat history from Session B. In that moment, I knew that unless I built an automated test suite to lock down the accuracy and consistency of memory storage, the next blow-up would definitely happen at 2 AM.

Breaking Down the Problem

In LLM-powered chat products, the memory module is responsible for remembering context across multiple turns — so that when the user said earlier "I live in Beijing," the weather query later can automatically include "Beijing." Sounds easy, but things get messy once you land in LangChain: ConversationBufferMemory stores all conversations in plain text. It works fine as long as the memory fits in RAM, but switch to Redis or a database for persistence, and a whole bunch of issues bubble up — serialization/deserialization, concurrent reads/writes, and trimming old messages.

In our production scenario, a customer service bot handled hundreds of concurrent users. Each user session was independent but they all shared a common Redis instance. When we first launched, QA manually tested a dozen typical conversation paths and found absolutely no cross-session memory leaks, because manual testing simply can't cover race conditions under high concurrency, nor reproduce edge cases where trim_messages mixes up adjacent sessions when a Redis connection blinks out. Once real traffic hit, bugs popped up like whack-a-mole — you fixed one, another sprang out. We desperately needed a set of regression tests that could directly verify memory read/write accuracy and cross-session isolation.

Designing the Solution

The goal was clear: run through the core logic of the memory module right in our local CI, without a real LLM or a real Redis instance, and catch issues before any code landed.

Framework choice was a no-brainer — Pytest. Its fixture capabilities are perfect for assembling different memory instances. LangChain's memory abstraction is fairly clean: BaseChatMemory provides uniform save_context and load_memory_variables interfaces, so we could write the same set of tests against different memory backends. A real Redis is too heavy, so we chose fakeredis to simulate a Redis instance in memory — quick to spin up and zero side effects. All LLM calls were banished with unittest.mock, because we were testing memory, not the LLM.

Why not use the built-in langchain.tests? They only cover the shallowest of interfaces, none of the hard-won scenarios like message type conversion or multi-session isolation. We also didn't want to run Redis in a Docker container — our CI resources are already stretched thin; adding one more container would jam the build queue by an extra 3 minutes.

The overall architecture: define a fake_redis_memory fixture inside Pytest's conftest.py, use it to construct different Memory subclasses (ConversationBufferMemory, ConversationSummaryMemory), simulate multi-turn conversations with helper functions, and then assert that the history returned by load_memory_variables is both complete and free of cross-session contamination.

Core Implementation

1. Building a Zero-Dependency Test Harness

This snippet packages fakeredis, mock LLM, and Memory instantiation into a fixture. All subsequent test cases run on top of it. The non-negotiable requirement: zero network requests, and any single test completes in under 0.3 seconds.

# conftest.py
import pytest
from unittest.mock import MagicMock
from langchain.memory import ConversationBufferMemory
from langchain_community.chat_message_histories import RedisChatMessageHistory
from fakeredis import FakeRedis

@pytest.fixture
def fake_redis_memory():
    # 用 fakeredis 构建一个假 Redis 客户端
    fake_redis_client = FakeRedis()

    def _create_memory(session_id: str):
        # 注入伪造的 Redis，保证每次测试的 session 隔离
        history = RedisChatMessageHistory(
            session_id=session_id,
            redis_client=fake_redis_client
        )
        # ConversationBufferMemory 默认 return_messages=True 时，会返回 Message 对象
        memory = ConversationBufferMemory(
            chat_memory=history,
            return_messages=True  # 关键：确保拿到结构化消息，方便断言
        )
        return memory

    return _create_memory

2. Testing Accuracy: Every Message Written Must Come Back

This test simulates two rounds of conversation and verifies that the history returned by load_memory_variables has the exact length and content we expect. It puts an end to the mysterious "I stored two lines but only got one back" bug.


python
# test_memory_accuracy.py
from langchain.schema import HumanMessage, AIMessage

def test_buffer_memory_keeps_all_messages(fake_redis_memory):
    memory = fake_redis_memory("session_1202")

    # 模拟第一轮对话
    memory.save_context(
        {"input": "我叫张三"},
        {"output": "你好张三"}
    )
    # 模拟第二轮对话
    memory.save_context(
        {"input": "我的订单号是多少"},
        {"output": "你的订单号是 #1123"}
    )

    variables = memory.load_memory_variables({})
    history = variables.get("history", [])

    # 断言：总共应该有 4 条消息（两问两答）
    assert len(history) == 4
    assert isin