At 1 AM, I was fixing a bug in a customer support Agent. A user sent an order number, the Agent looked up the shipment, and then in the next turn the user asked “What about changing the address?” The Agent replied, “May I have your order number?” — the context was completely lost. My boss tagged me in the group: “Why does every conversation seem like amnesia?” It took me 6 hours to discover that the issue wasn’t the model itself; it was that LangChain’s memory storage hadn’t been properly tested. Some innocent-looking changes had silently broken a critical assumption. If you’re building Agents with LangChain, the risk of context breakage might already be hiding inside that memory module that “seems fine.” Here’s how I used pytest to lock down the memory layer and avoid these pitfalls.
Problem Breakdown
LangChain’s memory components (ConversationBufferMemory, ConversationSummaryMemory, etc.) essentially inject historical messages into the prompt so that the LLM knows what has been said. Context breakage usually shows up in a few ways: the Agent suddenly forgets parameters you just gave it; history disappears after a cross-tool call; summary memory generation lags, leaving blank middle turns.
The root cause almost always points to one thing: our assertions about memory state were entirely manual. During development we’d click through a few rounds in the chat UI, see that the answers were roughly right, and ship it — never once thinking to verify memory content, length, or key names with automated tests. As soon as someone changes a variable name in the prompt template (say, {history} → {chat_history}) or adjusts the summarization model, context breaks silently until a user complains. Conventional “output comparison” tests never catch this kind of flaw because the LLM is non‑deterministic — answers might differ while still feeling correct — while the fatal context loss hides in the intermediate state.
Solution Design
We have to treat memory storage itself as a testable component, not as an invisible ghost buried inside the Agent’s execution chain. The tech stack is straightforward: pytest + langchain.memory + a fixed deterministic LLM (for summary tests or as a mock), together with fixtures to build reproducible conversation flows.
I compared several testing approaches:
- Integration tests with a real LLM: slow, expensive, flaky, and OpenAI API hiccups cause false‑positive failures. Rejected immediately.
- Scripts that manually check printed conversation: that’s what we used to do — everything relied on human eyes, zero regression coverage, doomed to fail.
- Unit tests on the memory object with pytest: pull the memory out of the chain, directly push messages into it, read variables, and assert message lists, token counts, and keys are correct. This is the only way to prevent breakage at low cost.
Architecturally, I’d extract the memory instantiation logic into a factory function. In tests, use the same factory to create the memory object, then simulate the message inputs/outputs that happen before and after an Agent execution. Focus assertions on three things: whether the chat_history string (or message list) stored in memory contains the key entities; whether the dictionary keys returned by load_memory_variables match the prompt template; and whether memory behaves correctly at boundaries (message limits exceeded, summary triggers).
Core Implementation
First, build the test skeleton: solve memory instance reuse and isolation. We use a pytest.fixture to create a ConversationBufferMemory and provide a helper function that writes multiple conversation turns into the memory. This way each test gets a clean memory and can freely populate prior history.
# test_memory_base.py
import pytest
from langchain.memory import ConversationBufferMemory
from langchain.schema import HumanMessage, AIMessage
@pytest.fixture
def memory():
"""返回一个干净的 ConversationBufferMemory,并指定人类/ AI 前缀。"""
return ConversationBufferMemory(
memory_key="chat_history", # 必须与 prompt 模板占位符对齐
return_messages=True
)
def add_messages(mem, messages):
"""将消息列表按角色写入记忆,模拟真实对话过程。"""
for role, text in messages:
if role == "human":
mem.chat_memory.add_user_message(text)
else:
mem.chat_memory.add_ai_message(text)
First critical test: verify memory correctly retains key entities to prevent breakage. Many context losses appear as “forgetting the previous message,” but the root cause is that memory never stored that user input. The test below bluntly checks whether the order number exists in the history.
# test_memory_content.py
from test_memory_base import memory, add_messages
import pytest
def test_memory_retains_order_number(memory):
"""模拟 Agent 处理订单,断言历史消息中能找到订单号。"""
messages = [
("human", "我的订单号是 XD12345,帮我查物流"),
("ai", "好的,XD12345 目前在运输中,预计明天到达。"),
("human", "我搬家了,能不能改地址?")
]
add_messages(memory, messages)
# 加载记忆变量(这里返回消息列表)
vars = memory.load_memory_variables({})
history = vars["chat_history"]
# 将消息列表转为纯文本,做内容断言
history_text = " ".join([msg.content for msg in history])
assert "XD12345" in history_text, (
"上下文断裂!历史消息中丢失了订单号 XD12345,"
"说明记忆没有正确存储用户输入。"
)
Second critical test: check memory_key alignment to prevent silent breakage caused by template changes. This is the most hidden trap — you change the variable name in the prompt, but the memory is still using the old key, and the Agent receives an empty history every time.
# test_memory_key.py
from langchain.memory import ConversationBufferMemory
from langchain.chains import LLMChain
from langchain.prompts import Ch
Top comments (0)