LangChain Memory Pitfall: A Concurrency Race Condition That Cost Me 6 Hours

#python #programming

At 2:17 AM, I was jolted awake by a phone call. The ops team told me the production chatbot had suddenly developed a “split personality”—User A was watching the bot chat with User B about a loan. They took screenshots and posted them on social media. I snapped wide awake, grabbed my laptop, and started digging through the logs.

Just a few hours earlier we had shipped a “session memory” feature, built on LangChain’s ConversationBufferMemory paired with our own session_id caching strategy. Why were memories bleeding across users? I couldn’t reproduce it manually no matter how hard I tried. In the end I wrote an automated Pytest concurrency test, and only then did I flush out the deeply hidden race condition. That experience drilled one thing into my skull: If you don’t write automated concurrency tests for AI agent memory, you’re burying a landmine.

The Illusion of Static Isolation for Session Memory

Our setup was typical: a customer-service bot that keeps independent context for each visitor. Session memory lived in a server-side dict—keys were session_id, values were ConversationBufferMemory instances. The happy path looked like: request arrives → fetch the session’s memory → inject it into a ConversationChain → generate a response → update the memory. In dev and QA, clicking buttons by hand, memory isolation was flawless.

But once we hit production, when the same user sent messages in extremely rapid succession or when multiple users made concurrent requests, cross-talk appeared randomly. I narrowed down the root cause quickly: in an attempt to save memory, we had mistakenly reused the same ConversationBufferMemory instance from our cache. Every fetch from the dict returned the exact same object reference. Different coroutines were calling append on it at the same time—naturally, their messages got tangled.

Why couldn’t a manual test catch this? Manual testing thinks in pure synchronous terms; it can never manufacture the microsecond-level coroutine interleaving that happens under real load. You might do what I did at first: write a simple script that loops for i in range(100), see no problem, and declare it safe. But only by using asyncio.gather to fire two coroutines simultaneously can you stretch the race window wide enough to reproduce the bug reliably.

Designing the Test: Creating a Miniature Concurrency Crash with Pytest

To build an automated guard, you need a way to reliably reproduce the cross-talk. The toolbox was clear:

Pytest + pytest-asyncio: the foundation for async tests, giving precise control over the event loop and the ability to simulate real concurrency.
LangChain native components: ConversationChain + ConversationBufferMemory, tested directly from the production code path.
Fake LLM: instead of hitting the OpenAI API, a simple async fake that returns predetermined responses—stable and fast.
asyncio.gather: to fire two coroutines at nearly the same instant via chain.apredict, intentionally creating the interleaving.

Why not Locust or JMeter? Those load-testing tools excel at traffic simulation but are poor at making precise in-memory assertions. I needed to inspect the memory after a conversation and verify that it didn’t contain any user input from the wrong session. That requires unit-test-level freedom. Pytest fits perfectly: it can manufacture concurrency and, with assert, give you a precise postmortem.

Even more important: with automated tests I could solidify an “anti-pattern” (the shared memory instance) and let CI run that anti-pattern case on every commit, guaranteeing that no future refactor would accidentally break isolation again.

Core Implementation: From “Reproduce the Bug with Tests” to “Guard the Fix with Tests”

1. A Fake LLM Built Solely to Reproduce the Problem

This snippet solves the “no OpenAI keys, but we still need async LangChain tests” pain point. I made the LLM return fixed responses from a preset list, and it supports async.

from typing import Any, List, Optional
from langchain_core.language_models.llms import LLM
from langchain_core.callbacks import CallbackManagerForLLMRun

class FakeAsyncLLM(LLM):
    """按顺序返回预设回复的异步 LLM，专门给测试用"""
    responses: List[str]
    i: int = 0

    def _call(self, prompt: str, stop: Optional[List[str]] = None,
              run_manager: Optional[CallbackManagerForLLMRun] = None) -> str:
        resp = self.responses[self.i % len(self.responses)]
        self.i += 1
        return resp

    async def _acall(self, prompt: str, stop: Optional[List[str]] = None,
                     run_manager: Optional[CallbackManagerForLLMRun] = None) -> str:
        # 模拟异步调用，实际同步返回以保证结果可预测
        return self._call(prompt, stop, run_manager)

    @property
    def _llm_type(self) -> str:
        return "fake_async"

2. The Test That Deliberately Shares Memory—Exposing the Race Condition

This test proves that when two ConversationChain instances share a single ConversationBufferMemory, concurrent calls will inevitably cross-contaminate the memory. Run it, and you’ll see red.

import pytest
import asyncio
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

@pytest.fixture
def shared_memory():
    """共享 Memory 实例——这是我们要测试的反模式"""
    return ConversationBufferMemory()

@pytest.mark.asyncio
async def test_concurrent_corruption_with_shared_memory(shared_memory):
    llm = FakeAsyncLLM(responses=["Hello", "How can I help?", "Sure", "Goodbye"])
    # 错误：两个 chain 共用了同一个 memory 对象
    chain_a = ConversationChain(llm=llm, memory=shared_memory)
    chain_b = Conversat