How I Fixed LangChain’s Agent Memory Loss with 23 Pytest Cases

#python #programming

At 1 AM, my boss @-ed me in the company chat: “What’s wrong with the customer service bot? The user just said their name, and the very next message it asks again — like a goldfish.” I stumbled to my laptop and pulled up the logs. Sure enough, the context queue for that session_id in Redis was clean as a whistle, as if nothing had ever been written. That was the fourth “memory evaporation” incident this month. Every previous time, we’d papered over it with a service restart and a redeploy. But that night I decided to stop coddling the bug — I was going to test the AI agent’s context persistence logic from top to bottom with Pytest.

Where exactly does an agent’s memory get lost?

An AI agent’s “memory” is basically a list of messages. With each turn, you push the user’s input and the AI’s reply into that list. On the next request, you pull out the history and prepend it to the prompt so the model knows what was said before. A typical implementation looks like this:

from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(return_messages=True)
# After each turn: memory.save_context({"input": user_msg}, {"output": ai_msg})
# Later: memory.load_memory_variables({}) to get the history

When we wire this up with Redis for persistence (e.g. using RedisChatMessageHistory), memory should survive as long as Redis does. But in the real world, sometimes a conversation gets routed to a different machine, and fields get lost in the serialized JSON or Pickle payload. Sometimes the async connection pool gets exhausted and save_context returns without really flushing. Even spookier: when return_messages=True, load_memory_variables can return SystemMessage and HumanMessage objects, but if you serialize them with pickle, the slightest difference in class definitions on the deployment server causes a silent deserialization failure — you end up with an empty list and not a single error.

These problems share one trait: they only surface under multi-turn, concurrent conversations, and manual testing can never reproduce them. Debug locally step by step and everything works fine. Deploy to production and you get intermittent amnesia. Without an automated test suite that simulates real, multi-turn interactions, you’re effectively planting a time bomb in every release.

Design decisions: Why Pytest + fakeredis instead of a real Redis cluster

I needed a test setup that could run in CI, execute in milliseconds, and reliably reproduce every edge case. The options on the table:

Real Redis + Testcontainers — Spins up a legitimate Redis container, very close to production. But each test requires a few seconds to start the container, which slows down CI pipelines noticeably. It also ties you to Docker; some teammates use M1 Macs, and certain images cause compatibility headaches.
unittest + mock — Swap out the Redis client with Python’s standard mock library. It can test the logic, but the code becomes verbose (mock every single Redis command), and mocks don’t reproduce the real behavior of serialization/deserialization.
Pytest + fakeredis — fakeredis is a pure-Python Redis simulator that supports over 90% of common commands, fully emulating serialization, expiration, and data structures — with zero external dependencies. Combined with Pytest’s fixtures, parametrization, and pytest-asyncio, you can polish every corner of the memory storage logic.

Another reason not to use a real Redis cluster: some of the scenarios I wanted to test depended on specific Redis error responses (e.g. OOM, connection timeouts). With fakeredis, simulating those anomalies is actually easier than with a real instance, because you can inject specific faults. This way the test environment becomes “a version of production that knows your code better than you do” — exposing those sporadic boundary conditions.

The final architecture is simple: a MemoryStore abstract class that encapsulates the agent’s memory operations (add messages, get history, trim, clear). The underlying implementation can be Redis or an in-memory dict. During testing, we inject fakeredis; in development, we reuse the exact same test suite to validate any new storage backend.

Core implementation: from fixtures to multi-turn conversation tests

Setting up an isolated memory store so every test starts fresh

We start with a Pytest fixture that provides an isolated, async memory store, guaranteeing no cross-test interference:

# conftest.py
import pytest
import fakeredis.aioredis  # async fakeredis
from myapp.memory_store import RedisMemoryStore

@pytest.fixture
async def memory_store():
    fake_redis = fakeredis.aioredis.FakeRedis(decode_responses=True)
    store = RedisMemoryStore(redis_client=fake_redis)
    yield store
    await store.clear_all()  # clean up after each test
    await fake_redis.aclose()

We built RedisMemoryStore ourselves, avoiding a direct dependency on LangChain’s History classes. This lets us test serialization logic in isolation and makes it easy to swap out for DynamoDB or PostgreSQL later.

Verifying the most critical behavior: you can save and retrieve what you stored

# test_memory_basic.py
import pytest
from myapp.schemas import ChatMessage

pytestmark = pytest.mark.asyncio  # the whole module is async

async def test_save_and_retrieve_messages(memory_store):
    session_id = "session-001"
    messages = [
        ChatMessage(role="user", content="My name is Wang"),
        ChatMessage(role="assistant", content="Got it, Wang"),
    ]
    await memory_store.add_messages(session_id, messages)

    history = await memory_store.get_history(session_id)

    assert len(history) == 2
    assert history[0].role == "user"
    assert history[0].content == "My name is Wang"
    assert history[1].role == "assistant"

Here ChatMessage is a Pydantic model. Serialization uses .json() and stores the result in a Redis Hash, avoiding Pickle and its class-versioning issues altogether.

Simulating real-world multi-turn conversations concurrently

Now the part that catches the “goldfish” bug: multiple interleaved sessions, each with multiple turns, running under asyncio.gather to trigger race conditions.

async def test_concurrent_multi_session_isolation(memory_store):
    session_a = "sess-A"
    session_b = "sess-B"

    await memory_store.add_messages(session_a, [ChatMessage(role="user", content="From A1")])
    await memory_store.add_messages(session_b, [ChatMessage(role="user", content="From B1")])

    # Simulate two sessions continuing at the same time
    async def continue_session(sid, content):
        await memory_store.add_messages(sid, [ChatMessage(role="assistant", content=content)])

    await asyncio.gather(
        continue_session(session_a, "Reply A"),
        continue_session(session_b, "Reply B"),
    )

    hist_a = await memory_store.get_history(session_a)
    hist_b = await memory_store.get_history(session_b)

    assert len(hist_a) == 2
    assert all(h.role in ("user", "assistant") for h in hist_a)
    assert hist_a[-1].content == "Reply A"

    assert len(hist_b) == 2
    assert hist_b[-1].content == "Reply B"

Testing what happens when Redis misbehaves

With fakeredis we can also simulate infrastructure failures:

async def test_graceful_fallback_on_redis_timeout(memory_store):
    # Forge the fake client to raise on the next write
    memory_store.redis_client.set = AsyncMock(side_effect=ConnectionError("timeout"))

    with pytest.raises(MemoryStoreBackendError):
        await memory_store.add_messages("s1", [ChatMessage(role="user", content="this will fail")])

In total, I wrote 23 such cases — covering session isolation, message trimming when a conversation grows too long, clearing history, TTL expiration, serialization edge cases, and concurrent write collisions. Every time a new “memory evaporation” pattern appeared, I added a test that reproduced it, fixed the code, and watched it go green.

Results and takeaways

After integrating the test suite into our CI pipeline, memory-related incidents disappeared from the #oncall channel. The boss now knows me as “the guy who ended the goldfish era.” More importantly, the team now treats memory persistence as a first-class concern — every new storage adapter gets validated against the same 23 cases before it ever touches production.

If your LangChain agent ever “forgets” what the user just said, don’t reach for a redeploy. Write a test that replays the exact conversation pattern, break it, and only then fix it. You’ll sleep better — and your users won’t have to introduce themselves three times.