I Used pytest for 3 Years Before Realizing I Never Actually Tested LLM Memory Persistence

#python #programming

It was 2:30 a.m. when a user complaint exploded on my phone: “Your chatbot has amnesia again—it forgot everything we talked about three days ago!” I scrambled out of bed, dug through the logs, and discovered that a seemingly harmless “optimize Redis serialization” PR from the previous week had silently caused the legacy message loader to throw an exception whenever it encountered the new structure. Worse, our live memory storage module had zero coverage for that regression path. The irony? I had been writing pytest for three years, but every single test mocked Redis, mocked the storage layer, and only exercised business logic. I had never actually verified memory persistence and consistency regression against a real database. After that night, I changed everything.

This article is my blood-and-tears postmortem: how to build a reproducible, regression-proof integration test suite for LLM memory storage with pytest and Docker—so that “what you stored yesterday can be read back exactly today, and it survives a container restart tomorrow.”

Breaking down the problem: why mocks can’t save memory storage

LLM memory storage sounds trivial: save conversations, load conversations. But in the real world, three tricky dimensions immediately surface:

Version compatibility of serialization/deserialization

You store a Message object, but when you read it back, you deserialize it into a dict or a Pydantic model. The moment a timestamp format, a role enum value, or an extra metadata field changes, a mock-based test that fakes disk reads and writes will never notice—because the mock always returns idealized data you prepared in advance.
Concurrency consistency

A user can fire off several messages in rapid succession within the same session, triggering concurrent writes. Redis’s HSET and ZADD are atomic, but if you wrap them in a “read-then-write” merge step (e.g., merging a context window), you can easily get overlapping updates and lost messages. A pure unit test running single-threaded will never hit such race conditions, not even after 100 runs.
Persistence regression

After a Docker container restart, can the AOF/RDB files on the mounted volume restore everything correctly? Are you relying on Redis’s default save configuration, or did you happen to land exactly within the 900-second window where no snapshot triggered? If an ops colleague casually bumps the Redis image version or tweaks the persistence policy, previous conversations can vanish into thin air. Unless you actually kill the container and then read the data back, this pitfall stays forever in your blind spot.

The root cause is simple: memory problems are always compound failures of storage engine × access pattern × time. Mocks amputate the storage engine, and manual testing removes the high-frequency repetition—you will never wait long enough for the failure to surface.

Designing the solution: real dependencies with Docker, sharp regression with pytest

We need a real storage environment that can be created and destroyed on demand, and we need to run the exact same set of consistency and persistence test cases against it over and over.

Comparing the candidates:

Option A: Point your dev environment directly at a local Redis. Downside: local data gets polluted, you have to manually flush between tests, and your colleague’s Redis version may differ from yours—no guaranteed regression consistency.
Option B: Bring up the full dependency stack with docker-compose and run pytest by hand. Downside: extra process management, not atomic enough, and tricky to achieve unit-test-level retries in CI.
Option C (the one I settled on): testcontainers-python to spin up Redis/Postgres containers on-demand inside pytest fixtures, automatically destroyed when the test finishes. Combine that with pytest’s scope and autouse to control container lifetimes, achieving isolation while keeping startup costs reasonable.

Why not pytest-docker? It’s better suited for running a long-lived container that you start once in CI. testcontainers, on the other hand, lets you treat the database as part of the test itself, dynamically created. When writing tests, you don’t have to care whether the external environment is already prepared, and parameterization becomes far easier—e.g., running the same test against Redis 6 and Redis 7.

Architecturally, we define an abstract MemoryStore base class. If we later swap in Postgres/pgvector, we can reuse the exact same consistency test suite. The key principle: tests don’t care about implementation details; they care about the contract. What you write is what you should read back; after a restart, historical data must not be lost.

Core implementation: inviting the Redis container into a pytest fixture

1. Abstract memory storage interface

This snippet decouples the tests from the concrete storage implementation—we only need to define the contract for reading, writing, and persistence behavior.

# llm_memory/store.py
from abc import ABC, abstractmethod
from typing import List, Optional
from dataclasses import dataclass, asdict
from datetime import datetime

@dataclass
class Message:
    role: str          # 'user' | 'assistant' | 'system'
    content: str
    timestamp: float   # epoch second, 保证跨语言兼容

class MemoryStore(ABC):
    @abstractmethod
    def append_message(self, session_id: str, msg: Message) -> None:
        ...

    @abstractmethod
    def get_history(self, session_id: str, limit: int = 100) -> List[Message]:
        ...

    @abstractmethod
    def close(self) -> None:
        ...

2. RedisMemoryStore implementation—must handle serialization edges

This piece addresses real-storage serialization and deserialization consistency. We deliberately use JSON instead of pickle because pickle has terrible cross-version compatibility and is unsafe.

# llm_memory/redis_store.py
import json
import redis
from typing import List
from llm_memory.store import MemoryStore, Message

class RedisMemoryStore(MemoryStore):
    def __init__(self, host: str, port: int, db: int = 0):
        self.client = redis.Redis(host=host, port=port, db=db,
                                  decode_responses=True)

    def _session_key(self, session_id: str) -> str:
        return f"session:{session_id}:messages"

    def append_message(se