Bringing LLM Memory Regression Tests from 30 Minutes Down to 90 Seconds with pytest + Redis

#python #programming

At 2:17 a.m., a chain of alarms yanked me out of sleep — the LLM in production had suddenly "lost its memory." One moment users were discussing project timelines, the next it was naïvely asking "How can I help you?" After digging through logs, I found the culprit: a single line change to Redis’s expiry policy during a memory-store release. Nobody had tested the memory persistence flow before deployment, causing every session’s context to expire within 5 minutes. While fixing the bug, I cursed to myself: if only we had automated, repeatable regression tests that never touch production data, this would never have happened.

Breaking down the problem

At its core, an LLM memory store is a key–value system with TTL: session IDs serve as keys, and dialogue history, summaries, vector indexes are serialized and dumped into Redis. The business requirement is clear — "7×24 multi-turn conversations must never lose memory." Yet our testing process was stuck here:

Fire a few messages manually via Postman, then eyeball Redis keys to guess if it’s working.
Test data shares the same Redis instance as production; one slip and you’ve deleted real sessions.
Redis specifics like lazy expiration, asynchronous deletion are ignored, making timeout tests nothing more than superstition.

The root cause isn’t that Redis is unreliable — it’s that we never integrated Redis’s real behavior into regression tests. Mocking Redis with a plain dict means you’re not testing timeouts, serialization, or connection-pool exhaustion. Deploying without those checks is like flying blind.

Solution design

What we need is a test setup that’s “real Redis, fully isolated, auto-cleaned, and repeatable.”

Technology choices:

pytest: Its fixture system is perfect for managing test resources. Tagging tests with @pytest.mark.redis lets you flexibly skip real-backend tests in CI when needed.
Real Redis: No fakeredis — it doesn’t support time travel (you have to actually wait for expiration), and its lagging support for Lua scripts, Streams, and modules leads to tests passing while production explodes. For local development, use a dedicated db=15; in CI, spin up a dedicated instance with docker-compose.
Why not testcontainers-python: For small teams with flaky networks, pulling an image on every test run is painfully slow. Running Docker-in-Docker inside CI containers is also a configuration mess. Directly connecting to a “prepared, dedicated Redis” is far more practical.

Architecture idea: Each test gets a client scoped to a unique namespace via the redis_memory_store fixture defined in conftest.py. After the test, all keys under that namespace are atomically deleted, preventing interference between parallel tests.

Core implementation

1. Fixture providing a fully isolated Redis client

The code below solves the problem of “dirty data leaking between tests.” By using a random prefix and an autouse cleanup hook, each test lives in its own sandbox.

# conftest.py
import pytest
import redis
import uuid

# 命令行参数：允许跳过需要真实Redis的测试
def pytest_addoption(parser):
    parser.addoption("--real-redis", action="store_true", default=False,
                     help="run tests that require a real Redis instance")

@pytest.fixture(scope="session")
def redis_url():
    # 默认连本地 test 数据库，生产请通过环境变量覆盖
    return "redis://localhost:6379/15"

@pytest.fixture
def redis_memory_store(redis_url):
    """
    为每个测试函数创建一个前缀隔离的 Redis 客户端，
    测试结束后自动删除该前缀下所有 key。
    """
    prefix = f"test:{uuid.uuid4().hex[:8]}:"  # 避免并行测试 key 冲突
    client = redis.Redis.from_url(redis_url, decode_responses=True)
    client._test_prefix = prefix              # 挂个属性方便业务代码使用

    yield client

    # 清理：用 scan 分批删除，防止 keys * 阻塞
    cursor = 0
    while True:
        cursor, keys = client.scan(cursor, match=f"{prefix}*", count=100)
        if keys:
            client.delete(*keys)
        if cursor == 0:
            break
    client.close()

2. Business tests for the memory store

These tests cover the core scenarios: write a dialogue memory → read → update → automatic expiration. The design injects a MemoryStore dependency so that during tests you pass in redis_memory_store, while in production you inject a connection pool.

# test_memory_store.py
import time
import pytest
from myapp import MemoryStore  # 你的实际记忆存储模块

@pytest.mark.redis
class TestMemoryPersistence:
    """记忆持久化回归测试"""

    def test_write_and_read_conversation(self, redis_memory_store):
        """基本读写：存一段对话，拉回来必须完全一致"""
        store = MemoryStore(redis_memory_store, prefix=redis_memory_store._test_prefix)
        session_id = "user_abc"
        messages = [{"role": "user", "content": "我叫小明"}, {"role": "assistant", "content": "你好小明"}]

        store.save(session_id, messages)
        loaded = store.load(session_id)

        assert loaded == messages, f"预期 {messages}，实际 {loaded}"

    def test_memory_ttl(self, redis_memory_store):
        """TTL过期：写入短暂TTL，等待后数据应该消失"""
        store = MemoryStore(redis_memory_store, prefix=redis_memory_store._test_prefix, ttl=2)
        session_id = "user_ttl"
        store.save(session_id, [{"role": "user", "content": "test"}])

        time.sleep(3)  # 等待过期
        loaded = store.load(session_id)

        assert loaded is None or loaded == [], f"过期后应返回空，实际 {loaded}"

    def test_parallel_isolation(self, redis_memory_store):
        """并行隔离：两个测试使用不同前缀，互不影响"""
        store_a = MemoryStore(redis_memory_store, prefix="a:")
        store_b = MemoryStore(redis_memory_store, prefix="b:")
        store_a.save("1", [{"role": "x"}])
        store_b.save("1", [{"role": "y"}])

        assert store_a.load("1") == [{"role": "x"}]
        assert store_b.load("1") == [{"role": "y"}]

What we gained

From 30‑minute manual verification to a 90‑second fully automated regression suite that catches real-world Redis edge cases before they hit production. Now every PR that touches memory‑related code must pass these tests, and the “LLM amnesia” pager has stayed silent ever since. The real Redis, clever namespace isolation, and pytest fixtures are the three pillars that made it possible. If your team is still testing LLM memory with mocks or shared instances, it’s time to give this approach a shot.