Redis Crash Recovery Testing: The 2 Bugs Manual Testing Missed

#python #programming

At 3 a.m., an ops colleague called urgently: "The AI customer service bot's memories are all messed up—users were just chatting about their orders, and the next second the bot forgot everything. We found that Redis had crashed and restarted." I opened monitoring and saw that after Redis restarted, many session memories had been partially lost, but not completely—it was like "selective forgetting." This was worse than total loss—no one noticed the data was corrupted until users started complaining. Ultimately, the root cause pointed to one thing: we had never truly automated the verification of memory consistency after Redis crash recovery. Every time, we manually typed redis-cli to test a few keys and concluded "probably okay."

The deception of manual testing is that the failure scenarios you simulate are always idealized. You flush, kill, check a few data entries, and everything looks fine. But in real production, Redis can lose partial data in ways you never imagined—due to differences between RDB and AOF, fsync policies, and the loading order of mixed persistence during restarts. In this article, I'll break down the painful lesson and the solution: Redis crash recovery consistency verification based on Pytest + Testcontainers, and reveal the two hidden bugs that could only be caught by automated testing.

Problem Breakdown — Why Manual Testing Misses "Selective Forgetting"?

We use Redis as the short-term memory store for LLM. Each session's memories are stored in a List under the key session:<id>:messages, and each message is a serialized JSON. The recovery requirement is simple:

No matter how Redis crashes and restarts, the memories must either be all there or all gone (when full persistence is used), never half-present.

The real scenario is more complex: the failure could happen right after an RDB snapshot is completed, while AOF is rewriting, or in the middle of mixed persistence. When simulating manually, it's nearly impossible to kill Redis at exactly those precise moments. You can only do a few docker restarts, then LRANGE to check, see the data still there, and sign off. But the key to reproducing "selective forgetting" lies in:

Bug1: With only RDB enabled (no AOF), after a restart, all incremental writes since the last snapshot are lost, while developers assumed automatic saving.
Bug2: With AOF enabled but appendfsync everysec, when the process is killed abruptly, the last 1–2 seconds of write operations disappear, causing the memory list to lack a few tail entries, but the old data can still be read from the list, making it very difficult to spot with the naked eye.

These two issues are almost invisible with manual redis-cli get because you tend to pick a specific key to check, and if that key happens to fall within the snapshot range or before the AOF fsync, you'll be tricked. Only automated, parameterized, and repeatable fault injection testing can systematically expose them.

Solution Design — Why Pytest + Testcontainers?

To simulate real Redis crash recovery, there are several hard requirements:

Each test must start with a clean Redis instance with configurable persistence policies.
Be able to force-kill the Redis process at any point in the code execution, then restart and verify data integrity.
Tests need to be fully programmable and CI-integrable, not manual commands.

Common alternatives were all rejected:

fakeredis: purely in-memory simulation, no real persistence or process crashes—totally incapable of testing fault recovery.
docker-compose + manual scripts: decent environment isolation, but fault injection is limited to docker kill, can't trigger at precise code logic points, and hard to parameterize and assert.
Direct host Redis operations: dirty environment, interference between tests, impossible for CI.

Testcontainers for Python hits all these points perfectly: it starts a real Redis container directly in test code, we can obtain a connection via with redis_container, and at any moment call container.stop() then container.start() to simulate a crash restart. Combined with Pytest's fixture mechanism, each test case gets an independent Redis lifecycle, and the persistence configuration file can be injected via mounted volumes. The entire verification becomes repeatable unit tests, with fault injection granularity down to the line of code.

Core Implementation — Building Memory Consistency Tests Step by Step

Let's dive into the code. Assume a memory storage class RedisMemoryStore provides append_message(session_id, msg) and get_messages(session_id), backed by a Redis List. Now we need to write a test to verify its flaws under the "RDB without AOF" configuration.

Dependency Installation and Container Fixture (conftest.py)

This fixture solves the “fresh Redis per test” problem and allows passing persistence configuration via an environment variable.

# conftest.py
import pytest
import redis
from testcontainers.redis import RedisContainer

# 自定义 Redis 容器，允许传入配置文件
class ConfigurableRedisContainer(RedisContainer):
    def __init__(self, image="redis:7-alpine", port=6379, config_path=None):
        super().__init__(image=image, port=port)
        self.config_path = config_path

    def _configure(self):
        super()._configure()
        if self.config_path:
            # 挂载配置文件到容器
            self.volumes = {self.config_path: {"bind": "/usr/local/etc/redis/redis.conf", "mode": "ro"}}
            self.with_command("redis-server /usr/local/etc/redis/redis.conf")

@pytest.fixture(scope="function")
def redis_no_aof():
    """仅开启 RDB 的 Redis 实例（Bug1 专用）"""
    # 配置文件内容：只打开 RDB，关闭 AOF
    config = """
    save 900 1
    save 300 10
    save 60 10000
    appendonly no
    """
    import tempfile, os
    with tempfile.NamedTemporaryFile(mode='w', suffix='.conf', delete=False) as f:
        f.write(config)
        config_path = f.name
    try:
        container = ConfigurableRedi