Redis Persistence Pitfall: How RDB+AOF Hybrid Persistence Silently Lost Data — I Reproduced 30 Failure Scenarios with pytest + Docker

#python #programming

2:17 AM. I was blasted awake by a chain of alerts — the order service was frantically throwing “inventory deduction failed,” and the logs were drowning in KeyError. I jumped onto the Redis dashboard: memory usage was only 30%, but the inventory keys for dozens of hot products had simply vanished. Just yesterday we had rolled out the hybrid persistence strategy, thinking we could finally sleep soundly. That confidence was brutally shattered. After six hours of digging, the root cause surfaced: under a specific restart timing, Redis silently dropped a batch of keys that hadn’t yet been flushed to the AOF, while the RDB snapshot happened to sit right in an empty window. This pitfall drilled a lesson into me: if you haven’t thrown your persistence strategy into real failure scenarios and let it “explode” a few times, you can never really trust it.

Breaking down the problem

Our use case is e-commerce inventory deduction, following a “write to Redis first, then async persist to DB” model. Data consistency is critical — we absolutely cannot accept losing confirmed deduction results after a restart. To prevent data loss, we enabled RDB + AOF hybrid persistence (aof-use-rdb-preamble yes), with AOF fsync every second (default) and RDB snapshots every 5 minutes. It all looked beautiful on paper, but the failure timing was the trap.

During a rolling restart of the Redis cluster, the ops script executed kill -9 (because SHUTDOWN had been misconfigured and timed out). At that exact moment, Redis had just completed an RDB save (the 5-minute cycle), while the writes from the last 2 seconds hadn’t yet been fsynced to the AOF buffer. The RDB didn’t contain those keys either. After a restart, Redis preferentially loads the RDB as the base dataset and then replays the AOF. However, if the last few commands in the AOF file are truncated, Redis simply discards the incomplete commands — meaning all writes during those 2 seconds were lost. Conventional monitoring (memory usage, connection counts) would never catch this time‑window bug.

Why do “textbook” protections fail? Because the theoretical persistence guarantees weaken under extreme timing conditions: filesystem caches, process signals, and I/O scheduling can all make data you thought was persisted quietly disappear. The only way forward is to repeatedly simulate various failure scenarios through automated tests and observe the boundary conditions of data recovery.

Designing the approach

I needed a test environment that could be spun up quickly, inject faults, and verify data integrity. Here’s the tech stack I chose:

Docker instead of real machines: Spin up an isolated Redis instance in seconds, kill it freely, and never pollute the host.
pytest to drive the tests: Writing test cases feels like writing documentation, and fixtures are a natural fit for managing Redis container lifecycles.
docker-py instead of docker‑compose: I needed to dynamically start/stop containers and send signals from within the tests. container.kill(signal='SIGKILL') offers far more flexibility than compose.
No Redis built‑in commands like DEBUG sleep: Those don’t simulate realistic failures; process‑level signals are much closer to what happens in production.

Architecture idea: A session‑scoped pytest fixture pulls the Redis image. A function‑scoped fixture starts a brand‑new container for each test with the desired persistence parameters. Inside each test function, I write known data, simulate a failure (kill -9 / power loss / abrupt stop), restart the container, and check key integrity.

Core implementation

Part 1: pytest fixture — launching a Redis container with persistence settings

This code solves the problem “How do I quickly create a customizable Redis instance for each test case?” It uses docker-py to pull the image, create a container, and mount a temporary directory for dump.rdb and appendonly.aof, so data survives a restart.

# conftest.py
import pytest
import docker
import tempfile
import os
from pathlib import Path

@pytest.fixture(scope="session")
def docker_client():
    # 确保docker daemon可用
    client = docker.from_env()
    return client

@pytest.fixture
def redis_container(docker_client, tmp_path):
    """每个测试用例独立的Redis容器，自动清理"""
    data_dir = tmp_path / "redis-data"
    data_dir.mkdir()

    # 通过环境变量配置持久化参数，避免手动改conf
    container = docker_client.containers.run(
        "redis:7-alpine",
        command=[
            "redis-server",
            "--appendonly", "yes",              # 开启AOF
            "--aof-use-rdb-preamble", "yes",    # 混合持久化
            "--save", "5 1",                    # 5秒内至少1次修改则触发RDB
            "--appendfsync", "everysec",
        ],
        volumes={str(data_dir): {"bind": "/data", "mode": "rw"}},
        detach=True,
        remove=True,  # 停止后自动删除容器
        ports={"6379/tcp": None},  # 随机端口
    )
    # 等待Redis启动完成
    container.exec_run("redis-cli ping", retry=dict(max_attempts=5, delay=1))
    yield container
    # fixture teardown: 强制清理
    container.stop(timeout=0)

Using tmp_path guarantees isolation of persistence files across tests. Why not use --dir /data? The official Redis Docker image already sets the working directory to /data; we just need to mount our volume there.

Part 2: Simulating a failure — kill -9, restart, and verify data

This test case specifically validates whether data is lost under a kill -9 with hybrid persistence enabled. It writes 1000 keys, immediately sends SIGKILL, then restarts the container and checks the key count. A crucial detail: the restart command must remain consistent, so Redis loads the previously persisted files.


python
# test_persistence.py
import redis
import time

def test_mixed_persistence_survives_sigkill(redis_container):
    # 获取动态分配的端口
    port = redis_container.ports["6379/tcp"][0]["HostPort"]
    r = redis.Redis(host="local