At 2 a.m., the on-call phone jolted me awake. The SRE from the user service sounded panicked: “Redis was restarted on production, and now a ton of user sessions are gone — everyone’s logged out, the customer support lines are exploding.” As I pulled up the monitoring dashboards, my heart sank. This Redis instance was dedicated to storing user sessions. A year ago, we had explicitly requested persistence to be enabled, and it was configured with both RDB and AOF for safety — so how on earth did we lose data? By daybreak, I finally pieced together the root cause. Along the way, I built an automated verification system with Pytest and Docker to ensure this never happens again. Let me walk you through what went wrong and how this validation tool can help you avoid the same trap.
Breaking Down the Problem: Why Does Data Still Disappear When Persistence Is Enabled?
The setup was straightforward: we used Redis to store web sessions and required data to survive a restart. The ops team had configured RDB (save 900 1) and AOF (appendfsync everysec), which looked rock solid. But the failure sequence went like this:
- The container had been running for 2 hours; the last RDB save was 30 minutes ago.
- A rolling restart was triggered; the Redis process was gracefully shut down with
docker stop. - After the restart, we discovered that among the nearly 20,000 sessions written in the past 30 minutes, around 10,000 were lost.
The root cause is that the combination of RDB and AOF does not handle the scenario of “writing a large volume of data in a short time and then immediately restarting.” The save 900 1 rule means that if only 1 key changes, Redis would wait 900 seconds before saving to disk. Meanwhile, AOF with appendfsync everysec flushes every second, but docker stop gives the process 10 seconds to exit by default. Upon receiving SIGTERM, the Redis process tries to flush its AOF buffer to disk. The problem is: if the amount of data is too large and the flush takes longer than 10 seconds, Docker sends SIGKILL, leaving an incomplete AOF file. When Redis restarts and detects the corrupted AOF file, it either truncates the tail or skips repair based on its configuration — thus the unflushed data simply vanishes.
Why don’t typical approaches work? In the past, to test persistence, we would spin up a local Docker container, SET a few keys, restart, and see if the data was still there. This “eyeball test” cannot simulate massive writes and race conditions near timeout boundaries, nor can it cover various “unexpected kill” scenarios. So we absolutely need an automated, repeatable validation method that covers real-world failure modes.
Designing the Solution: Parameterized Validation with Pytest and Docker
I needed to do three things:
- Quickly spin up a Redis container with a predefined persistence policy.
- Simulate real write pressure.
- Kill the container in different ways, then restart it and check data consistency.
For the test framework, I went with Pytest because of its powerful parameterization — you can easily generate test combinations of different persistence configurations and failure modes. For container management, I chose testcontainers-python. It automatically handles container lifecycles, leaves no dirty environments, and is easy to run in CI. Why not use docker-compose? Because I needed to run dozens of configurations with the same image; writing a bunch of YAML files would be tedious and inflexible. With testcontainers, I can dynamically create configurations in code and even mount a custom redis.conf.
The overall architecture: a conftest.py provides a parameterizable Redis container fixture. Test cases verify two scenarios: “graceful shutdown” and “unexpected power loss.” Finally, using Pytest’s --csv or Allure, we generate a report that clearly shows data loss under each strategy.
Core Implementation: Building the Automated Validation Step by Step
Step 1: Define a container fixture that can configure Redis on demand
This snippet solves the problem of “same image, different persistence policies.” We use testcontainers’ Redis module but directly pass command-line arguments or mount a custom configuration file. For precise control, I chose to mount a custom redis.conf.
# conftest.py
import pytest
from testcontainers.redis import RedisContainer
from testcontainers.core.container import DockerContainer
import tempfile
import os
def _write_conf(conf_content: str) -> str:
"""把配置写入临时文件,返回宿主机路径"""
tmp = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.conf')
tmp.write(conf_content)
tmp.close()
return tmp.name
@pytest.fixture
def redis_container(request):
"""
按参数化需求创建Redis容器,挂载自定义redis.conf
request.param 格式示例: {"save": "1 1", "appendfsync": "no", "aof": "no"}
"""
params = getattr(request, 'param', {})
save_rule = params.get('save', '') # 如 "900 1"
appendfsync = params.get('appendfsync', 'no')
use_aof = params.get('aof', 'no')
conf_lines = []
if save_rule:
conf_lines.append(f"save {save_rule}")
else:
conf_lines.append("save \"\"") # 完全关闭RDB
if use_aof.lower() == 'yes':
conf_lines.append("appendonly yes")
conf_lines.append(f"appendfsync {appendfsync}")
else:
conf_lines.append("appendonly no")
conf_path = _write_conf("\n".join(conf_lines))
# 构建容器,挂载配置并启动
container = DockerContainer("redis:7-alpine")
container.with_volume_mapping(conf_path, "/usr/local/etc/redis/redis.conf")
container.with_command("redis-server /usr/local/etc/redis/redis.conf")
container.with_exposed_ports(6379)
container.start()
redis_host = container.get_contain
Top comments (0)