Redis Memory Consistency Pitfalls: 3 Bugs That Cost Me 2 Days of Debugging

#python #programming

At 2 AM, our user group blew up: “The AI forgot my girlfriend’s birthday — I’m dead.” We run an emotional companion chatbot, and its long‑term memory lives entirely in Redis. Lately, users kept complaining that the bot spontaneously developed amnesia — sometimes a few lines of conversation disappeared, sometimes an entire session’s memory just reset to zero. My gut reaction: Redis doesn’t lose data. Then I dragged myself out of bed and checked the dashboards. Write QPS was totally normal, yet keys were simply vanishing into thin air.

Breaking Down the Problem

Our memory model is straightforward: each user session maps to a Redis Hash where the field is a memory ID and the value is a serialized memory snippet. At the end of a conversation, the code calls HMSET to merge the current turn’s new memories and uses EXPIRE to extend the TTL. The trouble surfaced with concurrent sessions — a user might chat from both the web and the mobile app, and both connections could trigger a merge at the same time. We used a “read‑then‑write” pattern: HGETALL to fetch all memories, merge in the application layer, then HMSET everything back. No transactions, no optimistic locking — pure hope.

During a load test we saw that when two ends wrote simultaneously, one end’s memories simply disappeared. The root cause was a classic read‑modify‑write race: A read {1,2}, B also read {1,2}, A wrote {1,2,3}, B wrote {1,2,4} → final state {1,2,4}, memory 3 lost. On top of that, we caught an EXPIRE mistake — someone hard‑coded EXPIRE 3600, so after a user chatted for two hours all their memories physically expired.

The typical remedies are distributed locks or Lua scripts for atomic merges, but we wanted to go beyond that: build an automated test suite that cages every concurrency, expiry, and crash scenario, and runs on every commit. That’s how this memory‑consistency test framework came to life.

Design Choices

We had a few options: mock Redis in unit tests, or use an in‑memory library like fakeredis. But we needed to test real Redis behavior — expiration policies, eviction, replication lag — things mocks simply can’t reproduce. So we chose to connect to a real Redis instance and spin one up inside Docker in CI.

Why not test against Redis Cluster or Sentinel? Because our production environment is also a single Redis instance (don’t ask, historic debt). We decided to nail single‑instance consistency first. The core idea of the framework is simulate real client conflicts with multi‑threaded / async concurrency, then use assertions to verify that the final state matches exactly what we expect. To avoid test pollution, every test case gets its own isolated Redis database (via SELECT on different DB numbers, or by isolating keys with a namespace).

Another key point: the tests must cover “crash recovery” scenarios. We explicitly kill the Redis process during a test and bring it back (controlling a Docker container via subprocess), then verify that the memory is intact after an AOF replay. Rolled into Pytest, run on every commit, the recurring nightmare of “I swear I wrote it, where did it go?” finally ended.

Core Implementation

I’ll walk through the skeleton of the test framework, explaining what each part solves before the code. All the snippets are ready to run.

1. Base fixture: a fully isolated Redis client

This fixture ensures test isolation — every test receives its own Redis connection and automatically cleans up residuals, so tests never stomp on each other.

import pytest
import redis
import uuid

@pytest.fixture
def redis_client():
    """返回一个测试专用的 Redis 客户端，使用独立的 key 前缀避免冲突。"""
    r = redis.Redis(
        host='localhost',
        port=6379,
        db=0,
        decode_responses=True
    )
    namespace = f"test:{uuid.uuid4().hex[:8]}:"
    # 把命名空间注入到连接对象上，方便后面构造 key
    r._namespace = namespace
    # 测试前清理可能遗留的旧数据
    for key in r.scan_iter(f"{namespace}*"):
        r.delete(key)
    yield r
    # 测试后二次清理
    for key in r.scan_iter(f"{namespace}*"):
        r.delete(key)
    r.close()

2. Lua atomic merge and the concurrency test

Here’s the heavy hitter: simulate concurrent writes to the same Hash and verify that our custom Lua atomic merge really eliminates the race condition. We wrote a Lua script merge_memories.lua that takes the Hash key and a dictionary of new memories, does HSETNX (or batch sets) internally, and refreshes the TTL.

The test goal: with multiple threads calling this script simultaneously, the final Hash must contain every field written by every thread — not a single one missing.

import threading
import time
from redis import Redis

LUA_MERGE = """
local key = KEYS[1]
local ttl = tonumber(ARGV[1])
-- 后面的偶数位是 field，奇数位是 value
for i = 2, #ARGV, 2 do
    redis.call('HSET', key, ARGV[i], ARGV[i+1])
end
if ttl > 0 then
    redis.call('EXPIRE', key, ttl)
end
return redis.call('HLEN', key)
"""

def test_concurrent_merge_no_loss(redis_client):
    """
    10 个线程同时合并记忆，每个线程写入 100 个 field，
    最终 Hash size 必须等于 1000。
    """
    key = redis_client._namespace + "user:123:memories"
    merge_script = redis_client.register_script(LUA_MERGE)
    errors = []
    threads = []

    def thread_worker(thread_id):
        try:
            # 构造当前线程要写入的一批 field
            fields = []
            for i in range(100):
                field = f"mem_{thread_id}_{i}"