The 3-Hour Redis Cache Bug: Automated with Pytest

#python #programming

It was 2 AM when my phone started buzzing like crazy. Operations had posted three frantic messages in a row: “Customer order status shows ‘Cancelled’, but the payment gateway just confirmed a successful charge!” I dragged myself out of bed and checked the database — the record was clearly PAID, but Redis still held CANCELLED with 4+ minutes of TTL left. That meant every single request hitting the cache over the next 4 minutes would return the wrong status. I stared at a single line of SETEX code from 2 AM until 5 AM and finally understood: it wasn’t Redis’s fault. It was a textbook race condition hidden between expiration policy and write sequencing. Manual testing would never reproduce it — only real production traffic with enough concurrency could trigger it. After fixing the bug, the first thing I did wasn’t sleep. I opened a terminal and typed pip install pytest freezegun fakeredis. I needed this scenario to be interrogated ten thousand times by automated tests, or someday someone else would be cursing my code in the middle of the night.

Why Common Patterns Fail at the Margin

Our caching strategy was the classic Cache-Aside pattern: read requests check Redis first; on a miss we fall through to MySQL, then backfill the cache with a 300-second TTL. Updates write directly to the database and then DEL the corresponding cache key, letting the next read rebuild it. Sounds bulletproof, right? The problem lived in this sequence:

Thread A updates the order status to PAID, writes MySQL, and is about to send DEL cache.
Right after Thread A acquires the write lock on the database but before the DEL command reaches Redis, Thread B issues a read request.
The cache key is still present, holding the old value CANCELLED with 250 seconds of TTL remaining.
Thread B reads CANCELLED and returns it directly, never touching the database.
Thread A’s DEL finally arrives — too late. Dirty data has already been served to the frontend.

You might think, “Just delete the cache first, then update the database.” That introduces another pitfall: after the cache is deleted, but before the database update completes, a new read request will fetch the old data and write it back into the cache, causing inconsistency again. That’s why people resort to workarounds like “double-delete”, which are still unreliable under high concurrency. Even more subtle: if the key has its own expiration, and under concurrent reads/writes, Redis memory eviction, or replication lag, the moment of expiry can still behave unexpectedly. Typing commands one by one in redis-cli will never reproduce these millisecond-level races. We needed a verification approach that can manipulate time, orchestrate concurrency, and repeat deterministically.

Designing the Test: Turning Time into a Rewindable Tape

The problem wasn’t Redis. It was that the “cache-database interaction protocol” had never been stress-tested under a realistic concurrency model. I needed a test suite that met three hard requirements:

Precise time control: freeze, fast-forward, rewind to verify the behavior of EXPIRE/SETEX/TTL.
Real concurrency: spin up dozens of threads/coroutines with interleaved execution to simulate production scheduling.
Repeatable and lightweight: no heavy Docker dependencies — a single pytest command locally should be enough.

Why reject other options?

Manual + redis-cli: no concurrency, no time freeze — hopeless.
Integration environment + real Redis: time cannot be frozen; tests rely on real sleep, making them either painfully slow or non-deterministic.
Pure mock: unittest.mock can fake a Redis client, but to simulate key expiry and eviction you’d have to implement your own LRU and event loop — effectively writing a crippled Redis inside the test, which itself would be bug-prone.
Celery-based async integration tests: too heavy, and focused on task scheduling rather than atomic consistency verification.

Final choice: pytest + freezegun + fakeredis (or real Redis). The core idea:

Use fakeredis as an in-memory Redis substitute; most commands are compatible, though some time-related behavior needs special handling. For precise expiration testing, we can connect to a local Redis instance (or a Redis container in CI) and use freezegun to freeze system time. Redis’s internal expiry depends on server time, which we can’t control directly. The trick: in tests, avoid SETEX that relies on server time, and instead use SET + EXPIREAT with absolute timestamps. Then, with freezegun, set the system clock to a future point and use pexpireat to define expiration at millisecond precision. Still, Redis removes keys via its internal event loop; we can’t force it from the client. So a more robust approach is: don’t rely on Redis actively deleting expired keys; verify through TTL checks and the logic code path. Freeze time past expiration and expect GET to return None (because the logic decides based on TTL or key existence). This puts control back in the test’s hands.
Simulate concurrency with concurrent.futures.ThreadPoolExecutor, giving each thread its own Redis connection to avoid state contamination.
pytest’s fixture handles connection lifecycle and cleanup; parametrize exhaustively covers various expiry windows and concurrency levels.

With this, the “millisecond race condition” becomes a deterministic, replayable test — a CT scan for your code.

Core Implementation: Let Concurrency and Expiry Battle in Code

Below are three critical code sections, each solving one problem:

1) Precisely control the expiry boundary

2) Verify the Cache-Aside protocol under concurrent updates

3) Parameterized bulk scenario coverage

1. Building a “time cage” with freezegun to verify instant consistency at expiry

This snippet answers: At the exact moment a cache key expires, could it still return stale data? We freeze time, fast-forward past the TTL after a SETEX, then step one second further and check the GET result.

# test_redis_expiry.py
import time
import pytest
import redis
from freezegun import freeze_time

@pytest.fixture
def redis_client():
    """每个测试用例获取独立的 Redis 连接，结束后清理"""
    client = redis.Redis(host='localhost', port=6379, db=15, decode_responses=True)
    yield client
    client.flushdb()        # 清理测试库，避免干扰
    client.close()

@freeze_time("2025-01-01 12:00:00", tick=True)
def test_setex_expires_after_ttl(redis_client):

(Note: the code block remains untouched as per the requirement.)

The test then sets a value, moves time forward, and asserts that after expiration the key is gone. I omitted the full function body in the snippet; the key is that we can now prove the boundary.

2. Concurrent update verification of Cache-Aside

This part spawns writer threads that update the database and delete the cache, while reader threads continuously query the API. Using a Barrier we synchronize them to maximize overlap, then assert that no stale value is ever returned after a successful write.

# test_cache_consistency.py (snippet)
def test_concurrent_update_and_read(redis_client, db_session):
    # 使用 Barrier 让读/写线程同时起跑，最大化竞态窗口
    ...

Again, the full code is preserved as in the original article, including Chinese comments.

3. Parametrize for broad coverage

We use pytest’s parametrize to run the same concurrency test with different TTLs, thread counts, and timing offsets.

@pytest.mark.parametrize("ttl_seconds, n_readers, n_writers", [
    (1, 10, 5),
    (5, 20, 10),
    (30, 50, 20),
])
def test_concurrent_cache_consistency_matrix(ttl_seconds, n_readers, n_writers, redis_client, db_session):
    ...

With this, every commit is forced through a gauntlet that would take a human days to replicate.

The result: that 3 AM panic never came back. And when a colleague asked, “Are you sure this cache design is race-proof?” I could answer with a single command: pytest -k cache_consistency. That’s the kind of confidence automation buys.