50 pytest Tests That Caught Redis Cache Bugs Before Production — 90% Fewer Incidents

#python #programming

At 2:07 a.m., my phone vibrated three times. The alert channel lit up with a red message: “DB connection pool exhausted, API responses timing out after 20 seconds.” I squinted at Grafana — sure enough, dozens of hot keys in Redis had expired at exactly the same moment. Tens of thousands of requests per second slammed straight into MySQL, and the CPU shot to 95%. I knew instantly: another cache avalanche, and another developer had changed the TTL without telling anyone.

At the next morning’s standup, I made a call: every cache behavior must be covered by automated pytest tests. No merge without passing. Three months and 50 test cases later, those tests caught 12 high-risk defects before production. Cache-related incidents dropped by 90%. Here’s how we built that safety net.

The real problem: why cache bugs only show up at 2 a.m.

We run an e-commerce middle platform — products, inventory, pricing all go through Redis. Two patterns keep hitting us:

Cache penetration: requesting data that doesn’t exist in the database (e.g., a product ID of -1) goes straight through Redis to MySQL every time. A malicious actor can flood the database in seconds.
Cache avalanche: a batch of keys share the same TTL. When that TTL hits, they all expire together, and the peak traffic collapses onto the database.

The team knew the theory. We had defences: caching null values for absent data, adding random jitter to TTLs. But those defences lived in business code — refactoring or feature iteration could silently remove or “optimise” them. Once, a developer thought “null caches waste memory” and deleted the logic. Code review didn’t catch it, and three days later an attacker exploited the hole. Human vigilance can’t guarantee consistent cache policies. An automated gatekeeper can.

Design: why we didn’t mock, but used real Redis + pytest

We evaluated three approaches:

Mock every Redis call — too fake. Misses serialisation quirks, network latency, pipeline differences.
Docker Redis in CI alone — too slow. 50 test cases took 8 minutes.
fakeredis for fast regression + Docker Redis for critical path integration — speed and realism.

We picked option 3. Daily development uses fakeredis for unit tests — instant, zero dependencies. The CI merge_request pipeline spins up a real Redis container and runs integration tests, catching behavioural gaps between fakeredis and real Redis. This keeps development fast while catching edge cases.

Pytest was the obvious framework. fixture is perfect for managing Redis connections, and parametrize makes it easy to cover the three states we care about: cached, not cached, expired.

Implementation: turning cache logic into reusable pytest tests

Ensuring cache-penetration guards work: requesting a non-existent ID hits the DB once, then always hits the empty cache

import time
import pytest
from fakeredis import FakeRedis
from myapp.cache import get_product  # business function under test

FAKE_PRODUCT_ID = "product:99999"      # ID that doesn’t exist in DB
NULL_CACHE_TTL = 60                    # TTL for null cache entries

@pytest.fixture
def redis_client():
    """Independent Redis instance per test, auto-cleaned"""
    client = FakeRedis(decode_responses=True)
    yield client
    client.flushall()

def test_cache_penetration_guard(redis_client, mocker):
    """
    Verify null caching: query a non-existent product twice.
    The DB must be accessed only once.
    """
    # Mock DB query to return None (not found)
    mock_db = mocker.patch("myapp.cache.query_db", return_value=None)

    # First call: should hit the DB
    result1 = get_product(redis_client, FAKE_PRODUCT_ID)
    assert result1 is None
    assert mock_db.call_count == 1, "First call should hit DB"

    # Second call immediately: must hit the null cache, no DB call
    result2 = get_product(redis_client, FAKE_PRODUCT_ID)
    assert result2 is None
    assert mock_db.call_count == 1, "Second call must not hit DB — null cache active"

    # Verify the null marker was stored in Redis
    cached = redis_client.get(FAKE_PRODUCT_ID)
    assert cached == "NULL"
    # TTL should be within the expected range (FakeRedis supports ttl)
    ttl = redis_client.ttl(FAKE_PRODUCT_ID)
    assert 0 < ttl <= NULL_CACHE_TTL

Note: the null-cache key is set to the special string "NULL", distinct from a real serialised object. Its TTL must not be permanent, otherwise memory slowly fills up.

Preventing cache avalanche: a batch of keys must not all expire in the same second

import random
from typing import List

# Production logic: sets cache with a base TTL + random 0–600 seconds
def set_product_with_random_ttl(redis, key: str, value: str, base_ttl: int = 3600):
    """
    Actual business function: TTL = base_ttl + random.randint(0, 600)
    """
    ttl = base_ttl + random.randint(0, 600)
    redis.setex(key, ttl, value)

def test_avalanche_prevention_ttl_distribution(redis_client):
    """
    Test: set 100 keys. Their TTLs must not all be identical.
    """
    base_ttl = 300
    keys: List[str] = [f"hot:item:{i}" for i in range(100)]

    for key in keys:
        set_product_with_random_ttl(redis_client, key, "data", base_ttl

How these tests blocked real bugs

Once the pipeline was in place, the 50 test cases started catching mistakes in pull requests that humans missed. A few real examples:

A refactor that accidentally removed the null cache logic — caught by test_cache_penetration_guard.
A configuration change that set a shared TTL of exactly 10 minutes for all hot keys — caught by the TTL distribution test.
A serialisation change that broke the "NULL" marker, causing false DB misses — the guard test failed immediately.

Each one would have been a production firefighting session at 2 a.m. Instead, the tests failed loud and early, and developers fixed the issues before merging.

Closing the loop

Automated cache testing isn’t optional when Redis is central to your performance. The combination of fakeredis, pytest fixtures, and real Redis integration gave us a fast, reliable safety net. The 50 test cases now run on every merge request, and cache-related production alerts are rare.

If your team relies on Redis for hot data, take a weekend to write the first 10 tests. Those tests will repay the time tenfold the first time they prevent a midnight page.