DEV Community

kol kol
kol kol

Posted on

My Redis Cache Returned Stale Data for 3 Hours — The Bug Nobody Warns You About

My Redis Cache Returned Stale Data for 3 Hours — The Bug Nobody Warns You About

Last Tuesday, our monitoring dashboard lit up at 2:17 PM. API response times spiked from 45ms to 1,200ms. Error rates jumped to 8%. The root cause? Not a database crash, not a network partition — a cache invalidation bug so subtle it hid in plain sight.

Here's what happened, how I found it, and the pattern that prevents it forever.

The Symptom

Users reported seeing outdated product prices. Our Redis cache was serving data that was hours old, even though the database had the correct values.

# This looks innocent, right?
def get_product(product_id):
    cache_key = f"product:{product_id}"
    data = redis.get(cache_key)
    if data is None:
        data = db.query("SELECT * FROM products WHERE id = ?", product_id)
        redis.set(cache_key, json.dumps(data))
    return json.loads(data)
Enter fullscreen mode Exit fullscreen mode

The problem isn't the cache miss. It's the cache never misses when it should.

The Bug: Silent Key Corruption

Our product update function looked like this:

def update_product(product_id, updates):
    db.execute("UPDATE products SET ... WHERE id = ?", product_id, updates)
    # Invalidate cache
    redis.delete(f"product:{product_id}")
Enter fullscreen mode Exit fullscreen mode

Seems fine. But here's what actually happened:

  1. A background worker updates prices at 11:00 AM
  2. The delete() call fails silently — Redis returns 0 (key didn't exist)
  3. Why? Because the cache key format was changed in a previous deploy from product:{id} to product:v2:{id}
  4. The update function was never updated to match

Old cache keys, new format. Three hours of stale data.

The Root Cause: Cache Key Drift

This is what I call cache key drift — when cache keys diverge between read and write paths due to:

  • Format changes (v1 → v2)
  • Namespace mismatches (tenant A vs tenant B)
  • Serialization differences (JSON vs MessagePack)
  • Case sensitivity (product:123 vs Product:123)

The worst part? Redis delete() on a non-existent key returns 0, not an error. You won't know it failed until users start complaining.

The Fix: Cache Invalidation with Verification

import logging

def update_product(product_id, updates):
    cache_key = f"product:v2:{product_id}"

    db.execute("UPDATE products SET ... WHERE id = ?", product_id, updates)

    # Invalidate and verify
    deleted = redis.delete(cache_key)
    if deleted == 0:
        logging.warning(f"Cache key {cache_key} not found during invalidation")

    # Optional: force-refresh to prevent the next read from getting stale data
    # new_data = db.query("SELECT * FROM products WHERE id = ?", product_id)
    # redis.set(cache_key, json.dumps(new_data), ex=3600)
Enter fullscreen mode Exit fullscreen mode

The Prevention Pattern

Here's what we implemented to make this class of bug impossible:

1. Centralized Key Builder

class CacheKeys:
    @staticmethod
    def product(product_id: int) -> str:
        return f"product:v2:{product_id}"

    @staticmethod
    def user_products(user_id: int) -> str:
        return f"user:{user_id}:products:v2"
Enter fullscreen mode Exit fullscreen mode

Every cache operation — read, write, delete — uses CacheKeys.product(). Change the format once, everywhere updates.

2. Invalidation with TTL Safety Net

# Always set a TTL — even if invalidation fails, stale data expires
redis.set(cache_key, json.dumps(data), ex=3600)  # 1 hour max
Enter fullscreen mode Exit fullscreen mode

3. Invalidation Audit Logs

# Log every invalidation attempt
def invalidate(key: str):
    result = redis.delete(key)
    logger.info(f"Cache invalidation: {key} deleted={result}")
Enter fullscreen mode Exit fullscreen mode

4. Integration Test

def test_cache_invalidation():
    # Write
    product = create_product(name="Widget", price=9.99)
    assert get_product(product.id).price == 9.99

    # Update
    update_product(product.id, {"price": 12.99})

    # Cache must reflect the change
    assert get_product(product.id).price == 12.99
Enter fullscreen mode Exit fullscreen mode

The Takeaway

Cache invalidation is the hardest problem in computer science — right up there with naming things and off-by-one errors. But most cache bugs aren't algorithmic. They're organizational: read paths and write paths diverging over time.

The fix isn't smarter algorithms. It's:

  1. One source of truth for key formats
  2. Always-set TTLs as a safety net
  3. Tests that verify invalidation actually works

Your cache will lie to you eventually. Build systems that catch it before your users do.


What's the worst cache bug you've dealt with? Drop it in the comments — I want to know I'm not alone.

Top comments (0)