My Redis Cache Returned Stale Data for 3 Hours — The Bug Nobody Warns You About
Last Tuesday, our monitoring dashboard lit up at 2:17 PM. API response times spiked from 45ms to 1,200ms. Error rates jumped to 8%. The root cause? Not a database crash, not a network partition — a cache invalidation bug so subtle it hid in plain sight.
Here's what happened, how I found it, and the pattern that prevents it forever.
The Symptom
Users reported seeing outdated product prices. Our Redis cache was serving data that was hours old, even though the database had the correct values.
# This looks innocent, right?
def get_product(product_id):
cache_key = f"product:{product_id}"
data = redis.get(cache_key)
if data is None:
data = db.query("SELECT * FROM products WHERE id = ?", product_id)
redis.set(cache_key, json.dumps(data))
return json.loads(data)
The problem isn't the cache miss. It's the cache never misses when it should.
The Bug: Silent Key Corruption
Our product update function looked like this:
def update_product(product_id, updates):
db.execute("UPDATE products SET ... WHERE id = ?", product_id, updates)
# Invalidate cache
redis.delete(f"product:{product_id}")
Seems fine. But here's what actually happened:
- A background worker updates prices at 11:00 AM
- The
delete()call fails silently — Redis returns0(key didn't exist) - Why? Because the cache key format was changed in a previous deploy from
product:{id}toproduct:v2:{id} - The update function was never updated to match
Old cache keys, new format. Three hours of stale data.
The Root Cause: Cache Key Drift
This is what I call cache key drift — when cache keys diverge between read and write paths due to:
- Format changes (v1 → v2)
- Namespace mismatches (tenant A vs tenant B)
- Serialization differences (JSON vs MessagePack)
- Case sensitivity (product:123 vs Product:123)
The worst part? Redis delete() on a non-existent key returns 0, not an error. You won't know it failed until users start complaining.
The Fix: Cache Invalidation with Verification
import logging
def update_product(product_id, updates):
cache_key = f"product:v2:{product_id}"
db.execute("UPDATE products SET ... WHERE id = ?", product_id, updates)
# Invalidate and verify
deleted = redis.delete(cache_key)
if deleted == 0:
logging.warning(f"Cache key {cache_key} not found during invalidation")
# Optional: force-refresh to prevent the next read from getting stale data
# new_data = db.query("SELECT * FROM products WHERE id = ?", product_id)
# redis.set(cache_key, json.dumps(new_data), ex=3600)
The Prevention Pattern
Here's what we implemented to make this class of bug impossible:
1. Centralized Key Builder
class CacheKeys:
@staticmethod
def product(product_id: int) -> str:
return f"product:v2:{product_id}"
@staticmethod
def user_products(user_id: int) -> str:
return f"user:{user_id}:products:v2"
Every cache operation — read, write, delete — uses CacheKeys.product(). Change the format once, everywhere updates.
2. Invalidation with TTL Safety Net
# Always set a TTL — even if invalidation fails, stale data expires
redis.set(cache_key, json.dumps(data), ex=3600) # 1 hour max
3. Invalidation Audit Logs
# Log every invalidation attempt
def invalidate(key: str):
result = redis.delete(key)
logger.info(f"Cache invalidation: {key} deleted={result}")
4. Integration Test
def test_cache_invalidation():
# Write
product = create_product(name="Widget", price=9.99)
assert get_product(product.id).price == 9.99
# Update
update_product(product.id, {"price": 12.99})
# Cache must reflect the change
assert get_product(product.id).price == 12.99
The Takeaway
Cache invalidation is the hardest problem in computer science — right up there with naming things and off-by-one errors. But most cache bugs aren't algorithmic. They're organizational: read paths and write paths diverging over time.
The fix isn't smarter algorithms. It's:
- One source of truth for key formats
- Always-set TTLs as a safety net
- Tests that verify invalidation actually works
Your cache will lie to you eventually. Build systems that catch it before your users do.
What's the worst cache bug you've dealt with? Drop it in the comments — I want to know I'm not alone.
Top comments (0)