Mehmet TURAÇ

Posted on May 30

Great Stack to Doesn't Work #3 — Redis: "99% Cache Hit Ratio, System Down"

#redis #backend #devops #discuss

A survival guide for when everything goes wrong in production.

Your Redis dashboard looks perfect. Hit ratio: 99.2%. Latency: sub-millisecond. Memory usage: 60% of available. Every metric says healthy.

Then at 2:47 PM, your API starts returning 500s. Response times spike to 30 seconds. Users can't log in. The dashboard still shows 99% hit ratio because the cache is working — it's serving cached errors to everyone equally fast.

Redis is doing exactly what you told it to do. The problem is what you told it to do.

Why Single-Threaded Is Fast (Until It Isn't)

Redis processes commands on a single thread. No locks. No context switching. No synchronization overhead. One CPU core, fully utilized, can handle 100K+ operations per second because it never waits for another thread to release a lock.

The event loop model (similar to Node.js) multiplexes thousands of client connections on a single thread using non-blocking I/O. Read a request, process it, write the response, move to the next. When your commands are simple — GET, SET, INCR — each one takes microseconds.

The trap: slow commands block everything. KEYS * on a million-key database? That's a full keyspace scan on the main thread. While it runs, every other client waits. SORT on a large set? Same. LRANGE on a list with 10 million elements? Same.

Redis 6.0 introduced I/O threading (io-threads config) for reading and writing network data on multiple threads, but command execution is still single-threaded. Redis 7.0 improved this further, but the fundamental model hasn't changed. Long-running commands on the main thread stall everything.

Rules:

Never use KEYS in production. Use SCAN instead — it's cursor-based and returns results incrementally.
Watch out for O(N) commands on large data structures: LRANGE, SMEMBERS, HGETALL on million-element structures.
Use SLOWLOG to find commands that are blocking the event loop.

Pipelining: The Easiest 10x You'll Ever Get

Every Redis command involves a network round trip: send request, wait for response. If you're executing 100 commands sequentially, that's 100 round trips. At 0.5ms per round trip, you're waiting 50ms for what should take 1ms of actual processing.

Pipelining batches commands into a single network write and reads all responses at once.

pipe = redis.pipeline()
for user_id in user_ids:
    pipe.get(f"user:{user_id}:profile")
results = pipe.execute()

Instead of 100 round trips, you make 1. The server processes all commands in sequence (it's single-threaded, remember) and buffers the responses. Your client sends the batch, waits once, and gets everything back.

Pipelining doesn't reduce server-side processing time — each command still runs individually. It eliminates network latency, which is almost always the dominant cost for simple commands.

The catch: if one command in the pipeline fails, the others still execute. Pipelining is not transactional. If you need atomicity, use MULTI/EXEC or Lua scripts.

Lua Scripting: Atomic Operations Without the Complexity

Redis evaluates Lua scripts atomically. While a script runs, nothing else executes. This makes Lua scripts the right tool for read-modify-write operations that would otherwise need distributed locking.

Classic example — rate limiting:

-- KEYS[1] = rate limit key
-- ARGV[1] = max requests
-- ARGV[2] = window in seconds
local current = redis.call('INCR', KEYS[1])
if current == 1 then
    redis.call('EXPIRE', KEYS[1], ARGV[2])
end
if current > tonumber(ARGV[1]) then
    return 0  -- rate limited
end
return 1  -- allowed

This increments a counter and sets expiry atomically. No race condition between INCR and EXPIRE. No chance of two requests both reading "0" and both thinking they're first.

Use EVALSHA instead of EVAL in production. EVALSHA references the script by its SHA1 hash, avoiding sending the full script text with every call. Load the script once with SCRIPT LOAD, then call it by hash.

Caveat: Lua scripts block the main thread for their entire duration. Keep them short. A script that queries 10 keys is fine. A script that iterates over 100,000 keys is a production incident waiting to happen.

Pub/Sub vs Streams: Two Very Different Tools

Pub/Sub is fire-and-forget. Publisher sends a message, all connected subscribers receive it instantly. If a subscriber disconnects and reconnects, it misses everything published while it was gone. No message persistence. No consumer groups. No acknowledgment.

Use Pub/Sub for: real-time notifications where missing a message is acceptable. Chat typing indicators. Cache invalidation signals. Dashboard live updates.

Streams (introduced in Redis 5.0) are persistent, append-only logs with consumer groups. Think of them as "Kafka Lite inside Redis."

XADD orders * user_id 42 amount 99.99
XREADGROUP GROUP payment_processors consumer_1 COUNT 10 BLOCK 5000 STREAMS orders >
XACK orders payment_processors 1234567890-0

Streams persist messages. Consumer groups track which consumer has read what. Unacknowledged messages can be claimed by other consumers if one dies. You get at-least-once delivery semantics.

Use Streams for: job queues, event sourcing, lightweight message processing where you don't want to deploy Kafka but need more than Pub/Sub.

Don't use Streams to replace Kafka at scale. Redis Streams are bounded by single-node memory. Kafka is designed for multi-broker distributed throughput. Different tools, different scale.

Memory Eviction: The Policy That Saves or Kills You

When Redis hits maxmemory, it needs to decide what to delete. The eviction policy determines what goes.

noeviction: Redis returns errors for write commands. Reads still work. Use this when you absolutely cannot lose data and you'd rather fail loudly than silently corrupt your cache. Common for session stores.

allkeys-lru: Evicts the least recently used key across all keys. The safest general-purpose policy. If you're using Redis purely as a cache, this is your default.

volatile-lru: Only evicts keys with a TTL set. Keys without TTL are never evicted. Use this when you have a mix of permanent data (config, feature flags) and cache data (user sessions, query results). The permanent data stays; the cache data gets evicted under pressure.

allkeys-lfu (Least Frequently Used): Evicts keys accessed least often, regardless of recency. Better than LRU when you have a mix of frequently-accessed hot data and occasionally-accessed warm data. A key accessed 1,000 times yesterday but not today won't be evicted as quickly as with LRU.

The disaster scenario: noeviction on a cache. Redis fills up. Every write fails. Your application treats the write failure as a cache miss and hits the database directly. Now your database is handling the full load that Redis was supposed to absorb. The database slows down. API latency spikes. Cascading failure.

Monitor evicted_keys in Redis INFO stats. A sudden spike means you're running out of memory and eviction is kicking in aggressively. Either add memory or investigate why your keyspace is growing.

Persistence: RDB vs AOF vs "I Thought Redis Was Just a Cache"

Many teams deploy Redis without persistence, treating it as a pure cache. Then the server restarts and 6 hours of cached data vanishes. Cold cache stampede: every request hits the database simultaneously.

RDB (snapshotting): Redis forks the process and writes the entire dataset to disk at intervals. Fast restores. Compact files. But you can lose data between snapshots — if Redis saves every 5 minutes and crashes 4 minutes after the last save, those 4 minutes are gone.

AOF (Append Only File): Redis logs every write operation. Three sync modes: always (fsync every write — safe but slow), everysec (fsync every second — good balance), no (let the OS decide — fastest but risky). On restart, Redis replays the log to rebuild state.

RDB + AOF: Use both. RDB for fast restores and backups. AOF for durability. On restart, Redis prefers AOF because it's more complete.

The real question: what happens to your system when Redis restarts with an empty cache? If the answer is "everything melts," you need persistence. If the answer is "things are slow for a few minutes while the cache warms up," maybe you don't — but you should still have RDB snapshots for disaster recovery.

The Thundering Herd: Cache Invalidation's True Face

You cache a popular product page for 5 minutes. 10,000 users are viewing it. The TTL expires. All 10,000 requests simultaneously hit the database for the same data. The database buckles under the sudden spike.

This is the thundering herd problem, and it's not theoretical. Any high-traffic system with TTL-based caching will encounter it.

Solutions:

Staggered TTLs. Add random jitter to expiration times: TTL = base_ttl + random(0, 60). Keys expire at different times, spreading the database load.

Lock-based refresh. When a key expires, only one request acquires a lock and rebuilds the cache. All others wait or serve stale data. Implementation with Lua:

local value = redis.call('GET', KEYS[1])
if value then return value end
local lock = redis.call('SET', KEYS[1] .. ':lock', 'locked', 'NX', 'EX', 5)
if lock then
    return nil  -- caller rebuilds cache
else
    return redis.call('GET', KEYS[1])  -- wait for rebuild
end

Early refresh. Refresh the cache before it expires. If TTL is 5 minutes, start a background refresh at 4 minutes. The cache never actually expires under normal operation.

How We Crashed Production 3 Times

Crash #1: Hot key. A flash sale product page was cached under a single key. 500,000 requests per second hit that one key. Redis can handle the throughput, but the single-threaded nature means this one key's reads were queuing behind each other. Latency spiked to 50ms — fine for one request, fatal for the 499,999 behind it.

Fix: cache the hot key locally in-process with a short TTL (1-2 seconds). Application memory serves 99% of requests, Redis serves the refresh.

Crash #2: Serialization bomb. Someone cached a full user object including activity history — 50MB serialized. Every time the app read that key, Redis had to send 50MB over the network. The single thread was blocked for 200ms per read. At 100 concurrent reads, the event loop was saturated.

Fix: cache only what you need. User profile: 2KB. User activity: separate key, paginated, never cached as a monolith.

Crash #3: Cache invalidation race. Service A updates a user record in the database and deletes the cache key. Service B reads the cache, gets a miss, reads the stale data from a read replica (replication lag), and writes the stale data back to cache. Now the cache has stale data and it won't refresh until the TTL expires.

Fix: don't write to cache after a miss if the data might be stale. Use read-from-primary for cache rebuilds, or use a TTL short enough that stale data self-corrects quickly.

When Redis, When Memcached?

This is a shorter decision than people make it.

Redis when: you need data structures beyond key-value (lists, sets, sorted sets, hashes, streams), persistence, pub/sub, Lua scripting, cluster mode, or any feature beyond simple caching.

Memcached when: you need a simple, multi-threaded cache with predictable memory allocation and you're caching large blobs (images, rendered HTML). Memcached's multi-threaded architecture handles large-value workloads more efficiently than Redis's single-threaded model.

In practice: Redis, almost always. The feature set is so much broader that the rare cases where Memcached wins are outweighed by Redis's versatility. The exception is if you're caching very large objects at very high throughput and you're hitting Redis's single-threaded bottleneck. Then Memcached's multi-threaded reads genuinely help.

Key Takeaways

Redis is fast by default and slow by mistake. The mistakes are predictable: slow commands on the main thread, missing pipelining, wrong eviction policy, no persistence on a critical cache, and hot keys.

Monitor commandstats to see which commands are running. Monitor slowlog to find the ones that are too slow. Monitor evicted_keys to know when you're running out of memory.

The 99% hit ratio dashboard doesn't mean your cache is healthy. It means your cache is serving something fast. Whether that something is correct, fresh, and useful — that's a different question.

Over to You

What's your worst Redis incident? Hot key? Thundering herd? Wrong eviction policy? The cache invalidation race condition stories are always the best.

If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.

Follow me:

This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*

Top comments (1)

Self-Correcting Systems • Jun 1

Crash #3 is the most structurally interesting failure in the series because the cache
isn't broken it's working exactly as designed. The failure happened upstream: nothing
in the write path told the cache "this data is superseded." Your read-from-primary fix
is essentially adding an authority check before serving. Instead of "data exists →
serve it," the system has to ask "data exists AND came from an authoritative source →
serve it."

I've been running into the same structural gap in AI agent memory systems retrieval
layers that optimize for query relevance but have no mechanism to distinguish a current
instruction from a superseded one. A retriever finds the right memory, reports a hit,
and the action layer acts confidently on something that's been overridden. Your 99% hit
ratio headline is the exact same failure mode: the metric is real, it's just measuring
the wrong thing.

The staggered TTL fix in the thundering herd section is underused. Will be sharing that
one.