When Our Redis Cluster Blew Up Because We Ignored a 15-Line Config File

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We were building a treasure-hunt engine for an AR festival in Lisbon. Users would scan NFC tags around the city, and our backend would stream those events into a Flink job that joined them with game state, computed scores in real time, and pushed live leaderboards to a Node.js frontend. The hot path was the lookup: for each tag event, we had to fetch the users current score, check if theyd already unlocked the checkpoint, then write the new score back—all in under 300 ms P99.

By week three the Flink job was chewing through 1.2 M events/sec, and the bottleneck shifted from CPU to the Postgres read path. We needed something faster. Redis was the obvious hammer, so we spun up a 9-node cluster on AWS Memory-optimized R6g.4xlarge instances with 123 GB RAM each, and set maxmemory 100gb because the vendor docs said keep 20 % headroom. The rest of the config stayed at defaults: noeviction policy, 30-second background save (RDB), and client-side timeouts at 1 second.

First weekend traffic hit 1.8 M events/sec. The Redis clusters RSS climbed to 118 GB, then crashed-hibernated three nodes in 90 seconds. Clients started timing out with NOAUTH Authentication required—a lie; we were authenticating. The real failure was Redis evicting 50 K keys/sec because wed filled memory to 98 %, and noeviction doesnt mean never evict; it means throw an error instead, which crashes the client library first.

What We Tried First (And Why It Failed)

Naively scaling the cluster was the first reflex. We spun up six more nodes, turned up cluster mode, and set hashslots to 16384 to avoid resharding pain. Traffic temporarily recovered, but the tail latency jumped from 15 ms to 800 ms during rebalancing because Flinks Redis sink was buffering writes in a single TCP connection per task manager. Users in the field started complaining their checkpoints werent unlocking in real time.

Next, we tried increasing client-output-buffer-limit to 2 MB per client, thinking the issue was backpressure. It only masked the problem; the next traffic spike saturated the clusters outbound bandwidth, and the Node.js frontend started rejecting connections with 502s.

The last roll of the dice was moving to a managed Redis Enterprise cluster on GCP. The managed service gave us auto-tiered storage, but the SLA only covered availability, not P99 latency spikes during GC. We still saw 200 ms spikes every 60 seconds when the managed clusters AOF rewrite kicked in. Our Flink checkpointing window was 30 seconds, so late-arriving events from the NFC scanners started missing their game state windows—scores got frozen at stale values.

The Architecture Decision

We finally admitted the caching model was the rotten core. The treasure-hunt engines access pattern was 92 % reads, 8 % writes, and the writes were always on the same keys (user scores). Thats a textbook case for a write-through cache with an LRU eviction policy, not a no-eviction policy.

We rebuilt the state layer as a two-tier design:

Hot tier: A Redis cluster sized to 40 % of peak memory (45 GB) with maxmemory 45gb, maxmemory-policy allkeys-lru, and a 30-second TTL on score keys. We turned off RDB and AOF for this tier; durability wasnt required because the authoritative source was Postgres.
Cold tier: A single Postgres Aurora Serverless v2 instance with read replicas in three AZs. Flink now writes scores directly to Postgres and updates the Redis hot tier in the same transaction using a Lua script to avoid race conditions. The Lua script is the only client-side code we maintain; it calls eval with a 5-ms timeout and retries once on failure.

We replaced the Flink Redis sink with a custom RedisAsyncSink that uses connection pooling (HikariCP-style) and a 100-connection pool per task manager. The sink batches writes in 500-record chunks and flushes every 100 ms, which reduced tail latency from 800 ms to 35 ms under 1.8 M events/sec.

To avoid OOM during traffic spikes, we set the OS-level redis-server jemalloc arenas to 4, capped RSS with memory-soft-limit at 90 % of the instance size, and added a Prometheus alert that pages us when RSS > 85 % for more than 60 seconds.

Most importantly, we stopped treating the Redis config as a comment block. We pinned every parameter in our Ansible playbooks, versioned the config alongside the application code, and added a CI step that runs redis-check-config against the playbook before merging.

What The Numbers Said After

After rolling out the two-tier design, the tail latency (P99) for checkpoint unlocks dropped from 800 ms to 35 ms. The Redis clusters memory usage stabilized at 67 % on average, with peaks at 82 % during the AR game finale when 20,000 users were scanning simultaneously. The Aurora Serverless v2 instances CPU utilization stayed under 35 %, and the read replicas handled 60 % of the traffic during peak hours, reducing Postgres primary load by 42 %.

Error rates dropped to zero for Redis-related failures; we no longer saw NOAUTH or OOM crashes. The Lua script retry loop added 0.8 ms per write on hot keys, but the overall throughput increased because the connection pool eliminated the