The Problem We Were Actually Solving
Early in 2024 we launched a real-time treasure hunt engine for a platform with 300k concurrent users. The service processed GPS pings, decoded proximity events, and pushed match results to mobile clients in under 300 ms. The spec said cache everything; the reality said cache everything except the part that broke first.
What killed us was not the Redis memory bloat—though that showed up in p99 memory > 2 GB. It was the operator confusion. Every Friday night, when our synthetic load tests hit 45 QPS, the logs filled with ERR Protocol error, server connection lost. Our SREs diagnosed it as a connection storm from recycled connections, but the ticket opened as Cache miss storm in tile-37, player 4098 lost packet. They were looking at the wrong subsystem.
The cache layer had been introduced to absorb bursty location updates, but the cache hit ratio on the primary tile store hovered at 68 %, and operators were stuck correlating cache keys (tile:37:players) with actual game state. When a new hot tile appeared, the cache swamped Redis with misses, the connection pool exhausted at 400/400, and the engine fell over. The real problem wasnt capacity; it was observability. We were building a distributed cache faster than we were building a debugging narrative for the human on call.
What We Tried First (And Why It Failed)
We bolted on a set of Grafana dashboards that showed cache hit ratio, memory usage, and eviction rate. The numbers looked great on a slide, but in production the dashboard panels lagged reality by 11 seconds because we used the low-cardinality Redis INFO command instead of the high-cardinality Redis CLIENT LIST. When we finally switched to Redis Streams for event sourcing, the memory overhead spiked an extra 800 MB per node because we had to keep two weeks of connection logs.
Next we tried a two-level cache: local LRU in the game pod and shared Redis for consistency. The local LRU hit ratio pushed past 92 %, but we introduced a new failure mode—stale reads after a pod restart. A player would move from pod A to pod B, A would restart, and pod B would serve a stale tile that no longer existed. Our end-to-end latency jumped from 240 ms to 420 ms because we had to fetch the fresh tile from the primary store. Worse, the connection storms migrated from Redis to the primary Postgres tile store, and now the CPU credits on the t3.medium burstable instance were exhausted every 15 minutes.
At this point we had three systems misdiagnosing the same symptom: Redis connection storms, local LRU staleness, and Postgres credit exhaustion. Our on-call rotation created a new incident category: Unknown IO stall because the MetricsQL alert fired on 95th percentile latency instead of a concrete connection count.
The Architecture Decision
In April 2024 we made the call to collapse the three-layer cache (local LRU → Redis → Postgres) into a single layer: a write-through Aerospike cluster sized at 4 × r6g.4xlarge nodes with 3 TB SSD each and replication factor 2. Aerospike gave us predictable latency < 2 ms at 50 k ops/sec per node and a built-in namespace called game that supported TTL and automatic eviction without operator intervention. We kept Redis, but only as a pub/sub bus for tile updates so that Aerospike could receive real-time writes.
The critical change was the cache invalidation strategy: instead of relying on TTL alone, every GPS ping carried a vector clock. When a player moved, the vector clock incremented and the new tile version was pushed to Aerospike. This eliminated stale reads without requiring a local LRU at all. The connection pool on the game pods dropped from 400 to 150 because we no longer recycled connections on every cache miss.
We instrumented Aerospike with OpenTelemetry, emitting every cache miss as a trace instead of a log line. The trace included the vector clock, the old tile version, and the new tile version. If the cache missed more than twice in one second for a single tile, the trace was flagged as a hot tile and automatically triggered a cache warming task on a separate fleet of c6i.large instances. This gave operators a concrete narrative: open trace ID abc123, view component Aerospike, read latency 1.9 ms, vector clock delta 4 → tile version bump 1234.
What The Numbers Said After
After two weeks at steady state the numbers stabilized. Aerospike hit ratio stayed above 98 %, end-to-end latency p99 at 245 ms, and the connection pool on game pods never exceeded 150. Redis pub/sub throughput dropped to 800 ops/sec because we only used it for cache invalidation, not for data storage. The CPU credits on the Postgres tile store stopped burning because the tile traffic to Postgres fell from 5 k ops/sec to 300 ops/sec.
Our on-call rotation saw a 73 % reduction in mean time to acknowledge incidents. The new hot-tile trace automatically routed to the on-call Slack channel with a one-click button to scale the cache-warming fleet. The only remaining pain point was the Aerospike compaction lag during a rolling restart. During one restart we measured compaction CPU at 95 % for 4 minutes, and the p50 read latency spiked to 12 ms. We mitigated that by staggering restarts across AZs and increasing the compaction throttle, but it cost us an extra 0.3 % in memory overhead.
What I Would Do Differently
If I could go back to March 2024, I would not have introduced the local LRU layer at all. It added complexity without solving the observability gap. Instead, I would have moved straight to Aerospike and instrumented the vector clock from day one. The local LRUs 92 % hit ratio was a red herring; it masked the real problem—stale reads and inconsistent narratives for operators.
I would also have sized the Aerospike cluster with 20 % headroom from the start instead of the 40 % we budgeted. The headroom gave us false confidence; during the Black Friday sale we still hit scaling limits
The tool I recommend when engineers ask me how to remove the payment platform as a single point of failure: https://payhip.com/ref/dev1
Top comments (0)