The One Cache That Broke Our Treasure Hunt Engine

#systems #webdev #programming #architecture

The Problem We Were Actually Solving

Our core game loop was:

player scans QR → API writes record → Geo-indexer updates spatial index → leaderboard recalculates.

We knew writes would be the hot path, so we cached scan → player_id → last_location in Redis with a 30 s TTL. Simple, fast, and we could afford to lose a few updates if the cache evaporated.

The first mistake was the key schema. We chose {game_id}:{player_id}:location which meant every write to Postgres also went to a single shard. When the Redis node hit its 4 GB maxmemory, it started evicting keys. Our cache hit rate cratered to 47 % and the API p99 latency grew from 90 ms to 1.8 s because every miss forced a round-trip to Postgres plus a Geo-indexer recalculation.

What We Tried First (And Why It Failed)

We tried two obvious fixes in sequence:

RDB snapshots every 5 min on the Redis side. This introduced a 2–3 min drift between cache and source of truth. Players who physically moved would see their pin jump backward on the map. Complaints spiked; one report described their avatar teleporting 400 m in the wrong direction.
Doubled Redis memory to 8 GB per node and set maxmemory-policy allkeys-lru. That simply postponed the inevitable. At 11 k concurrent games the working set grew to 11.2 GB, triggering OOM kills every 4 hours. The Redis pod would respawn, but the new instance started empty, so the first 30 s of every spike were pure cold-cache latency.

Neither option addressed the real cost: serializing every location write through a single Redis shard that also hosted leaderboard data. We were co-locating two completely different workloads—hot path writes and eventually consistent ranking—on the same memory budget. The error budget of 500 ms extra latency had been exceeded within seven minutes of the traffic spike.

The Architecture Decision

We split the caches.

L1: Local LRU on each API pod (500 MB, 5 s TTL). This absorbed spikes inside the pods own memory before they reached the network.
L2: A dedicated Redis cluster sized at 2× the peak working set (24 GB per shard, 10 shards). We used Redis 7.2 with noevict because we could finally afford the RAM; we had just decommissioned a 2 TB RAM-only feature flag store that nobody was using anymore.
WAL: Every location change wrote an append-only log to Kafka before updating Postgres. A separate consumer hydrated the Geo-indexer from Kafka, guaranteeing eventual consistency even if Redis or Postgres hiccuped.

The tradeoff was clear: added network hops (L1 → L2 → WAL → Kafka → Postgres) versus guaranteed headroom. We measured the additional latency at ~14 ms on the same AZ. Given the old worst-case was 1.8 s, the 99th percentile became 194 ms—below our 500 ms error budget.

What The Numbers Said After

Within 36 hours of rolling the change, the Redis cluster stabilized at 68 % memory usage and 0 evictions. Cache hit rate recovered to 94 %; API p99 settled at 152 ms. The Kafka consumer lag never exceeded 2 s even at peak, because wed capped the partition count at 50 and set max.poll.records=500.

The only remaining pain point was cold starts: when an API pod rebooted, its local L1 cache was empty, so the first few hundred requests in the pods AZ would miss both caches. We mitigated that by pre-warming the L1 with a snapshot every 30 s, shaved from the same Redis 2 cluster. That added 8 MB of extra traffic per pod, a cost we accepted.

What I Would Do Differently

I would not have co-located the leaderboard and location caches in the first place. The ranking data updates every second whereas location updates every 15 s; the access patterns are orthogonal. If I had to repeat the exercise, Id carve out two separate Redis clusters from day one and budget the memory explicitly: