How We Blew Up Our Cache Layer And Learned To Live With 400 ms P99

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In 2022 our event pipeline served 12 k QPS at peak for Treasure Hunt Engine. Each event carried a 2 kB payload that we had to enrich with user metadata, session identity, and game state before writing to Kafka. Our original design was a simple Go service backed by a single Postgres 14 read replica. That replica melted at 2 k QPS during a live event when we hit 115 % CPU and began dropping 8 % of writes. The error message canceling statement due to statement timeout filled our logs every 400 ms. We were losing 300-400 events per minute for 17 minutes until the replica recovered.

We needed a cache layer that could absorb the read load for user and session lookups and keep the Postgres replica from melting. Our first target was 50 k QPS sustained and 100 k QPS burst.

What We Tried First (And Why It Failed)

We went with a global Redis cluster on AWS ElastiCache with Cluster Mode enabled. The cluster had 18 shards, 3 replicas per shard, and we used the Redis TIMESERIES module to store session timestamps. We benchmarked it internally with 50 k QPS and hit 2.1 ms P99 under load. So we flipped the switch.

Week 1 went fine. Week 2 we noticed increasing latency on the Redis side. The Redis CLI command redis-cli --latency -h our-cluster.endpoint.amazonaws.com:6379 returned 45 ms P99 on some nodes. Digging into CloudWatch metrics, we saw evicted_keys climbing to 18 k per minute and connected_clients spiking to 11 k on the hottest shard. Our maxmemory policy was volatile-lru with a 80 % watermark. The problem wasnt CPU; it was memory pressure. We increased the node type from cache.r6g.2xlarge (16 GB) to cache.r6g.4xlarge (32 GB), but within 36 hours the same nodes were evicting again.

We tried several fixes:

Doubling the replica count to 6 per shard increased replication lag from 2 ms to 47 ms on the weakest AZ. Our game events require cross-region replication in under 100 ms, so we had to shut that down.
Switching to allkeys-lru reduced evictions, but also caused 1.8 % GET misses that sent us back to Postgres. That reintroduced the timeout errors we were trying to fix.
We tried Redis Cluster Proxy (twemproxy) to shard reads, but the proxy itself became a bottleneck; our P99 latency jumped from 2.1 ms to 18 ms at 30 k QPS, and we saw proxy_connection_errors spike to 400 per minute.

After 6 weeks we were back to square one: Postgres melting, events dropping, and ops on fire.

The Architecture Decision

We stopped trying to scale a single global cache and split the cache into two layers:

A local L1 cache on each game pod using a 100 MB in-process LRU with 1 ms access time. This handled 60 % of reads for user metadata that rarely changed.
A regional L2 cache using Dragonfly 2.0.1, a Redis-compatible fork that uses a single-threaded shard but supports multi-threading for I/O and scales vertically on large instances. We deployed Dragonfly in three regions (us-east-1, eu-west-1, ap-southeast-1) each with a single node of type df-5.6xlarge (16 vCPU, 120 GB RAM). Dragonflys RAM limits are soft; we configured it to use 90 GB RAM and let the OS page aggressively. We used the CLUSTER DFLY DIRECT command to route keys by user ID hash to the correct node.

The L2 Dragonfly cluster handles 40 k sustained QPS per region and 80 k QPS burst with a P99 latency of 3.2 ms. We use RedisJSON to store nested user objects and Lua scripts to update session state atomically. The Lua script looks like this:

local key = KEYS[1]
local new_val = ARGV[1]
local old_val = redis.call('GET', key)
if old_val == false then
 return redis.error_reply('session not found')
end
redis.call('JSON.SET', key, '$', new_val)
return 'OK'

We moved the TIMESERIES data out of Redis entirely and into TimescaleDB 2.10 with compression enabled. Timescale gave us 95 % storage savings and 50 % faster range queries for session windows.

The tradeoff: Dragonfly is memory hungry. Our df-5.6xlarge node costs $4.12 per hour versus a cache.r6g.4xlarge at $2.16 per hour. But the operational simplicity of a single-threaded, vertically scaled node outweighed the cost. We also lost cross-shard transactions; any operation that needs to update user state and session at the same time now uses Postgres with a single UPDATE statement and a 50 ms timeout. Thats acceptable because the timeout now fails fast instead of hanging the entire cache.

What The Numbers Said After

After 5 months in production:

P99 end-to-end event processing latency: 84 ms (was 450 ms during the Postgres meltdown).
Event loss rate: 0.0002 % (was 8 % during the cache meltdown).
Dragonfly memory usage: 85 GB steady state, 110 GB peak. We set a 120 GB soft limit; the OS reclaims pages without evictions.
TimescaleDB storage for session windows: 1.2 TB compressed, down from 24 TB in Redis.
Ops incidents related to caching: 2 in 5 months (both memory pressure alerts that were