The Day We Blew Up Our Cache Tier and What It Taught Us About Treasure Hunt Engines

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

I joined Veltrix in 2023 to run the backend fleet for their treasure-hunt-as-a-service platform. Two years in, our Redis Cluster tier hit a wall every time the top 20 customers reached ~10TB of active game state. The pattern was ugly: p99 latency spiked from 8ms to 700ms inside 72 hours, followed by a cascade of 503s from the API layer. The outage wasnt CPU or memory exhaustion—it was fork() latency from Redis fork-based RDB snapshots. With 8 CPU cores and 64GB RAM, the kernel spent 3.2 seconds just copying page tables before OOM-killer stepped in. We tried bigger boxes; the snapshot time scaled linearly until we hit the physical limit of the AWS i3.8xlarge.

What We Tried First (And Why It Failed)

First we punted to bigger primary instances. Promoted an i3.16xlarge with 128GB RAM and local NVMe. The snapshot dropped to 1.8s, but now the fork() still happened every 5 minutes, and the cost curve looked like a hockey stick at $2.10/node-hour. Next we tried turning off persistence altogether—pure in-memory mode. Within a week we lost 0.34% of active game state after a single AZ failure. Thats 1,200 concurrent users re-spawning in a Santorini lobby with no progress. The CEOs Slack ping was literally We cant lose progress.

The Architecture Decision

We ripped out Redis entirely and built a tiered, append-only log on top of Apache BookKeeper 2.9.1. The new layout:

Hot tier: 4 m6g.4xlarge instances running a 3-way replicated BookKeeper ledger. We kept the same amount of RAM but removed fork(), so snapshot cycles became 15 ms writes to S3 via Tiered Storage. Latency p99 stayed under 12ms for SETs <1KB.
Warm tier: 2 c6g.large bookies holding 48h of tail data. Reads from warm tier were 30ms p99, but we throttled clients to 10QPS per session to avoid thundering-herd on startup bursts.
Cold tier: S3 IA for everything older than 48h. A nightly compaction job rewrote segments into 64MB parquet files. Restore path used Athena + Lambda to replay logs into the hot tier. We ate a 5-minute cold-start hit when players returned after a week-long hiatus, but the game state was still intact.

We paid for the migration by burning two engineering sprints—112 hours of senior eng time—and a 15% increase in infra cost per player. The upside was zero progress loss and predictable tail latency.

What The Numbers Said After

After six weeks on BookKeeper:

p99 write latency: 8ms (down from 700ms).
Snapshot time: 15ms every 5 minutes (was 3.2s).
Daily cost per active player: $0.023 (was $0.019 on Redis i3).
Player loss due to state corruption: 0% (was 0.34% per AZ failure).
Fork() syscalls per second: 0 (was 1,200 during snapshots).

The one surprise was that BookKeepers segment merge storms would occasionally spike CPU on the warm tier. We mitigated it by capping segment size at 64MB and adding a 1-minute cooldown between merges. Without that cap, the warm tier would sometimes hit 95% CPU for 45s, causing read-latency outliers up to 180ms.

What I Would Do Differently

I would not have trusted Redis snapshots for anything more than ephemeral caching. The moment the dataset exceeds RAM by even 10%, youre playing Russian roulette with fork(). I would also have started with BookKeeper from day one if wed known the game state would keep growing. The initial infra cost delta was painful, but the operational headroom it bought—no surprise midnight pages, no emergency Redis upgrades—was worth every cent.

DEV Community

The Day We Blew Up Our Cache Tier and What It Taught Us About Treasure Hunt Engines

Top comments (0)