This Single-Node Treasure Hunt Engine Cost Us 14 Grafana Alerts

#systems #webdev #programming #architecture

The Problem We Were Actually Solving

Our games treasure-hunt feature was built as a stateful micro-service: it kept the entire current treasure map in RAM, updated distances in real time, and accepted WebSocket connections from mobile clients. The product team wanted sub-50 ms latency for every GET /treasure/{id}/distance call, even when 10 k concurrent players were racing to the same city block.

At first we mocked it with Node + Redis. We stored each treasure as a Redis hash with fields lat, lng, value, and updatedAt. The Node service queried Redis using MGET for every move. That lasted until Black Friday marketing blasted the promotion to TikTok. Redis memory usage hit 32 GB; pipeline latency climbed to 217 ms; the daily Redis cost bill doubled. The caching layer had become the bottleneck instead of the hero.

What We Tried First (And Why It Failed)

We bolted on a Go rewrite we called treasure-srv-v2. We ditched Redis and parsed the 300 MB map once at startup into an in-memory Red-Black tree keyed by treasure ID. The tree gave us O(log n) distance lookups, perfect for the geospatial math. We set -Xmx2g -Xms2g and crossed our fingers.

It failed at 7:32 AM during UAT when 500 synthetic clients hammered the endpoint. Our Go profiler showed the GC was spending 35 % of CPU time; the RSS grew from 2 GB to 3.6 GB in four minutes flat. The OOM killer finally stepped in. Logs revealed the Red-Black tree was allocating 8 KB nodes for each treasure, and we had 400 k treasures in the wild. The memory density was catastrophic.

We also tried an off-the-shelf geospatial library, Turf.go, because someone remembered a Twitter thread that claimed it was faster than handrolled math. Turf bombed even harder: it converted every lat/lng pair to geom.Point, which internally created slices and mutexes. GC pauses spiked to 1.2 s, and the endpoint error rate hit 11 %.

The Architecture Decision

On the third refactor attempt, we accepted the only tradeoff that actually moved the needle: split the treasure map into immutable shards we call chunks. Each chunk is a 16 MB gzipped GeoJSON file stored in S3, versioned by city tile (Z14). The runtime service, treasure-engine, keeps only three things in RAM:

A 1 KB index (chunk_id → [min_lat, max_lng, s3_version]).
A tiny in-memory LRU cache (32 MB) for the hottest 2 k treasures keyed by player proximity.
A RocksDB instance (on a 20 GB EBS gp3 volume) for deltas that players actually change (treasure claimed or updated).

At startup, treasure-engine downloads the relevant chunk set from S3 once, mmaps it via mmap(2), and never parses the JSON again. Distance math uses K-nearest neighbors over the mmapped vertices; we dropped Turf entirely. The RocksDB store is append-only; we flush every 2 s to S3 and ship WAL to Kafka for replay.

The shard strategy meant we could now scale horizontally: a single treasure-engine pod serves one city tile. When load exceeds 800 RPS per pod, the Kubernetes HPA scales the deployment to three pods, each pinned to a topology.kubernetes.io/zone label. Latency stayed flat because the hot path became mmap reads from the page cache.

What The Numbers Said After

Six weeks after the shard rollout, we ran a controlled blast with Locust: 5 k concurrent clients hitting the same city tile. P99 latency stayed at 38 ms. Memory RSS per pod averaged 780 MB, peak 1.1 GB. GC pause 95 th percentile was 22 ms. Our Grafana dashboard now shows 14 green alerts and zero red.

Cost per 1 k requests dropped from $0.0047 (Redis + Node) to $0.0003 (S3 + gp3). Monthly AWS bill for the hunt service fell by $12 k, allowing us to finally fund the mobile teams iOS 18 beta slots.

What I Would Do Differently

I would not have written a Go service at all. We burned three weeks rolling our own tree and parser; the Red-Black nodes alone cost us 1.8 GB of heap. If I could rewind, I would have started with a simple PostgreSQL table with a GiST index on location GEOMETRY(POINT, 4326) and used PostGIS distance queries. The initial query latency would have been ~80 ms, but we would have avoided rearchitecting twice and saved 14 on-call pages.

I would also have skipped the RocksDB delta store. After load testing, we discovered that 98 % of the treasure state is read-only; writes were < 5 RPS per city tile. We could have kept the RocksDB layer only for the 2 % write path and saved ourselves the EBS and Kafka costs. In hindsight, over-optimizing for write throughput was premature optimization.