The Treasure Hunt Engine Blew Up My Inbox at 3 AM

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

The Treasure Hunt Engine wasnt supposed to be a cache. It was built as a read-through service over a PostgreSQL table called hunt_tiles, partitioned by hunt_id and indexed on (hunt_id, x, y). During closed beta, the table had 1.2 million rows and p99 latency of 47ms for point queries. We projected 6x growth by launch and provisioned a 32-core R6g.4xlarge with 256 GB RAM; the RDS monitoring dashboard showed 11% CPU utilization, 68% memory free.

We were wrong.

During stress tests, the engines /tiles/{huntId}/{x}/{y} endpoint started returning 503s under 1,800 QPS. Not CPU, not disk—it was the planner choking on bitmap index scans across partition keys. EXPLAIN ANALYZE showed 4.3 ms of planning time for a single tile fetch, but 1,800 concurrent plans multiplied to 7.7 seconds of CPU burn per second, spiking iowait to 42%. We had optimized the wrong layer.

Meanwhile, the Veltrix front end expected every tile within 50 ms or it would refuse to render the map. The team added a local in-memory cache in each instance using Caffeine with TTL 30s. That worked for 3 days. Then, the cache invalidation policy collided with the weekly content refresh.

The Tuesday Ghost arrived.

What We Tried First (And Why It Failed)

First we tried SSD-backed Redis with eviction set to noeviction. Latency fell to 3 ms p99, but memory usage climbed 60% week-over-week. Within three weeks, the cluster exceeded the 72 GB memory limit of our cache.m6g.large instance. The auto-scaling policy triggered at 85% memory and spun up a new node, but the DNS update to the front end took 47 seconds—long enough for every operator to reload their browser and lose their current hunt session. That was ticket #1782: Lost session after cache node scale-out.

Next we tried Cluster Mode with four shards. The sharding key was hunt_id % 4, but hunt_ids 0 and 1 received 70% of the traffic because the top 10 hunts were always popular. Shard 0 became the bottleneck again—CPU at 94%, latency climbing to 210 ms. We tried rebalancing hunt_ids with a modulo shift, but the front end restarted every session because the API endpoint had hunt_id hard-coded in the path.

Then we tried an L1 cache inside the engine JVM using Caffeine with maximumWeight=512MB and a tinyLFU eviction policy. The JVM footprint grew from 800 MB to 1.8 GB, and during the first Tuesday rollout, the JVM paused for 1.4 seconds for a full GC. The front end interpreted the GC pause as a timeout and dropped the map.

We were chasing latency with memory and complexity, but the real issue was the contract between the engine and the front end. The front end expected every tile within a fixed window, and the cache was allowed to disappear. What we needed was a bounded timeline, not an unbounded cache.

The Architecture Decision

We scrapped the cache-first approach and rebuilt the engine as a stream-time materialized view.

The core change: instead of caching tiles, we pre-computed every active hunts tile grid as a set of materialized views in PostgreSQL, updated via a Debezium CDC stream from the content system. The view was called hunt_tile_mv and defined as:

CREATE MATERIALIZED VIEW hunt_tile_mv
REFRESH CONCURRENTLY hunt_tile_mv
ON DEMAND
AS
SELECT hunt_id, x, y, tile_data, version
FROM hunt_tiles t
WHERE t.active = true
AND t.version >= (SELECT max(version) - 2 FROM hunt_versions);

We set up a Kubernetes CronJob that fired every 5 minutes and executed:

REFRESH MATERIALIZED VIEW CONCURRENTLY hunt_tile_mv;

The scheduler used a distributed lock via Redlock on Redis to prevent concurrent refreshes. If a node crashed mid-refresh, the next pod would skip the refresh and wait for the lock to expire.

For reads, the engine served /tiles/{huntId}/{x}/{y} with a direct SELECT against hunt_tile_mv using an index on (hunt_id, x, y). The planner used an index-only scan; p99 latency measured 6.8 ms with 300 concurrent threads.

For writes, we kept the original hunt_tiles table as the source of truth and updated it via a REST endpoint. After each write, we published an event to a Kafka topic called hunt_tiles_changed. The materialized view job didnt listen to Kafka; it just compared the current version against the latest content version every 5 minutes. That simplified the consistency model: eventual consistency with a bounded staleness of 5 minutes.

We removed the Redis cluster entirely and shut down the cache nodes. The memory footprint of the engine dropped from 1.8 GB to 512 MB, and the JVM never paused above 120 ms for GC under 1,200 QPS.