Treasure Hunt Engine Blew Up Because We Trusted the Demo

#ai #programming #machinelearning #webdev

The Problem We Were Actually Solving

Every Friday at 17:00 UTC the Veltrix platform queued up 120 000 concurrent players for a treasure-hunt event. The engines job was to resolve spatial queries against a 110 GB world graph in under 400 ms. During the first test, the median response time was 180 ms—comfortably inside the SLA—but the p99 spiked to 1 900 ms. We traced the spike to a single PostgreSQL CTE that joined treasure locations, player inventories, and dynamic loot tables. Autovacuum froze on the loot table while autovacuum_wraparound_* counters climbed from 200 to 2 000 in 30 seconds. The DB logs simply said vacuuming is in progress, which is the kind of shrug you never see in a demo.

What We Tried First (And Why It Failed)

Our first reflex was to shard the world graph horizontally. We split the graph into 16 shards by chunk coordinates (X/8192, Y/8192, Z/8192) and routed queries with a simple modulo. The shard-level latency dropped to 60 ms median and 320 ms p99, looking great on the monitoring dashboard. In production, however, half of the queries were cross-shard because players clustered around a single dungeon entrance. The coordinator node then started serializing 100-way cross-shard joins, and the p99 climbed to 2 800 ms again. At that point the cluster CPU was 78 % idle and the network RTT was 0.6 ms; the bottleneck was not compute, it was coordination skew.

We also tried a Redis-layer cache: LUA scripts that cached the entire player inventory per shard for 30 seconds. The first hit ratio reached 89 %, but the cache stampede after each event start caused 45 000 cache misses in 5 seconds. We watched the Redis eviction rate spike to 9 800 keys/sec and the p99 latency climbed to 1 100 ms. The manifests never mentioned cache coherency or key invalidation batches.

The Architecture Decision

We ripped out both solutions and replaced them with a single vertical partition inside PostgreSQL.

The key insight was that the treasure-hunt queries only needed four tables: world_nodes, player_inventories, loot_tables, and events_metadata. All three queries in the engine could be expressed as a star join on a central fact table: treasure_hunts(id, world_id, loot_table_id, player_id, status, updated_at). We denormalized the star into a single 140 GB hypertree table with BRIN indexes on world_id and updated_at, and kept a tiny 3 GB materialized view for per-player summaries refreshed every 5 minutes.

The planner now used the BRIN indexes for range scans on world_id, avoiding the large CTE join. We set autovacuum_naptime to 10 s and added a custom extension, pg_partial_agg, that computes the per-player summary incrementally. The vacuum workload dropped to 3 % of what it was before, and the p99 stabilized at 210 ms under 130 000 concurrent queries.

The tradeoff was disk: the hypertree table ballooned from 110 GB to 140 GB, but we gained 1.4 TB of SSD headroom on the Veltrix nodes after we decommissioned the Redis cluster. We also lost the horizontal scaling story; if the next event hits 250 000 players we will need to repartition manually—there is no hot-swap here.

What The Numbers Said After

After the switch:

p95 response time stayed at 180 ms across all 15 Friday events.
p99 stayed below 280 ms, with 99.9 % of events finishing under 300 ms.
CPU utilization on the primary replica dropped from 48 % to 12 %.
Autovacuum wraparound warnings disappeared entirely.
The only remaining failure mode is when the BRIN pages are still cold after a node restart; the first query after reboot can stall for 800 ms while the OS page cache loads. We mitigated that by pre-warming the BRIN pages with pg_prewarm during the node boot cycle.

What I Would Do Differently

I would not trust any marketing slide that shows linear scaling without cold-start data. The demo cluster was 1/10th the size and already warmed up; we learned nothing about vacuum storms or buffer cache misses. I would also insist on a chaos-engineering budget: every Friday we should simulate a node loss at event start to verify that the p99 does not collapse during a failover. Our current failover time is 4.2 s, which is still visible in telemetry as a 1 100 ms spike for the unlucky 1 % of requests that land on the newly promoted leader.

Finally, I would push back against the feature team that wanted to add a real-time loot-tiering algorithm to the treasure engine. That feature would have meant another hot table with 1 MHz updates and killed our current p99. Instead, we moved loot tiering to a background job that publishes to Kafka and the engine only reads the pre-computed tier. The theater of dynamic loot is impressive, but in production it is just another source of latency variance we do not need.