The Day the Treasure Hunt Engine Drowned in 300 ms Queries

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In 2024 the Veltrix festival needed a treasure-hunt engine that could handle 12,000 concurrent players across three venues while still giving each user the illusion that their phone was whispering secrets directly to them. We thought the bottleneck would be network I/O or client rendering. We were wrong—it was the lookup query that ran every time a player tilted their phone:

SELECT treasure_box.uuid FROM treasure_box
WHERE venue_id = ? AND ST_DWithin(geom, player_location, 5)
ORDER BY RANDOM() LIMIT 1;

At 12 k concurrent connections that became 92 k queries per second. PostgreSQL started returning 300 ms–500 ms spans, then connection pool exhaustion, then festival staff holding flip-phones to the sky asking why the map was blinking red. The real problem wasnt hardware—it was the assumption that random spatial queries over 1.2 million polygons could stay in a single relational table.

What We Tried First (And Why It Failed)

First we tried Redis with GEOADD and GEORADIUS. It was fast—sub-millisecond—but the ST_DWithin 5 m radius meant we needed 5,000+ keys per venue because each polygon had to be tessellated at 1 m. The memory footprint exploded to 42 GB just for the index, and every GEORADIUS call still did a linear scan inside the radius. After two days of swapping we bailed.

Next we tried CockroachDB multi-region clusters because their spatial functions are PostGIS compatible. Reads were fine, but range-splitting during writes turned every INSERT of a new polygon into a 2–3 second blocking operation. At 200 writes per second (new treasures being placed) the cluster spent more time gossiping than serving. P99 latency climbed past 1.8 seconds—well past the human eyes tolerance for a blinking dot.

Finally we tried a write-through cache with PostGIS on the hot path: every treasure box was pre-indexed into 1 m tiles, stored as JSON {x,y,tile_id,box_id}, and sharded across four Redis nodes. It solved the read path but introduced cache invalidation hell. When a player picked up a treasure, we had to delete the tile keys that contained that box across all shards. Deletes in Redis Cluster are eventually consistent; we saw souvenir screens reporting We couldnt find your treasure even though the server knew youd grabbed it. The event staff started muttering about cursed phones.

The Architecture Decision

We abandoned the all-in-one spatial database idea and split the engine into two layers:

Static Geometry Microservice
• One job: store every venue polygon exactly once.
• Served by a single PostgreSQL 15 instance with PostGIS, autovacuum turned off, and a 64 GB shared_buffers tuned by pgtune.
• Writes only happen during setup and once per hour overnight when curators upload new maps. Reads are simple ST_Contains(point, polygon) calls against a primary key. That query averages 1.2 ms with a 10 k cache hit ratio on the connection pool.
Dynamic Treasure Microservice
• Receives every players GPS tick via gRPC, converts lat/lon to venue-local tile (1 m grid) using a Rust micro-crate called geo-tile at 0.03 ms per call.
• Looks up active treasure boxes from an in-memory hash map sharded by tile_id. Each hash map lives in its own process on a 32-core EC2 m6i.4xl machine running Redis 7.2 with jemalloc and no persistence.
• When a box is collected, we publish an internal event on NATS 2.9.6 with a 10 ms TTL. Other tiles can drop stale references asynchronously; we dont block the players next GPS tick.

The boundary between the two services is strict: geometry never leaks into the treasure map. That killed the random-order requirement, so we replaced ORDER BY RANDOM() with a deterministic round-robin per tile. The visual difference is invisible to players—we rotate the order every 30 seconds anyway for fairness—but the cache misses on the treasure service dropped to 0.4 %.

What The Numbers Said After

Three weeks before Veltrix 2025 we ran a soak test with 15 k synthetic players on three c6g.4xl Spot fleets. The geometry service handled 200 k TPS at 0.8 ms p95. The treasure service handled 1.1 M TPS (yes, million) at 0.12 ms p95 while still leaving 45 % CPU idle on the 32-core box. Total AWS spend was $197 for the three-day weekend, including NATS and Redis Enterprise licenses for high availability. The staffs flip-phones never blinked red.

We instrumented everything with OpenTelemetry and Grafana. The error metric we watched most closely was no-treasure-found, which dropped from 8.2 % on day one of the festival in 2024 to 0.03 % in 2025. The single largest outage—Redis node failover at 14:23—caused a 170 ms spike in p99 for that venue. We had an SLO of 200 ms; we still met it.

What I Would Do Differently

I would not put PostGIS in the critical path again. A single spatial database is a single point of failure and tuning hell. We should have started with the split architecture from day one and treated geometry as a read-mostly reference dataset, not a runtime query target.

I would also replace our in-house geo-tile crate with the official Rust bindings to GDALs Rasterlite. Our 1 m tiles were accurate to 0.3 m on average, but for an AR treasure hunt thats the difference between a glowing orb appearing at your feet and appearing three meters away—a ticket generator for support tickets.

Last, I would budget for a dedicated NATS cluster from the start. In our soak test the gossip traffic between Redis nodes and NATS caused a 4 % increase in p99 every time we scaled the treasure service beyond 12 shards. We ended up co-locating NATS on the same AZ as the treasure boxes, but that

DEV Community

The Day the Treasure Hunt Engine Drowned in 300 ms Queries

Top comments (0)