Why the Treasure Hunt Engine Killed Our Weekend Before the Scale-Out

#ai #webdev #programming #machinelearning

The Problem We Were Actually Solving

We needed to distinguish between real treasure spawns and synthetic spam. The original design used a lightweight LLM filter called TreasureLLM that ran on top of every /spawn request; it cost 12 ms and dropped only 0.3 % of fake spawns in the demo. The problem was that the filter was pure Python, blocking, and our traffic model showed that once we crossed 300 k ccu the filter would become the new tail latency at 100 ms. At that point the geo-fence lookup we already had in Redis would have to do extra round-trips to validate the result, which was a latency stack we had not budgeted. The documentation for TreasureLLM promised sub-5 ms responses with ONNX, but the actual compilation artifact came with a 256 MB model that fit into neither our 512 MB Redis container nor our 1 MB hot cache.

What We Tried First (And Why It Failed)

We tried three things in the same weekend:

Fuse TreasureLLM directly into the geofence micro-service using coroutines. This reduced the extra latency to 8 ms per spawn but the service started OOMing every ten minutes because the 256 MB model was loaded twice—once in the Python runtime and once in the Redis module we used for sidecar inference. The memory spike didnt show up in k6 because our load test capped at 200k users.
Off-load inference to a dedicated GPU node running vLLM. The throughput looked good (2000 req/s on a single A100), but the round-trip latency from mobile to the inference cluster was 60 ms plus 20 ms of carrier network jitter. We replaced a 10 ms latency tax with a 80 ms tax that varied by carrier; it broke our p95 budget.
Replace TreasureLLM with a hand-rolled probability filter that used a 16 KB LMDB shard to store historical spawn patterns. We thought we could get 1 ms latency and zero additional memory. On the first day of production we discovered that the filter used a 512-byte critical section that serialized every /spawn request; at 500k ccu the mutex wait averaged 90 ms and we saw tail latency explode past 5 seconds.

Every fix solved one problem and created two new ones. We were patching theatre instead of building an instrumentation loop.

The Architecture Decision

On Monday we discarded the LLM entirely. The actual requirement was not semantic sophistication but temporal consistency: we needed to prevent a single user from spawning more than 50 treasures in 5 minutes without locking the whole table. We migrated to a two-tier system:

Tier 1 was a Lua script inside OpenResty that ran on every edge node. It checked a 10 MB ring buffer of user actions maintained in shared memory. The script used a 128-byte lockless ring buffer and returned in 0.12 ms on average. Rejecting an attacker cost a single Redis SADD op, which cost 1.2 ms at p99.

Tier 2 was a periodic batch job that ran every 30 seconds and used a PostgreSQL advisory lock to reconcile long-term spawn rates. The job had zero effect on latency because it ran asynchronously and only wrote to a separate user_spawn_stats table we synced every minute to an S3 bucket. We stopped paying the 12 ms plus 60 ms plus 90 ms tax; our p99 dropped back to 150 ms and the Redis memory footprint stayed flat.

We also replaced the Redis geofence cache with a Rust rewrite of the same C module that served the exact same Lua API, reducing memory by 45 % and latency by 3 ms. Instead of exotic ML we bought ourselves predictability with boring systems work.

What The Numbers Said After

After the change we saw:

TreasureLLM path: 12 ms median, 140 ms p99, 42 % cache miss under load.
New path: 0.12 ms median, 1.4 ms p99, 0 % external ML cost, 99.8 % cache hit at edge.
Monthly inference bill dropped from $8 k to $0.
Player reports of missing treasures fell from 1.2 % to 0.08 %, which we traced to a separate bug in the clients GPS smoothing filter.

We added a Prometheus metric called player_spawns_filtered_total that counts the number of spawns rejected by the Lua ring buffer. It fires at ~20k events per second at peak, but the cost is a single increment in shared memory—no network hop, no model load, no context switch.

What I Would Do Differently

I would never have let the demo version of TreasureLLM graduate to production without a load test that included mobile network jitter and Redis eviction storms. The demo ran on a MacBook Pro with a 7.5 MB model compiled without quantization; the production container had to run a 256 MB model on a 512 MB budget and still answer in less than 5 ms. Two orders of magnitude matter.

I would also have instrumented Redis cluster memory in the load test environment. Our on-call rotation spent six hours debugging why the filter kept evicting the geofence set every time the model was loaded, which we discovered only after the service OOMed in a 300k user load test that used a 10 GB dataset instead of the 1 GB sample in the demo.

Finally, I would have architected the anti-spam logic as an edge-native Lua module from week one instead of bolting a Python service onto the side. The marginal cost of shipping a 256 MB model to the edge