Veltrix Treasure Hunt Engines Blew Up in Production—This Is How We Fixed Them

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

The engines job is simple: collect a list of 3-to-12 coordinate clusters from a 2,048 × 2,048 tile grid, then let the client render them as glowing geodes. The catch is that treasure spawns can change every 90 seconds, and clients demand fresh clusters within 250 ms. Under load we saw two concrete failures:

MongoDB 6.0 reads with ReadConcern majority would block WiredTiger checkpoint commits, raising WT_UPDATE_CONFLICT from <1 % to 19 %.
The engine used a single capped collection per shard named treasure_clusters_v1 capped at 1,024 documents. After three days the collection grew to 1.2 million capped-inserts per shard, and the 16 MB default document size capped each insert into a new record, spiking fsync time to 300 ms.

The ops runbook had a one-line note about updating maxTimeMS to 250 ms, but the real failure was architectural: we were treating a real-time feed like an analytics cache.

What We Tried First (And Why It Failed)

We started with the obvious—scale up. Forked the engine into four pods, each pinned to a separate shard, and gave it 4 CPU and 8 GB RAM. Load balancers sat in front of an nginx reverse proxy that did 1-second retry loops on 5xx. The first 10 minutes looked good: p99 settled at 60 ms. Then the first daily spike hit, and the nginx queue grew to 12,000 requests. The nginx error log showed upstream_timed_out while the Veltrix pods themselves were idle, so the bottleneck moved from compute to connection churn.

Next we tried MongoDB Atlas with a dedicated cluster tier M30 (30 GB RAM, 4 vCPU). We set readPreference secondaryPreferred, but the secondary nodes lagged up to 3 s during daily ETL jobs, and we started seeing WriteConcernTimeout –11600 on every fourth write. The MongoDB Atlas console showed storage iops at 8,000 but CPU credit balance at 0 %, so Atlas throttled us silently to 1,200 iops during the spike window.

Finally we rewrote the query to use an in-memory grid stored in Redis 7.0 with a Lua script that did 2 D array lookups. Redis consumed 14 GB of RAM at rest and would restart every 12 hours because jemalloc fragmentation hit 96 % residency. The Lua script itself is 67 lines long and contains six pcall blocks; one time a nil tile coordinate caused a silent script exit, and the next client poll waited 256 ms for a nil return that never cached.

We had optimized for scale and speed, but ignored the one invariant we should never ignore: the tile grid is immutable for 90-second epochs. Every previous solution treated the problem as a hot-cache problem instead of a bounded-state problem.

The Architecture Decision

We drew a service boundary around the tile grid itself. Instead of letting hundreds of game servers query MongoDB directly every frame, we created a new micro-service called GridProducer that publishes a compact delta file every 90 seconds. The file is an Apache Parquet row-group containing only the eight bytes for each changed tile: x, y, value. We store 60 days of deltas in S3-IA and keep the last 60 minutes in memory via a custom mmap-based Parquet reader built on Apache Arrow 14.

GridProducer runs on a single k6s pod with 1 CPU and 1.5 GB RAM, but it pre-computes the entire grid map into a zstd-compressed 42 MB file and uploads it to S3. Game servers download only the file they need via HTTP range requests (RFC 7233). The server unzstds the Parquet in 18 ms on a 2022 Intel i5-1240P and keeps an LRU cache of the last three files.

Trade-off: we replaced high-concurrency MongoDB traffic with eventual consistency. If a client asks for clusters at 11:03:05 but the file published at 11:03:00 hasnt propagated yet, they get stale data for ≤ 90 s. We measured this and decided its acceptable because the gameplay loop resyncs every 90 s anyway.

What The Numbers Said After

After one week on this architecture:

99.95 % of cluster requests served from in-memory Parquet files, p99 24 ms.
S3 GET cost dropped to $0.0004 per 1,000 requests (we serve ~2.4 M requests/day).
MongoDB Atlas traffic fell from 4.2 M reads/day to 42 writes/day—we only write the delta file once.
Redis 7.0 cluster mode was decommissioned, saving $1,800/month in managed-memory fees.
No more WT_UPDATE_CONFLICT or WriteConcernTimeout errors.

The only new failure mode is S3 429 throttling when GridProducer and hundreds of game servers hit the same object simultaneously. We mitigated it by adding CloudFront in front, setting TTL to 90 s, and turning on S3 Transfer Acceleration. The first CloudFront hit still requires ~80 ms, but edge caches absorb 78 % of traffic and the fallback to origin is rare.

What I Would Do Differently

I would not have built GridProducer as a separate service on day one. Two weeks of profiling showed that 87 % of latency spikes were caused by MongoDB write stalls during daily ETL. Instead of forking the engine, I would have isolated the write path first—switched from capped collections to a time-series collection (MongoDB 6.0 feature) and set expireAfterSeconds to 7,