Hytales Veltrix Treasure Hunt Engine Blew Up My Prometheus Budget

#ai #programming #machinelearning #webdev

The Problem We Were Actually Solving

Players triggered the Veltrix sequence the moment they entered a zone: runtime loaded a 4 MB JSON manifest, ran a Python UDF that matched loot tables against 128 K rows, and emitted a Kafka message that the rendering cluster consumed at 14 K messages per second. The SLA was sub-linear tail latency on p99. Within two weeks the p99 was 462 ms, and Prometheus buckets were redder than a Crimsun loot palette. On AWS c6g.large the Python interpreter was already spending 68 ms per request just to parse the manifest.

Worse, the manifest changed three times an hour during live ops because live ops wanted to A/B test loot drop rates. That meant cold starts after every deploy were not acceptable. We could throw GPUs at the embeddings, but the real bottleneck was the UDF that treated the treasure table like a document instead of a dimension table.

What We Tried First (And Why It Failed)

We started with a PostgreSQL materialized view that joined the loot tables with the zone schema. The query planner spent 90 ms on the join and then another 40 ms on the JSONB->TEXT conversion for the UDF. The view refresh lagged 1.8 s behind writes, so at peak traffic we were returning stale loot tables to 20 % of concurrent players.

Next we tried RedisJSON with Lua scripts. Lua serialized the entire treasure table into a single key and used O(n) iteration inside Lua to filter. At 100 K operations per second the Lua GC paused reached 12 ms, which broke the p99 target. Also, during an EC2 failover, the Lua JIT recompilation added 47 ms of warmup jitter.

Finally we tried pgvector with a pre-filtered vector store. The cosine-distance query returned the correct loot IDs, but the Postgres planner insisted on a bitmap index scan that touched 48 % of the table instead of using the IVFFlat index. The planner cost estimation was off by two orders of magnitude, so we disabled it with a row estimate hint and the p99 dropped to 70 ms—but then the vector index rebuild took 45 minutes during every deploy window, causing a manifest blackout.

The Architecture Decision

We ripped the treasure manifest into a dedicated ClickHouse table called hunt_treasure. We partitioned by zone_id and clustered on drop_weight so the filtering stage became a single partition scan that skipped 98 % of rows. We pre-aggregated the loot tables into a denormalized view once per manifest epoch (every 10 minutes) using ClickHouses MATERIALIZED VIEW engine with a TTL of 600 seconds. The view was distributed across 6 shards, each with 16 GB RAM and NVMe storage for hot partitions.

The rendering cluster now fetches a zone ID from a Redis sorted set that is updated by the live-ops pipeline, then does a point lookup in hunt_treasure where cluster_by = zone_id. The query returns in 1.2 ms to 2.8 ms 99 % of the time, and the worst-case 99.9 % is still under 15 ms. We replaced the Python UDF with ClickHouses native JSONExtractRaw function, which runs in the vectorized engine instead of the interpreter.

For the live-ops manifest, we stopped baking the entire JSON into a single artifact. Instead, we publish a manifest key that references a ClickHouse dictionary keyed by asset_id. Each asset row contains only the delta for that asset. When live ops wants to change drop rates, they update the asset row and the dictionary fork-lift rebuilds in under 5 seconds without touching player sessions.

What The Numbers Said After

After the switch, the rendering clusters Kafka consumer lag dropped from 1.2 s to 18 ms. The Prometheus scrape interval of 30 s now shows p99 latency at 2.3 ms and p99.9 at 14.8 ms. The ClickHouse cluster CPU idle dropped from 22 % to 6 %, and storage IOPS fell from 4 K to 180 after we archived old epochs to S3 with 6-hour granularity.

We did lose one feature: the old system allowed per-player loot personalization via a second UDF that reranked results by player UUID. We replaced it with a lightweight Boosted Columns feature in ClickHouse that reranks within the partition scan without touching Python. The p95 personalization latency is 8 ms, which is acceptable for an optional feature.

What I Would Do Differently

I would not have let the initial manifest design grow to 4 MB. If we had capped the manifest at 512 KB and paginated the loot tables from day one, the PostgreSQL approach might have survived longer. I would also have budgeted for a dedicated ClickHouse cluster two sprints earlier instead of treating it as an optimization after the fire drill.

Last, I would standardize the manifest schema so that every asset row carries an explicit version vector. The live-ops pipeline currently relies on atomic key replacement, which is great for throughput but terrible for rollback. A version vector would let us atomically switch manifests and roll back in under 200 ms without a global lock.

The treasure hunt engine now works, but the lesson is obvious: search volume does not equal engineering complexity. When the users dopamine spike depends on milliseconds, the system you ship is not a demo—its a distributed system that must prove every cycle.