The Ghost of 500 ms Latency: A Hytale Veltrix Config Engine Postmortem

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In late 2024 the Hytale ops team noticed our Veltrix configuration engine was routinely serving 480 ms responses under load—far above the SLA wed promised the community team for the upcoming beta. The engine is a Lua-based rules processor that decides which treasure tables to expose based on biome, time-of-day, and player progression. By February 2025 it was clear the hot path was the nested Biome → SubRegion → Feature → LootSet lookup chain. Each extra indirection added latency and we began seeing TimeoutError: LuaCoroutine 487 in New Relic when the player count in Pellucid jumped above 12 k concurrent. Community Slack was filling up with messages like @HiOps why is my treasure table stuck on loading for 30 s and we still had four months before launch.

What We Tried First (And Why It Failed)

Our first instinct was to shard the biome index across three LuaJIT workers. We partitioned the 128 vanilla biomes alphabetically: A–K on node-1, L–R on node-2, S–Z on node-3, each worker bound to a c5.2xlarge with 8 vCPUs and 16 GB RAM. Traffic was distributed by consistent hashing on biome name. The numbers looked promising in staging: P95 dropped to 130 ms. But then we deployed to canary and within an hour the Citus layer that backed our treasure table metadata started throwing could not serialize access due to read/write dependencies at 8 k QPS because three different workers were trying to update the same loot set metadata simultaneously. The system ground to a halt and we rolled back in 19 minutes—long enough for the discord thread Is Hytale dead? to hit the top of r/gaming.

We then tried a Redis Cluster front-end: six shards, each storing a BallTree of all loot sets keyed by (biome_id, subregion_id). The Lua workers issued GEORADIUS queries. In benchmarks the P95 was 48 ms and our error budget looked healthy. But real players in the Pellucid region were getting inconsistent loot sets; one player would see Amethyst Ore in the Jungle Plateau while a friend nearby would see Diorite. The problem was that BallTree radii did not perfectly align with our in-game region polygons. We had to abandon the approach because consistency trumped latency in a treasure-hunting game—players would rather wait an extra 200 ms than lose that Amethyst Ore to a friend who rolled on a different tree.

The Architecture Decision

We finally converged on a single-process LuaJIT engine with a precomputed, on-heap B+Tree index. The index is built offline from the Unity asset bundles during CI: we run hltgen --src bundles/treasure --out vindex.bpt which produces a memory-mapped B+Tree file sized 42 MB. The file is loaded once at engine start via mmap; subsequent lookups are pure Lua memory access with zero serialization cost. All loot sets for a given biome are stored in contiguous pages, so the whole lookup chain (Biome → SubRegion → Feature → LootSet) is a single 120-byte memcpy followed by a 32-byte hash probe.

We kept the index volatile (no disk writes at runtime) and relied on Kubernetes liveness probes to restart pods whenever the A/B test changed the underlying treasure tables. The Redis Cluster became a write-through cache only for dynamic player modifiers (e.g., seasonal events), while the core loot resolution remained in-process. Each pod now fits in 40 MB of RSS and serves 75 k QPS at 1.9 ms P95 on a c6g.large—well inside the SLA.

What The Numbers Said After

After six weeks of beta the metrics told the story:

P95 latency for biome resolution: 1.9 ms (was 480 ms)
P99 latency for biome resolution: 4.7 ms (was 980 ms)
TimeoutError: LuaCoroutine incidents dropped to zero.
Treasure-table load failures in client dropped from 3.2 % to 0.1 %.
CPU utilization on the LuaJIT pods stayed below 28 %, leaving headroom for the new particle effects we added in patch 0.27.

The Redis Cluster we kept as a write-behind cache now handles 12 k writes/s with 98 % hit rate for seasonal modifiers, while the B+Tree index handles the remaining 63 k reads/s. Our error budget before launch was 500 ms for 99.9 % of requests; we shipped with 50 ms buffer.

What I Would Do Differently

If I had to rewind, I would never have sharded the biome index. Sharding introduced distributed state and a new vector for read/write conflicts that we had to unwind under pressure. The decision to keep the core index single-process simplified debugging enormously; with one binary log and one core dump we could reproduce any players loot table in a debugger in under two minutes.

I would also call out the BallTree attempt earlier. Consistency boundaries in treasure systems are sacred—loose consistency breaks immersion faster than latency ever will. Once we accepted that the treasure table had to be strongly consistent, the B+Tree solution became obvious and we stopped chasing shinier tech.

Last, I would have insisted on a memory-mapped file from day one instead of rolling a custom serialization layer. Trying to keep the index in both memory and on disk doubled our build pipeline complexity. The mmap discipline is now part of our LuaJIT engines startup contract and has saved us three on-call pages since March.