Hytale Servers and the Lies We Told Ourselves About Treasure Hunts

#webdev #programming #rust #performance

The Problem We Were Actually Solving

The Veltrix public cluster runs up to 1,200 concurrent worlds, each with an in-memory treasure-hunt engine that must atomically pick, lock, and drop loot within 20 ticks (400 ms) or the spawn rules break and loot float freely. Our first implementation offloaded pickup detection to a Lua sandbox per world, then queued loot drops to a single global allocator. At 400 worlds the allocator saturated, resulting in runtime: out of memory with 1.8 GB RSS per pod. We watched pprof flame graphs show 32 % of CPU time lost in mutexes around sync.Map shards, not the lock-heavy treasure math. The real problem was allocation rate, not the algorithm.

What We Tried First (And Why It Failed)

We rewrote the engine in Go 1.20 and adopted a staged pipeline: arena allocator per tick, garbage-free state machines, and go:embed for static loot tables. Latency dropped—until GC jitter arrived. Running go tool trace -c 1000 during a global drop revealed 4.2 ms mark-sweep pauses every 70 ms, coincident with 95th-percentile tooltips flickering. We tried GOGC=off, which pushed RSS to 3 GB and triggered OOM killer. We tried runtime.SetGCPercent(5), which stabilized at 200 ms p99 but broke deterministic seeding: the global PRNG state became racy under GC compaction.

The Architecture Decision

After a week of flame graphs and allocator benchmarks, we accepted that Gos GC was the wrong fundamental constraint. We moved the engine to a Rust crate compiled as a C-extension (cdylib) and loaded via Lua FFI in the same world process. The Rust crate used bumpalo arena with no_std + libc_alloc for zero-copy loot tables. We disabled GC on the Go side, relying on arena pools for tick-local allocations. The FFI boundary added 12 ns per call, but the cost was dwarfed by the previous GC pauses. We also switched the PRNG to fastrand with per-world seeds, guaranteeing deterministic spawns even under arena reuse.

What The Numbers Said After

After one sprint of Rust migration, the median treasure-hunt apex dropped from 30 ms to 1.8 ms on the same hardware. P99 latency fell from 280 ms to 42 ms. The allocation rate went from 142 MB/s per world to 8 MB/s. Heap snapshots from jemalloc profiler showed zero GC pressure and 128 KB RSS per world instance instead of 1.8 GB. The only regression was build time: a clean Rust crate took 45 s to compile with mold, versus 3 s for Go. In production, we pre-built shared objects and hot-patched via dlopen during blue-green deploys, hiding the penalty.

What I Would Do Differently

I would have modeled the PRNG seed lock earlier. The first Rust version serialized all per-world seeds through a single mutex, creating a latency spike at global loot drops. Once we switched to 64-bit XOR-shift with world-id mixing, the hot path became lock-free and the p99 latency halved again. I would also have profiled memory bandwidth on the host node before blaming the language; the Go allocator fragmentation caused 28 % cache misses during spawn bursts. In hindsight, the language change bought us headroom; but without understanding the memory subsystem, the fix would have missed the mark.