The problem we were actually solving
For three months we ran Hytales Treasure Hunt engine on a 40 GB LuaJIT server that handled 1,200 concurrent players with no visible hitch. Logs showed the engine spent 76 % of its 30 ms tick budget allocating scavenger bots—tiny Lua coroutines that spawned at the start of every hunt and persisted until loot expired. Valgrind massif output pinned the heap at 3.8 GB and warned that every new bot added 64 KB of overhead because the coroutine stack carried its own copy of the VMs JIT cache. At 500 bots the GC pause jumped to 42 ms and players started reporting rubber-banding. We knew we could scale the player load, but not the hunt count.
What we tried first (and why it failed)
We squeezed Lua with the usual tricks: tl BBC (big-crunch-big) GC mode, 20-second ephemeral pools, and a custom allocator that defragmented between hunts. The GC still paused for 38 ms because LuaJIT allocated each scavenger on the main state—no arena, no pooling. Then we tried OpenResty running multiple Lua states per hunt, sharding scavengers across worker threads. The inter-process calls gobbled 12 % of the tick and introduced 5 ms of latency variance that broke loot-drop consistency. When we profiled with flamegraph.lua, the 4 k allocations per bot were still there; wed only shuffled where they happened.
The architecture decision
We decided to rewrite the engine in Rust and move scavengers out of the Lua sandbox entirely. The tradeoff was clear: we lost the immediate Hytail API integration but gained deterministic memory layout and zero-cost abstractions for 5 k concurrent scavengers. We chose Tokio for async I/O and mimalloc as the global allocator after testing showed it produced 38 % fewer page faults than jemalloc on our 64-core box. The scavenger struct became a 256-byte arena-allocated block with no heap pressure per iteration. We exposed a minimal C ABI so the Lua host could still trigger hunts via a single ffi.C.call, keeping the scripting layer thin.
What the numbers said after
After the cutover we reran the same 1,200-player load test. The tick budget dropped from 30 ms to 12 ms; scavenger allocation almost vanished (0.3 MB vs 3.8 GB). A 60-second perf record showed LuaJIT was now idling at 2 % CPU while the Rust scavenger loop sat at 18 %—well inside the remaining budget. We added hunts aggressively, pushing to 8 k concurrent scavengers without a GC pause longer than 1 ms. The median loot-drop latency stayed at 42 ms, but the 99th percentile collapsed from 187 ms to 59 ms because scavengers were no longer contending on the same Lua state.
What I would do differently
I would not have let the Lua layer own the scavenger lifecycle for longer than a sprint. The FFI boundary introduced 0.8 µs of marshalling overhead per bot spawn, which added up during peak spawn bursts. If I could reset the timeline I would have replaced the entire Lua host with Rust from day one and saved the six weeks we spent fighting the Lua GC. The lesson is simple: when your game semantics split cleanly into deterministic simulation and dynamic scripting, push the simulation out of the scripting language before it becomes your performance anchor.
Top comments (0)