The Problem We Were Actually Solving
Our Veltrix-based shard cluster was supposed to scale the treasure-hunt mini-game to 1 000 concurrent players so we could open the European region without another hardware purchase. Treasure hunts are memory-marathons: each clue is a nested NBT subtree that gets parsed, deserialized, and streamed to every participant every 750 ms. By Friday of week two we were already GCing 2.3 s every 30 s on a 32-core bare-metal node. The JVM heap was set to 32 GB, but the allocator kept reserving 47 GB of direct memory because ByteBuffer.wrap() was pinning the entire NBT payload in native RAM. jcmd gc-heap-dump showed 28 million unreachable ByteBuffer instances still holding on to their backing arrays. Our ops team called it the Phantom Array Problem—we could see the memory in pmap but the JVM didnt count it as part of its heap.
What We Tried First (And Why It Failed)
We started by tuning the JVM. We set -XX:MaxDirectMemorySize=16G and added -XX:+AlwaysPreTouch to force the OS to commit the pages upfront, hoping to avoid surprises. That only pushed the same OOM a bit further out: now the node died at 1 200 players instead of 700, but the GC logs revealed a new pattern. Between CLS Young GC and G1 Mixed GC we were spending 7 % of CPU just walking the remembered sets for objects whose only live reference was the ByteBuffers Cleaner. VisualVM sampler showed 1 842 μs per NBTNode clone in a full GC cycle. We tried switching to Nettys PooledByteBufAllocator, but the treasure-hunt engine was still revolving around Javas NBT library, which unconditionally returned new byte[] copies in every getTag() call. The copy-on-read was chewing 1.4 GB/s on a 40 GbE link even when the data wasnt going anywhere. At that point it was obvious: the language runtime was enforcing the pace, not the algorithm.
The Architecture Decision
We rewrote the clue engine in Rust. Not because we love the borrow checker, but because rustcs compile-time checks finally let us answer the question wed been dancing around: where does each NBT subtree actually die? We swapped the Veltrix Netty pipeline for tokio and replaced the NBT parser with a hand-rolled S-expression parser that produced zero-copy slices into a single mmapd file. The key decision was mapping the treasure-hunt assets as an ArenaAllocator backed by memmap2::MmapMut. Every clue is now a NodeId that borrows the backing memory without ever duplicating the byte array. We also introduced a 64-byte slab for small string blobs so that 85 % of clues never allocates at runtime. The whole crate compiles with deny(warnings) and uses #[track_caller] on every unwrap() because wed already burned ourselves with panics in production. The migration took 19 days and cost one engineers sanity.
What The Numbers Said After
Post-migration, the same 32-core node ran 2 300 concurrent players before the GC flat-lined at 180 ms every 120 s. pmap now shows 4.2 GB RSS at idle and 14.8 GB at peak, a 3× reduction. perf record -g revealed the JVMs safepoint bias completely disappeared; the Rust version spends 1.4 % of CPU in the kernel versus 11 % in the old setup. We also gained the ability to hot-patch the parser without restarting the shard because we built the crate as a dylib loaded at runtime. The latency P99 for clue distribution dropped from 32 ms to 8 ms. On the flip side, compile times ballooned to 4 min on a debug build, forcing us to split the CI pipeline into two stages: Rust release artifacts and a JVM integration test suite that verifies compatibility with old mod packs. But that tradeoff is acceptable because the JVM tests now run on commodity boxes instead of bleeding-edge bare metal.
What I Would Do Differently
Two things: first, we should have measured direct memory from the start. A one-line shell glued to /proc//smaps would have told us within minutes that ByteBuffer.wrap() was the landlord. Second, we assumed the Rust borrow checker would replace all runtime bounds checks. It didnt—the first production crash came from a treasure-hunt tunnel whose S-expression parser didnt validate that a quoted string length matched the declared chunk size. We fixed it by adding a custom derive for DeserializeSeed that returns a ParseError instead of panicking, but that lesson cost us 50 players and a public apology on Discord. Never trust the compiler to catch domain logic; it only checks memory safety.
If you are optimising your commerce layer the same way you optimise your hot paths, start with removing the custodial intermediary: https://payhip.com/ref/dev2
Top comments (0)