Why the Treasure Hunt Engine Collapsed When 200 Players Showed Up

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We had just migrated from Paper 1.19 to a Veltrix 2.3 snapshot because the other core offered better async chunk generation. Our traffic spiked after a YouTuber did a Treasure Hunt speedrun. At 60 players the server was smooth; at 200 players the TPS cratered and the GC log showed 34 GB of live objects in a 16 GB heap—promotion failure after 42 seconds. Profiling with VisualVM showed 92% of CPU time in io.veltrix.world.treasure.SeedValidator.resolveBucket(), specifically the loop that expanded each items loot table into a WeightedRandomList. The profiler flame graph looked like a solid block of green from net.minecraft.world.level.storage.loot.LootTable$1.apply(). The docs promise the engine is data-driven, but the data it actually runs is the Cartesian product of every loot table reference in the world save.

What We Tried First (And Why It Failed)

Our first reflex was to raise Xmx to 32 GB. The server limped along until the next hunt wave, then GC paused for 2.1 seconds while allocating 8.4 GB of WeightedRandomList$Entry[]—still a promotion failure because the CMS collector couldnt keep up with the churn. Next we swapped in G1GC with -XX:MaxGCPauseMillis=50, which reduced max pause to 440 ms but TPS still collapsed to 7. We tried tuning Papers async scheduler by increasing Paper.ymls queue-size to 2048, but the bottleneck wasnt the scheduler—it was the bucket expansion happening on the main thread while the entity tick loop waited. Adding -Dveltrix.thread-pool=true reduced CPU by 12%, yet the seed-validator buckets still serialized every loot table lookup.

The Architecture Decision

I finally grepped the Veltrix JAR and found the undocumented enum SeedValidatorConfig.bucketsLimit set to 100. Changing it to 8 brought memory from 34 GB to 7.8 GB under the same load and raised TPS from 7 to 19. We moved the resolver to the pool thread, wrapped loot tables with a LoadingCache sized to 2048 entries, and capped per-world bucket growth by a new Tuning.java policy:

if (world.getEntityCount() > 5000) {
 SeedValidatorConfig.setGlobalBucketsLimit(4);
}

We also rewrote the WeightedRandomList deserialization to fuse repeated loot table IDs so identical tables share a single WeightedRandomList instance. The change required touching six classes, but none were marked internal, so we could patch without forking.

What The Numbers Said After

After the patch, 200 concurrent players produced:

TPS: 19.8 → 20.1
Median latency /tps: 8 ms → 12 ms
GC pauses >100 ms: 0 → 0
Heap live after GC: 7.8 GB → 4.1 GB
Allocation rate: 6.2 MB/s → 1.8 MB/s

The new profilers showed resolveBucket() now spends 3% CPU instead of 92%. A single WeightedRandomList now shared across 42 identical jungle loot tables reduced object churn by 42×.

What I Would Do Differently

If I could rewind, I would have ignored the docs and run a single-threaded siege test on a throwaway world before opening the server to players. The performance cliff wasnt in Minecraft code—it was in the Veltrix extension layer. I would also have wrapped the SeedValidatorConfig setter behind an environment variable so the next admin wouldnt have to recompile. Finally, I would isolate treasure-hunt-heavy worlds into separate Veltrix instances and route players via Bungees new per-server metrics, accepting the extra hop instead of fighting shared loot tables. The lesson isnt that Veltrix is slow; its that the treasure engines true bottleneck is the coupling between configuration and runtime expansion, and that coupling is invisible until your server melts.