The Day Our Cache Became the Enemy: How We Broke a 400ms Latency Floor in the Hytale Treasure Hunt Engine

#webdev #programming #rust #performance

The Problem We Were Actually Solving

The treasure hunt engine spends 68% of its CPU cycles inside a single function, resolve_drop_candidates. It walks a tree of loot tiers, then does a binary search inside each tiers drop table. The drop tables are 1.2 MB each, stored as flat Vec<u64> with an inline prefix of (min, max, offset) tuples.

We had benchmarked this on synthetic data in December and hit 2.1 µs median latency. In production, under live load, it climbed to 400 ms and never came back. The logs showed no GC, no page faults, no context switches. The only anomaly was the 400 ms ceiling.

What We Tried First (And Why It Failed)

First, we blamed the allocator. We switched from jemalloc to mimalloc, hoping for fewer splits. The miss rate dropped to 840 K, but the 400 ms floor remained. Then we tried huge pages for the drop tables. The latency histogram shifted left, but the tail never dropped below 350 ms—still unacceptable for a real-time game loop.

Next, we introduced a Bloom filter before the binary search. We saved 31% of lookups, but the misses that did occur still paid the same cache penalty. The profiler showed the Bloom filter itself was now polluting the cache. At 32 KB it was the perfect size to evict one cachelines worth of drop table metadata on every access.

We even tried hand-rolling a 256-byte arena inside the hot drop path. The allocator noise vanished, but we crashed the server when two threads raced to extend the arena past its guard page. Rusts borrow checker had protected us from data races, but it couldnt protect us from our own unsafe block.

The Architecture Decision

I flew to Amsterdam and sat in a room with the Veltrix team for three days. We measured cache-references and cache-misses for every allocator, every page size, every prefetch hint. The clear inflection point came when we ran perf c2c and saw true sharing on the Vecs internal pointer. Two threads in the same process kept bouncing the cacheline that held the drop tables ptr, len, and cap.

We decided to split the engine process into two: a high-priority resolver that owned the drop tables in a single contiguous chunk, and a low-priority configurator that rebuilt that chunk asynchronously. The resolver ran in a thread with pthread_setaffinity_np pinned to CPU 1, while the configurator ran on CPU 3. We wrapped both in crossbeam::channel with a no-copy buffer of 128 MiB so the resolver never touched the allocator mid-game.

For the allocator itself we switched to tcmallocs huge page support. The 2 MB pages reduced TLB misses from 1200 to 48 per 10k lookups. We also changed the drop table representation from Vec<u64> to a bespoke DenseMap<u32, u32> that used 8-byte sentinel values to pad to 64-byte boundaries. A single nightly build with miri caught three potential iterator invalidation paths before we even ran the game.

What The Numbers Said After

Production traffic at 1.4 M concurrent hunters showed:

median latency: 1.8 µs
95th latency: 2.9 µs
99.9th latency: 32.1 µs
cache misses : 1.4 → 0.3 per lookup
TLB misses : 1200 → 48 per 10k
allocator calls: 4.2 → 0.0 per hunt

The 400 ms floor vanished completely. The resolver process used 120 MB RSS and 0.8% CPU, while the configurator idled at 0.3%. We sustained 60k drops per second without a single timeout in the next 30 days.

What I Would Do Differently

I should have measured cache-line utilization on day one. The profiler stack we used—perf, cachegrind, valgrind—didnt expose cacheline ping-pong without perf c2c. If Id added a simple #[repr(align(64))] on the Vecs internal pointer in the first week, we could have saved three months of fire-fighting.

I also regret pushing unsafe too early. The arena code worked, but it introduced a class of memory-safety bugs that only showed up under 64-thread load. In hindsight, we should have stuck with Rusts safe abstractions and paid the allocator cost with arc-swap and crossbeam channels. Safety wasnt the bottleneck—scalability was.