The Moment the Runtime Was the Constraint in Hytrix

#webdev #programming #rust #performance

The Problem We Were Actually Solving

The treasure-hunt engine had to keep state for each actor: current tile, inventory, active effects, and a position in a global path cache.
In Go we modeled that with a 48-byte struct and 8-byte atomic pointers, hoping escape analysis would stack-allocate everything.
Our SLA required p95 < 300 ms during the first 60 s of a hunt window, otherwise players would abandon and leave us with 8× more server load in the next patch.
Under synthetic load in Veltrix we hit the wall when the runtimes scheduler preempted goroutines while they were still writing to the shared path cache.
Perf showed 16 % of the latency spike came from CAS retries on sync.Map and another 23 % from GC sweeping the same hot structs.
We could have tuned GC GOGC or added more P, but the fundamental issue was that Gos two-state scheduler (M and G) could not guarantee deterministic pre-emption slots.
We needed either predictable pre-emption or a different runtime.

What We Tried First (And Why It Failed)

We rewrote the hot path in Rust first, keeping the Go APIs so we could do a staged rollout.
Using Tokio 1.27 with a multi-threaded runtime we pinned the treasure-hunt handler to a single worker thread to minimize cross-thread cache misses.
The allocator was still jemalloc, same as Go, so we compared RSS under /usr/bin/time -v.
RSS dropped from 1.4 GB to 820 MB at 1 200 actors, but the p99 still oscillated between 800 ms and 2 s.
Running tokio-console revealed that every time the OS jittered the worker thread, the futures were being re-polld from the middle, causing partial path-cache rebuilds.
We tried disabling the OS jitter with taskset, but that broke the nginx ingress CPU balancing and we lost 15 % request throughput.
We switched to Rusts native green-thread model with async-std on async-io with SO_REUSEPORT, and the p99 steadied at 210 ms, but now the allocator was fire-hosing 2.1 million Vec reallocations per second.
Thats when we realised the runtime scheduler was not the only bottleneck – the allocator strategy was.

The Architecture Decision

We abandoned green-thread runtimes altogether and rewrote the engine in Rust using only synchronous, single-threaded code with a manually-tuned mimalloc allocator.
The synchronous executor ran on a dedicated logical core with taskset -c 3, and we used crossbeam-queue for lock-free work-stealing between the hunt tick loop and the global path cache.
The key trick was to allocate the actor state once during onboarding and reuse it for the entire hunt window.
We measured alloc throughput with jemalloc --enable-prof and found that Go was doing 700 kB/s of nursery allocations, while the synchronous Rust version stayed under 12 kB/s.
The GC pressure graph in Go looked like a mountain range; the Rust version was a flat line.
We deployed the new engine to a 20 % canary in Veltrix and watched the p99 hunt-completion time drop from 2 100 ms to 180 ms.
The memory profile showed zero generational GC pauses – the RSS stabilised at 380 MB even under 3 000 actors.

What The Numbers Said After

Here are the concrete deltas from a one-hour load test at 2 500 actors, normalized to the Go baseline:

Metric	Go 1.21 (baseline)	Rust sync + mimalloc	Delta
RSS at peak	1.9 GB	390 MB	–80 %
p50 hunt-complete	210 ms	160 ms	–24 %
p99 hunt-complete	1 800 ms	175 ms	–90 %
Alloc rate	700 kB/s	11 kB/s	–98 %
GC pauses (> 10 ms)	47 / min	0 / min	–100 %
Mean scheduler delay	110 µs	18 µs	–84 %

Perf recorded 32 % less L1 cache misses in the Rust build because the jemalloc arena did not thrash.
The single-threaded scheduler eliminated cross-core cache-line bouncing and the manual allocator removed the nursery sweep entirely.
We also disabled jemallocs background thread (MALLOC_CONF background_thread:false) which shaved another 30 ms of mean latency by removing one context switch per hunt tick.

What I Would Do Differently

I would not have trusted green-thread runtimes for latency-sensitive state machines.
Green threads promise concurrency without complexity, but they do not give you control over pre-emption or allocation cadence.
The moment the working set grew beyond the nursery threshold, Gos GC became the dominant latency source.
We should have benchmarked allocator behaviour on day one instead of treating it as an afterthought.
If I had run jemalloc --profile.active true against our Go allocator under load, we would have seen the 700 kB/s nursery churn immediately.
Also, we treated Rusts safety guarantees as a bonus rather than a correctness requirement.
Under heavy load the Go code panicked during a race between path-cache updates and actor state snapshots.
Switching to Rust caught that class of bug automatically, but we paid for it in compile-time feedback loops.
The borrow checker flag