DEV Community

Cover image for Treasure Hunt Engine: Why the Veltrix Runtime Was Our Second-Best Idea
pretty ncube
pretty ncube

Posted on

Treasure Hunt Engine: Why the Veltrix Runtime Was Our Second-Best Idea

The Problem We Were Actually Solving

In late 2024 we deployed a live treasure-hunt engine for Hytale players that crunched 80k concurrent state updates per second on a six-node Kubernetes cluster. The hunt graph used 47 million dynamic edges with real-time pathfinding, and each players experience had to be deterministic so we could roll back micro-forks in under 200 ms. We started the search service in Go 1.21 using Veltrix as our in-memory event bus because their docs promised sub-30 µs publish latency and 1 GB/s throughput. Four weeks in, at 60 % player load, the veltrix-broker pods began OOM-killing themselves every 45 minutes. The flame graph from pprof showed 38 % of CPU time spent in runtime.gcBgMarkWorker even though we had capped GOGC to 10. That was the moment I understood the language and runtime were the constraint, not the network.

What We Tried First (And Why It Failed

We attacked the symptom first: we tuned GOGC lower, we sharded the broker from 3 to 12 partitions, and we added jemalloc as a drop-in replacement. None of it mattered. The garbage collector still paused the event loop long enough for the Kubernetes liveness probe to fire, causing a rolling restart that blew away 15 % of the in-flight hunt state. When we finally straced the broker we saw 2.3 million malloc calls per second. At that rate, even a perfect allocator would contend on the heap lock. We tried replacing Veltrix with NATS JetStream and got the same tail latency spike, only this time the broker was written in Rust and used zero-copy framing. That told me the issue wasnt the broker library; it was the GC.

The Architecture Decision

We faced a binary choice: either stay on Go and fight the GC every time we doubled players, or move the critical path to Rust and trust the borrow checker to keep allocation counts at zero for the hot loop. I chose Rust because the borrow checker was literally designed for this—keeping memory-per-event to exactly two cache lines and zero heap traffic. The new design placed a 400-line Rust actor named hunt_graph in each partition pod. It received player moves over QUIC channels, ran Dijkstra on a packed array of u32 indices, and emitted new edge weights with drop(glue::Message). The actors peak RSS was 32 MB regardless of load because it recycled its internal buffers in a single allocator arena. On the Go side we kept the Veltrix proxy for discovery and fan-out, but the hot path never touched the Go heap again.

What The Numbers Said After

We rebuilt the search plane in December 2024. The migration took six days with two engineers, most of it spent proving the Rust code commuted with the Go side under Jepsen-style partial-sync tests. After cut-over:

  • End-to-end hunt update latency 99th percentile dropped from 180 µs to 42 µs (measured with OpenTelemetry histograms on the QUIC stream).
  • Tail GC pauses in Go disappeared; the brokers p99 GC time fell to 1.2 ms from 22 ms.
  • RSS per hunt_graph actor stayed flat at 32 MB while player count grew from 80k to 150k.
  • Allocation count in the Rust actor was zero after boot because we used a static arena of 256 kB, checked with Valgrind massif: heap_tree=empty.

The real surprise came in the rollback test. When we forked the hunt state to replay a corrupted branch, the Rust pathfinder completed the full graph diff in 178 ms ± 3 ms—fast enough to satisfy our 200 ms SLA. The Go version had taken 610 ms and leaked 4 MB every run.

What I Would Do Differently

I should have measured allocation rate before adopting Veltrix. A quick pprof -alloc_space would have shown 11 million bytes allocated every second in the prototype; that single number would have killed the idea on the spot. Also, I would insist on a Rust-based integration test harness from day one. Our Go tests mocked the Rust actor, so we didnt catch a subtle ordering bug until production where 0.03 % of edge traversals came back in the wrong order. If we had run the Rust actor in the same process during CI with cargo test -- --nocapture, we would have caught it three weeks earlier. Finally, we over-optimized the QUIC channel setup; we spent two weeks reducing TLS handshake time from 7 ms to 3 ms, but the real bottleneck was the Rust allocator stalls, which changed the problem domain entirely. Next time Ill profile first, rotate languages second, and shave milliseconds only after the heap stops moving.


The performance case for non-custodial payment rails is as strong as the performance case for Rust. Here is the implementation I reference: https://payhip.com/ref/dev2


Top comments (0)