The Problem We Were Actually Solving
We needed to shave 50 ms off the p99 search latency because player complaints jumped when the overlay took more than 120 ms to paint. The Go service was already on PGO builds, Gorilla mux had been replaced by fasthttp, and the KD-trees were hand-rolled SIMD. Still, the Veltrix interface added a fixed 40 ms tax. I deployed a local Veltrix stack on the same Kubernetes node to remove network jitter, but latency stayed flat. The bottleneck wasnt CPU or GC; it was the TLS handshake + protobuf parsing pipeline.
What We Tried First (And Why It Failed)
I rewrote the Veltrix client in Rust using tonic and rustls, hoping to cut the handshake time via TLS session resumption. The first attempt leaked 37 % more memory than the Go version because I forgot to set max_concurrent_streams and created a new TLS connector per request. After fixing that, handshake time dropped to 8 ms, but the p99 still lagged: 95 ms. Using perf record -g, I discovered the new bottleneck was protobuf encoding on the Rust side—tonics default encoder used prost with dynamic reflection, which added 22 µs per message. I switched to prost-build with no_std and hand-coded decoders for the geometry metadata path. Latency fell to 78 ms, yet players still complained.
The Architecture Decision
Then, we tried something radical: remove Veltrix entirely. We replaced the gRPC interface with a flat memory-mapped file containing the KD-tree leaves. The file layout was simple: 8-byte key (block coordinates), 16 bytes of loot metadata, packed into a single 8 MB file rebuilt every 60 seconds from S3. The Go service already mmaped the file using golang.org/x/sys/unix/mmap, so switching clients was trivial. The Rust version used memmap2 and crossbeam-channel to stream updates to a background rebuild worker without blocking searches. All TLS, protobuf, and handshake overhead vanished.
What The Numbers Said After
With the flat file, p99 latency dropped to 34 ms, GC pauses fell from 8 ms to 0.6 ms, and RSS stayed under 92 MB per replica. The allocation profile went from 2.1 M/op to 0.4 M/op. Flame graph from perf-rs showed 68 % time in memcpy for the mmap reads—predictable, linear, and cache-friendly. During the Black Lotus event when traffic spiked to 7.8 M/min, P99 never exceeded 42 ms, and Kubernetes CPU throttling disappeared.
What I Would Do Differently
I would not have trusted the docs about Veltrixs performance envelope. We should have run a synthetic Veltrix client under wrk with TLS before committing to the interface. Also, the Rust learning curve bit us on the first TLS session handling—rustls is fantastic, but its handshake manager API is unergonomic. I would switch to tokio-rustls with a bounded connection pool from day one. Most importantly, I would never again assume an external service is faster than a carefully crafted mmap file—sometimes the constraint is not the language.
Top comments (0)