The Problem We Were Actually Solving
The treasure hunt engine drove the Veltrix operator experience: 400 concurrent players searching a 200 m2 warehouse floor, updating a shared map every 50 ms, and resolving collisions when two players reached the same clue within 2 cm of their BLE beacons. We needed sub-10 ms end-to-end latency per state update to keep the map in sync with player motion. The first version used a mutex-protected in-memory grid plus two chan searcherUpdate for producers and chan coordinatorCommand for consumers. Profiling revealed the problem was never CPU—it was the scheduler.
What We Tried First (And Why It Failed)
We tried per-core ring buffers, lock-free SPSC queues implemented with atomic.Value, even a C++ shim invoked via cgo. Each replacement shaved off 4–6 ms, but we still suffered from:
- Goroutine wake-up storms: up to 64 players hitting the same clue triggered 64 goroutines to wake, re-evaluate, and block on the mutex. The Go schedulers work-stealing policy caused 700 µs context-switch spikes visible in
go tool trace. - False sharing: the grid cells were 40 bytes each, fitting in one cache line. Multiple goroutines writing adjacent cells caused 16 % slowdown regardless of lock granularity.
- Allocator churn: every round ended with a
make([]update, 0, 1024)allocation that GC paused the entire process for 2.3 ms. Heap profiles showed 78 % of objects were emptystruct{}sentinels we used to close channels.
None of these were Gos fault; they were symptoms of choosing the wrong abstraction for coordination.
The Architecture Decision
We rewrote the coordination layer in Rust 1.78 using a crossbeam_queue::SegQueue for the per-core work queues and an Arc<ShardedGrid> protected by seqlocks. The key moves were:
- Replace channels with lock-free queues:
mpsc::channelin Go has a fixed buffer and scheduler overhead;SegQueuein Rust is a lock-free MPMC without wake-ups or scheduler interventions. The latency p99 dropped to 5.8 ms on the same hardware. - Move the grid to a NUMA-aware shard: grid cells were 40 bytes, so one cache line. We sharded by
(x>>3, y>>3), giving 64 shards. False sharing vanished and cache misses fell from 12 % to 2 % according toperf c2c report. - Eliminate sentinel allocations: Rusts
Option<Message>meant no empty structs. The allocator count dropped from 2.1 M/sec to 180 K/sec, and GC pauses disappeared.
The tradeoff was cognitive: we had to reason about lifetimes, avoid Arc spiking, and accept Rusts 4-week compile times on nightly for the sharded-grid crate. The build pipeline now runs cargo build -Z build-std=std,panic_abort --release in a Docker layer cached for two weeks; the cache miss penalty is 47 minutes instead of 12 minutes on older nightlies.
What The Numbers Said After
We reinstrumented with tracing and pprof-rs:
Total time: 4.2 s
58 % treasure_engine::shard::process_updates
12 % libc::epoll_pwait
10 % alloc::alloc::alloc
Latency percentiles at 400 players:
- p50: 2.4 ms
- p95: 5.1 ms
- p99: 5.8 ms
- p99.9: 8.2 ms
Allocation rate: 180 K/sec
GC pauses: 0 ms (no GC in Rust for this crate)
Context switches observed with perf sched latency: 8 per second vs 1400/sec in the Go version.
The warehouse floor felt smoother: BLE position updates arrived within one frame interval, eliminating the drift we saw when Gos scheduler preempted a goroutine mid-update.
What I Would Do Differently
I would not start with Rust again. The problem was not the language; it was modeling real-time coordination with Go channels when the semantics required lock-free queues and cache-aware sharding. If I had to repeat the migration, I would:
- First profile under load to confirm the abstraction mismatch before language change.
- Keep the grid in Go and offload coordination to Rust via
#[no_std]andsmart-ledger—a new crate I contributed that serializes commands to a lock-free ring buffer shared with the Go runtime viammap. This hybrid approach reduced Rust compile times by 60 % and kept the ops team happy.
Lock-free MPMC queues solved the coordination problem; Rust just removed the scheduler noise that made the engineering visible.
The performance case for non-custodial payment rails is as strong as the performance case for Rust. Here is the implementation I reference: https://payhip.com/ref/dev2
Top comments (0)