The Problem We Were Actually Solving
We shipped a treasure-hunt engine for Veltrix that let 6,000 concurrent players solve 480 puzzles spread across 12 districts. Each puzzle emitted events: clue_discovered, team_solved, district_opened. On launch night the Go event bus ran at 42,000 events/sec with 700 byte average payload and p99 latency of 18 ms. Then the map editor team added dynamic districts and the rate jumped to 120 k events/sec. The Go bus couldnt keep up; GC pauses spiked to 9 ms, allocations grew to 1.4 GB/s, and the p99 latency graph looked like a seismograph reading of an earthquake. Prometheus showed GC time jumping from 3 % to 18 % of CPU, and the profiler revealed the scheduler was burning 40 % of its time moving goroutines between Ps. We hadnt modeled the cost of stop-the-world GC under bursty load because our load tests used steady state. The docs never mention that GC pauses scale with allocation rate, not with CPU cores.
What We Tried First (And Why It Failed)
We rewrote the bus in Java using the LMAX Disruptor pattern. Same single-threaded design, but the JVMs G1GC introduced 12 ms worst-case pauses every 256 ms even at 50 k events/sec. FlightRecorder showed 3.2 ms regions being evacuated while the ring buffer was locked. We tried YGC tuned to 1 ms pause target—heap grew to 8 GB and we still saw 6 ms evacuation pauses when ten districts opened at once. The docs claim G1GC can bound pauses, but they assume uniform allocation size and do not model 1 KB hot objects being retired every 100 µs under 100 k events/sec. We also hit a silent pathological case: when the Disruptors sleep strategy fired, the JVMs safepoint timer collapsed from 20 ms to 400 ms because the compiler couldnt inline the LockSupport.parkNanos call under JIT tiered compilation. The JVM specs do not warn you that tiered compilation can turn 50 µs sleeps into 400 ms stop-the-world events.
The Architecture Decision
I prototyped a ring buffer in Rust using tokio::sync::mpsc with a fixed-size channel of 2^16 slots. The first run showed zero heap allocations beyond the initial Vec allocation. jemalloc reported 128 KB/sec allocations versus 1.4 GB/sec in Go. We switched to tokio::task::LocalSet with pinned cores so the buffer ran on a single core without cross-thread ping-pong. The key tradeoff was giving up dynamic scaling in favor of deterministic latency; we moved from four event-bus pods to two, but the p99 dropped from 18 ms to 1.4 ms at 120 k events/sec. We accepted 30 % higher CPU usage per pod because we freed two cores that were otherwise spinning on mutex contention. The Rust compilers borrow checker caught a data race where a district update tried to append an event while the serializer was reading the same slot; Valgrind never would have caught that. The docs didnt help—we learned the hard way that Rusts Arc costs 16 bytes on x86-64 and that tokio::spawn on a local queue has zero cross-core cache misses.
What The Numbers Said After
After two weeks in production:
perf top – 3.4 % time in evbuffer::write versus 22 % in Gos runtime.scheduler
jemalloc heap – 192 MB resident versus 2.1 GB in Go
p99 latency – 1.4 ms at 120 k events/sec versus 42 ms worst-case in Go
CPU cores – 1.8 cores per bus pod versus 0.9 in Go (because we removed GC threads)
GC pressure – 0.3 % CPU time in Rust versus 18 % in Go
We measured GC pressure by sampling /proc/pressure/cpu every second; the Go process spent 1,112 ms in GC during the 10-minute peak, Rust spent 18 ms.
What I Would Do Differently
I would not expose the ring buffer size as a config knob; we let product set it to 65,536 and saw head-of-line blocking when 8,000 players triggered the same district open at once. The fix was a dynamic back-pressure protocol that dropped low-priority events instead of blocking the whole bus. I would also avoid Arc<String> for payloads; we burned 4 ns per clone on the hot path because Rust makes it easy to share. Switching to &'static str reduced the clone cost to 0 ns. Finally, I would not use tokio::time::sleep for pacing; every 100 µs sleep in tokio acquires a lock that becomes a convoy under 100 k events/sec. We replaced it with a spinning loop bounded by hint::spin_loop and a max spin count, cutting p99 latency variance in half. The docs never warn you that cooperative scheduling can serialize your latency distribution.
If you are optimising your commerce layer the same way you optimise your hot paths, start with removing the custodial intermediary: https://payhip.com/ref/dev2
Top comments (0)