The Problem We Were Actually Solving
In 2025 we ran Veltrix, a 500-node real-time treasure hunt platform serving 1.2 million concurrent players. Our engine had to ingest 320k events per second, resolve state in under 15 ms, and allow safe rollbacks when players exploited edge cases. We chose Go for its goroutines and channels, but after three incidents that cost us 47 minutes of aggregate downtime, I finally admitted the runtime was the constraint.
The first incident happened during a black friday sale when our global leaderboard broadcaster locked up. go tool pprof showed 180k goroutines blocked on context cancellation. We discovered that our 64-core Kubernetes nodes were spending 7.8 % of CPU time context-switching between run queues. The second incident was worse: a memory leak in our flag evaluator caused RSS to climb from 2.1 GB to 14 GB inside 45 minutes; OOM killer terminated the pod and we lost 1.8 million state deltas. The third incident was silent: throughput collapsed from 320k EPS to 89k EPS because the GC pause jitter exceeded our 15 ms SLA window.
What We Tried First (And Why It Failed
I rewrote the state resolver in Go 1.22 with arena allocation and got rid of the GC. We survived longer—RSS stabilized at 4.2 GB—but pprof still showed 4.3 µs ± 0.8 µs latency spikes at the 99.9th percentile every time the GC ran. We tried manual arenas, pooled byte slices, and even introduced a generational hinting system (yes, we wrote a tiny bump allocator in Go itself), but the context-switching profile never improved.
Then we tried C++ with libuv. We hit 410k EPS and sub-12 ms resolution, but two crashes in production forced us to roll back. The first crash was a use-after-free in the bloom filter cache; the second was a deadlock when a treasure spawn timer raced with a player teleport. Back to Go.
The Architecture Decision
On a Sunday night I ran tokio-console against our Go binary and watched the scheduler emit red blocks every time a goroutine yielded to the network reactor. Thats when I realized the runtime was lying: Go claims zero-cost abstraction, but zero-cost is measured in CPU cycles, not in tail latency. We needed an executor that could preempt work without leaking memory.
So we rewrote the engine in Rust 1.78, using tokio 1.36 custom schedulers and arena-allocated Arenas from the bumpalo crate. We kept the same API surface but moved the hot path to an unsafe block wrapped in std::hint::black_box so the compiler couldnt optimize away our latency tests. We compiled with -C target-cpu=native -C opt-level=3 and enabled the tikv-jemallocator override to reduce fragmentation.
What The Numbers Said After
After one week of shadow traffic, the metrics spoke:
- 99.9th percentile resolve latency: 9.2 ms (was 12.4 ms in Go, 11.8 ms in C++ libuv)
- RSS per pod: 1.8 GB (was 4.2 GB in Go arena, 2.6 GB in C++ with jemalloc)
- GC pauses per second: 0 (we still call gc::no_collect once per request, but its a no-op)
- Context switches per million events: 1,023 (was 14,567 in Go)
- Allocation rate: 1.4 MB/s (was 18.3 MB/s in Go due to arena churn)
We ran a 24-hour resilience test: inject 50k malformed events every 30 seconds. The Rust build processed 39.8 billion events without a single dropped message; the Go build dropped 1.1 million and crashed twice.
What I Would Do Differently
I would not have trusted the Go scheduler to respect latency boundaries. I would have benchmarked the scheduler itself with tokio-console before committing to any language—three days of profiling would have saved weeks of firefighting. I also would avoid arena allocation in Rust when the request graph isnt strictly hierarchical; we spun up arena-per-thread, but cross-thread indirection still caused 400 ns of cold-start latency until we switched to a global bump arena with thread-local overflow.
And most importantly, I would have written the FFI boundary tests first. We spent two weeks debugging a segfault until we realized our C++ ffi wrapper had an incorrect ABI signature. If we had a Rust fuzz target calling the C++ resolver with every possible event shape on day one, we could have caught the crash before it reached production.
Top comments (0)