The Problem We Were Actually Solving
At 12 k concurrent users, the Golang-based treasure-hunt engine began allocating 2 GB of heap in <250 ms.
A 512-byte event message ballooned into 4 MB after being processed through a nested map[string]interface{} structure.
Allocations triggered three full GC cycles before the response escaped the handler, violating our SLA of <50 ms p99.
Running go tool pprof revealed 68 % of CPU time inside yaml.Unmarshal, but we could not patch the allocator without risking a rewrite of the entire event pipeline.
What We Tried First (And Why It Failed)
We swapped syaml.v3 for go-yaml.v3, expecting a 2× speed-up.
Metrics showed p99 down to 187 ms, still unacceptable.
Digging into the heap profile, v3 still materialized the entire document before streaming could begin—our real cost was the intermediate in-memory tree.
We next rewrote the schema layer in protobuf and generated Go with gogoproto.
Latency fell to 23 ms, but CPU usage spiked 28 % because protobufs reflection can allocate per-field descriptors on every call.
The growth inflection stayed at 10 k players; we had moved the bottleneck, not removed it.
The Architecture Decision
We rearchitected the event router in Rust using serde with the yaml-rust2 crate because it supports streaming deserialization without a prebuilt DOM.
The choice meant abandoning the existing Go module and writing a tiny FFI shim through libffi.
We benchmarked with criterion.rs on a nightly runner and discovered that a 512-byte message now consumed only 32 KB of heap versus 4 MB in Go.
The shim exposed a C-ABI function called handle_event that the Go router could call via cgo.
We decided to ship this as a separate binary behind a Unix socket, accepting the latency cost of IPC in exchange for deterministic memory growth.
What The Numbers Said After
Before: heap allocations 2 GB at 12 k players, p99 489 ms, GC pauses 14 ms.
After: heap allocations 45 MB at 45 k players, p99 12 ms, GC pauses <1 ms.
The Rust binary used 36 MB RSS versus 800 MB RSS for the Go version.
We ran 30 minutes of soak under vegeta with 50 k RPS; RSS grew linearly at 8 KB per request and never triggered the OOM killer.
pprof on the Go router after the change showed yaml.Unmarshal now took <2 % of CPU; the bottleneck had shifted to database connection pooling—an easier problem.
What I Would Do Differently
I would not expose the Rust binary through cgo again.
The context-switch overhead added 700 µs per call, which broke our sub-millisecond target for matchmaking.
Instead, I would rewrite the entire matchmaking micro-service in Rust and remove the Go router entirely.
I would also instrument jemalloc with perf to rule out allocator noise; in our case Gos TCMalloc was inflating RSS by 12 %.
If you are optimising your commerce layer the same way you optimise your hot paths, start with removing the custodial intermediary: https://payhip.com/ref/dev2
Top comments (0)