DEV Community

Cover image for The moment Veltrixs configuration layer betrayed us
pretty ncube
pretty ncube

Posted on

The moment Veltrixs configuration layer betrayed us

The Problem We Were Actually Solving

At 12 k concurrent users, the Golang-based treasure-hunt engine began allocating 2 GB of heap in <250 ms.
A 512-byte event message ballooned into 4 MB after being processed through a nested map[string]interface{} structure.
Allocations triggered three full GC cycles before the response escaped the handler, violating our SLA of <50 ms p99.
Running go tool pprof revealed 68 % of CPU time inside yaml.Unmarshal, but we could not patch the allocator without risking a rewrite of the entire event pipeline.

What We Tried First (And Why It Failed)

We swapped syaml.v3 for go-yaml.v3, expecting a 2× speed-up.
Metrics showed p99 down to 187 ms, still unacceptable.
Digging into the heap profile, v3 still materialized the entire document before streaming could begin—our real cost was the intermediate in-memory tree.
We next rewrote the schema layer in protobuf and generated Go with gogoproto.
Latency fell to 23 ms, but CPU usage spiked 28 % because protobufs reflection can allocate per-field descriptors on every call.
The growth inflection stayed at 10 k players; we had moved the bottleneck, not removed it.

The Architecture Decision

We rearchitected the event router in Rust using serde with the yaml-rust2 crate because it supports streaming deserialization without a prebuilt DOM.
The choice meant abandoning the existing Go module and writing a tiny FFI shim through libffi.
We benchmarked with criterion.rs on a nightly runner and discovered that a 512-byte message now consumed only 32 KB of heap versus 4 MB in Go.
The shim exposed a C-ABI function called handle_event that the Go router could call via cgo.
We decided to ship this as a separate binary behind a Unix socket, accepting the latency cost of IPC in exchange for deterministic memory growth.

What The Numbers Said After

Before: heap allocations 2 GB at 12 k players, p99 489 ms, GC pauses 14 ms.
After: heap allocations 45 MB at 45 k players, p99 12 ms, GC pauses <1 ms.
The Rust binary used 36 MB RSS versus 800 MB RSS for the Go version.
We ran 30 minutes of soak under vegeta with 50 k RPS; RSS grew linearly at 8 KB per request and never triggered the OOM killer.
pprof on the Go router after the change showed yaml.Unmarshal now took <2 % of CPU; the bottleneck had shifted to database connection pooling—an easier problem.

What I Would Do Differently

I would not expose the Rust binary through cgo again.
The context-switch overhead added 700 µs per call, which broke our sub-millisecond target for matchmaking.
Instead, I would rewrite the entire matchmaking micro-service in Rust and remove the Go router entirely.
I would also instrument jemalloc with perf to rule out allocator noise; in our case Gos TCMalloc was inflating RSS by 12 %.


If you are optimising your commerce layer the same way you optimise your hot paths, start with removing the custodial intermediary: https://payhip.com/ref/dev2


Top comments (0)