The Problem We Were Actually Solving
Our treasure-hunt engine at Veltrix was not exploding; it was quietly drowning. During the 2025 Black Mesa event, we served 380 000 concurrent players searching 1.4 million geo-cached items across ten shards. Latency percentiles looked fine from the outside—p95 of 85 ms, p99 of 142 ms—yet the Jaeger traces revealed a hidden cliff. Under load, every trace showed 60 % of the time spent inside the configuration parsing loop. Not the search itself, not the database round-trip, but the 6 ms per-player deserialization of dynamic event rules. At peak we were spawning 2 800 parser goroutines per second, each allocating 128 bytes for the AST before garbage collection. The GC pauses were ticking up from 1.2 ms to 34 ms, and the traces were literally turning orange because the CPU governor throttled the flame graph. We needed to cut that parsing overhead before the next event, otherwise the same pattern would repeat with 600 000 players.
What We Tried First (And Why It Failed)
First we tried replacing the JSON rules with Protocol Buffers. The protobuf schema cut the payload size by 42 %, but the Go protobuf runtime still boxed and unboxed every rule into an interface{} tree before we could evaluate it. The p99 deserialization went from 6 ms to 4.8 ms, which sounded good until we factored in the now-visible GC ticks—12 ms every 1.2 s from the arena allocator. Then we tried flatbuffers. Flatbuffers gave us zero-copy reads and the p99 cratered to 0.8 ms. Victory? Not quite. Two minutes into a chaos test the game servers started panicking because the flatbuffers verifier threw alignment errors on rules that came from mobile devices with older Android runtimes. The error message was:
flatcc encountered misaligned primitive, possible memory corruption
We rolled the change back after 22 minutes and our p95 latency shot back up to 120 ms. The Jaeger traces now showed red spans labeled MisalignedFlatbuffers.
The Architecture Decision
Enough profilers. We stopped trying to fix the data format and asked whether the entire runtime was the constraint. I ran a back-of-the-envelope calculation: 85 % of the treasure rules were static for the duration of the event. They only changed when the event designers pushed a hot patch, which happened twice a day at most. That meant we did not need a dynamic parser at request time; we needed a code generator. We chose Rust for the rule engine because it compiles to native code, has no hidden allocations in the hot path, and lets us embed the generated module directly into the Go server via cgo. The workflow became:
- Event designers write YAML in a private repo.
- CI runs a custom proc-macro that emits Rust code.
- rustc produces a .so that is dlopend once at server start.
- The Go side calls into the Rust module with the players request; no serialization until the final JSON response.
The key trade-off was build time versus runtime speed. A full rebuild of the rule engine took 47 s on our CI runner, which is acceptable because it happens twice a day. We accepted that latency cost to eliminate 2–4 ms of parsing and GC pressure per request.
What The Numbers Said After
We redeployed the Rust engine two days before Black Mesa 2026. p95 dropped to 41 ms, p99 to 79 ms. The GC in the Go workers steadied out at 1.4 ms pauses, down from the previous 22 ms worst case. Allocations for rule metadata fell from 128 bytes per player to zero in the hot code path. The flame graph now shows the biggest red span as the actual search algorithm, not configuration parsing. The rustc output confirms the generated module has 124 KB of .text and zero heap allocations at runtime. Most importantly, the MisalignedFlatbuffers errors vanished because we are no longer shipping unaligned data from mobile clients into the deserialization layer.
What I Would Do Differently
I would not have trusted the Jaeger traces alone. The Go runtimes internal CPU limiter hid the real bottleneck until we graphed the goroutine wake-up latency. Next time I will set up eBPF-based off-CPU flame graphs before touching the code. Second, I should have measured the cost of the Rust ↔ Go boundary earlier. The cgo call itself adds 2–3 µs of latency, which is negligible at p95 but shows up in the tail. If we had known that up front, we could have batched calls to the Rust module instead of one-per-player, reducing the boundary crossings by 60 %. Finally, I would avoid the temptation to use serde_json anywhere in the hot path—even if its only for debugging. We had one debug log line inside the Rust module that was still using serde_json, and it added 400 ns per call. It took a week to notice because the cumulative effect only showed in the p99.9.
Top comments (0)