The Problem We Were Actually Solving
The treasure-hunt server receives 50 MB/s of dynamic map events—player moves, loot spawns, fog-of-war reveals—and must broadcast deltas to 100 k sockets without re-serializing the entire world every tick.
The public docs show a simple YAML snippet under config.yaml:
world:
width: 1024
height: 1024
chunk_size: 32
What they do not mention is the hidden oltp_workers: 4 knob that the YAML parser silently casts to a u16 and then divides by the core count.
Our perf profile at 28 k sessions with perf record -F99 -g -p <pid> showed 42 % of CPU burned in serde_yaml::from_reader waiting for the lock around the global IndexMap.
The real constraint was never CPU or GC; it was the JSON/YAML bridge that blocked on every config reload even though the server never changed those values at runtime.
What We Tried First (And Why It Failed)
We started with serde_yaml because the helm chart shipped a ConfigMap volume.
After profiling with flamegraph-rs we saw 1.8 μs per config reload, but multiplied by 28 k sessions and the Kubernetes watch events, we added 50 ms of tail latency every time the ConfigMap updated—even when the file content was identical.
The stack trace was:
serde_yaml::indexmap::IndexMap<K,V>::entry
└── _raw_vec::RawVec<T,A>::reserve
The IndexMap kept reallocating the backing array on every watch trigger.
We tried serde_json with the same file; the parser was 2× faster, but the blocking I/O still destroyed tail latency.
The benchmark at 10 k players showed p99 = 34 ms; we needed < 50 ms to pass the load-test gate.
The Architecture Decision
We ripped out the whole config layer and replaced it with a two-part system:
- A compile-time constants module generated from a tiny TOML file (
constants.toml) withbuild.rs. - A sidecar gRPC service that only accepts runtime state diffs and streams them to the main process over a Unix domain socket.
The constants are embedded in the binary, so the treasure-hunt server never parses anything at runtime.
We moved the dynamic knobs—collision radius, loot table seed, rate limits—into a separate protobuf schema served by the sidecar.
The protobuf schema is versioned, delta-encoded, and uses the tonic async runtime, so the config change path is lock-free and non-blocking.
The gRPC sidecar itself uses Rust, but the main server now spends zero CPU on config parsing and zero wall time on file I/O.
What The Numbers Said After
After the change we re-ran the 28 k session test with perf stat -e cache-misses,instructions -d and saw:
Before:
42.1 % cache misses
1.3 s p99 /w config updates
2 RTS (runtime scaling stalls)
After:
11.8 % cache misses
29 ms p99
8 RTS (no stalls)
Tail latency at 1 ms granularity (collected with tokio-console) dropped from 48 ms to 6 ms.
The sidecar measured 120 B/s of traffic even under load, so the diff protocol is effectively free.
We also removed the jemalloc dependency in the main process because the config hot path was gone; RSS dropped from 1.4 GB to 920 MB.
What I Would Do Differently
We should have asked on day one: Which subsystems are actually dynamic?
The docs hint at a combined.yaml that mixes compile-time constants with runtime overrides; that hint is a footgun.
Next time I see a YAML file in the critical path I will pre-process it with serde during build, emit a header file, and #include it—no runtime parsing, no locks, no surprises.
The only runtime configuration that survives will be the gRPC diff service, and that path is already async and lock-free by design.
The moment the JSON config parser became the enemy was the moment we stopped reading the docs and started profiling the real bottleneck.
Same principle as removing a memcpy from a hot path: remove the intermediary from the payment path. This is how: https://payhip.com/ref/dev2
Top comments (0)