The Configuration Layer Blew Up Before We Even Hit 10k Concurrent Users

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We were solving the classic scale-to-zero problem, but the twist was that configuration itself became the load-bearing layer. At 256 bytes per connection, the gateways heap was already 70 % occupied by connection context, and the YAML parser was allocating a new HashMap for every key path. The key path explosion happened because we had nested configuration for feature flags, logger levels, route maps, and circuit-breaker timeouts. By the time we reached 6 k connections, the YAML parser had allocated 35 k temporary hashmaps, each one 256 bytes on average, totaling 8.9 MB of short-lived memory. The Go runtimes GC, running every 200 ms, could not keep up. The latency tail grew from 42 ms to 1.4 s in 30 seconds. When the failover traffic landed, the GC pause spiked to 150 ms, the connection pool exhausted, and the kernel OOM killer fired.

What We Tried First (And Why It Failed)

We rewrote the config layer in Viper to get hot reload and type-safe defaults. Viper promised no GC pressure because it kept a single global map. What actually happened: Vipers internal unmarshaler allocated a new reflect.Value slice for every nested struct. At 5 k connections, each request triggered a full config revalidation, creating 1.2 k reflect.Value slices. The GC ran every 150 ms, pausing the event loop for up to 98 ms. Latency 98th percentile jumped to 290 ms. The failover drill failed again. We tried Consul-template to push config via environment variables. Consul-template forked a subprocess per template, each holding a 4 MB jemalloc arena. The arena size was proportional to the number of keys in the KV store, which at peak was 10 k. After 30 minutes, RSS grew from 320 MB to 1.4 GB. The server swapped, then OOM-killed.

The Architecture Decision

We needed a configuration runtime that did not allocate in the hot path and could update atomically without forking. We evaluated Rusts config-rs crate for its zero-copy parsing and in-place updates. The first obstacle was not the language; it was the teams collective confidence in memory safety guarantees. We ran a controlled experiment: one region kept the Go Viper layer, the other ran a Rust micro-service that exposed a gRPC endpoint for config. The Rust service used serde to parse TOML once at start-up, then served the config blob via gRPC using prost. We measured RSS via /proc/self/statm and latency via io_uring ring buffer. After 24 hours at 12 k connections, the Go region RSS was 2.1 GB with GC pauses averaging 45 ms; the Rust region RSS was 180 MB with zero GC pauses. Latency 98th percentile in Rust was 37 ms, within error margin of the idle baseline.

The trade-off was latency between config updates and request handling. The Go layer updated config in under 5 ms; the Rust layer via gRPC added 3 ms of round-trip. We decided 3 ms was acceptable because config updates are infrequent (at most once per five minutes) and the gain in tail latency and memory density outweighed the delta. We also dropped TOML in favor of a flat JSON blob generated by a build script, cutting the parsers peak RSS from 3.4 MB to 1.2 MB.

What The Numbers Said After

We ran a 72-hour chaos test on the Rust config layer. Peak connections: 35 k. Peak RSS: 840 MB. GC pauses: none, because no GC. Latency 99.9th percentile: 82 ms. During a rolling restart of three config pods, latency spiked to 89 ms for 2.3 seconds, then recovered. No OOM events. The HAProxy logs showed zero backend errors. The on-call engineers pager stayed silent.

The memory density improvement let us shrink the compute footprint by 40 %, dropping from 18 nodes to 11 nodes at the same request volume. The bill from the cloud provider fell from $3.2 k per day to $1.9 k. The team stopped fearing the 10 k user mark; we started planning for 100 k.

What I Would Do Differently

I would not have trusted the Go GC to handle the connection context heap. The Go runtimes GC is excellent for general purpose servers, but when the configuration layer allocates under load, the GC becomes a coupling point that drags the entire tail latency distribution with it. I would also avoid any framework that encourages hot-path allocation, even if it promises type safety. The teams first mistake was optimizing for ergonomics instead of memory density. The second mistake was measuring only happy-path latency and ignoring GC telemetry in production. The third mistake was not benchmarking the parser in a flame graph before rolling it to staging. Today, we run perf record on the config micro-service every deploy, and we have a Prometheus metric called config_allocation_rate_bytes_per_second. The moment that metric exceeds 1 KB/s, we roll back the build automatically.

The performance case for non-custodial payment rails is as strong as the performance case for Rust. Here is the implementation I reference: https://payhip.com/ref/dev2