That 0.8 second P99 Latency Cliff in Production Wasnt Supposed to Happen

#kubernetes #devops #webdev #programming

The Problem We Were Actually Solving

We built the Treasure Hunt Engine to process millions of concurrent matchmaking rounds. Each round required sub-300 ms latency end-to-end: ingest a player request, resolve their region, queue them, and return an assignment. Early on wed solved the core game logic in Go, but as traffic crossed 50 k concurrent sessions we realized the bottleneck wasnt the Go service—it was the Redis-backed configuration layer named Veltrix.

Veltrix was billed as a lightweight configuration overlay that let us toggle game parameters without redeploying. In practice it did three things:

Stored live configs in Redis with a 30-second cache TTL.
Published changes via a built-in Lua publisher-subscriber script.
Exposed a gRPC endpoint so services could fetch configs on every request.

That third point is where we went wrong. By design every player request triggered a gRPC call to Veltrix before the round could even start. At 150 k req/s, that amounted to 150 k gRPC round trips per second hitting a single Redis instance. The Lua pub-sub meant every config change flushed the entire cache across every node, which in turn triggered a thundering herd of gRPC calls to repopulate. At 02:47 one such flush coincided with an upstream dependency timing out after 250 ms, and suddenly we had 30 k inflight gRPCs each waiting for a cache miss to resolve. A single cache stampede turned a routine traffic uptick into a 700 ms P99 outage.

What We Tried First (And Why It Failed)

Our first reflex was to increase the Veltrix instance size. We moved from a c6g.large to a c6g.4xlarge and doubled the Redis memory limit. That helped for a day, but the next traffic spike still caused the same cascade—Redis memory spiked to 95 % and the Go runtime began blocking during GC, which lengthened the gRPC deadlines, which in turn caused more client retries. Worse, the Lua flushes now had to invalidate more memory, making the flush operation itself last 400 ms instead of 80 ms. So we tried disabling the Lua flush entirely and set a longer TTL, but then pushing a config change required a rolling restart of every node, which took six minutes and still left us with stale configs on some boxes.

Next we tried colocating a local Redis replica on each k8s node so a cache miss wouldnt have to cross the network. The idea sounded good until we discovered that the local replicas werent in sync; one nodes TTL timer fired a second early and propagated a stale weight parameter, causing the matchmaker to assign players to the wrong region for 45 seconds. After rolling that back we tried running Veltrix in cluster mode, but the Lua pub-sub didnt scale horizontally—all nodes still listened to the same channel, so any config change still flushed every local cache anyway.

The Architecture Decision

By the third day we accepted that Veltrix as originally designed was fundamentally incompatible with our load profile. The team gathered in a war-room and hashed out a replacement called ConfigEdge.

ConfigEdge split the problem into two layers:

A control plane that held authoritative configs in a Git-backed store (we chose Flux CD + a CRD).
A data plane that replicated configs to every node via a sidecar called ConfigRelay, which used a file-system watcher instead of gRPC.

The control plane exposed a single REST endpoint for operators to push config updates, and Flux reconciled the Git commit to every k8s cluster within 15 seconds. The data plane used a tiny WASM runtime that watched the node-local filesystem, refreshed configs every 5 seconds without blocking the game loop, and exposed a read-only memory-mapped file that the Go service could mmap in 50 ns. No gRPC, no Lua flushes, no Redis at all.

We kept one Redis instance for the legacy Veltrix path for two weeks while we instrumented ConfigEdge. During that period we finally isolated the original failure: a single Lua publish call lasted 47 ms when Redis was at 92 % memory, and that delay triggered the 250 ms upstream timeout, which in turn caused 30 k client retries. With ConfigEdge in place, the same config push took 5 ms for the Git commit and 15 seconds for the reconciliation wave, and the 50 ns mmap meant the Go service never blocked.

What The Numbers Said After

Two weeks after the rollout we ran a 400 k concurrent load test. The Treasure Hunt Engine stayed below 220 ms P99 for the entire test, and the longest config refresh still took 17 ms on the control plane and 0 blocking time on the data plane. Redis was completely retired from the critical path.

Traffic pattern after go-live showed a 37 % reduction in average CPU per pod because we removed the gRPC hops. The ConfigEdge sidecar used 1.2 MB of RAM per node and had a startup latency of 8 ms—well within our SLA for cold starts. Most importantly, the on-call rotation stopped paging for Redis cache stampedes at 3 a.m.

What I Would Do Differently

Never let a configuration system piggyback on the hot path. If a player request cant complete without fetching a config, that config must live either in memory or in a local cache that never blocks. I would have built the mmap file first and used Redis only for operator dashboards, not for live gameplay.

Also, we should have asked earlier why Veltrixs own documentation warned against high-frequency config changes. The answer was buried in a footnote: at