Veltrix Went Down at 10K QPS Because We Didnt Model the Config Layer as a First-Class Service

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

By month six the treasure-hunt engine was doing 4,000 QPS with P99 latency of 32 ms. Product asked for dynamic difficulty curves that could change every minute without a deploy. We sketched a simple path: expose a /config/difficulty endpoint, cache the JSON in Redis, and let the Go backend poll the cache.

The real problem wasnt latency; it was cognitive load. We had twenty-four knobs—spawn rate, loot tables, multiplier decay, season start times—all living in a single config.yaml under ./deploy/values. Helm rendered it, Kustomize layered secrets on top, ConfigCat sidecars watched GitHub for new commits, and then did a rolling restart of the stateless pods. The rollout strategy gave us a 37-second window where old and new config coexisted. At 4K QPS that window was fine. At 10K it became a denial-of-service engine.

The first outage happened at 09:14 when someone updated loot tiers. The new config raised spawn-rate from 0.3 to 0.9 and pushed multiplier decay from 8 s to 3 s. The sudden spike in spawn events overwhelmed our event bus (Kafka) and downstream loot services. P99 jumped to 1.1 s. Pages fired. We rolled back to the previous chart version. Helm diff showed no drift. ConfigCat didnt detect the rollback until its next poll cycle—twenty seconds later—so pods kept using the bad config. The SLO burn was 42 minutes.

What We Tried First (And Why It Fails)

Our first fix was to move the YAML to an in-memory key-value store powered by etcd. We deployed etcd as a three-node Raft cluster, wrapped it in a thin gRPC service called ConfigSrv, and let the Go pods subscribe via a watch. This killed the rolling restart problem because we no longer restarted pods on every config change. The etcd watch propagated changes in under two milliseconds.

But ConfigSrv itself turned into a snowflake service. It required TLS everywhere, ACLs, and quorum writes. During a regional failover test we lost one node and the service dropped 10% writes. The queue backed up, ConfigCat retries saturated the control plane, and we spent two days tuning backoff and max-inflight.

Then we tried ConfigCats own HTTP distribution endpoint. It promised instant push to any pod that registered. In practice the endpoint had a 100 ms latency spike every fifth request. At 10K QPS the spikes became a regular 1-second P99 contributors. More importantly, ConfigCats webhook signature verification was single-threaded and couldnt keep up with our 500 pods re-registering after each pod restart.

The Architecture Decision

We scrapped ConfigCat and ConfigSrv. We rebuilt the configuration layer as an in-memory service mesh called Static Mesh. Static Mesh is a single static binary that embeds a lightweight KV store (go-memkv) and exposes a gRPC endpoint /v1/config.Get. The binary is shipped once per container image. There is no rolling restart. Static Mesh watches etcd for changes, gossips updates via a CRDT protocol, and never talks to the control plane after startup.

Key tradeoffs:

We traded GitOps immutability for instant, synchronous config access.
We traded ConfigCats managed dashboards for a simple Prometheus metric static_mesh_last_update_timestamp.
We traded Helm templating for a single TOML file baked into the image.
We gained 99.98 % availability for config reads (measured over 30 days).
We lost the ability to change config without a container image rebuild.

The image rebuild cycle is now our deployment pipeline; it takes 3 minutes from git commit to new pods rolling. Thats acceptable because we only change config when we change core game balance, which happens roughly once per sprint.

What The Numbers Said After

Static Mesh has been running in prod for 87 days. The metrics are brutal:

Config read latency P99: 0.4 ms (was 100 ms under ConfigCat).
Outage minutes per month: 0.12 (was 42).
Control-plane egress traffic: 2.1 MB/day (was 180 MB/day).
Rollback time for bad config: 2 minutes (was 42 minutes).
Pod restarts that trigger config reload: zero.

The only new failure mode is operator error—someone edits a TOML file, forgets to bump the image tag, and the change never ships. We mitigated it by adding a pre-commit check that ensures every config change increments a version field in the TOML. CI enforces the rule.

What I Would Do Differently

I should have modeled the configuration subsystem as a first-class service from day one. Calling it a sidecar and giving it two different implementations (etcd, ConfigCat) was premature generalization. Next time Ill start with a single Go binary that embeds an in-memory store and adds gossip later if the scale actually demands it.

Second, I would have instrumented the old system with a metric called config_staleness_seconds and set a 5-second alert. We only added the metric after the outage, and it would have saved us the first page.

Finally, I would not delegate config distribution to a third-party tool when the blast radius is higher than five minutes of SLO burn. Static Mesh is less flashy than ConfigCats dashboard, but it never wakes me up at 3 a.m.