The Day Veltrix Scale Decided to Lie Down and Die at 200 RPS

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In late 2024, the product team pushed us to support real-time treasure hunting events—imagine massive multiplayer games where users race to collect virtual artifacts, and every movement, click, and skip triggers an event. The SLA was 100ms p99 latency and zero data loss. The math was brutal: 100k concurrent users could generate up to 120k events per second at peak. We needed a routing engine that could fan out events with millisecond precision and reconfigure on the fly.

Veltrix was born: a YAML-driven configuration layer that mapped event types to downstream consumers, with a tiny Go runtime to reload configs without restarting. It looked elegant. It felt declarative. And it fell over at 200 RPS like a house of cards in a hurricane. The fatal flaw wasnt in the routing logic—it was in the configuration reload path. Every time we updated the routing table, the runtime walked the entire config directory, parsed every YAML file, and rebuilt an in-memory tree. On a cold start with 30 event types, that took 45ms. Under load, it spiraled into 200ms pauses. And when Kafkas consumer lag climbed, the runtime tried to reload again. And again. Until Prometheus screamed about high GC pressure and the Go runtime choked on a 32GB heap.

What We Tried First (And Why It Failed)

Our first attempt was to optimize the YAML parser. We swapped the standard gopkg.in/yaml.v3 for github.com/goccy/go-yaml, which promised 2–3x faster parsing and lower memory. It helped—parsing dropped from 15ms to 6ms—but the real bottleneck wasnt parsing. It was the full directory walk on every reload. We tried file watchers: fsnotify, watcher, even a custom inotify wrapper. All triggered reloads on any file change, even if only one YAML line changed. Result: 17 reloads per second during peak chaos, each rebuilding the entire routing graph. We tried debouncing—500ms, then 1s, then 2s. The latency SLA still died. At 400 RPS, the Go runtime spent 70% of its time in GC, and the p99 latency ballooned to 480ms. Worse, the memory profiler showed we were leaking yaml.Node objects because we werent closing file handles properly. The error message in the logs was a quiet killer: runtime: out of memory, followed by fatal error: runtime exhausted.

We even tried to offload the routing logic to Lua scripts via gopher-lua. The theory was that hot-reloading Lua scripts would be cheaper than rebuilding a Go struct. But the bridge between Go and Lua added 8–12ms per event, and the Lua GC fought with Gos for the same heap. We reverted after 10 days. The benchmarks told the truth: Lua didnt save us; it just gave us a new place to hang.

The Architecture Decision

Enough was enough. We ripped out the YAML-based configuration layer and replaced it with a runtime-compiled routing table. Heres how it worked: we defined routes in .proto files, compiled them to Go code using protoc and protoc-gen-go, and embedded the generated route.pb.go directly into the binary. No file walks. No YAML parsing. No hot reloads in production.

Instead of reloading configs, we introduced a control plane—a tiny Go gRPC service called Veltrix-CP—that exposed a single endpoint: UpdateRouteMap(bytes []byte). The request carried a base64-encoded RouteMap protobuf. The runtime received it, deserialized it in-place using a pre-allocated buffer, and atomically swapped the routing table with a sync.RWMutex. The lock had 1ns uncontended latency. The whole update took 3–4ms. Zero GC churn. Zero directory walks.

We also decoupled the config update path from the event path. Veltrix-CP ran in a sidecar. The main Veltrix engine only listened to Kafka events and a single RouteUpdate gRPC stream. If the control plane died, the engine kept routing with the last known good route map. If the engine panicked, Kubernetes restarted it in 1.2 seconds thanks to our livenessProbe set to /healthz.

It wasnt pretty. We had to write a custom YAML-to-protobuf converter during migration. It produced 176 lines of protobuf schema and 4,200 lines of generated Go. But it worked. The build pipeline now ran protoc in CI, and buf generate replaced our old make config. The binary size grew from 12MB to 14MB. Acceptable.

What The Numbers Said After

We reran the load test at 500 RPS. The p99 latency stayed under 85ms. GC pauses dropped from 120ms to 2ms. Memory usage plateaued at 280MB RSS. Kafka consumer lag stayed flat at zero. During a real user event storm at 120k RPS, the error rate never exceeded 0.08%. The Grafana spike graph finally looked like a healthy EKG.

But the best metric wasnt in Prometheus. It was in our on-call rotation. Before Veltrix-CP, we averaged 4 pages per week during traffic spikes. After, we averaged one every six weeks—and that one was usually a Kafka broker restart, not our engine.