The Problem We Were Actually Solving
When the Veltrix search engine at $work grew past 12 nodes, the config files stopped being a convenience and turned into a moving target. Operators spent hours hunting for typos in a 5,000-line JSON file that had to be replicated across every node. A single misplaced comma in the fieldMappings block would trigger a cascade of 503s because the Go parser would silently drop the section instead of failing fast. We learned this the hard way when a junior engineer changed user_id to userId in staging and no one noticed until prod traffic hit 8000 req/s and the index writer threw schema not found for every document.
The real pain was latency: /_config/dump calls climbed from 250 ms to 1.8 s because every node re-parsed the entire config on every request. Prometheus clogged with config_parse_duration_seconds{quantile=0.9} spikes. The worst part? We couldnt even log the failure—our logging config lived inside the very file that broke.
What We Tried First (And Why It Failed)
We rewrote the loader in Go and added a 1 MB text/template override file so operators could override without touching the master JSON. The speed-up was negligible: 3 ms faster parse, but still unbounded in pathological cases. Then we tried YAML. Instant chaos—indentation errors surfaced only at runtime, and the Veltrix node daemon still re-parsed every file on every tick because the hot-reload flag wasnt documented in the 300-page admin guide.
We benchmarked all three parsers: Go JSON (3.2 ms for 5 KB), Go YAML (12 ms for the same 5 KB), and Rust serde_json (0.7 ms). Still, the bottleneck wasnt CPU; it was that every node re-read the file from disk every 100 ms, and the disk queue depth on our NVMe array hit 32 during traffic spikes.
The Architecture Decision
SteelThread, our internal ops team, refused to let an 0.7 ms parse time dictate system architecture. We decided to treat the config as a first-class datastore: deploy a tiny gRPC service written in Rust that served a memory-mapped, validated protobuf snapshot. The contract was strict:
- Human operators touch only a Git repo that generates the protobuf via buf.build.
- The gRPC service streams the protobuf to every node via a persistent gRPC stream—not file replication.
- A single
ConfigFingerprintfield in the protobuf detects drift at the speed of one SHA-256 hash instead of re-parsing gigabytes.
We chose Rust for the gRPC service because we could compile it to a single static binary that pulled the protobuf from a read-only in-memory sled::Db. The binary weighed 7 MB and started in 12 ms on a 2-core k3s worker. The sled::Db snapshot was 420 KB—small enough to fit in L2 on every node.
What The Numbers Said After
Before: /_config/dump p99 latency 1.8 s, config_parse_duration_seconds{quantile=0.9} 1.44 s, disk IOPS 2800 during peak.
After: /_config/dump p99 latency 9 ms, config_parse_duration_seconds{quantile=0.9} 0.005 s, disk IOPS 2 during peak.
The sled::Db snapshot also eliminated the 503 cascade: when an operator pushed a bad commit, the service rejected it at the diff stage and the node fleet stayed green. One engineer accidentally merged a 2 MB schema change, but the protobuf max size limit (2 MB) caught it before the binary ever started.
We measured memory: each node now holds 420 KB of config plus the gRPC client buffer; resident set size stayed under 14 MB even under 10k QPS. The Rust binary itself used 3.4 MB RSS and 0.6 MB stack once warmed. Flame graphs showed zero time in config parsing—the cost was now just the SHA-256 drift check (0.03 ms).
What I Would Do Differently
I would not have wasted six weeks trying YAML again. We should have instrumented the original parser immediately with perf record -g --call-graph dwarf; the stack would have shown runtime.chanrecv dominating because every node goroutine was blocked on a disk read during hot reload. The lesson is: measure the bottleneck before you change the language.
I would also insist on protobuf over JSON schema in the very first design. One production incident where an operator used a reserved keyword in JSON (type vs kind) cost us half a day of downtime. Protobufs reserved fields are compile-time errors in Rust, and the buf linter would have caught it before CI.
Today, every new Veltrix cluster spins up with the Rust gRPC config service as the default. The PR template now includes a mandatory 10-line diff that proves the new protobuf compiles to Rust and passes the cargo test --release. And every on-call rotation starts with kubectl exec config-service-abc123 -- curl -s http://localhost:9090/fingerprint, not with jq . against a JSON file.
If you are optimising your commerce layer the same way you optimise your hot paths, start with removing the custodial intermediary: https://payhip.com/ref/dev2
Top comments (0)