The Problem We Were Actually Solving
It was the third day of Veltrixs soft launch and our Slack was on fire. The on-call rotation paged me at 2 AM because the peak batch import rate had dropped from 1.2 M events/sec to 300 K events/sec, but the cluster CPU was flatlining at 58 %. The Prometheus dashboard screamed Not enough workers to drain the queue. At that moment I learned that the Veltrix Config Engine—our shiny abstraction that was supposed to let any operator tune concurrency with YAML—wasnt just leaking performance; it was the fuse for the whole circuit.
The issue wasnt the YAML itself. It was the way the engine mapped those YAML values into the Go worker pool. The default pool size in the helm chart was still hard-coded to 64, even though the Config Engines MaxConcurrency field had been cranked up to 512 by the SRE team trying to handle early Black-Friday style load. The gap between the advertised concurrency and the actual worker count was never logged; the Config Engine just happily accepted the override and silently fell back to the hard-coded value inside the pool constructor. The metrics counter veltrix_workers_configured stayed at 64 while veltrix_workers_running climbed to 512 before the runtime panicked with thousands of goroutines stuck in a mutex wait. The actual error we saw in the log was: panic: sync: negative WaitGroup counter.
What We Tried First (And Why It Failed)
We first blamed the Helm charts values.yaml. I rolled a change that set worker-pool-size: 512 in the global section, bumped the chart to v0.8.3, and did a rolling restart at 3 AM. Within ten minutes the slew of 500 errors dropped, but the CPU climbed to 94 % and the p95 latency jumped from 42 ms to 380 ms. The cluster autoscaler kicked in and spun up three new nodes, but the new workers immediately got stuck waiting for a single global lock inside the Config Engines environment cache. The cache was still using a sync.Map with no partitioning, so every time a pod restarted it had to deserialize 200 MB of configuration tags using reflection. The deserialization routine alone took 1.8 seconds per pod, and the lock contention showed up as netdata graph spikes labeled sync.Map contention: 142 ms avg hold time.
Next we tried disabling the cache entirely by setting config-cache-enabled: false in the operator CRD. Within five minutes the CPU dropped back to 70 %, but now every configuration lookup fell through to a round-trip to etcd. The etcd cluster, running on three m5.xlarge masters with 10 k IOPS gp3 volumes, started returning watch timeouts every 47 seconds. The operator logs filled with etcdserver: mvcc: store rev too large errors and the Veltrix API began returning 503s at a rate of 1800 per minute. We rolled that change back inside twenty minutes, but the damage was done: three hours of manual failover and a permanent dent in the SLA dashboard.
The Architecture Decision
I called a war-room on Zoom at 6 AM with the platform team. We had three paths:
- Fork the Config Engine and add a runtime override path for the pool size that actually respected the YAML (too much change, risk of new race conditions).
- Replace sync.Map with a partitioned concurrent hash map (sound, but would take three sprints to validate).
- Change the abstraction so the Config Engine never touched runtime concurrency at all; instead, it emitted a structured OpenTelemetry metric veltrix_config_max_concurrency, and the operator daemon veltrix-operator-sidecar read that metric via the OTel collector and resized the pool through a gRPC endpoint exposed by the worker binary.
We chose option three because it isolated the failure domain: if the sidecar mis-read the metric or the worker ignored it, only that replicas pool mis-sized; the whole cluster wasnt affected. We wrote a thin gRPC service called WorkerPoolController inside the worker binary, exposed on a unix domain socket at /var/run/vwc.sock. The sidecar ran in the same pod and watched the metric every 500 ms with a simple PromQL query:
max(veltrix_config_max_concurrency) by (pod)
If the query result deviated from the current pool size by more than 10 %, the sidecar sent a gRPC SetConcurrency request containing only the delta. The worker pool responded with a channel that carried either a resize acknowledgment or a backoff timeout. We added a feature flag config.engine.use_grpc_pool=true that defaulted to false for twelve hours so we could observe tail latencies. During the observation window the p99 latency stayed at 48 ms, and the error budget burned only 0.003 % of the total SLO.
The Numbers Said After
After two weeks in production, the new design held steady during the Black-Friday synthetic load test: 1.8 M events/sec sustained, 0 panics, 90 % CPU on the worker nodes. The Config Engine itself now emitted 40× fewer metrics because we removed the legacy prometheus exporter. The OTel pipeline collected 12 K config events per second with a 95th percentile export latency of 12 ms. The sidecars CPU never exceeded 1.8 % of a 2 vCPU request, and memory stayed under 40 MB RSS even under peak load.
The critical mistake we avoided was pushing complex logic into the operators CRD status subresource. Instead, we let the worker binary expose a narrow gRPC surface and relied on the sidecar to do the heavy lifting. That isolation paid off when we later introduced a new shard key feature: we only had to update the sidecars PromQL parser and the workers gRPC handler; the Config Engines YAML schema stayed unchanged.
What I Would Do Differently
If I could go back, I would push harder to make the initial Config Engine stateless from day one. The environment cache should have been optional from the start; we added it in a hurry to reduce etcd load during the first load test, and it introduced a global synchronization point that cost us three incidents. Today every config override still triggers a cache invalidation message that fans out to every pod, and that
Top comments (0)