When Metrics Lie and Operators Panic: How We Fixed Veltrix Configuration at 2 AM

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Hytales treasure hunt engine ingests 1.2 million configuration files per day from Veltrix, the in-house event orchestrator. Each file is a tiny JSON blob describing a treasure location, drop table, or spawn rule. The ingestion pipeline is a three-tier Kafka → Flink → Cassandra topology. The middle tier, Veltrix-config, is a 9-node Raft cluster (etcd 3.5.9) that stores the authoritative copy of every active treasure configuration. The config service exposes a gRPC endpoint cluster-aware via a headless Kubernetes service veltrix-config-client.

The operational surface area was perfect: we had a strict 300 ms 99.9th latency SLO and zero tolerance for stale reads. What we did not have was a coherent leader election story. The etcd cluster would lose quorum during rolling Flink upgrades when Cassandra compaction storms coincided with a short cloud provider network flap. Cassandra compaction storms were our fault—we had set compaction_throughput_mb_per_sec=256 to squeeze 500 MB/s writes into cheaper io1 disks. The network flaps were the providers fault—again. But the outcome was ours: 503 responses from veltrix-config for exactly 10 minutes every time quorum dropped, because the leader election loop in our Go operator relied on the default etcd client retry policy: 5 retries at 100 ms, then hard fail.

Worse, the operator playbook said: restart the config-service deployment. That forced Flink to reread the entire treasure config from S3, which took 8 minutes and chewed up 40 GiB of egress bandwidth. We measured it. So our attempt to keep the system stable made it fragile instead.

What We Tried First (And Why It Failed)

The first attempt was to lift the Flink checkpoint interval from 60 s to 600 s so fewer metadata writes went through veltrix-config. That reduced traffic by ~15 %, but the leader election still blew up every time etcd membership changed. The operator still rebooted the service.

Next we added an HPA on veltrix-config pods scaled by QPS. When the 503s hit, pods scaled to 20 replicas in 45 seconds. The headless services DNS records became a 2000-host SRV record. Flinks gRPC client balancer panicked:
grpc: failed to resolve the initial address list:
lookup veltrix-config-client.prod.svc.cluster.local on 10.96.0.10:53: no such host
Even after DNS stabilized, Flink kept a 60 s connection pool and reused stale connections that pointed to the terminated pod. We saw 4 % of traffic still go to pods that did not exist. We rolled back the HPA.

The third try was to swap etcd for a managed Consul cluster. We ran a 4-week dark test on staging with Consul 1.15. Instead of leader_not_found we got raft_replication_failed errors every 10 minutes because Consuls snapshot subsystem holds a global lock that blocked every write during snapshot creation. The staging environment locked up for 2.3 seconds each snapshot. We measured throughput drops from 4100 ops/s to 300 ops/s during those windows. The tradeoff was worse: managed Consul gave us runbooks, but it introduced latency spikes that violated our p99 SLO.

Finally we attached a custom readiness gate that made veltrix-config pods report ready only after etcd leadership was stable for 30 seconds. That prevented Flink from sending traffic to a cluster that could not elect a leader. It still did not solve leader election races; it only masked them until the next network flap. We were papering over the crack.

The Architecture Decision

We replaced the Raft leader election mechanism in Veltrix-config with a split-brain-resistant CP subsystem: CockroachDB 23.2 running in the same Kubernetes cluster but isolated via node affinity. CockroachDB gave us:

Strong consistency without split-brain on three availability zones (we had to add a 10 ms max_offset_adjustment in the cluster settings to allow clock skew).
A built-in health check endpoint /health?ready=1 that returns 503 until the node is both part of the quorum and has caught up on raft logs.
A raft_proposal_failure_threshold metric we wired to a custom Kubernetes controller that can trigger a controlled restart of a single pod instead of the whole deployment.

The Flink consumer now uses a two-phase fetch: first it talks to CockroachDB via the pgx driver to read the active treasure list, then it copies the actual JSON blobs from S3. The S3 blobs are immutable per configuration version, so we relaxed the consistency to eventual for those reads.

We moved the veltrix-config headless service to an Istio VirtualService with locality-aware routing and an outlier detection policy that ejects pods that return 503 for 3 consecutive health checks with a 5-second grace period. We set the gateways connection pool to 50 connections per pod with 10 second idle timeout to shed load during leader transitions.

The entire migration took 14 days. The numbers looked like this after:

Leader election latency dropped from 10 seconds to 150 milliseconds.
99th latency for config fetch stayed at 280 milliseconds even when one AZ was partitioned