Veltrix Cluster Configuration or, How We Spent Two Weeks Chasing a 3% Packet Loss Spike

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We had built the planner as a single logical service called hytrace-core. It exposed a gRPC endpoint /v1/sessions/{id}/state that returned the full JSON state (3–4 KB JSON) and updated it atomically. The state blob lived in etcd under the /hunt/state prefix. Every time a player started a dig session, the backend did:

one etcd Txn: compare-and-swap on the sessions revision
one etcd Put: new state snapshot
one etcd lease keep-alive

That looked cheap until we pushed 4 k concurrent sessions. The etcd leader spent 40 % of its CPU time serializing those 4 KB writes into the WAL and raft log. The 3 % packet loss was TCP retransmits between the planner pods and the etcd cluster caused by TCP send buffer exhaustion under 120 ms jitter spikes.

What We Tried First (And Why It Failed)

We first tuned etcd: increased quota-backend-bytes from 2 GB to 8 GB, bumped wal-fsync-duration threshold to 500 ms, and swapped the WAL storage to NVMe volumes. Packet loss fell to 1.7 %, but the 99th-percentile raft commit latency stayed at 700 ms. Next we tried increasing the etcd peer count from 5 to 7 to spread the load. The cluster stabilized for about six hours, then the same leader fell behind again when a follower got a noisy neighbor on the same hypervisor rack.

While etcd was the obvious bottleneck, we also tried sharding the planner state into three shards by session_id hash. This meant three separate etcd clusters, each with its own raft group. We wired in a client-side shard resolver that ended up doubling the latency budget because the planner now had to do three round-trips for every mutation. Worse, a single hot shard still caused leader thrashing; the planner traffic was not evenly distributed, it clustered around mid-tier sessions.

The Architecture Decision

At this point we had two paths:

Option A: push the planner state into Redis Cluster and keep etcd only for configuration. Redis had 8 ms local writes, but wed lose transactions and durability guarantees. Hytrace sessions had to survive a node loss in minutes, not seconds— we couldnt risk a lost dig.

Option B: keep the planner state in etcd but split the write path from the read path. We decided on Option B with an explicit tradeoff: we hardened etcd for writes and punted reads to an asynchronous snapshot served by the etcd followers.

Concrete changes:

Moved /hunt/state writes to the etcd leader only via gRPC with a LeaseKeepAlive stream that piggy-backed state deltas. Every 100 ms the leader batches deltas for the same session into a single 2 KB write. We set etcds max-wals-size to 64 MB to keep disk usage flat.
For reads, we deployed a sidecar called etcd-snapshot-proxy that tails the raft log via etcds watch API, builds an in-memory LRU cache of the latest state per session_id, and serves reads at 1 ms p99. We kept the cache size capped at 200 MB so that in a failover we could rebuild it in under 30 seconds.
We replaced the planners state struct with a delta-only representation: instead of storing the full 4 KB JSON, we stored three ops (dig_started, tiles_revealed, loot_claimed) and a revision counter. When the client fetches state, the snapshot-proxy reconstructs the full JSON on demand by replaying the deltas from its cache. This cut the write amplification by 6×.

The new write path committed in 12 ms p99 on the leader, and follower reads served at 0.8 ms p99. Packet loss dropped to 0.04 % and stayed there for two weeks.

What The Numbers Said After

After rolling the delta writes and snapshot reads to 100 % of prod traffic for 14 days:

etcd leader CPU utilization: 18 % (down from 75 %)
raft commit latency p99: 12 ms (down from 850 ms)
planner P99 tail latency: 19 ms (down from 280 ms)
session state durability: we replayed every 10-second snapshot onto a warm follower every minute; zero committed session data was lost in a chaos-monkey node kill test.

The Redis fallback we tested in staging showed 2 ms writes, but the planner then required a separate transaction manager to keep state consistent across Redis shards. When we killed one Redis shard in staging, we lost three live sessions because the transaction manager tried to commit to two out of three shards and the third never responded. That experience made the tradeoff worth it.