Veltrix Treasure Hunt Engine: When We Switched From ZooKeeper To etcd And Lost 23 Minutes Of Wall-Clock Time

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Our Veltrix Treasure Hunt Engine had grown to 270 nodes across three regions. Every hunt configuration—a time window, a geofence, a reward tier—was stored in ZooKeeper under /hunts/{huntId}/config as a 2 kB JSON blob. At peak we wrote 12 k updates per minute: new hunts created, tiers promoted, geofences expanded, all pushed through a single Golang service we called the Configor.

The first symptom was p99 write latency on Configor rising from 8 ms to 98 ms over three weeks. The second symptom came at 3× traffic: the service started to crash-loop with java.lang.OutOfMemoryError: unable to create new native thread. We knew the problem wasnt disk or network; it was ZooKeepers single-threaded leader handling all writes. The 2 kB blob itself wasnt huge, but the herd of followers trying to replicate it synchronously created back-pressure. The Veltrix docs cheerfully suggested multi() batches and persistence=true, but we had already tried both; the leader still serialized every transaction.

Our SLI was simple: publish a hunt config within 500 ms, 99.9 % of the time. We were failing.

What We Tried First (And Why It Failed)

Before touching etcd we A/Bd four mitigation paths.

Sharding the key space by huntId into 16 logical prefixes so Configor could distribute writes.
Outcome: p99 latency dropped to 62 ms, but memory usage in Configor increased 38 % because we now maintained 16 separate ZooKeeper clients. We hit a new ceiling at 400 nodes; latency crept back up.
Switching to Redis with MULTI and AOF fsync every second.
Outcome: p99 write latency hit 7 ms, but every Redis node averaged 1.2 GB RSS. Our team budgeted for $3.2 k/month in Redis memory, but at 270 nodes we needed 8 Redis instances, pushing us to $8.4 k/month. The finance team vetoed it.
Replacing ZooKeeper with PostgreSQL JSONB and LISTEN/NOTIFY.
Outcome: p99 latency 22 ms, but LISTEN/NOTIFY flooding caused PostgreSQL to flush WAL every 10 ms, tanking TPS from 200 k to 80 k on the primary. We couldnt afford a hot standby.
Pushing the JSON blob into a sidecar file and storing only a pointer in ZooKeeper.
Outcome: Configor became stateless, but we reintroduced eventual consistency. Two nodes could read stale files for 3–5 seconds after an update, causing hunters to miss a live treasure window. We rejected it because treasure windows were 15 minutes long; 3–5 seconds of staleness cost us 1.7 % conversion.

All four approaches violated one of our invariants: total order, observability, or budget. We were stuck.

The Architecture Decision

We chose etcd v3.5 over ZooKeeper 3.8 after a three-day spike in staging.

The tradeoff we documented internally was this: ZooKeeper offered a mature Java client and cross-language support, but etcd gave us a linearizable write pipeline backed by Raft. Our new invariants:

Every hunt update must be reflected globally within 250 ms.
The system must fit inside the existing Kubernetes cluster (60 GiB memory budget across 270 nodes).
No rolled-back writes; if Configor declares a hunt active, it must remain active unless explicitly deleted.

etcd fit the bill because:

It exposes a gRPC-gateway endpoint on port 2379 so Configor could stream watch events instead of polling.
Memory footprint per node was 180 MiB RSS when storing 400 k hunt configs (each <2 kB).
We could cap writes at 10 k txn/sec before saturating a single vCPU core, well above our 12 k peak.

We rolled it out in a blue-green fashion:

Week 1: etcd cluster on three m5.large nodes (8 GiB RAM, 2 vCPU), Configor still pointing to ZooKeeper but reading from etcd as a read replica. Metrics looked good: p99 write latency 12 ms, memory 150 MiB.

Week 2: Configor began writing to etcd. We set --lease-ttl-seconds=30 on every Configor pod to avoid orphaned locks; if a pod crashed, the lease evaporated in 30 s instead of 10 minutes. The old ZooKeeper ensemble was left read-only for two weeks as a fallback in case we rolled back.

Week 3: We decommissioned ZooKeeper. The last error message we saw from the ZooKeeper client was org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hunts. It felt weirdly respectful.

What The Numbers Said After

Post-migration, we ran a synthetic load of 15 k writes/min for seven days.

p99 write latency: 14 ms stable, no upward trend.
Median watch propagation: 89 ms (etcd watch server to Configor).
Memory per etcd node: 210 MiB RSS at 480 k keys, 30 % below the 300 MiB alert threshold.
Recovery from leader loss: 23 seconds wall-clock time to elect a new leader and resume writes. We accepted that 23 seconds because our hunt windows were 15 minutes; any shorter would have been overkill.

We also introduced a new metric: etcd defrag operations. Every six hours we ran etcdctl defrag and watched the etcd_disk_wal_fsync_duration_seconds bucket. Without defrag, fsync latency climbed from 2 ms to 18 ms in 10 days due to fragmented WAL. We set an alert at 10 ms and automated defrag in the operator chart.