Why Veltrix Thought It Could Buy Its Way Out of a Distributed Lock Problem

#systems #webdev #programming #architecture

The Problem We Were Actually Solving

Hytale runs a server-side treasure hunt engine that must hand out unique rewards every second during global events. Each reward is a non-fungible item, so we needed strict linearizable ordering: the same treasure ID must never be emitted twice, even if the cluster partitioned. The business rule was simple: no duplicate keys, no manual recovery, SLA 5 ms p99 latency. Redis Cluster gave us eventual consistency within the slot shard, but it could not do cross-slot linearizable writes. When the cluster rebalanced—even for a second—requests started to race, and we saw duplicate quest keys in prod logs. That violated the spec, and we had to backfill 14,000 duplicate items in the account database.

What actually broke was not Redis itself; it was the optimistic assumption that Redis Cluster could behave like a single atomic register under partial failure. The client library redislock-py was retrying with exponential backoff, but without a fencing token, two clients could both believe theyd won the lock and emit the same treasure ID. The error we chased for two days was MISCONF Redis is configured to save RDB snapshots, but the replica is too slow to persist, which masked the real race: two processes incrementing the same counter under split-brain.

What We Tried First (And Why It Failed)

First fix: shard the writes per realm so each key is single-slot. We rolled out a realm-to-slot mapping, but the mapping table grew to 120 MB and had to live in client memory. Any realm rebalance still forced a full client rollout, and we hit a bug in hytale-realm-client where the in-memory map was stale after a ZooKeeper re-election, leading to slot not served 3.2 % of the time.

Second fix: use Redlock algorithm inside the game service. We pulled the redis-py Redlock implementation and ran it against a 9-node cluster. The first problem was clock drift: the game servers were running on NTP-skew-prone Windows containers, and Redlock requires clocks to be within 50 ms. Our max drift was 137 ms, so we saw lock lost retries even on healthy nodes. The second problem was lease renewal: if a game server died mid-lease, the lock expired only after 30 seconds, releasing the treasure ID to the next lucky client, which violated the business rule.

Third fix: switch to a CP database instead of AP. We spun up three FoundationDB clusters per AWS region, each with three stateless resolvers. FoundationDB promised strict serializability and multi-region ACID, but the 60 MB transaction buffers caused the resolvers to exceed their 300 ms soft limit under 50 K tps. We saw client timeouts labeled foundationdb.client.unavailable: ClusterNotReady for 8.4 % of requests during cross-region failover. The ops team then set the resolver batch size to 200, which fixed latency but opened a new problem: the transaction retry loop in hytale-foundation-client could spin for 1.2 seconds, returning duplicate keys if the retry happened before the prior commit finished.

The Architecture Decision

After the third failure, we stopped trying to bolt linearizability onto a distributed cache. We went back to first principles: if we need a single atomic sequence, give it a single owner. We created a dedicated ID service called Anchor.

Architecture:

One stateless Anchor service per AWS region.
Each Anchor has a local etcd cluster with raft-each-reach configuration.
Client requests hit a gRPC endpoint /next/{namespace} which returns a monotonically increasing 64-bit integer.
The namespace is partitioned: treasure IDs go to namespace=1, ship logs to namespace=2, etc.
We use etcd lease-based leader election so only the leader can append to the raft log.
If the leader steps down, the new leader starts from the last committed index, ensuring no gaps or duplicates.

Tradeoffs:

Anchor is now a single hot shard in each region. If it dies, that entire region cannot hand out new IDs until the raft recovers. To mitigate, we run three Anchor replicas with 100 ms election timeout and a 5-second client retry with backoff.
Memory usage: the raft log for namespace=1 reached 2.1 GB in 72 hours under 120 K ids/minute. We added a nightly compaction job that snapshots every 24 hours and truncates the log to the last committed entry.
Latency: the p99 for /next requests inside the same AZ is 2.1 ms. Cross-AZ calls spike to 42 ms when the raft is unstable, but we accept this because the primary user, the treasure engine, batches calls every 10 ms anyway.

We shipped this on 15 March and watched the duplicate-id metric drop from 0.04 % to 0.0001 %. That number saved us two weeks of backfill.

What The Numbers Said After

Data from 18 March to 25 March—one full global event cycle:

Anchor p99 latency: 2.9 ms in us-east-1, 38 ms in ap-northeast-1 (cross-region).
Anchor memory footprint per region: 3.2 GB (stable), spiked to 4.1 GB during compaction.
Treasure engine duplicate keys: 2 events out of 34 million, rate 0.00006 %.
Cost per million IDs: $0.00012 in us-east-1, dominated by etcd disk IOPS.

Compared to the FoundationDB attempt