Why Hytale Treasure Hunt Engines Keep Dying at Scale — And How We Kept Ours Alive

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

The core loop of a Hytale treasure hunt is simple: spawn 500 clues per map, 50 hunters on each clue, first hunter to solve the riddle gets the loot, then we despawn everything and restart the next wave.
In practice that loop dies under load for two reasons:

Redis Streams at 8 k messages per second – each clue update, each loot claim, each wave restart – turn into a single blocking call when we used RPUSH on a per-hunter stream. Benchmarking showed 1.2 ms median latency under 2 k concurrent hunters, but P99 exploded to 480 ms when the stream approached 250 k events in flight. The server tick loop would stall waiting for the Redis response, and hunters would see the clue teleport them to the wrong coordinate because the event stream was still being flushed.
Single-threaded Lua in OpenResty doing JSON parsing of every clue change. When 500 hunters solved the same clue within 300 ms, the nginx worker would spend 80 % of its time serialising and deserialising table structures. EngineCPU per request jumped from 3 ms to 48 ms, and we started dropping 15 % of legitimate requests when the worker queue exceeded 128.

Most public configs I audited used the same pattern: a Redis key per clue that stored the current hunter UUID, and a Lua script to CAS the UUID when the clue was solved. The script looked like this:

local cur = redis.call('GET', KEYS[1])
if not cur or cur == ARGV[1] then
 redis.call('SET', KEYS[1], ARGV[2])
 return 1
end
return 0

It worked fine in the dev shard (1 k concurrent hunters), but in production we saw 42 % of scripts returning 0 under load because the Lua script didnt account for the 100 ms clock skew between nginx workers. Two workers could read the same key within 50 ms and both think the clue was free.

What We Tried First (And Why It Failed)

First attempt was to bolt on a lock: a Redis key called clue:123:lock with 50 ms TTL. The Lua script would try to SETNX the lock before reading the hunter UUID. Success rate jumped to 78 % under 8 k concurrent hunters, but we introduced two new failure modes:

Lock thrashing – the 50 ms TTL was too short for the Lua script to finish, so 32 % of locks expired before the CAS completed. The script would reacquire the lock, read a stale hunter UUID, and then overwrite a solve that had already happened. We logged 1,842 stale solves in the first 48 hours.
Timeouts in the locking path – when the lock acquisition loop retried 5 times with linear backoff, the nginx worker would exceed its 100 ms budget and emit 503 errors. Error rate climbed from 1 % to 8 % during peak hours.

Second attempt moved the state into Postgres with advisory locks and a table called treasure_hunts(id, clue_id, hunter_id, solved_at). The CAS translated to a SQL UPDATE with RETURNING. Median latency dropped to 8 ms under 8 k hunters, but P95 jumped to 120 ms because the Postgres pool ran out of connections under 12 k hunters. Connection wait time became the new bottleneck; we hit the default 100 connection limit and started queueing queries. Adding more connections didnt help—the Postgres server saturated at 800 TPS and the CPU steal from the hypervisor reached 35 %, which violated the SLA for the host.

Third attempt tried Kafka instead of Redis Streams. We sharded the 500 clues into 500 Kafka topics, produced each clue change to the topic, and consumed with 50 separate consumers. The consumer lag graph showed 2 minutes of lag under 10 k hunters, and the server started retrying events with exponential backoff. The lag never recovered until the event loop was paused manually. Kafka brokers on m5.xlarge nodes could handle 200 MB/s of write traffic, but 500 topics meant 500 separate partition leaders, and the controller election rate during broker restarts caused 15-second pauses every time we rolled a node.

The Architecture Decision

We ripped out the lock, the Lua script, and Kafka in one go and went back to Redis, but with two changes that mattered:

We stopped using Streams for every micro-event. Instead we stored the hunt metadata in a Redis Hash with fields: current_hunter, last_solve_time, reset_at. Every 2 seconds the backend worker would read the Hash and broadcast the state to all hunters via WebSocket. That reduced Redis throughput from 8 k to 250 messages per second—well below the 100 k/second Redis limit on the m5.2xlarge node.
We replaced the Lua CAS script with a single Redis Lua function registered as clue_cas. The function accepted clue_id, old_hunter, new_hunter and a token (unix timestamp). Inside the function we used redis.call('HGETALL', clue_id) to read the Hash, checked the current_hunter and last_solve_time, and only updated if the token was newer and the hunter field matched old_hunter. The Lua function ran atomically on the Redis server, so clock skew between workers no longer mattered. The script looked like this:



local key = KEYS[1]
local old = ARGV[1]
local new = ARGV[2]
local token = tonumber(ARGV[3])
local entry = redis.call('HGETALL', key)
if not entry.current_hunter or entry.current_hunter == old then