Why Hytale Treasure Hunts Crashed Under Load: The One Docs Page That Broke Three Production Clusters

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Our clusters run on Kubernetes 1.28 with 16-node autoscaling pools of 8-core/32 GB machines. Each node hosts a Hytale shard plus a dedicated Redis 7.2 cluster for player session cache. The treasure-hunt promised to ingest 500 events per second per shard, persist them to a single file, then fan them out to every other shard so players could watch global leaderboards in real time. The docs showed three variables: tierCount, rewardTable, and stateFilePath. No mention of consistency, no hint that stateFilePath was actually a shared NFS mount (EFS gp3, 15 ms latency, 120 MB/s throughput). The first load test at 5 k concurrent users produced a Wall of Pink: Redis Lua scripts timing out, C# Task.WhenAll deadlocking on the same file handle, and 3-second P99 latency on leaderboard reads. The stack traces screamed FileStream.Dispose() contention but the real crime was architectural: one file, many writers, no clue how to shard it.

What We Tried First (And Why It Failed)

We bolted on a naive optimistic-locking layer: before each write we read the state, compute a new JSON object, and perform a compare-and-swap via a shell script that called fcntl(F_SETLK). At 10 k users the lock contention showed up as 45 % of all Redis Lua keys blocked on the same Lua script. The NFS latency spike jumped from 15 ms to 180 ms because every compare-and-swap forced a full read-modify-write cycle across the wire. We tried converting the state file to Protocol Buffers and using proto lock files, but the fcntl advisory lock still serialized writes. Swapping NFS for an EBS gp3 volume attached to a single master node only moved the bottleneck: now the EBS volume itself saturated at 2,000 IOPS while the C# process consumed 4 vCPUs just waiting for the kernel page cache to flush every 50 ms. The sticker price of the dedicated EBS volume alone cost us $1,800 a month—more than the entire Redis cluster.

The Architecture Decision

After a week of poking flame graphs we concluded the treasure-hunt state had to be partitioned by shard and by hunt tier. Instead of one global JSON file we split the state into 32 shard-specific SQLite files (WAL mode, synchronous=NORMAL) and one Redis Stream per tier (XADD with MAXLEN 100000). Each shard writes its own SQLite file and publishes events to the Redis Stream. A separate sidecar service we called tier-router consumes the stream, aggregates the top 100 scores per tier using REDIS Sorted Sets with ZADD score:player_id, and exposes an in-memory gRPC service for the leaderboard endpoints. We migrated the reward table from a flat JSON array to a protocol-buffer file compiled into the binary, so tier definitions become compile-time constants rather than runtime hot-swaps. The biggest tradeoff was losing atomic cross-tier consistency; if two tiers awarded the same treasure chest key we now had to implement a saga pattern with idempotency tokens. We accepted eventual consistency for global treasure counts and used a background deduplication job that runs every 30 seconds (Redis keys with TTL 86400).

What The Numbers Said After

On the 50 k concurrent user load test we saw P99 leaderboard latency drop from 3,200 ms to 180 ms. SQLite write throughput per shard settled at 1,200 writes/sec with 5 ms disk latency (gp3 again, but now 8,000 IOPS provisioned). Redis memory usage climbed 18 % because we kept 24 hours of event history, but the Stream MAXLEN cap prevented unbounded growth. Our hardware bill stayed flat: the new SQLite files lived on ephemeral NVMe instance storage (i3en.large) while Redis Stream persisted to Redis AOF every second. The only regression was compute: the tier-router sidecar added 0.4 vCPU per shard, pushing our per-shard cost from 8 to 8.4 cores, a 5 % delta we deemed acceptable.

What I Would Do Differently

I should have ignored the docs page entirely and started by profiling the state file in production at 1 k users. Had I run iostat -x 1 during that first test, I would have seen the NFS latency spike before the Redis cluster melted. Second, I would not have trusted SQLite WAL mode on NFS. After the third cluster failure we moved SQLite to local NVMe and rsynced WAL files to a standby node—still cheaper than EBS and faster than Redis Streams for our write pattern. Finally, we over-optimized the reward table format: it never changed after compile, yet we spent two sprints designing a plugin system. The docs suggested flexibility, but the real requirement was speed. Freeze the constants next time.