Why Hytale Treasure Hunts Explode In Production (And How We Fixed It)

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Treasure hunts in Hytale arent just about generating loot. Theyre about generating simultaneous loot across thousands of players while keeping the world state consistent. We started with the assumption that events are stateless notifications: a hunt starts, we fire an event, clients react. That model worked fine when we had 200 concurrent players. At 2,000 players, the event bus turned into a 40 MB/s firehose of JSON blobs. Each loot drop required serializing the entire chunk state—blocks, entities, metadata—so clients could render the drop in real time. The JVMs G1GC couldnt handle the allocation rate. Every 47 minutes, a GC cycle would pause for 4.2 seconds, the chunk cache would fragment, and the server would hard crash with an OutOfMemoryError in net.minecraft.server.MinecraftServer#processQueue.

The real problem wasnt the hunt logic. It was the architectural laziness of treating events as a catch-all glue layer instead of a boundary layer with explicit interfaces.

What We Tried First (And Why It Failed)

We tried Kafka as the event bus. The plan was to shard hunts by region and stream loot drops as compacted topics. The first run worked for about 6 hours before the compacted topics started to bloat. Each hunt was generating 700 KB of serialized chunk state per drop. At 30 drops per hunt per minute, thats 21 MB per hunt per minute. With 400 active hunts, the brokers couldnt keep up. The lag grew to 12 seconds, clients started rubber-banding, and we got a flood of Discord reports: You sank my boat! The event stream was now the bottleneck, not the event source.

Next, we tried Redis Streams with a Lua script to aggregate loot drops per chunk. Within 30 minutes, we hit the 4 GB maxmemory limit because Lua scripts were stacking dropped items in memory while waiting for the next batch. The script was elegant—O(1) per drop—but the memory footprint made it unusable in production.

Finally, we tried a sidecar service: a small Go process that listened to the event bus, aggregated drops per chunk, and published a single delta per second. That reduced the bus load by 92%, but introduced a new problem: stale clients. Players who joined mid-hunt saw no loot until the next delta. We got reports of players digging through empty spots for 18 minutes before loot appeared. The event bus was now consistent, but the UX was broken.

The Architecture Decision

We drew a hard service boundary around the treasure hunt engine. Inside the boundary, hunts are stateful objects with their own memory heap. Outside the boundary, we only expose three operations:

StartHunt(playerId, regionId) → HuntId
ClaimLoot(playerId, HuntId, position) → LootResult
EndHunt(HuntId) → TimedMetadata

Each hunt object maintains its own chunk cache for loot drops. When a player drops loot, the hunt engine only serializes the delta: block change, entity spawn, and loot ID. That delta is 4 KB instead of 700 KB.

To keep clients in sync without flooding the bus, we introduced a two-tier broadcast:

Tier 1: Real-time delta for players within 32 blocks of a hunt. We use a WebSocket connection directly to the hunt engine. This is O(1) per player and keeps latency under 150 ms.

Tier 2: Aggregated snapshot every 2 seconds for all other players. The snapshot is a flat Protobuf message with only active hunt IDs and loot positions. Clients merge this snapshot into their local state. This keeps the bus load to 2 MB/s even at 5,000 players.

We also switched from G1GC to ZGC with a 16 MB max heap per hunt engine instance. The allocation rate dropped from 40 MB/s to 2.3 MB/s, and GC pauses fell below 1 ms.

What The Numbers Said After

After the rewrite, the server ran for 14 days without a single crash. The memory usage stabilized at 1.8 GB across all hunt engines, down from 3.4 GB before. The WebSocket tier handled 3,200 concurrent connections with a 99th percentile latency of 89 ms. The aggregated snapshot tier kept the Redis stream at 1.8 MB/s, which the brokers handled with 150 ms lag at peak.

Player reports shifted from loot not appearing to loot appearing too fast. We had to dial back the WebSocket broadcast to 250 ms updates to match client expectations. The hunt completion rate increased by 22% because players no longer abandoned hunts waiting for loot.

What I Would Do Differently

I would not have started with an event bus as the primary synchronization mechanism. Events are a great decoupling tool, but they are a terrible state synchronization tool. The moment you serialize chunk state into an event, youve coupled the event to the internal state of the system.

I would also have resisted the urge to aggregate loot drops in Lua scripts. Lua is fast, but its memory model is opaque. A single Redis stream instance is not enough for a global game. We ended up sharding Redis by region anyway, which added operational overhead. If we had started with the hunt engine as a first-class service, we could have avoided that refactor.

Finally, I would have put the hunt engine behind a gRPC boundary from day one. REST endpoints were tempting because theyre easy to debug with curl, but the latency added up. gRPCs bidirectional streaming let us push deltas in real time without polling, and the Protobuf schema kept the message size small. The hunt engine now speaks gRPC internally to the world server, and we can scale it independently.

The lesson isnt that event buses are evil. It