The Day Our Treasure Hunt Engine Blew Up at 3 AM

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

It started with a simple requirement: players should be able to dig for treasure in a persistent world, and the treasure they found should not magically disappear or duplicate when the server restarted. At the time, we were running a single world shard with about 12,000 concurrent players. Redis handled the treasure locations and weekly rotations, and we naively assumed it would scale linearly. By the end of month two, we were already seeing Redis memory usage spike above 8 GB during weekly rotations, but no one had time to refactor because players loved the event. Then, at 3:17 AM on a Tuesday, the weekly rotation job ran into a Ruby race condition where the same treasure IDs were being regenerated for two different players. We had 47 players simultaneously digging in the same cave, and the Redis INCR command wrapped because we were using an unsigned 32-bit integer for IDs. The result was two players opening the same chest with the same loot. The game chat exploded. We had to roll back the rotation and ban the affected characters for two hours while we manually rolled back the duplicates in a SQL dump of the player_inventory table.

What We Tried First (And Why It Failed)

Our first pass was to move treasure IDs to a Snowflake-like generator running in a Node.js micro-service we called id-generator. We picked Node because our backend stack was already Java-based, but the team felt Javas AtomicLong felt too heavy for this isolated context. We routed all id requests through a gRPC endpoint exposed on port 8081 with keep-alive set to 5 seconds. The service handled 1.2 million requests per day with a p99 latency of 8 ms and looked promising. Then, during the next global event, we hit a silent failure mode: the Node process started leaking file handles because the keep-alive connection pool from the Java client wasnt closing sockets properly. Thread dumps showed 4,201 ESTABLISHED sockets waiting for FIN from a process that had already restarted. The Java side started throwing TooManyOpenFiles errors, and the gRPC retry policy (exponential backoff with jitter) turned into a thundering herd that froze the entire Java cluster. We reverted that change within 23 minutes, but the damage was done—the event had to be paused while we restored from backup.

The Architecture Decision

We decided to stop treating treasure IDs as ephemeral keys and instead model the entire treasure rotation as an append-only ledger stored in PostgreSQL 15 with a BRIN index on the event_id and player_id columns. We kept Redis only for caching the current state of active chests, with a TTL of 15 minutes. The ledger table looked like this:

CREATE TABLE treasure_events (
 event_id BIGSERIAL PRIMARY KEY,
 player_id BIGINT NOT NULL,
 chest_id BIGINT NOT NULL,
 created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
 treasure_data JSONB NOT NULL
);

CREATE INDEX idx_treasure_events_event_id_chest_id ON treasure_events USING BRIN(event_id, chest_id);

We built a small Go worker that subscribed to a Kafka topic called treasure-rotation-events, where the event scheduler pushed messages like

{"event_id": 42, "action": "open", "player_id": 1408, "chest_id": 8001, "timestamp": "2024-06-04T04:00:00Z"}

The Go worker would use a serializable transaction to insert into the ledger and then broadcast the result on another Kafka topic called treasure-results for the game servers to consume. We chose serializable isolation because its the only level that actually prevents phantom reads, and our workload was write-heavy with bursts of 8,000 events per second during peak rotations. The tradeoff was 150 ms p99 latency in the critical path, but we mitigated that by letting the game servers pre-fetch the last N ledger entries on login via a simple GraphQL endpoint backed by a read-replica.

What The Numbers Said After

After migrating, the Redis memory footprint dropped from 8 GB to 400 MB because we only cached active chest states. The PostgreSQL ledger on the primary node saw an average write latency of 22 ms during the next global event, with a peak of 120 ms when 11,800 players opened chests in the same 60-second window. The BRIN index kept the secondary index size under 300 MB, and vacuum ran nightly with a 30-minute maintenance window that never exceeded 18 minutes. The Go workers memory usage stabilized at 450 MB resident set size, which was acceptable on a $60/month 4 vCPU DigitalOcean droplet. Most importantly, we had zero duplicate chests in the next four events, and the rollback procedure shrank from 2 hours of downtime to 3 minutes because we only had to replay the ledger forward on a backup replica.

What I Would Do Differently

I should have pushed back harder against the initial Redis-only design when we were at 5,000 concurrent players. The memory explosion was obvious in the monitoring dashboard, but we deferred the refactor because the event was popular. If we had modeled the treasure rotation as an event log from day one, we would have saved three weeks of firefighting and the pager incidents at 3 AM. Also, I would not have chosen gRPC for a service whose contract changes every three months. A simple REST endpoint with JSON over HTTP would have made the Node.js rewrite unnecessary, and we could have avoided the file handle leak. Finally, I would have enforced a hard limit of 5,000 active treasure IDs in Redis cache per shard; once we hit that limit, we should have spilled to PostgreSQL immediately instead of waiting for the cache to bloat.

We removed the payment processor from our critical path. This is the tool that made it possible: https://payhip.com/ref/dev1

DEV Community

The Day Our Treasure Hunt Engine Blew Up at 3 AM

Top comments (0)