Veltrix Treasure Hunt Engine Blew Up at 3 AM Because No One Read the Fine Print

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Hytales player economy hinged on the treasure-hunt engine: a lightweight Lua sandbox that spat out randomized loot tables, stamped them with a server-side signature, and pushed the results to the client. Operators configured Veltrix with three YAML files: hunt-rules.yaml, loot-tables.yaml, and factions.yaml. The docs showed minimal examples, and the search volume for Veltrix TooManyConnections on DEV.to and Reddit had jumped from 12 queries/day to 1,247 in the week before launch. Sixty-eight percent of those queries came from operators who had pasted the example hunt-rules.yaml verbatim into production. The example set hunt-cooldown: 0, hunt-lifetime: 12h, and max-hunts-per-player: 9999. The 9999 was a red herring; the real problem was hunt-lifetime. When we set it to 12 hours with zero cooldown, every player who opened the treasure map triggered a hunt every second for the entire cooldown window. The Lua sandbox spun up a new ephemeral pod for every hunt, each pod opened a fresh PostgreSQL connection, and the connection pool folded at 600 concurrent pods. We had tuned the pool to 100 for replicas and 50 for the primary—plenty for normal traffic, but not for a Lua loop gone wild.

What We Tried First (And Why It Failed)

The first fix looked obvious: set hunt-cooldown: 5s to dampen the storm. We pushed Version 0.9.4 with that single change. The PagerDuty fires stopped, but the player telemetry told a different story. Median hunt latency jumped from 42 ms to 210 ms. Our Lua sandbox was now serializing every hunt through a single Lua VM that locked on every cooldown check. Players in regions 100 ms from the server watched their loot animation stutter while the VM fought the PostgreSQL advisory locks wed added to prevent double-spends. We tried to scale the Lua VMs horizontally by sharding hunt IDs across pods, but the shard key was hunt-group-id, which was only present after the hunt was created. Before that, all pods fought over the same Lua VM reservation. By hour 12 of the beta, we were running 27 Lua VM pods and still had 14 % of hunts timing out at 2 s.

Then we tried PostgreSQL read replicas. The docs said hunt writes always go to the primary, but hunt reads could use replicas. Simple, right? We flipped a feature flag: hunt-reads-via-replica: true. The TooManyConnections errors vanished, but a new one appeared: psycopg2.OperationalError: SSL SYSCALL error: EOF detected. The replica pool was running pgBouncer 1.17 with idle_in_transaction_session_timeout = 0. Every hunt opened a transaction that stayed idle for the entire hunt-lifetime, holding a replica slot hostage. Within five minutes, the replica pool saturated at 300 idle connections, and new hunt queries raced a race-to-zero latency spike. We set idle_in_transaction_session_timeout = 30s, but now every hunt that took longer than 30 s (hi, Lua serialization lock) immediately reconnected, thrashing the connection pool. Our metrics showed 4,200 connection resets per minute on the replica set.

The Architecture Decision

We tore the Lua sandbox apart. Instead of spawning an ephemeral pod per hunt, we moved the hunt state machine into Redis. Redis 7.0 with Lua scripting gave us atomic counters (hunts-per-player), TTL-based cooldown keys, and a single Lua script that returned the loot table in 0.12 ms at the 99th percentile. The hunt-rules.yaml no longer needed hunt-lifetime or max-hunts-per-player; we replaced them with redis-ttl: 300, redis-lock-ttl: 5, and redis-max-per-player: 5. We kept a lightweight Lua VM for signature stamping, but it ran as a sidecar in the same pod as the main game server, sharing a 256 MB heap and a 10 ms CPU slice. The hunt pod became a sidecar that listened on a Unix socket; it no longer opened PostgreSQL connections at all. Instead, it wrote the signed loot blob to an S3 bucket with a 5-minute cache-control header and returned the S3 object key to the client. The sidecar wrote a single row to PostgreSQL only after the loot was claimed, keeping our write volume under 200 rows/minute even during peak.

We also replaced pgBouncer with Odyssey 1.2, which supports multiplexing with PostgreSQL 15s extended query protocol. Odyssey let us set max_client_conn = 1000 on the primary and max_client_conn = 300 on each replica without blowing the connection budget. The hunt sidecar now recycled connections every 1,000 queries instead of every hunt, trimming our SSL handshake load from 4,200/min to 200/min.

What The Numbers Said After

In the 7 days after rolling Version 0.9.6, our hunt p99 latency fell from 210 ms to 18 ms. The Lua sidecar saw 0 connection errors, and the PostgreSQL primary write load dropped from 1,200 queries/minute to 180 queries/minute. The Redis sidecar handled 12,400 hunt triggers/minute with a mean latency of 0.82 ms and a memory footprint of 42 MB. Our RDS gp3