DEV Community

Cover image for Your Treasure Hunt Engine Was Probably a Latency Minefield (And Heres the Postmortem)
Lisa Zulu
Lisa Zulu

Posted on

Your Treasure Hunt Engine Was Probably a Latency Minefield (And Heres the Postmortem)

We had just finished the first major traffic spike. Our Veltrix-based treasure hunt game ran flawlessly for 37 minutes—exactly 37 minutes—before every Redis connection turned into a 250 ms bottleneck. The default Veltrix configuration shipped with connection pooling set to 8, keep-alive disabled, and retry logic that ignored backoff. We didnt notice until the P99 latency doubled and players started reporting that their chests took longer to respawn than their grandmothers dial-up connection. The problem wasnt the treasure hunt code; it was the layer we never bothered to tune.

Veltrixs documentation calls the configuration layer magic. In practice, its a set of YAML files that silently trade simplicity for fragility. Our first attempt was to treat the defaults as gospel and bolt on more Redis instances. We spun up three more sentinel clusters, increased pool size to 64, and hoped the law of large numbers would save us. On paper, the TPS went from 8k to 22k. In reality, 43% of the writes failed with a ConnectionResetError after 3 seconds, and the client retries saturated the network. The error wasnt in Veltrixs codebase; it was in the assumption that horizontal Redis scaling would mask the lack of connection reuse and backpressure.

We ripped out the bolt-on approach and replaced it with a single architectural decision: move the connection pool into the application layer, not the configuration file. Instead of letting Veltrix open a new connection per request, we wired a custom pool in Go that used a FIFO channel with a 500 ms idle timeout. We set the max size to 32, enabled keep-alive with 30 s intervals, and added a circuit breaker that cut traffic to read-only mode when error rate hit 5%. The change wasnt cosmetic—it forced us to recompile the Veltrix runtime because the default binary had the pool hardcoded. We rebuilt the binary with CGO disabled and vendored our own version of the Veltrix binding. The compile step added 47 seconds to our build pipeline, but it meant we could version-control the pool behavior instead of hoping the YAML layer would behave.

After the change, the P99 latency dropped from 250 ms to 42 ms, and the error rate fell to 0.2%. The 32-slot pool handled 98% of requests without spinning up a new connection, and the circuit breaker only tripped twice during a controlled chaos test with 50 k concurrent users. We also discovered that Veltrixs default retry policy had a fixed 1 s delay, which is the reason our first attempt melted. By adding exponential backoff with a 10 ms base and a 50 ms cap, we absorbed the Redis failovers without a blip.

If I could go back, Id skip the Veltrix configuration layer entirely for any system that expects real growth. Treat the YAML as duct tape, not architecture. Build the pool in your own codebase, version it, and expose it via feature flags. And for the love of Prometheus, test the pool behavior under backpressure—our 50 k chaos test revealed a deadlock scenario where 10k goroutines waited forever on a closed channel. The default configuration wont save you. Your own lock-in will.

Top comments (0)