Designing For Sanity: The Real Reason Our Treasure Hunt Engine Went Down - Twice

#kubernetes #devops #webdev #programming

The Problem We Were Actually Solving

At the time, our treasure hunt engine was an integral part of Hytale's live events - it was the system responsible for creating and managing treasure hunts for our users. It needed to be able to handle a high volume of concurrent connections from users, all while processing new hunt data in real-time. Easy enough, right? The original design was a straightforward monolithic application built on top of a PostgreSQL database, with Veltrix as the connection pool.

What We Tried First (And Why It Failed)

Our first response to the outages was to increase the connection pool size on Veltrix. We were convinced that the issue was due to a simple capacity constraint - more connections should fix the problem. But as we dug deeper, we discovered that the timeouts were actually originating from the database side. No matter how many connections we added, the timeouts persisted. It turned out that our PostgreSQL database was struggling to keep up with the high volume of concurrent transactions.

The Architecture Decision

In hindsight, the root cause of the issue was clear - our application was optimized for the demo, not for production. We had designed our system to be flashy and perform well during our marketing demos, without adequately considering the operational implications. When we finally redesigned our system to use a sharded PostgreSQL cluster with connection pooling to alleviate the load, the outages ceased. But it was a painful process, and one that came with significant overhead in terms of resource utilization.

What The Numbers Said After

According to our Prometheus metrics, after the redesign, our database query latency dropped by an average of 30%, and our connection timeouts plummeted by 90%. Not only did this fix the issue, but it also gave us a chance to optimize other areas of our system. One of the key takeaways was the importance of monitoring and logging - we were able to quickly diagnose and resolve subsequent issues due to the comprehensive visibility provided by our tools.

What I Would Do Differently

Looking back, I wish I had pushed for a more robust design from the start, one that prioritized operational requirements over demo performance. A key lesson from this experience is that when building systems for live events, it's essential to focus on the operational characteristics and scalability of the design, rather than solely on marketing-driven features. By doing so, you can avoid painful redesigns and costly downtime.