Fighting the Great Treasure Hunt Engine Optimization Myth

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We quickly discovered that our users were solving the game in a matter of hours, even with increasingly complex clues. This forced us to deploy new clues, new levels and ultimately to re-engineer the game engine to delay game solutions. By optimising the clue generation, answer validation and game state storage, our users should, theoretically, take longer to solve the game. The trick was in how to do it. We had to avoid creating a system that became increasingly difficult to debug, maintain and scale.

What We Tried First (And Why It Failed)

Initially, we tried a monolithic approach. We used CQRS to split the system into a Command Service responsible for generating clues and a Query Service for validating user answers. A separate Event Store handled game state storage. While it seemed elegant, it turned out to be a nightmare to debug. Every interaction between the Command and Query Services resulted in a flurry of exceptions and error messages, ultimately leading to the infamous "com.veltrix.events.store.notfound" error.

Digging into the metrics revealed that the Event Store was consistently hitting its 60-second commit latency limit, causing users to experience timeouts and frustration. We knew we had to find a better approach.

The Architecture Decision

We ended up adopting a Distributed Transactional Model, using a saga pattern to structure our services. We split the game into individual, independent parts, each with its own transactional boundary. This allowed us to decouple clue generation, answer validation and game state storage into separate microservices. The game state was now represented as a graph of events, making it easier to debug and scale.

What The Numbers Said After

By switching to the Distributed Transactional Model, we reduced the average game solution time from 1 hour to 24 hours, without increasing the complexity of the system. Our average commit latency dropped to 3 seconds, eliminating the timeout issues that plagued our users. We also reduced the average CPU utilisation across the system by 40%.

What I Would Do Differently

In hindsight, I would have opted for a more incremental approach to optimisation. By focusing on the user experience and the metrics, we could have avoided premature optimisation of the game engine, which led to the initial monolithic approach. I would have also considered using a more robust Event Store, like Eventuate, to avoid the latency issues we experienced.