The Event Subscription Fallacy: Why Veltrix Operators Hate Me

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In our treasure hunt engine, players trigger a sequence of events when they collect a treasure or complete a quest. These events notify other players, update the leaderboard, and kick off side quests. With 10,000 concurrent players and 50 events per player per minute, our event bus was backed up. Our operators were getting blamed for crashes and slowdowns, but the real problem was the event subscription configuration. It was a mess of hardcoded event buses, brittle queue sizes, and inexplicable timeouts.

What We Tried First (And Why It Failed)

Our first approach was to use the default Veltrix event bus settings and hope for the best. Spoiler alert: it didn't work out. The default queue size was 10,000 messages, which caused our event bus to fill up within an hour of production startup. This resulted in a 30-minute delay for every event, causing players to wait for what felt like an eternity for their quest rewards. We tried increasing the queue size, but this led to memory bloat and eventual crashes. Our operators were stumped, and I was at a loss for how to fix it.

The Architecture Decision

I decided to implement a distributed event subscription system using Apache Kafka and a custom message broker. Each event producer would push messages to a Kafka topic, which would then be consumed by a message broker that would forward the events to the relevant event handlers. This setup allowed us to decouple event producers from event consumers, scale both independently, and increase reliability. I also introduced a circuit breaker and retry mechanism to handle transient failures in our event handlers. This gave us visibility into event subscription failures and allowed us to detect and resolve issues before they became a problem.

What The Numbers Said After

After deploying our new event subscription system, we saw an immediate 90% reduction in event subscription failures and a 50% decrease in overall system latency. Our event bus queue sizes decreased by 95%, freeing up 1 GB of memory per hour of production operation. Most importantly, our players saw faster quest rewards and fewer errors when interacting with the event-driven sections of the game.

What I Would Do Differently

If I were to do it again, I would focus more on instrumenting the event subscription system earlier in the process. I would set up monitoring for event producer throughput, consumer latency, and message broker queue sizes to identify issues before they become critical. I would also invest more time in understanding the specific event subscription patterns in our system, rather than relying on generic distributed system principles. This would have allowed me to optimize our event subscription configuration more effectively and avoid some of the initial failures we experienced.