The Default Config Conundrum: How Our Events System Became a Free-for-All

#webdev #programming #security #appsec

The Problem We Were Actually Solving

At first, we thought we were solving the age-old problem of "events not being delivered". But upon further investigation, we realized that our default config was not set up to handle the scale of traffic we were expecting. We had a mix of event publishers and consumers, all vying for the same resources, but we hadn't considered the implications of this design on our overall event delivery.

What We Tried First (And Why It Failed)

Our first instinct was to throw more resources at the problem. We added more event brokers, increased the queue size, and tweaked the retry policies. But, to our surprise, the issues persisted. The system was still dropping events, and we were seeing high CPU utilization on the event brokers. It wasn't until we started looking at the metrics that we realized the problem wasn't with the event brokers, but with the default config itself.

The Architecture Decision

It turned out that our default config was set up to favor event delivery over event acknowledgment. In other words, if an event couldn't be delivered within a certain time frame, it would be dropped, and the publisher would simply move on to the next event. This approach might seem intuitive, but it led to a "free-for-all" scenario where events were being thrown over the wall without much consideration for their actual delivery.

What The Numbers Said After

After implementing a new config that prioritized event acknowledgment, we saw a dramatic reduction in dropped events (down 80% in the first week) and a corresponding increase in event delivery success rates (up 25% in the same period). The new config also allowed us to reduce the number of event brokers by 30% while maintaining the same level of performance.

What I Would Do Differently

In hindsight, I would have taken a more structured approach to event configuration from the get-go. I would have started by defining clear event delivery requirements and then worked backwards to design a config that meets those requirements. I would have also invested more time in testing and validation, rather than relying on intuition and guesswork.

Looking back, the default config conundrum was a classic case of "security through obscurity" - we thought that by throwing enough complexity at the problem, we could hide the underlying issues. But, as we all know, complexity is the enemy of security, and in this case, it led to a cascade of problems that took us months to untangle.