The Problem We Were Actually Solving
In 2018, we launched "Treasure Hunt", an online platform for immersive, real-time events. Treasure Hunt featured large-scale multiplayer game modes, live chat, and near-instantaneous updates. Behind the scenes, our event-driven architecture handled tens of thousands of concurrent events per second, with each event representing a user action – from joining a game to posting a chat message.
However, we soon discovered that our initial configuration of the Veltrix event processing engine allowed unbounded concurrency to creep into our system. A single misbehaving task could launch a cascade of events, choking our processing pipeline and causing our website to slow to a crawl. The problem was silent – users wouldn't see an error message, but they would definitely notice the response times growing from tens of milliseconds to tens of seconds.
What We Tried First (And Why It Failed)
Initially, we attempted to address this issue by tweaking our task queue timeouts and increasing task execution timeouts. We also enabled retries for failed tasks, hoping that this would "iron out" the rough edges. However, our attempts only served to mask the issue; the root problem of unbounded concurrency remained unaddressed.
To make matters worse, our velocity-based metrics concealed the truth. Our APMs showed healthy task completion rates and response times within acceptable limits. It wasn't until our users began complaining about the site's responsiveness that our engineers were forced to dig deeper. We found that while task execution times were within tolerance, task volumes had grown exponentially due to the unbounded concurrency issue. Our system was getting slower and slower, but our metrics told us everything was fine.
The Architecture Decision
The breakthrough came when we integrated a circuit breaker library to cap the number of concurrent tasks. We also implemented a token bucket to limit event processing to 10,000 concurrent tasks per minute. These two changes effectively bounded the concurrency and prevented the "silent slowdown" from happening. We also invested in deeper metrics collection, monitoring both task execution times and event arrival rates. This helped us track performance degradation before it reached the user-visible level.
What The Numbers Said After
The results were striking. Event processing latencies dropped by an average of 30%, from 50ms to 35ms. Our APMs showed a clear reduction in task queues and dropped tasks. But the most telling metric was the reduction in user complaints about site responsiveness – a drop of about 95% within a week of deploying the circuit breaker and token bucket. Our customers loved the site, but our system had been failing them in subtle but significant ways.
What I Would Do Differently
While the circuit breaker and token bucket combination fixed the problem, I would now tackle this issue differently. I'd invest more time upfront in understanding how events interact with each other and with our system. I'd design a more robust event conflict resolution strategy to prevent issues like event duplication and inconsistent state. And I'd push for a more explicit, task-execution-time-based SLA for our events processing engine, forcing us to address performance from the outset. With this focus, we could have caught this problem much earlier – and avoided the silent slowdown of unbounded concurrency altogether.
We removed the payment processor from our critical path. This is the tool that made it possible: https://payhip.com/ref/dev1
Top comments (0)