The Problem We Were Actually Solving, was that our Treasure Hunt Engine was hitting an unacceptable amount of deserialization errors on the front-end of our application. At first, it seemed like a simple problem of not handling our eventual consistency model correctly. We used a combination of Apache Kafka and Apache Cassandra to handle our event sourcing, which led to a delay in when the data was actually reflected in the UI. This delay was causing the deserialization errors because the application was trying to load stale data.
What We Tried First (And Why It Failed), was to simply increase the number of nodes in Cassandra to reduce the delay. This was a classic 'throw hardware at the problem' approach, a common trap that I myself fell into. I thought that by increasing the number of nodes, we could magically reduce the latency and make our eventual consistency model work. However, this approach had some unintended consequences. The additional nodes increased our operational overhead and made it harder to manage our cluster. Additionally, the extra nodes did not actually solve the core issue of our eventual consistency model causing deserialization errors.
The Architecture Decision, that ultimately fixed this problem, was to implement a Read Repair Cache layer using Redis. This cache layer would store a copy of the most current data from our Cassandra cluster, effectively creating a 'read view' of the data that was consistent with our business logic. The tradeoff was that this added an additional layer of complexity to our system, but it allowed us to ensure that our reads were always consistent with our writes. This was a crucial piece of the puzzle, as it allowed us to prevent the deserialization errors from occurring in the first place.
What The Numbers Said After, is that our deserialization error rate dropped from 3.2% to 0.05% after implementing the Read Repair Cache layer. This was a huge victory for our team, as it prevented a major source of user frustration and improved our overall user experience. Our metrics also showed that the additional latency introduced by the cache layer was negligible, and our average response time remained within our SLA.
What I Would Do Differently, is to have taken a more gradual approach to implementing this change. I would have deployed the Read Repair Cache layer incrementally, monitoring our metrics and adjusting the configuration as needed. I also would have considered implementing this change earlier in our system, as it would have prevented the deserialization errors from occurring in the first place. Looking back, I can see that we were so focused on addressing the symptoms of the problem (increasing the number of nodes in Cassandra) that we neglected to address the root cause of the issue (our eventual consistency model). This was a costly mistake, but one that I learned from and will never make again.
Top comments (0)