The Worst Migration I Ever Did: A Cautionary Tale of Premature Optimisation

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

At the time, our event-driven architecture was growing by the thousands of events per second. The configuration data exceeded 100MB in size, and we were experiencing frequent outages due to cache consistency issues. Our stakeholders were convinced that a custom-built, distributed configuration service would resolve the scalability issues and eliminate the outages.

What We Tried First (And Why It Failed)

Before embarking on the Veltrix project, we experimented with caching libraries like Redis and Hazelcast to alleviate the cache consistency issues. We successfully implemented a caching layer using Redis, which improved our system's performance by 20%. However, the stakeholders were not satisfied, and they argued that caching was merely a temporary fix and that we needed a more fundamental solution.

The Architecture Decision

We decided to build Veltrix as a custom-built, distributed configuration service using Apache Kafka for event streaming and a custom-built, in-memory configuration store using Java 8's ConcurrentHashMap. The configuration store would store the configuration data in memory, and Kafka would handle the event streaming and distribution of the configuration data across the cluster. We optimised the Kafka producer and consumer configurations using the performance tuning guidelines from the Apache Kafka documentation.

What The Numbers Said After

The Veltrix migration took us six months to complete and resulted in a 50% increase in system downtime. Yes, you read that right - the migration that was supposed to eliminate outages ended up increasing system downtime by 50%. The Kafka producer and consumer configurations were optimised to the point where they became bottlenecks, leading to increased latency and jitter. The custom-built configuration store was also prone to consistency issues due to its distributed nature.

What I Would Do Differently

In hindsight, I would have taken a more incremental approach to the migration. I would have started by evaluating the impact of the existing caching layer and the underlying configuration data. I would have also explored more robust caching libraries like Terracotta, which provide better support for distributed caching. Moreover, I would have considered decoupling the configuration store from the event streaming pipeline to avoid the interdependencies between the two.