The Great Veltrix Event Debacle: Why Docs Are Not Enough

#kubernetes #devops #webdev #programming

The Problem We Were Actually Solving

As a platform engineer at Veltrix, I had the... privilege of designing an event-driven system that would eventually burn a hole through my sleep schedule. The initial goal was to create a robust notification pipeline for our users, with seamless integration across various APIs and services. We were solving a 'real-world' problem – or so we thought. In hindsight, the primary objective was to deploy something quick with the least amount of code changes, which ultimately led to a fragile mess.

What We Tried First (And Why It Failed)

Our initial setup was based on the default configuration of Apache Kafka, which is a great tool, don't get me wrong. We paired it with a simple AWS Lambda function as the message processor. Sounds straightforward, right? The issue was that our default configuration ignored the importance of event serialization, exactly-once delivery, and, most notably, no queue partitioning strategy. We figured that's what the documentation said to do – at least, that's what we thought. Our first production deployment resulted in a never-ending stream of duplicate notifications, where the Lambda function would start producing events at an exponential rate.

The Architecture Decision

After burning through the better part of a week trying to troubleshoot, we realized that our system design was an invitation for disaster. We made the following critical changes: first, we introduced Avro serialization for event payloads to ensure binary compatibility and versioning control; second, we enabled exactly-once message delivery using Kafka transaction IDs; third, we implemented a dynamic partitioning strategy based on a combination of event type and source. We also migrated the Lambda function to a Node.js-based implementation using Amazon's Elastic Container Service (ECS) for better containerization and orchestration control.

What The Numbers Said After

The impact was almost immediate. With the new configuration, our event delivery latency dropped from 30 seconds to under 5 seconds, and the number of duplicate notifications plummeted from thousands to a near-zero count. Our event throughput increased by 25% due to improved partitioning, while error rates decreased by 40%. These changes allowed us to confidently scale our system to meet increasing user demands.

What I Would Do Differently

Looking back, I would have spent more time on upfront design and prototyping, rather than taking shortcuts. If I were to redo the project, I would also invest more in automating our testing and deployment pipelines, reducing the likelihood of human mistakes during high-pressure situations. Our team would have benefited from more robust unit tests, as well as end-to-end integration tests, to ensure that the system was working as expected before deployment.