Five Mistakes We Made Configuring Enterprise Event Logging

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

We needed a scalable event logging system that wouldn't kill our server. The current setup - Apache Kafka to AWS S3 - was woefully inadequate. It had 10-minute latency, and the cost was exploding due to high S3 storage and transfer fees. We also had an impossible task of debugging queries, which took an average of 15 minutes to execute. The query cost averaged $15 per second - we had a huge bill every month.

What We Tried First (And Why It Failed)

We first tried sharding the Kafka cluster to reduce latency. We split it into four clusters and spread them across multiple zones. While latency did improve by 2-3 seconds, the overall performance didn't change much. The problem was the network latency between zones, which was around 10-20 milliseconds per hop. We were stuck with 5-7 hops, which made the network overhead unbearable. We discovered that our Kafka producers and consumers spent an inordinate amount of time waiting for each other to respond, leading to a throughput bottleneck.

Next, we tried to reduce the number of queries by batching events. We set the batch size to 1000 events, and the latency reduced to 4-5 seconds. However, the cost didn't go down as much as we expected, because our users were still running expensive queries. Our average query cost was still $10 per second, which added up to a massive bill every month.

The Architecture Decision

We decided to switch to a real-time event logging system using AWS Kinesis. We created event streams for each application, and we used AWS Glue to create a data warehouse. We also implemented a simple caching layer using Redis to reduce the number of queries. We set a freshness SLA of 5 minutes for all queries. The Kinesis streams had 1-second latency and a very low cost of $0.005 per event.

The caching layer helped us reduce the number of queries by 80%. Our average query cost went down to $0.05 per second, which reduced our bill by 93%. We also implemented a query retry mechanism to reduce the number of failed queries.

What The Numbers Said After

After the implementation, our pipeline latency was reduced from 10 minutes to 1 second. Our average query cost was reduced from $10 per second to $0.05 per second, saving us $350,000 per month. We also reduced our storage cost by 90% by using Kinesis and S3. Our overall monthly cost had reduced by 95%.

What I Would Do Differently

If I had to do it over again, I would implement a more robust data quality check at the ingestion boundary. We had a few errors in our event streams that we couldn't catch until it was too late. Our error rate was around 1% for all streams, which was a significant number. I would have added more automated checks to catch errors like the wrong data type, incorrect date format, or missing required fields. This would have reduced the number of failed queries and saved us more money in the long run.