Category: Events

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

Behind the scenes, this application relies on a complex interplay of Apache Kafka topics, Redshift views, and Presto queries to fetch user information, generate quest narratives, and push notifications. As users began to interact with the system, performance issues began to surface: query costs were piling up, and we were hitting Redshift's maximum concurrent query limit. We were also unable to meet the 2-minute query latency SLA for our most-used quest types. The root cause of these issues lay in the configuration decisions we made around Apache Kafka partitioning and Presto query optimization.

What We Tried First (And Why It Failed)

Initially, we designed the Treasure Hunt Engine to partition its Kafka topics by user ID. This approach seemed reasonable at first glance, as it would group related events together and enable more efficient aggregation in Presto. However, we soon discovered that this design led to hot partitions – those topics that receive a disproportionate share of writes – which caused uneven Kafka cluster utilization and subsequent latency issues. To make matters worse, our initial Presto query optimization strategy focused on reducing the number of queries executed by caching intermediate results. However, this approach only exacerbated the hot partition problem, as it encouraged the system to produce more queries overall.

The Architecture Decision

After debugging the system and analyzing our logs, we realized that our configuration decisions were driving the performance issues. We decided to pivot and partition our Kafka topics by quest type instead. This change evened out the writes across the cluster and reduced the likelihood of hot partitions. We also shifted our Presto query optimization strategy to focus on reducing the cardinality of our queries. By rewriting our queries to use more efficient aggregation functions and indexing, we were able to reduce query costs and meet our latency SLAs. In addition, we implemented row-level caching on our intermediate Redshift views to reduce the load on our Presto cluster.

What The Numbers Said After

After deploying the new configuration, we saw a 30% reduction in query costs and a 25% decrease in query latency. Our Kafka cluster utilization normalized, and we were able to meet our 2-minute query latency SLA for all quest types. We also noticed a reduction in our Redshift cluster's active sessions, which translated to cost savings.

What I Would Do Differently

In retrospect, I would have prioritized the partitioning decision from the outset, as it would have significantly reduced the likelihood of hot partitions and subsequent performance issues. I would also have advocated for a more incremental deployment strategy, where we would have iteratively tested and refined our Presto query optimization strategy between deployments. This approach would have allowed us to identify and address potential problems more quickly, rather than relying on a large-scale redeployment to fix the system.

DEV Community