Treasure Map Architectures: When "Default Config" Becomes a Luxury Item

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

Our platform was designed to stream events from producers to consumers in under 10ms, with 99th percentile latency goals ranging from 50ms to 2 seconds. We aimed to store and query events globally, and we expected to handle 5 million events per second with a modest team. However, when we first deployed our platform, it stalled at about 100k events per second, crippling performance and making it nearly impossible to meet our latency SLAs.

What We Tried First (And Why It Failed)

We initially applied a traditional sharding strategy, splitting our event stream into multiple shards based on timestamp. This approach didn't work as expected for a real-time stream because it led to an uneven distribution of data across shards, significantly impacting query performance and making it challenging to maintain data freshness. Furthermore, when we tried to scale our event processing infrastructure to handle increased loads, our shard rebalancing strategy caused performance bottlenecks and cascading failures.

The Architecture Decision

After several redesigns and re-architecting exercises, we landed on a hierarchical event map that allows for efficient event routing and better data distribution. This decision leveraged hierarchical routing to divide our event stream into "treasure map"-style layers, each representing a specific time window or type of event. This change allowed us to effectively load balance both event producers and consumers, while also ensuring that query performance remained consistent across all layers. We used Apache Kafka as our message broker and Apache Cassandra as our time series database to handle the hierarchical event routing and storage needs.

What The Numbers Said After

After implementing our new hierarchical routing architecture, our event processing pipeline achieved an impressive latency of around 150ms at 5 million events per second. We reduced our query cost by 75% and maintained 99th percentile latency at under 1 second across the entire 5-second time window. Our data freshness SLAs were consistently met with a median data age of less than 100ms. We could finally confidently ship our platform to clients with production-ready event streaming capabilities.

What I Would Do Differently

I would have liked to see an additional caching layer in front of our database to accelerate query performance. While our hierarchical routing architecture solved the scalability issues, it introduced a noticeable delay in the data ingestion pipeline, especially when events were generated rapidly. This impact made it essential to maintain data freshness through a solid caching strategy. With hindsight, we could have improved our initial event ingestion strategy to accommodate a cache, making our eventual solution even more efficient. The key takeaway for us was that a "default config" can quickly turn into a luxury item in high-scale event processing systems.