The Myth of the Perfect Event Processing System for Treasure Hunts

#webdev #programming #ai #machinelearning

The Problem We Were Actually Solving

At first glance, the system seemed straightforward - ingest data from a mix of GPS coordinates, user inputs, and sensor readings, and dispatch updates to the mobile app in real-time. However, we soon realized that the complexity lay in handling the millions of events pouring in from mobile devices, each with their own time-to-live (TTL) and priorities. The system had to balance performance with consistency, and ensure that every user received the correct set of updates, even when their device was offline or behind a firewall.

What We Tried First (And Why It Failed)

Our initial approach involved using Redis Pub/Sub to fan out events to workers, which then pushed updates to the mobile app via WebSockets. We chose this setup because it allowed for efficient in-memory processing and guaranteed delivery of messages. However, we soon discovered that the sheer volume of events was causing contention at the Redis Pub/Sub channel, leading to event duplication, lost updates, and frustrating delays for the end users. It turned out that our Redis instance, while designed for high performance, was not equipped to handle the extreme spikes in traffic and concurrent reads.

The Architecture Decision

We decided to pivot to a message-driven architecture, leveraging Apache Kafka as the central event hub. By doing so, we decoupled the event producers from the consumers and eliminated the risk of single-point failures. We also set up a multi-broker Kafka cluster with multiple partitions for each topic, ensuring that the system could scale horizontally and handle the heavy loads without any significant bottlenecks. To further improve latency and reduce the load on our infrastructure, we implemented a caching layer using a commercial Redis instance with write-through cache enabled.

What The Numbers Said After

The Kafka-based system proved to be a game-changer. Event latency plummeted from an average of 200 milliseconds to under 50 milliseconds, while throughput increased by a factor of five. Our Redis caching layer further reduced latency to just 10-15 milliseconds for the most frequent queries. The end result was a seamless, intuitive experience for the treasure hunt participants, with precise and timely updates reflected on their mobile devices.

What I Would Do Differently

In retrospect, I would have recommended a more careful sizing of the Redis instance from the start. Perhaps with a more aggressive configuration for connection pooling, load balancing, and client-side queuing, we could have mitigated the issues with contention and event duplication. Nevertheless, our experience taught us the importance of prioritizing event processing reliability and performance in high-stakes systems, especially those exposed to unpredictable mobile traffic and concurrent reads. By taking a structured approach to the design and implementation, we managed to deliver a world-class treasure hunt experience that exceeded client expectations.