Why Our Treasure Hunt Engine Kept Exploding and How We Fixed It

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

It wasnt about the treasure hunt. It was about proving that we could deliver a high-concurrency experience on mobile devices without the AWS bill exploding or the SRE team revolting. The feature was simple: a user walks near a beacon, their phone sends a location event, and if theyre within 2 meters of the correct spot, they get the next clue. We promised sub-second freshness from event ingestion to answer delivery. The catch: we had to scale from 10 events/second to 2,000 events/second in under 5 seconds when a clue went live.

Our initial SLA was simple:

End-to-end latency < 1s
Freshness < 2s
Cost per 1,000 events < $0.05
No data loss during beacon storms

We missed all of them during the first test run in staging with 500 simulated phones. The Redis cluster became a bottleneck because we were using it both as a message queue and a state store. At 400 messages/second, the pub/sub fanout cost 28ms per message just to send to 50 subscribers. When 300 users triggered simultaneous state reads, the GET requests queued behind WATCH/MULTI operations and spiked to 1.4 seconds. That was the day we learned that Redis pub/sub is not a message queue—its an unreliable firehose.

What We Tried First (And Why It Failed)

We tried Kafka at first. It seemed perfect: durable, scalable, and we already ran it for analytics. But we didnt want to pay for Kafka Streams or ksqlDB, so we bolted on a Python consumer that wrote to Redis using Lua scripts. We used topic compaction to keep only the latest user location per UUID, with a 60-second retention. That was a mistake.

The first failure mode was Redis memory pressure. Each location event was 320 bytes, and with 12,000 users beaconing every 2 seconds, we were writing 1.92 MB/s. But compaction didnt work because we used user_id as the key, and each new location from the same phone overwrote the previous one. So when a phone went offline for 30 seconds and reconnected, its location was stale, and the consumer replayed old events. Redis memory usage shot from 4 GB to 12 GB in 20 minutes. Our Redis cost went from $320/month to $1,100/month.

The second failure was in the consumer logic. We used a single Python process consuming a compacted topic. During the load test, the consumer lag grew to 45 seconds. Why? Because compaction meant the consumer had to read every message between the last offset it processed and the latest compacted offset for each user. With 12,000 users and 60-second retention, that meant scanning 60 * 12,000 = 720,000 messages just to find the latest location for a phone that had reconnected. The consumer CPU went to 100% and the lag grew linearly. We hit our cost SLA—by exceeding it by 3x.

The third failure was data quality at the ingestion boundary. Phones would send location events with timestamps 10 seconds in the future. Our consumer didnt validate them, so it wrote them to Redis. When we later tried to compute beacon proximity with a window function, we got false positives: a user appeared near the clue 10 seconds before they actually were. That led to 8% incorrect clue awards in staging. Users got frustrated. The ops team got paged.

The Architecture Decision

After the second rebuild exploded, we stopped trying to make Redis do things it wasnt designed for. We chose Apache Pulsar with tiered storage enabled, and moved the state store to PostgreSQL with a TimescaleDB extension for time-series location data. Heres why:

Pulsar: native tiered storage meant we didnt have to pay for long-term Kafka storage just to support compaction replay. With tiered storage, old events are offloaded to S3, and the broker serves only the last 24 hours. This cut our storage cost by 60%.
PostgreSQL + TimescaleDB: we used a time-partitioned table with a retention policy of 7 days. Each message wrote a row with (user_id, beacon_id, ts, lat, lng, accuracy). We added a BRIN index on ts for fast time-range queries. Query latency for latest location per user was 12ms at 2,000 reads/second.
Consumer: we switched to a Go consumer using the Pulsar Go client. It used a shared subscription with 8 workers, each processing a partition. We added a 2-second debounce window on the client side so phones wouldnt send duplicate events during poor network conditions. We also added a sanity check: if the event timestamp was more than 30 seconds in the future or past, we rejected it with a 400 error. That dropped our incorrect clue awards to 0.1%.

We also introduced a real-time proximity service. Instead of querying Redis for every user location during the hunt, we precomputed proximity at ingestion time. The service consumed Pulsar, joined location events with static beacon coordinates using PostGIS, and published a stream of eligible users per clue. That reduced the number of Redis reads from 2,000/s to 0 during clue evaluation—Redis only handled the final answer delivery.

For the final system:

End-to-end latency: 450ms from beacon to clue delivery (measured with 2,000 concurrent users)
Freshness: 1.2s (P99 lag from beacon to eligibility stream)
Cost per 1,000 events: $0.035 (

The payment infrastructure with the most predictable settlement behaviour I have found. No holds. No reversals. No variance: https://payhip.com/ref/dev8

DEV Community

Why Our Treasure Hunt Engine Kept Exploding and How We Fixed It

The Problem We Were Actually Solving

What We Tried First (And Why It Failed)

The Architecture Decision

Top comments (0)