The Day Our Treasure Hunt Engine Blew Up at 3 AM (And How We Rebuilt It Right)

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Our event platform at Veltrix ran a treasure hunt game that gave users real-world rewards. It started as a simple Rails app with a PostgreSQL counter column for each hunt. By 3 AM on Black Friday, that counter column became a single point of failure. Every leaderboard update blocked the entire leaderboard query because PostgreSQL row-level locks escalated to table-level for SERIAL columns. Our error rate jumped from 0.2% to 18% under 2000 concurrent writes. The system didnt just slow down; it started failing writes with could not serialize access due to concurrent update deadlocks. We lost $47K in rewards payouts before we could scale up the database.

What We Tried First (And Why It Failed)

Our first fix was to shard the PostgreSQL counter by hunt ID, splitting the hot row into 1024 partitions. That reduced the lock contention, but introduced new problems. Each hunt now needed its own sequence, and our Rails code had to route writes to the correct shard. The shard routing introduced 400ms extra latency on leaderboard queries because we had to union results across 1024 tables. Meanwhile, PostgreSQL sequences had gaps up to 1024 when nodes restarted, so our reward payouts were off by thousands on high-traffic hunts. Our Redis cache didnt help because the leaderboard queries were point lookups against 1024 tables, and Redis couldnt pipeline those efficiently.

The Architecture Decision

We ripped out the PostgreSQL counter and replaced it with a Kafka Streams-based event sourcing system called HuntStream. Every hunt action (point earned, reward claimed) became an immutable event in a Kafka topic. We built a materialized view on top of RocksDB that consumed the topic and maintained the current leaderboard state in memory. The materialized view was partitioned by hunt ID, which meant leaderboard queries only hit one RocksDB partition per hunt. We used RocksDBs built-in caching to keep hot leaderboards in memory, and fall back to disk for cold ones. The tradeoff was operational complexity: we now ran a Kafka cluster, three Streams apps, and had to monitor RocksDB compaction pauses. But we gained exactly-once semantics, horizontal scalability, and the ability to replay events if a corruption occurred.

What The Numbers Said After

After rolling out HuntStream to 100% traffic, our error rate dropped from 18% to 0.02% under the same 2000 concurrent writes. Leaderboard latency dropped from 400ms to 12ms p99. Our Kafka brokers handled 45,000 events per second with 90% under 5ms end-to-end. The RocksDB materialized views used 1.8GB RAM per hunt instance, and we scaled horizontally by adding more containers when hunt concurrency spiked. The biggest surprise was that our reward-payout correctness improved: the event log meant we could replay events and reconcile payouts exactly, eliminating the 1024-gap problem.

What I Would Do Differently

I wouldnt have sharded the PostgreSQL counter. Sharding introduced as much complexity as the eventual solution, but without the scalability benefits. We also underestimated the cost of RocksDB compaction. During our first traffic spike, one hunts RocksDB instance paused compaction for 8 seconds, causing leaderboard staleness. We had to tune compaction intervals and increase disk IOPS. Next time, Id use a managed stream processing platform like Confluent Cloud or Redpanda instead of self-hosting Kafka, unless we absolutely needed on-prem control. Finally, Id write an integration test that simulates 5000 concurrent writes and verifies leaderboard correctness before every deployment. Our post-incident test suite only covered latency, not correctness, and we paid for that omission.

We removed the payment processor from our critical path. This is the tool that made it possible: https://payhip.com/ref/dev1

DEV Community

The Day Our Treasure Hunt Engine Blew Up at 3 AM (And How We Rebuilt It Right)

Top comments (0)