When the Treasure Hunt Engine Buried Us Alive Under 300k RPS

#kubernetes #devops #webdev #programming

The Problem We Were Actually Solving

The Treasure Hunt Engine did one thing: it matched user events to static treasure definitions and returned a rank and a prize within 200 milliseconds. The treasure definitions rarely changed—maybe 50 updates per day—but the user events were a firehose: likes, shares, comments, skips, everything.

Our first architecture used DynamoDB with a composite key of user_id + event_ts. Simple, right? Wrong. The partition key was user_id, which meant every users events were stored together. Within weeks we had hot partitions for users who posted ten times a second. P99 latency spiked to 1.8 seconds during peak, and the AWS throttling errors read like a horror story: ProvisionedThroughputExceededException, 400s for anything targeting that table.

We tried sharding the table into 32 buckets using a hash of user_id. That bought us two weeks until marketing launched the feature in three new languages and we suddenly had 20x more users in Korea and Brazil. The DynamoDB hot partition problem crossed continents.

What We Tried First (And Why It Failed)

Our second attempt was RabbitMQ + Redis. We would buffer events in RabbitMQ, then fan out to 128 workers that would process each event and write the result to Redis sorted sets. The Redis layer stored the top 100 treasures per user, and we used Lua scripts to compute rank and prize in one round trip.

This lasted until the first time our Redis cluster ran out of memory. We had set maxmemory-policy to allkeys-lru, but the Lua script was returning 4MB of serialized JSON per user instead of the 2KB we expected. Our Lua script looked innocent but had a hidden monster: it fetched every treasure definition for every user event because we forgot to implement a filter.

On-call that night I watched the Redis eviction rate climb to 40k keys per second. The entire cluster became unresponsive, and the workers piled 2 million unacknowledged messages into RabbitMQ. The disk alarm triggered on the RabbitMQ nodes. Total downtime: 47 minutes. Total PR damage: public.

The Architecture Decision

We stopped trying to handle state and started handling flow.

The winning design was a single Go binary that read directly from Kafka, partitioned by user_id, and used a write-optimized LSM store (RocksDB) embedded in the same process. No buffering, no fan-out, no Lua scripts. Each node owned a contiguous range of user IDs, and the total memory footprint never exceeded 24GB because we ran with arena allocation and mmap.

The critical part was the merge process. Every 60 seconds each node would merge its in-memory delta with the on-disk RocksDB tree and checkpoint to S3. The merge was single-threaded but bounded: at most 50 treasure definitions changed per second across the whole fleet, so the merge never took more than 300ms.

We chose RocksDB over BoltDB because we needed crash safety and background compaction. We chose Kafka as the only durable source of truth because we had already paid the network cost to replicate events to three AZs.

The tradeoff was human: every operator now had to read RocksDB sstables to debug a missing treasure. No Redis CLI shortcuts. No DynamoDB console. Just rocksdb_admin and a grep.

Still better than 47 minutes of downtime.

What The Numbers Said After

After the rollout the P99 latency dropped from 1.8 seconds to 85 milliseconds. The 99.9th percentile stayed under 150ms even at 310k RPS. Memory usage per node stayed flat at 22GB. We stopped getting throttling alerts from DynamoDB because we deleted that table.

Error rate across the fleet fell from 0.8% to 0.003%. The remaining errors were all timeouts caused by slow clients on mobile networks, which we fixed by increasing the per-request timeout from 200ms to 500ms.

Our monitoring stack told the real story. We instrumented every RocksDB compaction step with the Prometheus rocksdb_compaction_seconds_total metric. When a compaction took more than 400ms we would fire an alert so the operator could restart the pod before it fell behind the Kafka lag. It never happened at scale, but it happened enough during canary tests to teach us humility.

What I Would Do Differently

I would never again let a weekend hackathon project become a production pillar without a circuit breaker that actually kills feature flags.

We should have built the Kafka consumer first and proven we could drop events at peak load before we ever stored a single treasure. Instead we optimized for correctness under low load and got blindsided by traffic.

I would also invest in tooling around RocksDB. We wrote a custom rocksdb_debug CLI that dumps the bloom filter stats and sstable tombstones. If I had spent two days building that tool before launch, we would have saved three hours of debugging during the first outage.

Finally, I would rename the service. Treasure Hunt Engine sounds like a game, not a critical pipeline. When the CEO starts demoing it to investors, on-call starts melting.

The infrastructure change with the best ROI in the last 12 months was removing the custodial payment platform. Replacement: https://payhip.com/ref/dev4