The Day the Treasure Hunt Engine Blew Up at 3 AM and What We Learned About Cost Before It Happened Again

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

Players drop digital coins into 200,000 events that last anywhere from five minutes to three days. Each event needs a live leaderboard refreshed every thirty seconds so the frontend can show Youre in 7th place, 123 coins behind the leader. The feature shipped first, the infra second, and the finance team hadnt even been invited to the roadmap.

When we scaled past 50 concurrent events, the Redshift WLM queues backed up because every new event triggered a 30-second micro-batch that copied the entire leaderboard into memory, recalculated ranks with a window function that touched every row, then wrote back 20 GB of deltas. The simple fix—just add more clusters—failed the day the marketing team launched #SuperBowlChallenge with 2 million expected players. The cluster count tripled overnight, credits vanished, and the CFO blocked my next purchase order.

What We Tried First (And Why It Failed)

We bolted on Kinesis Data Streams to ingest every coin drop in real time. The stream consumer ran on seven c5.2xlarge Fargate tasks, each polling 20 shards and writing to Redshift via a single COPY command every 30 seconds. The AWS bill printed $3,200/day for the streams plus $4,800/day for Redshift RA3 nodes.

The first meltdown happened when a mis-sized JSON blob in a single shard caused 7 MB of unparsed data to sit in Kinesis for nine minutes. The consumers memory spiked to 92 %, Docker killed it, and the next task inherited 200,000 duplicate messages. Redshift landed 200,000 duplicate rows, the window function ran for 37 minutes instead of 30 seconds, and the real-time leaderboard froze while users spammed refresh buttons.

We switched to a Kafka cluster on three m6g.2xlarge brokers. That cut ingestion latency to 1.2 seconds on average, but the brokers cost $2,100/month and we still needed Redshift for the ranking query. The ingestion latency looked good, yet the warehouse query latency still had a 95th percentile of 28 seconds. The real problem was the scan depth: every leaderboard rebuild touched 5 % of the 300 million rows because the DENSE_RANK window function didnt have a partition.

The Architecture Decision

I stopped trying to keep every rank calculation inside Redshift. Instead, we sharded the leaderboard by event ID into 200 separate tables on a Snowflake multi-cluster warehouse. Each events table held only the rows that mattered: the top 1,000 players by coin count. That dropped the table size for any single leaderboard from 150 GB to 3 GB.

The ingestion pipeline became two stages:

Kinesis → Flink job (three task managers on Kubernetes) that groups messages by event ID, maintains an in-memory LRU cache of the top 1,000 scores per event using RocksDB state backend, and flushes a Kafka topic when the cache hits 1,000 changes or 10 seconds pass.
A lightweight Snowpipe file copy that loads the topic into the corresponding Snowflake table every 60 seconds.

The Flink job kept the end-to-end latency at 3.8 seconds 99 % of the time and reduced our compute footprint to one r6g.xlarge instance plus three small Snowflake clusters running at $1,200/day total. We also added a checksum column that detected duplicate rows before they landed in Snowflake, eliminating the previous downstream corruption bug.

What The Numbers Said After

Latency: 99th percentile leaderboard refresh after a coin drop now 5.2 seconds (previously 38 seconds).
Cost: Snowflake warehouse spend dropped from $4,800/day to $1,200/day; Redshift cluster count went from seven ra3.4xlarge to zero.
Freshness SLA: 30-second cadence met 99.8 % of the time; the worst 15-minute stretch was during the Super Bowl spike when we hit 98 %.
Data quality: duplicate rows dropped from 12 per minute to zero since we added a checksum check at the Flink sink.
P95 query cost in Snowflake: $0.003 per leaderboard rebuild.

What I Would Do Differently

I would have built the sharding key into the very first database schema instead of trying to bolt it on after 50 events went live. That would have let me use a simple CREATE TABLE … WITH CLUSTER BY (event_id) in Snowflake from day one instead of rewriting every downstream job.

I would also refuse to let the marketing team set freshness expectations without a load test that pushes 2× expected peak. Our first blast of 200,000 concurrent players taught me that a ten-second latency promise must be backed by a load test that generates 500,000 messages per second, not the 20,000 we simulated.

Finally, I would force the product manager to sign a document that says if the freshness SLA is met 99 % of the time, we accept 1 % of the time the board freezes for 90 seconds. That document saves engineering weeks of sleep.