Treasure Hunting at Scale: Why Our Cache-Aside Cache Cost Us 40% in Tail Latency During Black Friday

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

During load testing at 50k concurrent hunters hitting the hunt endpoints, p99 latencies stayed under 200ms. But at 270k concurrent users in production, the hunt page suddenly took 1.8 seconds to load, triggering cascading 502s from our CDN. The error surfaced in Datadog as hunt_page_render_time_bucket{le=2.0} = 42% while le=0.5 dropped to 18%. The fingerprints were identical across three regions: high latency correlated exactly with Redis cache miss rate spiking from 12% to 48% during the hunt start window. Our cache-aside pattern with a 30-second TTL was amplifying miss storms.

We discovered that the treasure hunt start time was synchronised by marketing campaigns. When the clock struck 10:00:00 UTC, 270k users hit the endpoint within 30 seconds. Each request would check the cache (miss), fetch from PostgreSQL, render the page, and write the cache entry. But PostgreSQL couldnt keep up with 9k queries per second during that window, causing query queueing and connection exhaustion. The Redis layer, designed for 150k ops/sec, was not the bottleneck. The database was.

What We Tried First (And Why It Failed)

Our first attempt was to increase Redis TTL from 30 seconds to 5 minutes. This reduced cache misses from 48% to 24%, and p99 latency improved to 650ms. But at 320k concurrent users, the latency still spiked to 1.4s because the underlying database queries were still hitting the same table with the same indexes. The Redis layer was masking symptoms, not solving the root cause.

Next, we tried database read replicas. We spun up three read replicas and routed hunt queries to them using a weighted service mesh. This worked for a few minutes, but then we hit replication lag. The replicas fell 800ms behind primary, causing hunt pages to display stale treasure locations. Our operators started getting customer complaints about seeing the wrong treasure coordinates. We rolled back within 15 minutes.

We even tried increasing PostgreSQL shared_buffers from 4GB to 8GB and vacuuming aggressively. This temporarily helped, but the real issue was the table scan on the treasures table. The query used a JSON column storing treasure coordinates, and the index on hunt_id and created_at wasnt covering the coordinate lookup. In production, this translated to Seq Scan on treasures (cost=0.00..12543.21 rows=12456 width=244) showing up in every explain plan during peak load. Our attempts to patch the cache were treating symptoms of a missing index.

The Architecture Decision

We redesigned the treasure hunt page to use a write-through cache pattern instead of cache-aside. Heres how it worked:

We introduced a write path: when a new treasure hunt was scheduled in our CMS, it emitted an event to a Kafka topic called treasure-hunt-scheduled. The event contained the hunt_id, start_time, and a precomputed tile grid of treasure coordinates.
A dedicated microservice called hunt-publisher consumed this topic and wrote the entire hunt configuration into a new Redis hash called hunt:{hunt_id}. The write included the tile grid, start time, and status.
We replaced the hunt page endpoint with a read-through endpoint that first checks Redis for the hunt metadata. If the hunt is not cached, it returns a 404 redirect to a fallback page. We accepted that cachable hunts would be best effort.
We added a background job that pre-warmed Redis caches 10 minutes before each hunt start time by publishing a synthetic cache-warm event to the same Kafka topic.

The write-through cache reduced cache misses to under 2% during peak traffic because the treasure data was preloaded. We also added a new index on the treasures table: CREATE INDEX idx_treasures_hunt_id_coordinate ON treasures(hunt_id, ST_AsGeoJSON(coordinates)::jsonb); This covered both the hunt lookup and the coordinate extraction without table scans.

The tradeoff was operational complexity. We now had to manage two ingestion paths: the CMS publishing to Kafka and the hunt-publisher service keeping Redis in sync. We also introduced eventual consistency—if the hunt was updated after scheduling, the Redis cache would be stale until the next scheduled hunt. But in practice, hunt schedules rarely changed after publication, and the staleness window was under 60 seconds during normal operations.

What The Numbers Said After

After deploying the write-through cache in April 2024, the Black Friday load test showed dramatic improvement. During the peak 500k concurrent users window, p99 hunt page latency stayed at 210ms, down from 1.8s. Cache miss rate dropped to 1.8%, and Redis ops stayed under 180k per second, within its capacity. The database query load on the primary dropped from 12k queries per second to 1.8k, eliminating the queueing bottleneck.

Error rates fell to zero during the hunt start window. The only failure mode we saw was when a hunt was updated after scheduling, causing a single hunt page to show stale coordinates. We measured this as a 0.3% occurrence and accepted it as a product tradeoff rather than an engineering failure.

Our operators stopped paging at 11am on Black Friday. The metrics told the story: hunt_page_render_time_bucket{le=0.5} = 82% and hunt_cache_miss_rate stayed under 2% for all hunt pages.