It Was 2024 When We Tried to Outsmart the Treasure Hunt Engine

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We needed to survive the Black Friday of game launches without throwing hardware at the problem. The treasure hunt engine is peak write-amplification: every coin placement and every search query mutates state and then broadcasts an event to hundreds of listening clients. Our first architecture carved the grid into shards, each backed by a dedicated PostgreSQL 14 instance with pg_bouncer in transaction pooling mode. It looked clean on paper: 64 shards, 8 read replicas per shard, Prometheus scraping every pg_stat_bgwriter.metrics at 5 s intervals.

The inflection came when a single viral TikTok clip drove 15× normal traffic. At 12 000 concurrent sessions we started seeing:

pg_bouncer logs: failed to get connection: timeout after 5000 ms
Prometheus counter pgsql_connections_max_overflow spiked to 42 on shard 23, exactly the one hosting the TikTok hotspot.
The nf_conntrack table grew to 1.2 million entries, and the kernel started dropping packets because the table was only 1.5× our peak.

We realized the abstraction that worked for 1 000 players was now a liability: every shard held its own connection pool, and the pool bounded the number of simultaneous writes. The TikTok wave concentrated in one shard, and the write load saturated not the CPU but the pool slots. We had optimized for cardinality, not tail latency.

What We Tried First (And Why It Failed)

First defensive move: scale the pool. We bumped pg_bouncer max_client_conn from 10 000 to 40 000 and upped shared_buffers to 1 GB. The immediate effect was the database OOM-killer stepping in after 47 minutes; apparently 64 × 1 GB = 64 GB wasnt trivial on our 64 GB RAM hosts. Second try: move to statement pooling to reduce the pool footprint. That bought 15 minutes before we hit the same RST storms. The root cause wasnt the pool size; it was that each client kept one long-lived WebSocket open, and every search issued a SELECT that joined three tables: grid_cells, coin_placements, and player_searches. The planner was doing nested loops because grid_cells had no spatial index.

Dumping the plan showed:

Nested Loop (cost=100.01..45678.90 rows=12345 width=42)
 Join Filter: (ST_DWithin(grid_cells.geom, player_searches.point, 50))

The ST_DWithin without a GiST index translated to 12 000 × 500 000 rows scanned per request. Our metrics showed 3.4 s average execution time under load, which explained the TCP backlog filling before we saturated CPU.

The Architecture Decision

By June 2024 we had two constraints: we could not rewrite the spatial queries overnight, and we could not keep doubling PostgreSQL RAM. We elected to place a read-through cache layer between the application and PostgreSQL. The cache would store the result of each unique search radius query for 500 ms—a compromise between consistency and memory usage. The tool was Dragonfly 0.13 (then called DragonflyDB), a Redis-compatible in-memory store forked from KeyDB. We chose it because it offered multi-threading via redis-threaded and 1 million ops/s on a single 32-core host, enough to absorb the search workload while PostgreSQL handled writes.

The tradeoff was eventual consistency: players could still see a coin that had been removed 300 ms ago if another player removed it at the same time. We documented it as acceptable for this simulation game. We sharded the Dragonfly cluster into 32 logical shards, consistent-hash based on the search center point. Each shard used 16 GB RAM, giving us 512 GB total L3 cache at a cost of $0.08 per GB-hour on our Kubernetes cluster.

Writes still went straight to PostgreSQL. We introduced a change-data-capture pipeline using Debezium 2.4 and Kafka 3.6. A small Go service consumed the PostgreSQL logical replication stream, filtered for grid_cells and coin_placements, and emitted cache-invalidations to a dedicated Kafka topic. A second consumer, co-located on the same Kubernetes node as the Dragonfly shard, listened to that topic and issued DEL commands only to the shard that held the affected geometry. We measured end-to-end invalidation latency at P99 120 ms, which matched our SLA for this game session.

What The Numbers Said After

Three weeks after rolling out Dragonfly we replayed the same Black-Friday traffic profile. The metrics told the story:

Connection pool overflows dropped from 42 per minute to 0.
Average application-layer latency fell from 250 ms to 32 ms (P99 87 ms).
PostgreSQL CPU usage declined from 85 % to 35 % because the planner no longer saw high-cost scans.
Memory usage on Dragonfly hosts plateaued at 82 % after 48 hours, comfortably within our 90 % safety threshold.
Debezium lag stayed under 200 ms, meaning cache invalidations were timely enough for gameplay.

The cost delta was 120 dollars per day for the extra Dragonfly nodes versus 450 dollars per day we had been