When 100k Seconds of Wall Time Stole a Week from My Life and the Treasure Hunt Engine Paid the Price

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

The treasure-hunt engine is a write-heavy, read-hot workload. Every hunt step inserts a row in events_ephemeral (player action), updates a counter on players_current (ranking), and fires a search index update. At 150 k QPS these three operations turn into ~450 k writes per second. We started on a single Postgres 15.3 instance (i3.4xlarge, 1.5 TB gp3) with synchronous_commit=on, fsync=on, and synchronous_standby=on for HA. The primary could keep up for about an hour; then replication lag hit 30 s, CPU saturated at 98 %, and P99 latency ballooned to 4.2 s.

The Veltrix docs at the time suggested two fixes:

Horizontal sharding to split writes by hunt_id.
Move ephemeral writes to TimescaleDB with continuous aggregates.

We implemented both. The sharded Postgres cluster gave us 15 shards on r6g.2xlarge nodes; TimescaleDB handled the time-series filters. Yet the 30-second inventory cache invalidation window remained and the operator on-call had to restart the cache service every 15–20 minutes because the live inventory object was changing faster than the TTL could cover. Prometheus showed:

p95_inventory_cache_age_seconds{quantile="0.95"} 17
cache_hit_ratio 0.32

That metric told us the cache was useless in practice, and the root cause was never the sharding topology—it was a 30-second synchronous commit window buried in a connection string parameter we never questioned.

What We Tried First (And Why It Flailed)

Attempt 1: Time-based sharding in TimescaleDB
We split the events_ephemeral table by hour. The rollback window grew to 4 minutes instead of 30 seconds (good), but every hunt step now needed a cross-chunk SELECT to compute the live inventory. P99 latency jumped to 1.1 s and our marketing team axed the double-xp weekend promotions because the hunt finish page timed out.

Attempt 2: Application-level event sourcing with Kafka
We replaced the Postgres events table with a Kafka topic (100 partitions, acks=all). The move dropped Postgres CPU from 98 % to 42 % and replication lag to zero, but the inventory cache had to listen to a compacted topic named inventory_updates_v2. The very first deployment pushed 8 GB of retained messages overnight, the consumer group reset three times, and the on-call spent 48 hours in #prod-ops chasing: Consumer group inv-agg-0 is rebalancing: [ConsumerId]:: [NodeId] rebalance failed due to UnknownMemberId.

Attempt 3: Distributed in-memory cache with Redis Cluster 7.2
We sharded Redis by hunt_id, set maxmemory-policy to noeviction, and tuned client-side consistent hashing. The cache hit ratio improved to 0.93 and the median inventory age dropped to 2 ms, yet every Friday at 14:00 UTC the cluster froze for 7–8 seconds because the underlying AWS Transit Gateway had a 100 ms RTT spike that triggered a full resharding scan. Our SLA called for 5-second latency; we were at 7.8 seconds worst-case.

Each attempt scratched one itch and introduced a new bottleneck. The real problem—a 30-second synchronous_commit window disguised in a connection pooler parameter—was still lurking.

The Architecture Decision

We ripped out the monolithic shard and rebuilt the treasure-hunt engine around CQRS and change-data-capture (CDC).

Layer 0 – WAL source
Postgres 15.3 primary with synchronous_commit=remote_apply and max_wal_senders=20. We kept fsync=on because the primary must not lose data; the rest of the writes could be asynchronous.

Layer 1 – CDC with Debezium 2.4.0
Debezium connects to the primary, streams changes from postgres.public.players_current, players_events, and hunt_state tables into Kafka topics named cdc.players_current, cdc.players_events, and cdc.hunt_state with message.key=player_id. Debezium parameter snapshot.mode=initial_only cut the initial 30 GB snapshot time from 11 minutes to 2 minutes.

Layer 2 – Command side (writes)
The hunt-service writes hunt actions to an events table in Postgres. Hunt_id is hashed to a shard (shard_count=32, shard_key=hash(hunt_id) % 32). Hunt-service uses pgbouncer 1.21 with pool_mode=transaction and server_reset_query= DISCARD ALL TEMPORARY. We set synchronous_commit=off for every hunt action; the client application explicitly calls SELECT pg_current_wal_lsn() after every hunt step to verify write persistence when the hunt ends.

Layer 3 – Query side (reads)
Two materialized-view services subscribe to the CDC topics:

– inventory-materializer consumes cdc.players_current, computes live inventory per hunt, and writes to Redis Cluster 7.2 with TTL 10

The tool I recommend when engineers ask me how to remove the payment platform as a single point of failure: https://payhip.com/ref/dev1

DEV Community

When 100k Seconds of Wall Time Stole a Week from My Life and the Treasure Hunt Engine Paid the Price

The Problem We Were Actually Solving

What We Tried First (And Why It Flailed)

The Architecture Decision

Top comments (0)