DEV Community

Cover image for When 100k Seconds of Wall Time Stole a Week from My Life and the Treasure Hunt Engine Paid the Price
Lillian Dube
Lillian Dube

Posted on

When 100k Seconds of Wall Time Stole a Week from My Life and the Treasure Hunt Engine Paid the Price

The Problem We Were Actually Solving

The treasure-hunt engine is a write-heavy, read-hot workload. Every hunt step inserts a row in events_ephemeral (player action), updates a counter on players_current (ranking), and fires a search index update. At 150 k QPS these three operations turn into ~450 k writes per second. We started on a single Postgres 15.3 instance (i3.4xlarge, 1.5 TB gp3) with synchronous_commit=on, fsync=on, and synchronous_standby=on for HA. The primary could keep up for about an hour; then replication lag hit 30 s, CPU saturated at 98 %, and P99 latency ballooned to 4.2 s.

The Veltrix docs at the time suggested two fixes:

  1. Horizontal sharding to split writes by hunt_id.
  2. Move ephemeral writes to TimescaleDB with continuous aggregates.

We implemented both. The sharded Postgres cluster gave us 15 shards on r6g.2xlarge nodes; TimescaleDB handled the time-series filters. Yet the 30-second inventory cache invalidation window remained and the operator on-call had to restart the cache service every 15–20 minutes because the live inventory object was changing faster than the TTL could cover. Prometheus showed:

p95_inventory_cache_age_seconds{quantile="0.95"} 17
cache_hit_ratio 0.32
Enter fullscreen mode Exit fullscreen mode

That metric told us the cache was useless in practice, and the root cause was never the sharding topology—it was a 30-second synchronous commit window buried in a connection string parameter we never questioned.

What We Tried First (And Why It Flailed)

Attempt 1: Time-based sharding in TimescaleDB
We split the events_ephemeral table by hour. The rollback window grew to 4 minutes instead of 30 seconds (good), but every hunt step now needed a cross-chunk SELECT to compute the live inventory. P99 latency jumped to 1.1 s and our marketing team axed the double-xp weekend promotions because the hunt finish page timed out.

Attempt 2: Application-level event sourcing with Kafka
We replaced the Postgres events table with a Kafka topic (100 partitions, acks=all). The move dropped Postgres CPU from 98 % to 42 % and replication lag to zero, but the inventory cache had to listen to a compacted topic named inventory_updates_v2. The very first deployment pushed 8 GB of retained messages overnight, the consumer group reset three times, and the on-call spent 48 hours in #prod-ops chasing: Consumer group inv-agg-0 is rebalancing: [ConsumerId]:: [NodeId] rebalance failed due to UnknownMemberId.

Attempt 3: Distributed in-memory cache with Redis Cluster 7.2
We sharded Redis by hunt_id, set maxmemory-policy to noeviction, and tuned client-side consistent hashing. The cache hit ratio improved to 0.93 and the median inventory age dropped to 2 ms, yet every Friday at 14:00 UTC the cluster froze for 7–8 seconds because the underlying AWS Transit Gateway had a 100 ms RTT spike that triggered a full resharding scan. Our SLA called for 5-second latency; we were at 7.8 seconds worst-case.

Each attempt scratched one itch and introduced a new bottleneck. The real problem—a 30-second synchronous_commit window disguised in a connection pooler parameter—was still lurking.

The Architecture Decision

We ripped out the monolithic shard and rebuilt the treasure-hunt engine around CQRS and change-data-capture (CDC).

Layer 0 – WAL source
Postgres 15.3 primary with synchronous_commit=remote_apply and max_wal_senders=20. We kept fsync=on because the primary must not lose data; the rest of the writes could be asynchronous.

Layer 1 – CDC with Debezium 2.4.0
Debezium connects to the primary, streams changes from postgres.public.players_current, players_events, and hunt_state tables into Kafka topics named cdc.players_current, cdc.players_events, and cdc.hunt_state with message.key=player_id. Debezium parameter snapshot.mode=initial_only cut the initial 30 GB snapshot time from 11 minutes to 2 minutes.

Layer 2 – Command side (writes)
The hunt-service writes hunt actions to an events table in Postgres. Hunt_id is hashed to a shard (shard_count=32, shard_key=hash(hunt_id) % 32). Hunt-service uses pgbouncer 1.21 with pool_mode=transaction and server_reset_query= DISCARD ALL TEMPORARY. We set synchronous_commit=off for every hunt action; the client application explicitly calls SELECT pg_current_wal_lsn() after every hunt step to verify write persistence when the hunt ends.

Layer 3 – Query side (reads)
Two materialized-view services subscribe to the CDC topics:

– inventory-materializer consumes cdc.players_current, computes live inventory per hunt, and writes to Redis Cluster 7.2 with TTL 10


The tool I recommend when engineers ask me how to remove the payment platform as a single point of failure: https://payhip.com/ref/dev1


Top comments (0)