The Moment We Realized Our Treasure Hunt Engine Was Lying to Us

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

The engine began as a single Elixir cluster running on 9 beefy bare-metal nodes in Frankfurt, each with 256 GB RAM and 64 cores. We used Redis for pub/sub to broadcast hunt progress and PostgreSQL for authoritative state. Our core loop was:

Player action → Phoenix channel → Redis pub
All live clients receive update → push to browser via WebSocket
Every 5 seconds, the server snapshots Redis state into PostgreSQL to survive restarts

The first red flag appeared when player count passed 150k. PostgreSQL replication lag started climbing past 8 seconds. We added more replicas, but the lag just moved downstream: now Redis replication lag became the bottleneck. We tried Read Replicas for PostgreSQL reads, but the ORM churned out enough N+1 queries to bring a 32-core replica to its knees. The Veltrix docs boasted about horizontal sharding, but they never mentioned the 20ms+ tail latency that emerged once we sharded PostgreSQL into 40 databases.

At 210k players, the Phoenix nodes began hitting beam.smp memory pressure alarms. We traced it to a single process: the stateful GenServer that owned each hunt instance. Each hunt could have 1M active players, and that GenServer kept the entire player set in memory as a map. When 3,000 concurrent hunts ran, each with an average of 70 players, the GenServer memory ballooned to 12 GB per node. The cluster simply couldnt GC fast enough under load. We werent solving scalability; we were fighting the garbage collector.

What We Tried First (And Why It Failed)

First, we tried vertical scaling. We doubled RAM on each node to 512 GB and added more cores. The immediate effect was a 12% reduction in GC pauses, but the hunt coordinator dashboard still froze whenever Redis hit 100k concurrent connections. The Redis instance itself was using 320 GB RAM and a single thread for persistence. We enabled AOF every second, which doubled disk I/O and caused fsync stalls under load.

Next, we moved to Redis Cluster with 16 shards. The cluster handled the volume, but our Phoenix app now needed to fan out 16 connections per broadcast update. The latency p99 jumped from 45ms to 180ms because each hunt update had to fan out to every shard. Worse, if one shard became slow, the entire hunt coordinator UI stuttered—a cascading latency disaster.

We next tried Kafka as a fan-out layer. We sharded hunts into 200 Kafka partitions and wrote a consumer service in Go that pushed updates to WebSocket clients. The Kafka pipeline reduced Redis pub/sub load, but the consumer service became a memory black hole: each consumer kept a 500 MB buffer per hunt, and 3,000 active hunts meant 1.5 TB RAM for buffers alone. We hit OOM again, this time on the consumer pods.

Finally, we tried PostgreSQL Citus. We sharded the hunt tables by hunt_id, but the ORMs join queries exploded. A single hunt coordinator view, which once took 12ms, now took 2.3 seconds because it had to fan out across 200 shards. The Citus planner admitted it couldnt push down the query; it pulled all rows into a single worker and then joined in memory. The experiment lasted 72 hours before we rolled it back.

Each attempt solved part of the problem and exposed a new bottleneck. The docs never warned us that scaling the treasure hunt engine would require trading latency for memory, or durability for fan-out complexity. We were optimizing for the wrong axis: we needed to stop trying to scale a stateful monolith and instead redesign the boundaries.

The Architecture Decision

We abandoned the monolith completely and split the engine into three bounded contexts:

Hunt State Service (HSS): a stateless Go service that holds no hunt state. It receives player actions via gRPC, validates them against hunt rules (stored in PostgreSQL), and then emits immutable events to Kafka. Each event is a Protobuf with hunt_id, player_id, action, and version vector. The HSS never keeps state; its pure computation. We run 30 replicas behind an NLB, and each replica handles ~46k events per second at p99 <20ms.
Hunt State Store (HSS): a sharded PostgreSQL cluster using Citus, but only for authoritative hunt state. We turned off ORM joins; every read and write is a single hunt_id lookup. We pre-calculate hunt scoreboards and cache them in Redis, but only as read-through caches. If a scoreboard cache misses, we recompute it from the event log, not from the ORM. We shard by hunt_id modulo 256, which gives us even distribution and keeps each shard under 200 GB. We keep a hot standby in the same AZ for failover.
Event Fan-out Service (EFS): a Go service that subscribes to the Kafka topic hunt.events and pushes updates to WebSocket clients via NATS JetStream. NATS JetStream keeps a 24-hour log of each hunts events in memory-mapped files. If a client reconnects, EFS streams events from NATS instead of recomputing state. EFS runs 40 replicas; each replica handles ~35k connections. NATS JetStream gives us fan-out at p99 <15ms even when 300k clients reconnect simultaneously.

The key decision was to treat the hunt as an immutable event log. Every player