12 Years Later, I Still Use the Same Event Bus for Treasure Hunts

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

What we needed was a single source of truth for every spawn, capture, drop, and teleport event that could be replayed reliably for anti-cheat audits, replay UIs, and leaderboard recalculations. Naïve designs used PostgreSQL LISTEN/NOTIFY, but the 140 byte JSON blobs multiplied into 180 GB/day of write-ahead log we had to ship to replicas. Replica lag grew to 3.2 seconds at 8,000 TPS and players were literally teleporting into walls because the follower node hadnt caught up. Our metrics dashboard screamed red at p99 of 720 ms for event delivery—far from the 50 ms ceiling we promised in the SLA.

What We Tried First (And Why It Failed)

We first tried Redis Streams with consumer groups. It looked perfect on the whiteboard: XADD with MAXLEN=1000000, XREADGROUP for each shard, and automatic trimming. Within a week we hit the wall: consumer lag stacked up like a Los-Angeles freeway when a single hunt zone got popular. The Redis process pegged a CPU core at 100 % during compaction, and XCLAIM started returning MOVED errors because slots shifted under us. The ops team rolled back to PostgreSQL, but now the lag was 3 seconds and climbing—back to square one.

The worst part was the replay code. Every audit tool demanded deterministic replay, yet Redis Streams did not guarantee total order. Three different operators wrote three different replay scripts that produced three different leaderboards. Customers noticed immediately; one guild posted a video of their score dropping 14,000 points after the replay script deduplicated spawns in the wrong order.

The Architecture Decision

We needed a durable, ordered log that still allowed horizontal readers. Apache Kafka 3.7 with idempotent producers and exactly-once semantics fit the bill. We partitioned the topic hunt-events by hunt_id modulo 32, so each hunts events stayed in one partition and maintained global order inside it. We set acks=all and linger.ms=5 so we could still hit 70k msgs/sec on three c6g.4xlarge brokers. To keep the wall-clock latency under 50 ms at p99 we left retention.ms at seven days but capped segment.bytes at 1 GB to shorten log-compaction sweeps.

For the replay problem we built a tiny Go worker called hunt-replay that consumed from the compacted topic and wrote into a materialized view in ClickHouse replica-1. ClickHouse enabled ORDER BY (hunt_timestamp) PRIMARY KEY (hunt_id, hunt_timestamp) so every replay query ran in under 400 ms. We also added a second topic called audit-events with one partition per hunt so auditors could replay without touching the primary log.

The trade-off was cost: three Kafka brokers plus three ClickHouse replicas cost us 4,200 dollars a month compared to the 1,800 dollars we spent on Redis. But the SLA held, and the audits stopped coming at 3 A.M.

What The Numbers Said After

Seventeen days after the Kafka migration we measured:

p99 event delivery latency: 42 ms (down from 720 ms)
Replica lag on ClickHouse: 0 ms (materialized views updated within 500 ms of Kafka commit)
Audit replay time shrunk from 23 minutes to 1.4 minutes on a 1.2 million event hunt
Cost per million events dropped from $0.0087 to $0.0021 once we stopped storing raw WAL snapshots

The most surprising metric came from anti-cheat: we detected three auto-clicker scripts within 48 hours of turning on deterministic replay because their spawn patterns now had to follow the immutable log. No one had caught them before.

What I Would Do Differently

I would not have given each hunt zone its own partition key. During the 2025 Lunar New Year spike we had 14,000 concurrent zones and our Kafka cluster maxed out at 60 MB/s ingress. Adding hunt_zone to the partition key gave us 2,000 partitions and smoothed the load, but it introduced cross-partition ordering problems for leaderboards spanning multiple zones. In hindsight we should have kept hunt_id as the only partition key and used a separate Kafka Streams table for zone-local queries.

The second regret is ClickHouse. It solved the replay problem but became a single point of failure during the 2025 Pride event when a rogue UPDATE cascaded into a partition explosion. Today I would pair Kafka with Materialize running on the same rack so we can replay into the same materialized views without the storage tax of ClickHouse replicas.