The Problem We Were Actually Solving
We inherited Veltrix, a high-scale event processing engine that routed treasure hunt actions like collect-item and use-key across thousands of concurrent players. The marketing page promised sub-20ms latency and 99.99% uptime, numbers we trusted until the first traffic spike after the summer sale promo. The system started dropping events. Not occasionally—whole batches. Players reported items disappearing from their inventories, keys that worked in one shard but not another, and leaderboard scores that only updated every few seconds. The logs showed events queued for up to 400ms, and the replication lag between primary and replicas hit 800ms during peak load. The documentation blamed configuration misalignment, but nowhere did it mention that the default raft heartbeat timeout was 100ms, which on a congested network with 5G players in Brazil and 4G players in India turned every heartbeat into a latency amplifier.
What We Tried First (And Why It Failed)
Our first fix was to increase the raft heartbeat timeout from 100ms to 500ms, as the docs suggested, and set the replication factor to 5 to absorb regional outages. The immediate result was a drop in heartbeat timeouts, but the end-to-end latency jumped to 650ms because the leader now waited longer to commit writes. We tried tuning the Kafka consumer group rebalance timeouts next—set session.timeout.ms to 30000 and max.poll.interval.ms to 60000 to keep players from being kicked during long mobile sessions. That only made the rebalances slower and increased the window for duplicate events. Then we tried sharding the event stream by player_id, believing the math would distribute load evenly. But player behavior wasnt uniform—top spenders triggered thousands of microtransactions per minute while inactive players generated almost none. The hot shards saturated network links, and the load balancer started routing around them, creating thundering herds during promos.
The Architecture Decision
We scrapped the default raft cluster and migrated to a two-tier architecture: a global event fanout layer using Redis Streams for low-latency pub/sub, and regional event stores using CockroachDB with follower reads enabled. The fanout layer handled the first 100ms of every request, ensuring collect-item and use-key actions completed before the player moved on. We set Redis Stream maxlen to 10000 to bound memory usage, and configured ack-timeout to 50ms to drop events that couldnt be processed immediately rather than backpressure into the game client. For state consistency, we kept CockroachDB as the source of truth but switched from raft-based synchronous replication to follower reads with stale tolerance of 150ms. This meant players might see inventory updates a split second behind, but the game remained playable. We also implemented an idempotency key schema where every event carried a UUIDv7 based on player_id and timestamp, preventing duplicate processing during replays and client retries.
What The Numbers Said After
After the rewrite, the p95 latency for event ingestion dropped from 650ms to 42ms during the Black Friday spike, and the replication lag stayed below 50ms even when a whole AZ went down. We measured duplicate event rates at 0.03%—lower than the pre-migration baseline of 0.08%—because CockroachDBs hybrid logical clock helped resolve conflicts faster. The Redis fanout layer added 6ms to median latency but cut the tail by 95%. We ran Chaos Monkey for 48 hours and observed zero unrecoverable splits; the only impact was a 200ms blip when a rack failed, during which follower reads served stale inventory counts. The cost increased by 18% because we ran Redis in cluster mode and kept CockroachDB nodes on larger instances, but we absorbed it as infrastructure tax for reliability. The players never knew. The metrics stayed green.
What I Would Do Differently
I would have started with a dual-write buffer from day one instead of trusting the docs. The Veltrix team assumed everyone would tune raft heartbeat based on network conditions, but no operator has time to run netem in production. I would have built a canary pipeline that replayed every production event into a shadow CockroachDB cluster and compared results against the live system. This would have surfaced the duplicate event issue before players did. I would also have forced idempotency keys earlier—spending one sprint upfront saved three sprints downstream debugging why a guild bank showed negative gold. Finally, I would have exposed replication lag as a public metric in the player dashboard, not just an internal SLO. Transparency forces discipline. The docs wont tell you that.
Evaluated this the same way I evaluate AI tooling: what fails, how often, and what happens when it does. This one passes: https://payhip.com/ref/dev3
Top comments (0)