The Veltrix Engine Was Bleeding Events Into the Wrong Places

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In 2024 we rebuilt Veltrix as a multi-tenant treasure hunt engine for corporate scavenger hunts. The public API served six different game types, each with its own state machine, scoring rules, and latency budget. We needed a single event backbone that could carry player actions, quest updates, and rewards without blocking the in-game UX or inflating our cloud bill.

Our first cut used a single Kafka cluster with a topic named events.v1.json. Every event—player login, quest checkpoint, power-up activation—landed in that topic, and we used a schema registry so producers could evolve their payloads. At 5,000 events per second we were fine, but when we scaled to 250,000 events per second for a live Fortune 500 event, the brokers started to fall over under ~1 MB/s disk pressure and we saw occasional 15-second pauses in partition commits. Worse, the mobile client would retry a failed quest step and we would replay the same reward event twice, causing phantom currency in the wallet.

What We Tried First (And Why It Failed)

Our first refactor moved reward events to a separate Kafka topic called rewards.v1.json and ran a dedicated consumer group so we could shard the processing. That fixed the duplicate-currency bug, but introduced a new problem: the quest state machine could now lag behind rewards by up to 800 milliseconds because we were reading from two different topics. During a live event in Singapore, the game masters noticed that players who completed a timed bonus round werent seeing their scores reflected in the leaderboard even though the time-stamp was correct. The root cause was a consistency gap: the quest service was updating its internal state based on reward events it had not yet consumed, so it told the UI the player was still in progress.

We tried to paper over the gap by running the rewards consumer with max.poll.interval.ms set to 3 seconds and enabling idempotent producer with enable.idempotence=true. It didnt help. During peak load we still saw 3–5 percent of quest steps stuck in a reconciliation loop that hammered the Redis leaderboard with SETNX calls, pushing Redis memory usage from 8 GB to 22 GB in 45 minutes and forcing a manual failover that dropped 0.8 percent of live sessions.

The Architecture Decision

We ripped out the single Kafka topic and replaced it with a domain-aware bus modeled after the Paper Trail pattern wed used at a previous gig. We split the backbone into three streams:

game.v1.*
• One log per game instance, partitioned by game_id.
• Contains player actions, location pings, and system events.
• Retention: 7 days, compression: lz4.
• Measured latency P99 < 15 ms end-to-end.
quest.v1.*
• One log per quest definition, partitioned by quest_id.
• Contains quest-state transitions and checkpoint records.
• Retention: 30 days, compression: zstd.
• Measured latency P99 < 30 ms.
reward.v1.*
• One log per reward type, partitioned by reward_id.
• Contains only reward-issued events, immutable once committed.
• Retention: 90 days, compression: snappy.
• Measured latency P99 < 40 ms.

Each stream feeds a separate consumer group. The quest state machine consumes from game.v1.* for player actions and quest.v1.* for its own state changes; the reward service consumes only reward.v1.*. Because each stream is independent, we can tune retention and replication factor per domain without risking cross-topic interference. We also introduced a small outbox table in Postgres for events that must survive Kafka unavailability; the outbox writer batches 100 events per transaction and retries for up to 12 seconds before failing fast. During the last public beta we deliberately killed a Kafka broker mid-event; the outbox drained in 6.2 seconds and no player saw a failed step.

What The Numbers Said After

After the migration we ran a four-week A/B with 1.2 million live sessions. The old single-topic bus averaged 92 ms end-to-end latency at P99; the new bus runs 19 ms P99 for the game stream and 28 ms P99 for reward stream. Kafka disk usage dropped from 1.8 TB to 420 GB because we could delete older data per topic rather than keep everything in one log. Our Redis memory stabilized at 9 GB with a 30-second TTL on leaderboard keys, and the reconciliation loop vanished: the quest service now sees every reward within 50 ms of commit, so the UI never shows stale states.

What I Would Do Differently

I would not have used Kafkas exactly-once semantics for the outbox writer. We enabled idempotent producer and transactional.id, but we still hit a deadlock once when two services tried to write to the same outbox partition in the same transactional batch. Since then we switched to a simple single-row lock with a 500 ms lease, and our outbox error rate fell from 0.007 percent to near zero. I also regret exposing the raw events.v1.json topic to our mobile client SDK; the SDK now hydrates 12 KB of JSON per event on a 4G connection and triggers unnecessary retries. Were moving to a compacted /events/v1/{playerId} topic that only ships deltas, which drops the mobile payload to 1.4 KB and halves the retry count.

The biggest lesson is this: treat every event as belonging to a consistency domain, not a technical domain. Events that mutate player state do not belong in the same stream as events that mutate quest state, no matter how tempting it is to