The Veltrix Failure Mode That Always Kills Event-Driven Treasure Hunts

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We needed to track three events per player per second while maintaining exactly-once semantics for physical-product redemption. The business required both fast path (claim a product) and slow path (audit trail for customs). Our first model treated every interaction as a flat Kafka record with a composite key of user_id + hunt_id + event_type. At 200k players, that gave us 60M keys in Redis sorted sets and 2.1TB of compressed Kafka segments per day. The query to join user clicks to product claims took 47 seconds on a 16-core RDS instance. That latency killed our warehouse pipeline; by the time the join finished, the next batch of events had already arrived and invalidated the result.

What We Tried First (And Why It Failed)

We swapped to a columnar warehouse—S3 + Athena—and ran the same query. The planner chose a Merge Join on hunt_id, but hunt_id was not the leading sort key on the fact table, so Athena spilled 84GB to disk and timed out after 18 minutes. We tried Redshift with a 16-node dc2.8xlarge cluster. The WLM queue filled up with long-running analytical queries, causing the API layer to see 94th percentile p99 latency of 2.4 seconds—above the 500 ms SLA for checkout. The engineering intuition was correct: we needed a separate analytics store, but the columnar approach introduced too much skew.

The Architecture Decision

We rebuilt the event envelope using the Event Store DB 22.10 event-sourcing pattern: each aggregate had its own stream, and we pushed the slow-path audit directly into the same stream as a custom metadata section. The key decision was to partition the stream by user_id only, not hunt_id. This inverted the write pattern: writes became 99.4 % sequential in the user stream, while reads became a single key lookup in Redis for the users current state. We kept Kafka for high-throughput fan-out, but we used it solely as a write-behind log, not as the source of truth. The EventRouter v2 now emitted two message types: real-time claims (to Kafka) and immutable audit entries (to Event Store DB). The audit stream was immediately replicated to a TimescaleDB hypertable for time-series queries. A small polling service watched the audit stream and updated a Redis Sorted Set with user-level metrics every 500 ms.

The tradeoff was operational complexity: we now ran three systems where we used to run one, and every redeploy required verifying idempotency across all three. We mitigated by wrapping the deploy in a custom ArgoCD sync hook that checked the consistency of the Redis Sorted Set against the TimescaleDB count before flipping the ingress.

What The Numbers Said After

Within two weeks we saw:

Average claim-to-dispatch latency drop from 47 seconds to 140 milliseconds.
Redis memory usage fell from 3.2 TB to 420 GB because we stopped storing hunt-level aggregates.
The Event Store DB cluster (3 nodes, i3en.3xlarge) handled 4.3 million events per minute with 99.9 % tail latency under 20 ms. We set the tombstone retention at 7 days; it cost us $800/month in S3 storage and saved us from rebuilding the stream three times during the campaign.
The TimescaleDB hypertable grew at 1.2 GB/day and the query planner chose a BRIN index scan that completed in under 300 ms for our audit report.

The only outage that quarter was when an operator accidentally ran a DELETE on the TimescaleDB hypertable during a maintenance window. The Event Store DB kept the audit stream, so we replayed the last 48 hours of events in 20 minutes and rebuilt the hypertable. Downtime: 3 minutes.

What I Would Do Differently

I would not have introduced the TimescaleDB layer for the treasure-hunt engine. The hypertable gave us 300 ms queries, but it added another replication lag source and forced us to write a custom backfill job that ran every campaign. Next time I would keep the audit stream in Event Store DB and run analytical queries directly against it using the $all stream projection. I would also replace the custom ArgoCD hook with a Kubernetes admission controller that verifies the Redis Sorted Set cardinality against the Event Store DB count before allowing the deployment to proceed. The hook we used was fragile; a single race during redeploy caused a split-brain state where Redis reported 199,842 players but Event Store DB showed 200,000. That mismatch cost us $24k in duplicate product dispatches before we caught it.