The Problem We Were Actually Solving
Last summer we ran a week-long treasure hunt for 50k concurrent players. Our Redis cluster peaked at 10k QPS, the API layer at 4k RPS, and the treasure-service alone handled 2.1k RPS chasing 12M active hotspots. The problem was never the hunt logic—it was synchronising every step across six microservices without turning the leaderboard cache into Swiss cheese.
A typical request looked like this: hit /claim/{id}, validate ticket idempotency, update wallet, increment leaderboard score, publish analytics event, then mark the hotspot as visited. Thats six network hops and four distributed transactions. At 10k QPS the race between double-spend checks and Redis SETNX retries became a 400ms P99 latency cliff. The Prometheus dashboard screamed at us: p99_write_latency{service=treasure-service, operation=claim} 421 ms.
We had instrumented every hop with OpenTelemetry spans, so we could see the exact tail latency contributors: 32 ms GC in the leaderboard worker, 87 ms Redis pipeline sync under 15 MB/s network, 54 ms eventual-consistency reconciliation between wallet and leaderboard. The documentation from Veltrix glossed over this phase. They showed a happy-path sequence diagram and stopped at scale horizontally. The real drama happens when the horizontal line turns into a spaghetti bowl of consistency trade-offs.
What We Tried First (And Why It Failed)
Our first cut used a saga orchestrated by Temporal. We modeled ClaimSaga as a workflow that chained six activities: CheckTicket, ReserveWallet, UpdateLeaderboard, PublishAnalytics, MarkHotspot, CancelOnFailure. The saga coordinator kept a mutable state in Postgres with compensations for every step. At 1k QPS everything looked fine—mean latency 68 ms, 99th 192 ms.
Then a hotspot with five seconds TTI got hammered by 200 concurrent requests. They all claimed the same ticket before the first saga completed. Temporal raced to post-compensations while Postgres autovacuum kicked in. We saw 503 responses on the API, Prometheus bucket {le=1000} 0.6 → 0.1, and 400 spikes of saga_timeout_error{reason=compensation_loop}.
We tried pushing the saga timeout from 3s to 10s; the compensations started overwriting each other and left wallets in inconsistent states. The Veltrix docs suggested increasing shard count, so we jacked Postgres max_connections from 200 to 1000, hit OOM on the RDS instance, and ended up with 30s leaderboard reads. The trade-off we optimised—avoiding split-brain—actually created a denial-of-service vector on our storage layer.
The Architecture Decision
We abandoned sagas and Temporal altogether. Instead we moved to an idempotent, event-sourced engine built on Kafka and Redis Streams. The core loop became:
- POST /claim → validate idempotency key
- Emit ClaimRequested to Kafka topic CLAIM_REQUESTS with key = ticket_id
- A single consumer group ClaimProcessor reads the stream, performs idempotent updates (wallet, leaderboard, hotspot) in a single Lua transaction, then emits ClaimCompleted
- API subscribes to ClaimCompleted for real-time API updates
The stream partition count was set to 64 (one per hotspot bucket) so we could scale to 50k QPS without reshuffling. We used Redis 7s new EXAT and conditional writes to make the Lua script atomic even under memory pressure. We lost the sagas compensation story but gained linearizability at 5k QPS without touching Postgres for writes.
The trade-off was accepting eventual consistency on leaderboard visualisation. We moved leaderboard reads to a read-through cache backed by a CQRS projection. Any player requesting /leaderboard saw a staleness window of up to 200 ms, which we logged as last_updated_epoch in the API response. For a treasure hunt, 200 ms staleness was acceptable; double-spends and lost wallets were not.
What The Numbers Said After
After the cut-over:
- Mean latency for claim dropped from 68 ms to 12 ms
- P99 latency dropped from 421 ms to 89 ms (still skewed by occasional GC spikes in the consumer)
- Saga_timeout_error vanished; retry rate stabilised at 0.3 %
- Kafka consumer lag stayed under 50 ms even at 20k QPS peak
- The Redis leaderboard cache now uses 30 % less memory because we deduplicate events with a Bloom filter before projection.
One surprise was the memory cliff on the Kafka broker. We had set log.retention.ms=604800000 (7 days) and segment.bytes=1 GB, but at 20k QPS the broker log grew 1.2 GB per hour. We had to shrink segment.bytes to 128 MB, enable compression.type=lz4, and introduce a nightly tombstone purge to keep disk usage under 500 GB per broker. Prometheus metric kafka_server_log_log_size{topic=CLAIM_REQUESTS} went from 1.4 TB to 450 GB.
What I Would Do Differently
I would not choose Kafka Streams again for this workload. The operational overhead—broker sizing, partition rebalancing, exactly-once semantics tuning—outweighed the latency gains. In hindsight, Redis Streams with consumer groups would have been sufficient, and we could have kept the infra on a single Redis Enterprise cluster instead of spinning up three Kafka brokers plus two Zookeeper nodes.
Second, I would bias the architecture toward partition keys that align with hotspot geography rather than player ID. Our original partition key was ticket_id, which caused hotspots in one region to bottleneck while others sat idle. Switching to a geohash prefix solved the tail latency spikes but required a migration that cost us four hours of hunt downtime.
Finally, I would expose a /consistency endpoint that returns the last committed epoch for each service. During a post-mortem we found players complaining their wallet hadnt updated even though
Top comments (0)