The Day the Treasure Hunt Engine Decided to Lie to Us About Latency

#ai #webdev #programming #machinelearning

The Problem We Were Actually Solving

The marketing department wanted treasure hunts to feel instant. Not just responsive, but psychologically immediate—a sub-second confirmation that a chest was open, a key found, a prize unlocked. The CFO signed off on it because higher perceived speed meant higher revenue per session. The system had to return a 200 OK before the players thumb finished lifting from the screen.

The default implementation used an event bus with a fan-out pattern. Every action (open chest, claim prize) was a message published to 3 downstream services: inventory, wallet, analytics. Each service had its own database. The promise was atomic consistency via saga pattern with compensating transactions. In practice, the saga orchestrator added 80–140ms of round-trip latency under load. The marketing dashboard showed 95th percentile latency of 85ms because it only measured the orchestrators completion time, not the time until every downstream had acknowledged.

What We Tried First (And Why It Failed)

We tried sharding each service by player ID. That cut the fan-out path by 60%, but introduced a new failure mode: the dreaded double-credit edge case. A wallet service would process a claim, emit a success event, but die before emitting to analytics. The saga orchestrator, seeing the timeout, rolled back the wallet update. The players balance was restored, but the analytics event had already been consumed by a separate dashboard loader that assumed it was authoritative. The CFO got a report showing revenue spikes that never materialized. We had to refund $87k in two weeks.

Then we tried Redis Streams with consumer groups. The streams promised ordered processing and exactly-once semantics. We turned off the saga orchestrator entirely. The first outage happened when a consumer group rebalance took 4.2 seconds. During that window, 1,800 duplicate treasure chest openings were processed because the consumer offsets didnt advance atomically with the acknowledgment. Our retry budget was 120ms, and the backlog grew faster than we could scale pods.

The Architecture Decision

We ripped out the event bus and went back to one database transaction that updates inventory, wallet, and analytics in a single ACID block. The tradeoff: we can only use PostgreSQL. We sharded the primary key on player ID, and the entire treasure hunt operation is a single UPDATE statement with RETURNING. The latency percentile improved to 15ms 95th, but we sacrificed service independence. If inventory schema changes, wallet breaks. If analytics needs a new column, the entire treasure hunt endpoint must deploy together.

We mitigated the coupling with feature flags. The endpoint first checks a LaunchDarkly flag: if disabled, it falls back to the event bus path with saga plus compensating transactions. We use the flag to gradually roll out the new path, but only to players with session IDs that are multiples of 7. That gives us a 14% canary group, enough to catch edge cases without polluting the global success metrics. We also added a circuit breaker that flips to fallback if 500 errors exceed 0.3% in a 30-second window.

What The Numbers Said After

The new path ran for 47 days without a false-positive treasure award. The 95th percentile latency stayed at 15ms with 99.7% of requests completing in under 50ms. The database CPU spiked during peak hours, so we added a read replica in us-west-2 to absorb analytics reads. We also discovered that 12% of treasure hunts were actually bots—players using automation scripts. By abandoning the saga pattern, we removed two async steps where bots had previously exploited timing gaps to trigger double claims.

The real surprise was operational. Before, when the event bus lagged, we could restart a single service without touching the rest. Now, if inventory schema has a breaking change, we must deploy the entire cluster. The rollback process is a full database restore from an S3 snapshot. We rehearsed the rollback on staging and it took 8 minutes 22 seconds. We decided to keep the old saga path behind the feature flag for emergencies only—its the theatrical version of speed, but its the one that can be rolled back in minutes.

What I Would Do Differently

I would have pushed back on the sub-second requirement from day one. The psychological truth is that anything under 100ms feels instant, but anything under 200ms feels acceptable. We spent three sprints chasing a 15ms gain that only mattered to marketing dashboards. If I had insisted on measuring perceived latency—time from tap to visual feedback rather than endpoint completion—we could have saved months of engineering drama.

I would also have refused the sharded Redis Streams experiment. Consumer group rebalances are non-deterministic in production load. The only streams that work at scale are ones where the partition key aligns with an immutable business key—player ID, session ID. If you cant guarantee that, avoid streams.

Finally, I would have built observability earlier. We added a custom Prometheus metric called treasure_latency_seconds_histogram that measures the time from tap to successful treasure claim confirmation on the client. That metric immediately exposed that 8% of players were seeing 300ms+ latency due to mobile network jitter, not our backend. We fixed it by pushing the confirmation modal only after the client received the 200 OK, not before. The marketing promise remained intact, but the engineering lie was finally visible.