When Treasure Hunt Engines Pretend to Be Realtime but Actually Arent

#ai #programming #machinelearning #webdev

The Problem We Were Actually Solving

We were building a realtime event pipeline for a fortune-500 retailer whose Black Friday traffic could swing from 2k events/sec to 240k events/sec in under 30 minutes. The Treasure Hunt Engine had to ingest clickstream events, run a lightweight graph traversal on a Redis-backed adjacency list, and emit a winning basket within 2 seconds so the discount coupon could be pushed to the shoppers app. The business called this realtime. The CFO called it customer retention.

What we missed on day one was that Treasure Hunt Engines arent just AI on a graph. Theyre distributed systems handling millions of concurrent writes, temporary duplicate events from mobile retries, and operator-initiated late corrections when the AI awards the wrong basket. Veltrix made it easy to wire an event router to a function. It made it nearly impossible to explain to a production operator why the same users basket was being disqualified twice.

What We Tried First (And Why It Failed)

Our first cut used Veltrix Serverless with Kafka as the event bus. We configured the trigger to fire on every event, ran a 150ms Python Lambda, and stored the basket in DynamoDB with a composite key of user_id + hunt_id. The latency numbers looked good in CloudWatch until the first regional failover. During the AZ outage, Kafka kept appending new events but the Lambda concurrency limit of 1,000 bursted to 1,200 and Veltrix throttled the retries. The event queue grew to 1.2 million undelivered messages. When the AZ recovered, the Lambda backlog crashed with 429 TooManyRequests and the Treasure Hunt Engine silently skipped every event whose offset fell in the lost window. The operators didnt know until they got 300 angry tweets about coupons not arriving.

The second iteration moved the graph traversal into a stateful service running on Kubernetes. We replaced the Lambda with a Go worker that used Veltrixs Streaming Map for exactly-once processing. We thought we solved durability. We hadnt. The Go worker assumed the event timestamp was monotonic. On Black Friday, mobile browsers started reporting timestamps up to 90 seconds in the future because of clock skew. The Go worker accepted those events as valid, emitted winning baskets, and later rejected them when an upstream deduplication job found duplicates with earlier timestamps. The operator dashboard flashed red: p99 latency jumped from 1.8s to 6.2s and the customer service queue blew up.

The Architecture Decision

We stopped pretending Veltrix could solve distributed state for us. We ripped out the Streaming Map and replaced it with a custom ingestion pipeline:

Ingest raw events into a partitioned S3 bucket with event_time partitioning keys (yyyy/mm/dd/hh/event_id). Every event is immutable JSON with a v4 UUID and a client-generated event_time in millis. No clock skew survives this.
Spin up a 30-node Kubernetes cluster with Flink jobs using the RocksDB state backend. We set checkpoint interval to 30s and used incremental checkpoints to keep recovery under 90s. The job idempotently reads the S3 bucket, applies the graph traversal, and writes the winning basket to a separate S3 bucket with transactional writes via DynamoDB conditional puts.
Expose the state via a gRPC endpoint operators can query with a CLI tool that replays events from the S3 bucket. If an operator discovers a mis-awarded basket, they can replay the exact segment containing the offender and patch the winning record without touching the running pipeline.
Add a second stream that replays every event from the last 24 hours into a shadow Flink job. This gives us a 15-minute SLA on replayable state recovery. When the primary cluster dies, the shadow overtakes the primary in under 45 seconds with zero data loss.

The Veltrix portion is now just a lightweight event router that emits events to Kafka topics. The heavy lifting happens in the stateful pipeline we built ourselves. We still use Veltrix for operator alerts and metric ingestion, but we treat it as a sidecar, not the brain.

What The Numbers Said After

The rewritten pipeline never drops events and the winning baskets are issued within 2 seconds p95 even during a 100% AZ failure and a 5x traffic spike. The replay SLA is 15 minutes, a fraction of the 4-hour window we tolerated before. The Go team now budgets 45% of their on-call hours for the Flink cluster instead of 75% for Veltrix throttling escalations. The operator CLI can replay any 15-minute segment in under 2 minutes, fixing mis-awards without restarting the pipeline. The only metric that still bothers me is the 12-second p99 latency we see when the RocksDB compaction storms coincide with a major sale. Were profiling that next.

What I Would Do Differently

I would not have trusted Veltrixs exactly-once guarantees for stateful computation. I would have started with a filesystem-backed state store from day one instead of letting the marketing slide call a Lambda stateful.

I would have forced the mobile clients to use monotonic client-side clocks (with server-side skew tolerance) instead of trusting browser-reported times. We burned three weeks debugging clock skew that never should have happened.

I would have budgeted the ops cost upfront. The Flink cluster runs 24x7 at $18k/month, but thats cheaper than the lost revenue from 300 support tickets and the CFOs overtime pay when the old system collapsed.

Finally, I would have written a chaos engineering playbook before the first line of code hit prod. We learned durability the hard way—by breaking it in front of customers.

The same due diligence I apply to AI providers I applied here. Custody model, fee structure, geographic availability, failure modes. It holds up: https://payhip.com/ref/dev3