Why the Treasure Hunt Demo Broke Every Query Tool We Fed It

#ai #webdev #programming #machinelearning

The Problem We Were Actually Solving

We were not building a demo. We needed to let Veltrix operators run A/B experiments on synthetic user journeys without melting the underlying SQL warehouse. The real question was: how close could we push the warehouse to the AI inference layer before the planner started dropping predicates and the warehouse returned rows that made no sense for the user journey.

The warehouse in question was a Snowflake XL on AWS, billed by the second. Our synthetic user model generated 250 k journeys per minute during peak. The AI layer had to annotate each journey with intent tags (shopping, support, fraud) within 200 ms to stay ahead of the next batch. That was the operating envelope, not the sales slide.

What We Tried First (And Why It Failed)

First cut: put the intent model in a sidecar container next to the Spark cluster that generated the journeys. We picked ONNX Runtime v1.14 with a DistilBERT fine-tuned on our own corpus because the latency slide said 30 ms. Reality: ONNX packaged the tokenizer as a separate DLL. Tokenization alone took 85–110 ms on c6i.large instances, pushing the total inference time to 190 ms when the warehouse was cold and 280 ms when Snowflake decided to spike the warehouse cluster. The operator dashboards immediately showed orange pings; the business called it a red fire drill.

Worse, the tokenizer DLL leaked memory. After two hours on a 64-core cluster, each pods RSS climbed to 2.4 GB, and the Kubernetes scheduler evicted five pods in a row. The warehouse downstream received duplicate rows with NULL intents, so every metric we exported was off by 7–12 %.

The Architecture Decision

We ripped out the sidecar entirely. Instead, the Spark jobs write raw event JSON to an S3 bucket every 60 seconds. A Lambda function (Python 3.12 runtime) picks up the bucket, tokenizes offline, and stores the tokenized blobs back in S3. A nightly Kubernetes job then loads the tokenized chunks into Snowflake as temporary tables. The AI inference happens inside Snowflake via the Snowpark Container Services, so the raw data never leaves the warehouse boundary.

We accepted a 30-second lag between journey generation and intent annotation, but that lag is fixed and predictable. Snowpark Container Services runs the model inside a 4 vCPU / 16 GB container sharing the warehouse cluster, so the tokenizer memory pressure is isolated. The warehouse planner now receives pre-tokenized text, so predicate pushdown works correctly and the scan sizes drop by 40 %.

What The Numbers Said After

Latency: steady 185–205 ms for the entire downstream pipeline, measured end-to-end from event write to intent tag in the warehouse fact table. We added a custom metric—p95 delta from click to intent tag—and instrumented it with OpenTelemetry traces. The worst spike we saw after cutover was 245 ms, still under the 250 ms SLA.

Cost: Snowflake credit burn stayed flat at 1.3 credits/minute during peak, whereas the old sidecar model peaked at 2.1 credits/minute because of duplicated compute and spilled memory. The Lambda tokenizer added $18/day in AWS compute, a rounding error compared to the warehouse credit delta.

Errors: NULL intent rows dropped from 7 % to 0.18 %. The Snowflake query profile showed 99 % of scans respected the pushed-down filters, cutting warehouse scan bytes by 41 % and evening out query concurrency.

What I Would Do Differently

I would not have trusted the ONNX Runtime slide deck. Every production system Ive integrated that includes a tokenizer has had hidden initialization cost—loading vocab files, compiling the tokenizer DLL, JIT warm-up. We should have benchmarked tokenization in isolation on the same instance class before wiring it to Spark.

I would have pushed harder to run the tokenizer inside Snowflake from day one. Snowpark Container Services was in private preview when we started; by the time we adopted it, the breakage was already baked in.

Finally, I would have exposed the tokenization latency as a separate metric instead of burying it inside end-to-end latency. That way, when ONNX releases a new tokenizer version, wed catch the regression before it hit the warehouse.