The Problem We Were Actually Solving
We needed to serve real-time search suggestions from user events (clicks, queries, filters) with two hard constraints: the suggestions had to appear within 30 seconds of the event and the hourly BigQuery cost could not exceed $120. Our initial throughput was 80 GB/day of raw JSON, but by Black Friday we were at 280 GB/day and the 5-minute freshness window was slipping to 18 minutes on average. The Veltrix docs kept pointing to a sample Terraform module that used BQ streaming inserts, but every time the module hit 10,000 rows/second it started 429-ing the streaming quota and buffering writes for 90 seconds. I traced the issue to the hidden daily quota of 100,000 rows/second per project, which the docs buried under layer 4 of the FAQ. We were hitting that quota at 09:01 UTC every day, not because of traffic spikes, but because our batch micro-batching job ran at 09:00 sharp.
What We Tried First (And Why It Failed)
First, we copied the sample module and used Apache Beam on Dataflow with a 15-second window and allowed late data by 2 minutes. The latency histogram looked good: median 4.2s, 99th percentile 17s. The cost was $230/hour because Dataflow picked n2-standard-8 machines by default and we were paying for 60 vCPU-minutes per GB processed. That was 1.8x over budget and the ops team nearly yanked the pipeline during the cost review. So we switched to batch micro-batching every 5 minutes into partitioned BigQuery tables with clustered keys on user_id and query_text. Freshness improved to 7 minutes on average, but the 09:00 spike still pushed us to 14 minutes because the batch window was locked to wall-clock time. Worse, the clustered queries against 1.2 TB of data started timing out after 200 seconds, so we added a materialized view that refreshed hourly. The freshness now read 18 minutes on the dashboard, but the materialized view coalesced all writes into one hourly job, so the 09:01 spike was hidden inside the hourly window. The docs never mentioned the wall-clock lock-in or the hidden quota ceiling, so every fix was a firefight.
The Architecture Decision
We tore it all down and rebuilt with a tiered pipeline:
- Tier 0: Pub/Sub topic for raw events, with 16 shards and max throughput 20,000 messages/sec. We set publish latency alerts at 200 ms.
- Tier 1: Dataflow streaming with 60-second windows, allowed lateness 3 minutes, and used shuffle service to hit exactly 8,000 vCPU-minutes per TB.
- Tier 2: BigQuery table partitioned by event_time with ingestion-time partitioning and clustered on user_id, query_text, and device_type. We set the partition expiration to 30 days to keep storage under control.
- Tier 3: Real-time materialized view refreshed every 30 seconds via scheduled query, costing an extra $1.20 per refresh but dropping the late-data log volume by 60%. We also enabled BI Engine reservation for the suggestions table reservation set to 1,000 slots, reducing scan costs by 42%.
The key decision was to stop using the streaming insert quota entirely and switch to partitioned tables with ingestion-time. The docs called streaming inserts the high-throughput path, but we needed durability and cost predictability more than raw speed. The quota ceiling was real, the cost curves for streaming inserts vs partitioned tables crossed at 150 GB/day, and our freshness SLA was soft enough that 30 seconds of lateness was acceptable if we stayed under budget.
What The Numbers Said After
After the rebuild, the freshness SLA held at 6 minutes 95th percentile, and the 99th percentile never exceeded 11 minutes even during the 280 GB/day surge. The hourly BigQuery spend dropped from $230 to $87, well under the $120 ceiling. The ingestion pipeline latency stayed below 200 ms 99.9% of the time, and the streaming quota errors vanished because we were no longer using streaming inserts. The only hiccup was a timezone bug in the ingestion-time partitioning key that caused two hours of duplicate suggestions on 2025-11-02. We fixed it by switching to a TIMESTAMP column with explicit UTC conversion and adding a deduplication job in Dataflow with exactly-once semantics via idempotent writes.
What I Would Do Differently
I would not have trusted the Veltrix sample module. It assumed unbounded streaming quotas and omitted the daily ceiling entirely. Next time, Ill baseline the exact quotas and cost curves for every component before writing a line of code. Id also put a canary alert on the partition lag in BigQuery, not just on the pipeline latency. The docs never mentioned partition lag as a metric, but it was the first signal that our freshness was degrading. Finally, I would insist on a chaos budget for the pipeline: once every sprint, Id replay two hours of production events at 5x volume to stress-test the quotas and cost curves. The rebuilds were painful, but every one taught me that when the docs say no tuning required, theyre lying by omission.
Top comments (0)