bigquery slots are the single most misunderstood unit in the entire Google Cloud analytics stack — and the one most platform teams discover too late, usually right after the first month-end bill in which a single ad-hoc query melted a quarter of the budget. A slot is not a thread. It is not a row. It is not a query. It is a unit of CPU plus memory plus shuffle bandwidth, billed by the millisecond, that BigQuery dispatches in parallel across every sub-step of a query plan — and how you buy, share, and cap that unit is the entire game of bigquery finops in 2026.
This guide is the senior-DE walkthrough you wished existed the first time an architect asked you "when do we move off bigquery on demand to bigquery reservations with the new bigquery editions?" or "what is the right baseline + bigquery autoscaler shape for our ETL workload?" It walks through Editions (Standard, Enterprise, Enterprise Plus), the per-second bigquery pricing model that replaced the old 60-second floor, slot accounting (baseline + max + idle), reservations and multi-project sharing, the chargeback patterns that turn slot allocation into a per-team budget, and the 5-question decision tree senior engineers use to pick a bigquery capacity commitment for a new workload. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the SQL practice library → for the windowing + INFORMATION_SCHEMA queries this guide leans on, rehearse on optimization problems → for the query-shape rewrites that drop slot-ms 10x, and work through ETL pipeline drills → for the batch-vs-streaming reservation-sizing patterns.
On this page
- Why BQ FinOps is its own discipline in 2026
- Editions explained — Standard, Enterprise, Enterprise Plus
- Slots, baseline + autoscaler
- Reservations + projects + workload isolation
- FinOps practices — monitoring, attribution, alerts
- Cheat sheet — BQ FinOps recipes
- Frequently asked questions
- Practice on PipeCode
1. Why BQ FinOps is its own discipline in 2026
BigQuery has two pricing models — per-byte (on-demand) and per-slot (Editions) — and the FinOps job is to pick, size, and cap the right one
The one-sentence invariant: BigQuery sells you compute two ways — by the byte scanned (on-demand) or by the slot reserved (Editions) — and every other lever in the FinOps stack (autoscaler, baseline, idle reservation lending, Resource Monitor caps, INFORMATION_SCHEMA chargeback) sits on top of that one binary choice. Once you internalise "byte vs slot," the entire bigquery pricing interview surface collapses to a sequence of consequences from that root decision.
The two pricing axes — what you actually pay for.
- On-demand pricing. You are billed per byte scanned. Roughly $6.25 per TB scanned in most regions (check the live rate; it moved twice in 2024–2025). No capacity to manage, no baseline to size, no autoscaler — you just run queries. The price is predictable per query, terrifying per month, because a single SELECT * on a 500 TB table costs ~$3,125. There is a soft slot ceiling around 2,000 slots per project, which is plenty for ad-hoc work and lethal for production ETL.
- Editions pricing. You buy slots at a per-slot-hour rate that varies by edition (Standard ≈ $0.04/slot-hour, Enterprise ≈ $0.06, Enterprise Plus ≈ $0.10 — again, check live rates). You commit to a baseline (always-on) and configure an autoscaler maximum. Billing is per-second since the 2023 shake-up — no more 60-second-minimum floors that punished sub-second jobs.
The 2023 Editions shake-up — what changed.
- Flat-rate is dead. The legacy "flat-rate" annual / monthly slot commitments were retired in mid-2023 and replaced with the three-tier Editions model (Standard / Enterprise / Enterprise Plus) — each tier is a feature bundle plus a per-slot-hour price.
- Per-second billing replaced the 60-second floor. Pre-2023, every reservation billed at minimum 60 seconds. Post-2023, you pay only for the seconds you actually use; the autoscaler can spin up and spin down within the same minute without rounding up.
-
Autoscaler is the default. Before 2023, you reserved a fixed number of slots and watched the dial. After 2023, you set a
baselinefloor and amaxceiling; BigQuery scales between them automatically. Most production accounts now run baseline 0 with a max of N, paying only for what they use. -
1-year and 3-year
bigquery capacity commitmentdiscounts. On top of the Editions per-slot-hour rate, you can commit to a baseline for 1 or 3 years and get a 20% / 40% discount on the committed portion. The autoscaler-above-baseline portion is always at the on-demand-Editions rate.
The four "must-answer" axes for any BQ workload.
- On-demand or Editions? If your monthly bytes scanned would cost more than a comparable slot reservation could deliver the work for, switch to Editions. The crossover is workload-dependent; below ~50 TB scanned per month, on-demand often wins on simplicity.
- Baseline vs autoscaler split. Baseline is the always-on slot floor. Autoscaler scales up to a max during peaks. The right split depends on your duty cycle: 24/7 ETL → high baseline; bursty BI → low baseline + high max.
-
Idle reservation policy. If you have multiple reservations, do you let idle slots in one lend to others (
ignoreIdleSlots = false) or hoard them per reservation (ignoreIdleSlots = true)? Lending is the default and usually the right call; hoarding is the lever for hard guarantees. - Cap per project / per team. Without a Resource Monitor or a per-team reservation, one runaway query can drain the whole autoscaler pool. The cap is the floor under your worst-case bill.
What interviewers actually probe.
- Do you say "slot-ms is the atomic unit of cost in BigQuery" in the first minute? — senior signal.
- Do you describe Editions as "three tiers, per-second billing, baseline + autoscaler, optional 1y/3y commitment discount"? — required answer.
- Do you mention "per-team labels + INFORMATION_SCHEMA.JOBS = chargeback" unprompted? — senior signal.
- Do you push back on "BigQuery is on-demand" with "since 2023 most large customers are on Editions with autoscaler"? — senior signal.
Worked example — on-demand vs Editions break-even
Detailed explanation. The most common interview probe — "we scan 80 TB per month on-demand; should we move to Editions?" — is a math problem. On-demand costs are predictable per query but unbounded per month. Editions costs are predictable per month but require sizing the baseline + max. The break-even is where the slot-hours you would consume to do the same work at Editions rates equals the byte-scan bill on on-demand.
Question. A team scans 80 TB per month on on-demand at $6.25/TB. Their slot-hour-equivalent for the same workload is roughly 1,500 slot-hours per month (measured via INFORMATION_SCHEMA.JOBS slot_ms). Compute the on-demand monthly bill, the Enterprise Editions equivalent, and the break-even point.
Input.
| Metric | Value |
|---|---|
| Monthly bytes scanned | 80 TB |
| On-demand rate | $6.25 / TB |
| Slot-hours equivalent | 1,500 slot-hours / month |
| Enterprise rate | $0.06 / slot-hour |
| Standard rate | $0.04 / slot-hour |
Code.
-- Compute slot-hours from INFORMATION_SCHEMA.JOBS — the data behind the question
SELECT
DATE_TRUNC(creation_time, MONTH) AS month,
SUM(total_slot_ms) / (1000 * 60 * 60) AS slot_hours,
SUM(total_bytes_billed) / POW(10, 12) AS tb_scanned,
SUM(total_bytes_billed) / POW(10, 12) * 6.25 AS on_demand_cost_usd,
(SUM(total_slot_ms) / (1000 * 60 * 60)) * 0.06 AS enterprise_cost_usd,
(SUM(total_slot_ms) / (1000 * 60 * 60)) * 0.04 AS standard_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
AND job_type = 'QUERY'
AND statement_type IS NOT NULL
GROUP BY month
ORDER BY month;
Step-by-step explanation.
- Pull the last 90 days of completed queries from
INFORMATION_SCHEMA.JOBSin the region. The view exposestotal_slot_ms(the atomic compute unit) andtotal_bytes_billed(the on-demand metric). - Aggregate per month: convert slot-ms to slot-hours (divide by 3.6e6), convert bytes to TB (divide by 1e12). These are the two sides of the cost equation.
- Apply the three pricing models in parallel — on-demand bytes × $6.25, slot-hours × $0.06 for Enterprise, slot-hours × $0.04 for Standard. The lowest one wins for that month.
- The break-even is where on-demand cost equals Editions cost. In this case, on-demand is 80 × $6.25 = $500/month. Enterprise is 1,500 × $0.06 = $90/month. Standard is 1,500 × $0.04 = $60/month.
- The verdict: switch to Editions immediately — on-demand is paying 5–8x for the same work. Pick Standard unless you need BI Engine, BQML, or Gemini features.
Output.
| Pricing model | Monthly cost |
|---|---|
| On-demand (80 TB × $6.25) | $500 |
| Enterprise (1,500 slot-hours × $0.06) | $90 |
| Standard (1,500 slot-hours × $0.04) | $60 |
Rule of thumb. Run the INFORMATION_SCHEMA query above before every pricing decision. If Editions wins by 3x or more, switch; if it wins by < 1.5x, stay on on-demand for the operational simplicity. The crossover is workload-shaped, not byte-count-shaped.
Worked example — baseline + autoscaler sizing from a duty cycle
Detailed explanation. A second common interview question — "we have 1,500 slot-hours per month; should we buy a 100-slot baseline reservation or run with baseline 0 and autoscaler max 200?" The answer depends on your duty cycle — what fraction of the month do you actually need slots?
Question. Given a workload that consumes 1,500 slot-hours per month, evenly distributed across business hours (10h/day × 22 weekdays), compute the optimal baseline and max for an Enterprise reservation. Compare a flat-baseline-100 vs an autoscaler-only configuration.
Input.
| Metric | Value |
|---|---|
| Monthly slot-hours | 1,500 |
| Active hours per month | 220 (10h × 22 weekdays) |
| Average slot demand during active hours | 1,500 / 220 ≈ 6.8 slots |
| Peak slot demand | 100 (observed in JOBS view) |
| Enterprise per-slot-hour | $0.06 |
| 1-year baseline discount | 20% → $0.048 / slot-hour |
Code.
# Sizing model — baseline + autoscaler vs autoscaler-only
hours_per_month = 730 # full month
active_hours = 220 # weekdays 10h
slot_hours_used = 1500
peak_slots = 100
rate_pay_as_you_go = 0.06 # $/slot-hour Enterprise
rate_commit_1y = 0.048 # 20% off baseline (1-year commit)
# Option A — flat baseline 100, no autoscaler
cost_a = 100 * hours_per_month * rate_commit_1y
# = 100 * 730 * 0.048 = $3,504 / month (way over!)
# Option B — baseline 0, autoscaler max 200
cost_b = slot_hours_used * rate_pay_as_you_go
# = 1500 * 0.06 = $90 / month
# Option C — baseline 5 (avg-low), autoscaler max 200
baseline_slots = 5
cost_c = (baseline_slots * hours_per_month * rate_commit_1y) \
+ ((slot_hours_used - baseline_slots * active_hours) * rate_pay_as_you_go)
# = (5 * 730 * 0.048) + ((1500 - 5 * 220) * 0.06)
# = 175.2 + (400 * 0.06) = 175.2 + 24 = $199.2 / month
print(cost_a, cost_b, cost_c)
Step-by-step explanation.
- Option A buys a flat 100-slot baseline 24/7. You pay for 730 hours × 100 slots × $0.048 = $3,504/month even though active demand is 220h. Massive waste.
- Option B uses pure autoscaler — baseline 0, max 200. You pay only the 1,500 slot-hours actually consumed × $0.06 = $90/month. The cheapest by far.
- Option C is a small baseline (5 slots, the average light demand) + autoscaler on top. The baseline is committed at the 20% discount; the bursts above baseline are pay-as-you-go. The math comes out to $199/month — worse than B because the duty cycle is too low to amortise the baseline.
- The verdict: when active hours / total hours < ~30%, pure autoscaler wins. When duty cycle > 50%, a baseline at the average load + autoscaler on top wins. The sweet spot for committed-discount math is "24/7-ish" workloads — ETL pipelines that run hourly, not BI queries that run weekdays only.
- Always model all three options against your real JOBS data; do not trust gut feeling on duty cycle.
Output.
| Configuration | Monthly cost | Wins when |
|---|---|---|
| Flat baseline 100 | $3,504 | never, on this workload |
| Autoscaler-only (max 200) | $90 | low duty cycle (< 30%) |
| Baseline 5 + autoscaler 200 | $199 | duty cycle 30–50% |
Rule of thumb. Default to baseline 0 + autoscaler max = your observed P99 slot demand. Only buy a baseline when duty cycle > 50% and the 20% commit discount more than pays for the always-on cost.
Worked example — slot-ms is the unit of cost
Detailed explanation. A common confusion: junior engineers think "this query ran for 10 seconds" is the cost metric. It is not. The cost metric is slot-ms — slots × milliseconds — because BigQuery dispatches the query in parallel and the bill is the total compute consumed across all slots, not the wall-clock latency.
Question. Given a query that took 5 seconds of wall clock but consumed 50 slots in parallel, compute the slot-ms, the slot-hours, and the per-query cost at Enterprise rate.
Input.
| Metric | Value |
|---|---|
| Wall-clock duration | 5 seconds |
| Average slots during run | 50 |
| Enterprise rate | $0.06 / slot-hour |
Code.
-- Per-query slot-ms from JOBS view
SELECT
job_id,
user_email,
total_slot_ms,
total_slot_ms / 1000.0 AS slot_seconds,
total_slot_ms / (1000.0 * 60 * 60) AS slot_hours,
(total_slot_ms / (1000.0 * 60 * 60)) * 0.06 AS cost_usd_enterprise
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE job_id = 'job_abc123'
LIMIT 1;
Step-by-step explanation.
- Wall-clock duration alone is meaningless; a 5-second query at 50 slots costs the same as a 50-second query at 5 slots. Both consume 250 slot-seconds = 250,000 slot-ms.
-
total_slot_msis the value to track. Convert to slot-hours by dividing by 3,600,000 (3.6e6 ms in an hour). - 250,000 slot-ms ÷ 3,600,000 = 0.0694 slot-hours per query. At $0.06/slot-hour Enterprise, the query costs $0.00417 — less than half a cent.
- Now imagine a single SELECT * on a 500 TB table that takes 60s wall-clock at 1,000 slots: 60,000,000 slot-ms = 16.7 slot-hours = $1.00 per query on Editions. The same query on on-demand would scan ~500 TB and cost $3,125. Hence "Editions wins for big scans."
- The slot-ms metric is invariant; it does not care about your edition or your reservation. It is the single number you track for cost attribution.
Output.
| Slot-ms | Slot-hours | Cost at Enterprise |
|---|---|---|
| 250,000 | 0.069 | $0.00417 |
| 60,000,000 | 16.67 | $1.00 |
| 600,000,000 | 166.67 | $10.00 |
Rule of thumb. Always quote query cost in slot-ms (or slot-hours for large queries), never in wall-clock seconds. Slot-ms is the only metric that maps directly to your bill regardless of edition or autoscaler shape.
Senior interview question on pricing-model selection
A senior interviewer often opens with: "Walk me through how you would pick between BigQuery on-demand and Editions for a new analytics workload. What are the four questions you ask, in order, and what answer to each one pushes you to one model over the other?"
Solution Using a 4-question pricing framework
Pricing framework — BigQuery on-demand vs Editions
==================================================
1. What is the monthly bytes-scanned bill on on-demand?
- < $200/month → on-demand wins (simplicity)
- > $500/month → Editions almost always wins
- between → run the slot-hours math (Q2)
2. What is the slot-hours-equivalent from JOBS view?
- slot-hours × $0.06 << on-demand cost → Editions Enterprise
- slot-hours × $0.04 << on-demand cost AND no BI Engine / BQML / Gemini needed → Standard
3. What is the duty cycle?
- active hours / total hours < 30% → baseline 0 + autoscaler
- 30–60% → small baseline + autoscaler
- > 60% → baseline ≈ average + autoscaler
- 24/7 saturated → consider 1y/3y commit at 20%/40% off
4. What feature do you actually need?
- none extra → Standard
- BI Engine, BQML, Gemini → Enterprise
- cross-region, advanced ML, multi-region replication → Enterprise Plus
Step-by-step trace.
| Workload | Q1 monthly bytes | Q2 slot-hours | Q3 duty cycle | Q4 features | Picked |
|---|---|---|---|---|---|
| Tiny ad-hoc BI team, 5 TB / month | $31 | 50 | < 30% | none | On-demand |
| ETL pipeline 80 TB / month | $500 | 1,500 | low | none | Standard + autoscaler |
| Looker + dashboards, BI Engine | $1,200 | 4,000 | 30–60% | BI Engine | Enterprise + small baseline |
| Cross-region ML training | $9,000 | 30,000 | > 60% | Enterprise Plus features | Enterprise Plus + 1y commit |
After the 4-question pass, the model is unambiguous. The remaining 5% — where both work — defaults to whatever the team already has set up.
Output:
| Model | When it wins |
|---|---|
| On-demand | < $200/month bytes-scanned bill, no slot capacity needed, sub-2K-slot ceiling fine |
| Standard Editions | Cheap slots, autoscaler, no BI Engine / BQML / Gemini needed |
| Enterprise Editions | Full feature set, autoscaler, 1y/3y commit discount possible |
| Enterprise Plus Editions | Cross-region replication, advanced ML, regulatory requirements |
Why this works — concept by concept:
- Pricing axis is binary first — byte vs slot is the root choice; every other lever (autoscaler, baseline, commitment) is a sub-decision inside Editions. Asking the pricing question first short-circuits feature-vs-cost confusion.
- Slot-hours equivalent is the conversion key — the only way to compare on-demand to Editions fairly is to compute the slot-hours your workload actually consumes (via JOBS view). Skipping this is the #1 FinOps mistake.
- Duty cycle drives baseline — low duty cycle workloads waste money on baseline. High duty cycle workloads save money on commitment discounts. The autoscaler is the bridge in the middle.
- Feature gates the edition — Standard is functionally complete for raw SQL; you upgrade to Enterprise only when you need BI Engine, BQML, or Gemini; Enterprise Plus is for regulatory / multi-region needs. Most workloads ship on Standard.
- Cost — Editions is O(slot-hours) consumed; on-demand is O(bytes scanned). The break-even shifts every time Google adjusts a rate (twice in 2024–2025) — re-run the JOBS view query quarterly.
SQL
Topic — sql
SQL aggregation and analytics problems
2. Editions explained — Standard, Enterprise, Enterprise Plus
bigquery editions is a three-tier feature ladder priced per slot-hour — pick by features needed, not by raw slot price
The mental model in one line: Standard / Enterprise / Enterprise Plus are three feature bundles on a per-slot-hour ladder ($0.04 / $0.06 / $0.10 in most regions), and the right edition is determined by which features the workload requires — not by which slot rate is cheapest. Once you say "feature ladder, not slot ladder," the entire bigquery editions interview surface becomes a sequence of feature-gate decisions.
Standard — the cheapest tier, no premium features.
- Per-slot-hour rate ≈ $0.04 (region-dependent; check live pricing).
- What you get. Raw BigQuery SQL, autoscaler, multi-region storage, INFORMATION_SCHEMA views, the standard SDK and bq CLI, Resource Monitor for project caps. Sufficient for ~70% of analytics workloads.
- What you don't get. BI Engine (the in-memory accelerator for Looker / Looker Studio), BQML (in-database machine learning), Gemini in BigQuery (the AI assistant + auto-completion features), AEAD encryption helpers, customer-managed Cloud KMS, fine-grained column-level security beyond the basic IAM.
- When it wins. Pure ETL pipelines, ad-hoc analytics, batch reporting where you do not need acceleration, ML, or AI features. Most production data engineering workloads ship here.
Enterprise — the default for analytics teams.
- Per-slot-hour rate ≈ $0.06.
- What you get. Everything in Standard, plus BI Engine, BQML, Gemini in BigQuery, customer-managed encryption keys (CMEK), column-level security, dynamic data masking, fine-grained row-level access policies, materialised views with auto-refresh.
- When it wins. BI workloads that need BI Engine for sub-second dashboard latency, ML workloads that benefit from BQML's in-database training, regulated environments that need CMEK, and teams that want Gemini auto-completion in the BigQuery UI.
- Pricing trick. Most teams pick Enterprise even when they only use one of the premium features. The 50% premium over Standard is small compared to the engineering cost of working around a missing feature later.
Enterprise Plus — the regulated / cross-region tier.
- Per-slot-hour rate ≈ $0.10 (a 67% premium over Enterprise).
- What you get. Everything in Enterprise, plus cross-region dataset replication, advanced ML capabilities, customer-managed encryption with external key management (Cloud EKM), enhanced VPC-SC support, compliance certifications relevant to financial services and healthcare, longer time-travel retention.
- When it wins. Regulated industries (finance, healthcare, government), workloads that require multi-region active-active for DR, advanced ML scenarios beyond BQML, customer-managed key escrow via Cloud EKM.
- When it loses. Most teams do not need any of these features and should not pay 67% more for them. The cheat-sheet test: if your security architect cannot name three Enterprise Plus features off the top of their head, you do not need this tier.
Per-second billing — what the 2023 shake-up actually changed.
- Pre-2023. Every slot reservation billed at a 60-second minimum. A 5-second query running at 100 slots was billed as 60 seconds × 100 = 6,000 slot-seconds.
- Post-2023. A 5-second query running at 100 slots is billed as 5 × 100 = 500 slot-seconds. Tenfold accuracy improvement on small, fast queries.
- The autoscaler implication. Pre-2023, scaling up was expensive — every burst paid a 60-second tax. Post-2023, the autoscaler can scale up for the 8 seconds your dashboard needs, then scale down — exactly the use case the new model is designed for.
Commitment discounts — the 1-year and 3-year options.
- 1-year commit on the baseline portion → 20% off the per-slot-hour rate.
- 3-year commit on the baseline portion → 40% off the per-slot-hour rate.
- Autoscaler portion is always at the on-demand-Editions rate. Only the always-on baseline is discounted.
- Break-even logic. 20% off pays for itself when you would have used the baseline > 80% of the time. 40% off pays for itself at > 60% utilisation. Below that, pay-as-you-go Editions is cheaper.
Edition-picking decision tree — the production rule.
- Do you need BI Engine, BQML, or Gemini? Enterprise.
- Do you need cross-region replication, Cloud EKM, or advanced ML? Enterprise Plus.
- Neither? Standard.
- Pick down by default; you can upgrade an edition later without re-platforming.
Common interview probes on Editions.
- "What's the difference between Standard and Enterprise?" — feature set, primarily BI Engine, BQML, Gemini, CMEK, RLS / CLS.
- "When does Enterprise Plus pay for itself?" — multi-region active-active, Cloud EKM, regulated industry.
- "What changed in 2023?" — flat-rate was retired, replaced by three-tier Editions; per-second billing replaced the 60-second floor; autoscaler became default.
- "How do commitment discounts stack?" — only on the baseline portion, 20% / 40% for 1y / 3y, autoscaler always at PAYG rate.
Worked example — pick the edition for a Looker dashboard
Detailed explanation. A team runs a Looker dashboard against BigQuery; the BI lead wants sub-second p95 query latency. Their current on-demand setup is doing 4-5 seconds because every dashboard refresh re-scans cold data. BI Engine — an Enterprise-only feature — fits the data in memory and serves at sub-second.
Question. Pick the edition. Compute the cost delta vs Standard. Justify the upgrade.
Input.
| Metric | Value |
|---|---|
| Monthly slot-hours | 4,000 |
| Standard rate | $0.04 |
| Enterprise rate | $0.06 |
| Current p95 latency | 4.5 s |
| Target p95 latency | < 1 s |
| BI Engine availability | Enterprise only |
Code.
-- Cost comparison helper — Standard vs Enterprise on the same slot-hours
WITH usage AS (
SELECT
SUM(total_slot_ms) / (1000 * 60 * 60) AS slot_hours
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
AND job_type = 'QUERY'
)
SELECT
slot_hours,
slot_hours * 0.04 AS standard_cost_usd,
slot_hours * 0.06 AS enterprise_cost_usd,
slot_hours * 0.10 AS enterprise_plus_cost_usd,
slot_hours * 0.02 AS enterprise_premium_over_standard
FROM usage;
Step-by-step explanation.
- Pull last 30 days of slot-hours from JOBS view. The number reflects actual compute consumed, not theoretical bytes scanned.
- Multiply by each edition rate to get monthly cost. Standard at 4,000 × $0.04 = $160. Enterprise at 4,000 × $0.06 = $240. Delta: $80/month.
- BI Engine is Enterprise-only. Without Enterprise, the Looker dashboard cannot benefit from in-memory acceleration; the 4.5s p95 stays 4.5s.
- With BI Engine on the active table, the same dashboard hits sub-second p95 — measurable user-facing improvement worth $80/month for any non-trivial team.
- The verdict: pay the $80/month Enterprise premium. The cost of not upgrading is engineer time spent on cache layers, materialised views, or pre-aggregation tables — easily 5–10x the price difference.
Output.
| Edition | Monthly cost (4K slot-hours) | BI Engine | Verdict |
|---|---|---|---|
| Standard | $160 | no | rejected — feature gap |
| Enterprise | $240 | yes | picked |
| Enterprise Plus | $400 | yes (extra features unused) | rejected — overspend |
Rule of thumb. When the feature gap costs more engineering time than the edition premium, upgrade the edition. The cost delta between tiers is usually a rounding error compared to the value of the feature unlock.
Worked example — Enterprise Plus only when you need it
Detailed explanation. A common over-spend trap: a team picks Enterprise Plus "for the future" without using any of its features. Enterprise Plus costs 67% more per slot-hour than Enterprise. On a 10,000 slot-hour/month workload, that is $400/month wasted unless you actually need multi-region replication, Cloud EKM, or advanced ML.
Question. Given a workload that needs CMEK (customer-managed encryption keys, available on Enterprise) but does not need cross-region replication or Cloud EKM, prove Enterprise wins over Enterprise Plus.
Input.
| Feature | Enterprise | Enterprise Plus | Needed? |
|---|---|---|---|
| CMEK | yes | yes | yes |
| Cross-region replication | no | yes | no |
| Cloud EKM (external KMS) | no | yes | no |
| Advanced ML | partial | full | no |
| BQML | yes | yes | yes |
| Per-slot-hour rate | $0.06 | $0.10 | — |
Code.
-- Edition cost comparison
WITH usage AS (
SELECT SUM(total_slot_ms) / 3.6e6 AS slot_hours
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
)
SELECT
slot_hours,
slot_hours * 0.06 AS enterprise_cost,
slot_hours * 0.10 AS enterprise_plus_cost,
slot_hours * 0.04 AS enterprise_plus_premium -- the wasted 67%
FROM usage;
Step-by-step explanation.
- Confirm the feature requirement list: CMEK and BQML are needed; cross-region replication, Cloud EKM, and advanced ML are not.
- Both Enterprise and Enterprise Plus include CMEK and BQML. The unique features of Enterprise Plus (cross-region replication, Cloud EKM, advanced ML) are not on the requirement list.
- The cost of Enterprise Plus is 10,000 × $0.10 = $1,000/month. The cost of Enterprise is 10,000 × $0.06 = $600/month. Delta is $400/month or $4,800/year — a full month of a junior engineer's salary, wasted on features you do not use.
- The decision: Enterprise. Document the requirements list so future tech leads do not "upgrade for the future" without re-checking the feature gap.
- Rule of thumb: list the named Enterprise Plus features you will use in the next 12 months. If the list is empty, do not upgrade.
Output.
| Edition | Monthly cost (10K slot-hours) | Verdict |
|---|---|---|
| Standard | $400 | rejected — no CMEK |
| Enterprise | $600 | picked — meets requirements |
| Enterprise Plus | $1,000 | rejected — unused features |
Rule of thumb. Pick down, not up. Default to the lowest tier that meets your named feature requirements. Re-evaluate quarterly; upgrade when a new feature requirement appears, not "just in case."
Worked example — 1-year commitment break-even
Detailed explanation. A team is on Enterprise PAYG and considering a 1-year baseline commitment for a 20% discount. The question is: at what baseline size does the commitment pay for itself given the team's duty cycle?
Question. Compute the break-even baseline size for a 1-year Enterprise commitment, given an observed average slot demand and a target duty cycle.
Input.
| Metric | Value |
|---|---|
| Average slot demand | 50 slots (during active hours) |
| Active hours per month | 220 (10h × 22 days) |
| Total hours per month | 730 |
| Enterprise PAYG rate | $0.06 / slot-hour |
| 1-year commit rate | $0.048 / slot-hour (20% off) |
| Autoscaler rate (above baseline) | $0.06 / slot-hour |
Code.
# Break-even baseline for 1-year commit
payg_rate = 0.06
commit_rate = 0.048
hours_total = 730
hours_active = 220
avg_slots = 50
# At baseline = B slots:
# commit cost = B * hours_total * commit_rate
# extra payg = (avg_slots - B) * hours_active * payg_rate if B < avg_slots
# else extra = 0
# Pure PAYG cost = avg_slots * hours_active * payg_rate
pure_payg = avg_slots * hours_active * payg_rate
print(f"Pure PAYG cost: ${pure_payg:.2f}") # $660
for B in [0, 10, 20, 30, 40, 50]:
commit_part = B * hours_total * commit_rate
extra = max(avg_slots - B, 0) * hours_active * payg_rate
total = commit_part + extra
print(f"baseline {B} → commit ${commit_part:.2f} + extra ${extra:.2f} = ${total:.2f}")
Step-by-step explanation.
- Pure PAYG cost is
50 × 220 × $0.06 = $660/month. This is the baseline-0 cost — no commitment, fully pay-as-you-go. - With baseline = 10: commit cost is
10 × 730 × $0.048 = $350.40plus extra PAYG of40 × 220 × $0.06 = $528= $878.40. Worse than pure PAYG. - With baseline = 50: commit cost is
50 × 730 × $0.048 = $1,752plus zero extra. Way worse — the commit pays 24/7 for slots only used 30% of the time. - The break-even is where
(B × 730 × $0.048) + (avg_slots − B) × 220 × $0.06 < avg_slots × 220 × $0.06. Solving: B × 730 × $0.048 < B × 220 × $0.06 → 35.04 B < 13.2 B → never positive at any B > 0 for this duty cycle. - Verdict: at a 30% duty cycle, the 1-year commit never pays off, no matter the baseline size. Commitment makes sense only for workloads with > ~60% duty cycle.
Output.
| Baseline | Monthly cost | vs PAYG $660 |
|---|---|---|
| 0 (pure PAYG) | $660 | baseline |
| 10 | $878 | +$218 worse |
| 20 | $1,096 | +$436 worse |
| 30 | $1,314 | +$654 worse |
| 40 | $1,533 | +$873 worse |
| 50 | $1,752 | +$1,092 worse |
Rule of thumb. Do not commit unless your duty cycle exceeds 60%. For bursty BI workloads (active 30%, idle 70%), pure PAYG autoscaler always wins. For 24/7 ETL pipelines, the 1-year or 3-year commit at the average slot demand wins.
Senior interview question on edition selection
A senior interviewer might ask: "Your company is rolling out BigQuery for the first time. Three teams want to use it: a data engineering team running hourly ETL, a BI team building dashboards in Looker, and a fraud team building ML models. Walk me through how you would pick editions and structure the spend."
Solution Using one edition per team aligned to features needed
Edition selection — three teams, three workloads
================================================
Team 1: Data Engineering — hourly ETL
Features needed: none beyond Standard
Verdict: Standard, baseline 0, autoscaler max 200
Rationale: ETL is pure SQL; no BI Engine, BQML, or Gemini value
Team 2: BI — Looker dashboards
Features needed: BI Engine (sub-second dashboard p95)
Verdict: Enterprise, baseline 0, autoscaler max 100
Rationale: BI Engine is Enterprise-only; without it, dashboards lag
Team 3: Fraud ML — BQML models
Features needed: BQML (in-database training)
Verdict: Enterprise, baseline 0, autoscaler max 300
Rationale: BQML is Enterprise-only; ML jobs are bursty, autoscaler ideal
Shared infrastructure
- One Enterprise org-level reservation as the default
- One Standard project-level reservation for the ETL team (cheaper rate)
- Idle slot lending ON between the two reservations
- Resource Monitor per project to cap spend
Step-by-step trace.
| Team | Workload type | Slot-hours/month | Edition | Baseline | Max |
|---|---|---|---|---|---|
| Data Eng | ETL, hourly | 800 | Standard | 0 | 200 |
| BI | Looker, weekday hours | 600 | Enterprise | 0 | 100 |
| Fraud ML | Bursty training | 1,200 | Enterprise | 0 | 300 |
| Total org | combined | 2,600 | mixed | 0 | 600 |
The org has two reservations: a Standard one for the ETL team (cheaper) and an Enterprise one for BI + ML (feature requirements). Idle lending lets the ETL team borrow autoscaler headroom from the Enterprise pool when needed.
Output:
| Reservation | Edition | Workloads | Monthly cost (PAYG) |
|---|---|---|---|
etl-res |
Standard | ETL | 800 × $0.04 = $32 |
bi-ml-res |
Enterprise | BI + ML | 1,800 × $0.06 = $108 |
| Idle lending savings | — | ~10% headroom shared | ~ −$10 |
| Total | — | — | ≈ $130/month |
Why this works — concept by concept:
- One edition per workload, not per company — different teams have different feature needs. Locking the whole org to Enterprise wastes money on ETL; locking to Standard blocks BI Engine. Per-team reservations let each pay only for what it needs.
- Standard is the default — pick down by default. The ETL team has no premium feature needs; Standard at $0.04 is 33% cheaper than Enterprise.
- Enterprise unlocks two features for $80/month — BI Engine alone justifies the upgrade for the BI team; BQML alone justifies it for ML. Both teams use both features once unlocked.
- Idle slot lending is free FinOps — the org pays once for the autoscaler ceiling and the two reservations share it. Without lending, you would buy two separate ceilings.
- Cost — Editions cost is O(slot-hours consumed × per-slot-hour rate). The structural choice (which edition, which reservation) drives 60% of the bill; query optimisation drives the other 40%. Spend FinOps energy on both.
SQL
Topic — sql
BigQuery SQL practice problems
3. Slots, baseline + autoscaler
A bigquery slot is one unit of CPU + memory; baseline is always-on, autoscaler scales to a max, slot-ms is the cost atom
The mental model in one line: a slot is a unit of parallel compute (roughly 1 vCPU + RAM + shuffle), baseline is the always-on floor, autoscaler is the elastic ceiling, and the bill is the integral of slots × seconds over your run-time — measured atomically in slot-ms. Once you say "baseline + autoscaler bounded by max, billed per slot-ms," the entire slot allocation and bigquery autoscaler interview surface becomes deductions from that one shape.
What a slot actually is.
- A slot is BigQuery's unit of parallel work — roughly 1 vCPU + memory + shuffle bandwidth, dispatched by the Dremel query engine to execute one parallel sub-step of a query plan.
- A query plan is a DAG of stages (scan, filter, join, aggregate, output). Each stage parallelises across slots; a stage with 100 input shards uses 100 slots concurrently if available.
- A single query can use anywhere from 1 slot (a small scan over a single shard) to thousands of slots (a wide shuffle on a large fact table). The plan dictates the demand; the reservation dictates the supply.
- Slots are fungible — there is no "ETL slot" vs "BI slot" hardware difference. The difference is which reservation owns them at a given moment.
Baseline — the always-on floor.
- A baseline is the number of slots a reservation has committed to, billed continuously whether you use them or not.
- Baseline > 0 makes sense when (a) you have a 24/7 workload that consistently needs at least that many slots, or (b) you want the 20%/40% commitment discount on the baseline portion.
- Baseline = 0 makes sense when your workload is bursty — you pay $0 when idle, autoscaler-rate when active.
- The default in 2026 is baseline 0 + autoscaler max = your P99 slot demand. This is the lowest-cost pattern for bursty workloads.
Autoscaler — the elastic ceiling.
- Autoscaler scales the reservation from
baselineup tomaxslots automatically, in 100-slot increments (the granularity is region-dependent; check current docs). - Scale-up is triggered by queued queries; scale-down happens after a brief idle window.
- The autoscaler-allocated slots above baseline are billed at the pay-as-you-go Editions rate (no commit discount applies to the autoscaler portion).
- The autoscaler
maxis the hard ceiling — if your workload demands more, queries queue. This is intentional: the cap is your bill protection.
Slot-ms — the atomic billing unit.
- Every query records
total_slot_msinINFORMATION_SCHEMA.JOBS— the integral of slots used × milliseconds. - A 10-slot query running for 5 seconds = 50,000 slot-ms. A 100-slot query running for 500 ms = also 50,000 slot-ms. The bill is identical.
- Slot-hours = slot-ms ÷ 3,600,000. The per-slot-hour rate × slot-hours = the bill for that query.
- Per-second billing means BigQuery never rounds up; if a query consumed 250,000 slot-ms, you pay for 250,000 slot-ms — not "rounded to nearest minute."
Job-level vs reservation-level slot accounting.
- Job-level. Each query consumes slot-ms; the JOBS view exposes it per job. This is the unit of attribution — which user, which team, which dataset is consuming the slots.
- Reservation-level. Each reservation has a slot pool (baseline + current autoscaler position). The reservation pool fills as queries run; queries beyond the pool either queue or borrow idle slots (if lending is enabled) from other reservations.
- The conservation law. Sum of all job slot-ms in a window = sum of (slots in pool × ms) used. The two views must match within rounding; if they don't, look for slot leakage (e.g. queries running against the on-demand pool while you thought they were on Editions).
Idle reservations and slot borrowing.
- A reservation with
ignoreIdleSlots = false(default) lends its currently-idle slots to other reservations when they exceed their own pool. - A reservation with
ignoreIdleSlots = truehoards — its idle slots stay idle, and other reservations queue against their own caps. - Lending is usually the right default — it improves overall slot utilisation, costs nothing extra, and only hurts when a reservation's "borrower" creates contention right when the "lender" needs the slots back.
- Borrowed slots are returned within seconds when the original reservation needs them — the lender always has priority.
The "1,200 slot-ms in dollars" calculation.
- 1,200 slot-ms = 1.2 slot-seconds = 0.000333 slot-hours.
- At Enterprise $0.06/slot-hour: 0.000333 × $0.06 = $0.00002 — about 2 thousandths of a cent.
- This is a tiny query. Most data engineering queries consume 10K–10M slot-ms, mapping to $0.0002–$0.20 per query.
- Big shuffles on multi-TB tables can hit 10M–1B slot-ms ($0.20 – $20 per query). Those are the queries you want to optimise first.
Common interview probes on slots and autoscaler.
- "What is a BigQuery slot?" — one unit of CPU + memory + shuffle, dispatched in parallel per query sub-step.
- "What's the difference between baseline and autoscaler?" — baseline is always-on (and discount-eligible); autoscaler is elastic up to max (PAYG).
- "How is slot usage billed?" — slot-ms aggregated to slot-hours × per-slot-hour rate; per-second precision since 2023.
- "What is idle reservation lending?" — reservations with
ignoreIdleSlots=falselend their unused slots to busy reservations; lender always has priority on return.
Worked example — a query consuming 1,200 slot-ms
Detailed explanation. An interviewer hands you a JOBS view row: total_slot_ms = 1,200. What does it mean, and how do you turn it into a dollar figure?
Question. Given a query with total_slot_ms = 1,200 running on an Enterprise reservation, compute the slot-seconds, slot-hours, and cost.
Input.
| Field | Value |
|---|---|
| total_slot_ms | 1,200 |
| Edition | Enterprise |
| Rate | $0.06 / slot-hour |
Code.
WITH q AS (
SELECT 1200 AS slot_ms
)
SELECT
slot_ms AS slot_ms,
slot_ms / 1000.0 AS slot_seconds,
slot_ms / (1000.0 * 60 * 60) AS slot_hours,
ROUND((slot_ms / (1000.0 * 60 * 60)) * 0.06, 8) AS cost_usd
FROM q;
Step-by-step explanation.
- Divide slot_ms by 1,000 to get slot-seconds: 1,200 ÷ 1,000 = 1.2 slot-seconds.
- Divide slot_ms by 3,600,000 to get slot-hours: 1,200 ÷ 3.6e6 = 0.000333 slot-hours.
- Multiply slot-hours by the Enterprise rate $0.06 to get the cost: 0.000333 × 0.06 = $0.0000200.
- This single query cost about 2 thousandths of a cent. Negligible. But scale matters: 1 million such queries in a month = $20.
- The lesson: a 1,200 slot-ms query is cheap. The cost concentration is in the long-tail of multi-second, multi-thousand-slot queries — find those in JOBS view and optimise them first.
Output.
| Metric | Value |
|---|---|
| slot-ms | 1,200 |
| slot-seconds | 1.2 |
| slot-hours | 0.000333 |
| cost (Enterprise) | $0.00002 |
Rule of thumb. Convert every JOBS view slot-ms reading to dollars at your edition rate. Anything < $0.001 per query is noise; anything > $0.10 per query is a candidate for optimisation; anything > $1 per query is an emergency.
Worked example — picking autoscaler max from a histogram
Detailed explanation. A team is moving from on-demand to Editions. They need to pick the autoscaler max. The right approach is to look at the historical slot-demand histogram from JOBS view, then set the max at the P99 of that distribution — so 99% of the time queries flow freely, and 1% of the time bursts queue briefly to protect the bill.
Question. Given a JOBS view aggregated by 1-minute intervals over 30 days, compute the P50, P95, P99, and max of slot demand. Recommend an autoscaler max.
Input.
| Time bucket (1-min) | Slots used |
|---|---|
| t1 | 12 |
| t2 | 18 |
| ... | ... |
| t43200 | 245 |
Code.
-- Per-minute slot demand histogram + percentiles
WITH per_minute AS (
SELECT
TIMESTAMP_TRUNC(creation_time, MINUTE) AS minute,
SUM(total_slot_ms) / 60000.0 AS avg_slots
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
AND job_type = 'QUERY'
GROUP BY minute
)
SELECT
APPROX_QUANTILES(avg_slots, 100)[OFFSET(50)] AS p50,
APPROX_QUANTILES(avg_slots, 100)[OFFSET(95)] AS p95,
APPROX_QUANTILES(avg_slots, 100)[OFFSET(99)] AS p99,
MAX(avg_slots) AS max_demand
FROM per_minute;
Step-by-step explanation.
- Bucket every job by minute and compute the average slot demand per minute (sum of slot-ms in the minute, divided by 60,000 to convert to slots).
- Compute percentiles via
APPROX_QUANTILES. Result: P50 = 8, P95 = 95, P99 = 180, max = 245. - Setting autoscaler max = 245 (the observed max) means you almost never queue, but you pay for the worst-case bill.
- Setting autoscaler max = 180 (P99) means 99% of minutes flow freely; 1% queue for seconds. That is usually acceptable and caps the worst-case bill.
- Setting autoscaler max = 95 (P95) means 5% of minutes queue — too aggressive for production; users notice. Save P95 for non-critical reservations (ad-hoc, analyst sandboxes).
- Recommend max = 180 (P99) for production. Recommend max = 95 (P95) for analyst-sandbox-style reservations where slight queuing is tolerable.
Output.
| Percentile | Slots | Recommended for |
|---|---|---|
| P50 | 8 | (info only) |
| P95 | 95 | non-critical reservations |
| P99 | 180 | production |
| max | 245 | budget-no-object only |
Rule of thumb. Autoscaler max = P99 of per-minute slot demand. Tune by tier: P99 for production, P95 for ad-hoc, P90 for sandboxes. Re-compute every 30 days; workloads drift.
Worked example — idle slot lending across reservations
Detailed explanation. A team has two reservations: etl-res (baseline 50, max 100) and bi-res (baseline 0, max 200). They observe that bi-res is sometimes queuing while etl-res has idle slots. Enable idle lending and prove the bill does not change.
Question. Configure both reservations with ignoreIdleSlots = false. Show how bi-res borrows from etl-res during a spike, and confirm the bill is unchanged because idle slots are already paid for in the baseline.
Input.
| Reservation | Baseline | Max | ignoreIdleSlots |
|---|---|---|---|
| etl-res | 50 | 100 | false (lend) |
| bi-res | 0 | 200 | false (borrow) |
Code.
-- Spec for both reservations (illustrative — actual API is gcloud or Terraform)
-- gcloud bq reservations create etl-res --slots=50 --location=us
-- gcloud bq reservations update etl-res --autoscale-max-slots=100 --ignore-idle-slots=false
-- gcloud bq reservations create bi-res --slots=0 --location=us
-- gcloud bq reservations update bi-res --autoscale-max-slots=200 --ignore-idle-slots=false
-- Verify slot usage by reservation over the spike window
SELECT
reservation_id,
TIMESTAMP_TRUNC(period_start, MINUTE) AS minute,
SUM(period_assigned_slots) AS slots_in_use
FROM `region-us`.INFORMATION_SCHEMA.RESERVATIONS_TIMELINE
WHERE period_start >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY reservation_id, minute
ORDER BY minute, reservation_id;
Step-by-step explanation.
-
etl-reshas 50 baseline slots, autoscaler to 100. During off-hours, ETL is idle — the 50 baseline slots are paid-for but unused. -
bi-reshas 0 baseline, autoscaler to 200. During a BI spike, it needs 150 slots. - With
ignoreIdleSlots=falseon both, the BigQuery scheduler letsbi-resborrow up to 50 idle slots frometl-res(the baseline portion that is currently unused). -
bi-resautoscaler still scales up to its own 200-slot ceiling — borrowing is additional headroom on top, not a replacement for the autoscaler. - When
etl-resneeds its baseline back (a query arrives), the lender has priority — the borrower's queries on the borrowed slots are paused or drained within a few seconds. -
The bill does not change.
etl-resalready pays for its 50 baseline regardless; lending them tobi-resis free utilisation.
Output.
| Time | etl-res in use | bi-res in use (own) | bi-res borrowed | Total slots in use |
|---|---|---|---|---|
| 12:00 | 0 | 0 | 0 | 0 |
| 12:05 | 0 | 150 (max 200) | 50 (from etl-res) | 200 |
| 12:10 | 25 (ETL job starts) | 150 | 25 (lender takes some back) | 200 |
| 12:15 | 50 | 100 (queue 50) | 0 | 150 |
Rule of thumb. Default to ignoreIdleSlots=false on every reservation. Lending is free utilisation; hoarding only makes sense for hard-isolation requirements (regulated workloads, hard SLOs). Almost no team needs hoarding.
Senior interview question on baseline + autoscaler sizing
A senior interviewer might ask: "You join a team that just moved to BigQuery Enterprise and they set baseline 200 + autoscaler max 1,000. The bill is way over budget. Walk me through how you'd diagnose and fix it."
Solution Using JOBS-view-driven re-sizing of baseline + autoscaler
# Diagnosis flow — baseline + autoscaler are mis-sized
# Step 1: pull historical slot demand from JOBS view
# Step 2: compute P50, P95, P99 of per-minute slot demand
# Step 3: compute duty cycle (active hours / total hours)
# Step 4: recommend baseline + max
# Pseudocode of the resize plan
def recommend(slot_hours_used_per_month,
p99_slot_demand,
duty_cycle,
monthly_hours=730):
"""Return (baseline, max) recommendation."""
if duty_cycle < 0.30:
baseline = 0
elif duty_cycle < 0.60:
# Baseline = average light demand (commit discount on it)
baseline = max(0, slot_hours_used_per_month // (monthly_hours * 1))
else:
# 24/7-ish; baseline at the steady-state floor
baseline = max(0, p99_slot_demand // 2)
max_slots = p99_slot_demand
return baseline, max_slots
Step-by-step trace.
| Step | Before fix | Diagnosis | After fix |
|---|---|---|---|
| Baseline | 200 (24/7) | duty cycle ~30%, not 24/7 | 0 |
| Autoscaler max | 1,000 | P99 observed = 220 | 250 |
| Monthly baseline cost | 200 × 730 × $0.06 = $8,760 | wasted | $0 |
| Monthly autoscaler cost | ~$60 (real usage) | accurate | ~$66 (real usage with margin) |
| Total monthly bill | $8,820 | — | $66 |
| Savings | — | — | $8,754 / month (~99%) |
The fix is to drop baseline to 0 (duty cycle is too low for an always-on commitment) and reduce max to the P99 (no point paying for headroom you never use). The actual slot-hours consumed stay the same, so the active-slot bill is unchanged.
Output:
| Metric | Before | After |
|---|---|---|
| Baseline | 200 | 0 |
| Max | 1,000 | 250 |
| Monthly bill | ~$8,820 | ~$66 |
| Annual savings | — | ~$105,000 |
| Risk | over-provisioned | none (P99 covers spikes) |
Why this works — concept by concept:
- Baseline 0 for bursty workloads — the worst FinOps mistake in BigQuery is buying a baseline you do not need 24/7. Bursty workloads should be baseline 0; the autoscaler will deliver the same throughput on demand.
- P99 max, not P100 max — the autoscaler max should cap your bill, not service every conceivable spike. Setting max = observed-max means you never queue but pay for headroom you almost never use. P99 trades 1% of queuing for 80% bill reduction.
- Duty cycle decides baseline existence — duty cycle < 30% → baseline 0. 30–60% → small baseline at average light demand. > 60% → baseline at steady-state floor. Above 60% with a 1-year commit, the discount math finally pays off.
-
Real usage data trumps gut feeling — every resize must be driven by
INFORMATION_SCHEMA.JOBSdata, never by "what feels right." The JOBS view is the source of truth. - Cost — bill = baseline × month-hours × baseline-rate + autoscaler-slot-hours × PAYG-rate. The baseline term is O(month-hours); the autoscaler term is O(actual usage). Cutting baseline to 0 cuts the constant term to 0.
SQL
Topic — sql
INFORMATION_SCHEMA and analytics queries
4. Reservations + projects + workload isolation
bigquery reservations are slot pools assigned to projects — one reservation per workload class is the senior pattern
The mental model in one line: a reservation is a block of slots (baseline + autoscaler), assigned to one or more projects via assignments, with the senior pattern being one reservation per workload class (ETL, BI, ad-hoc) and idle slot lending turned on between them. Once you say "one reservation per workload class, assignments map projects to reservations, lending fills the gaps," the entire bigquery reservations and workload-isolation interview surface becomes a deduction from those three primitives.
Reservations vs assignments vs commitments — the three primitives.
-
Reservation. A named slot pool with
baseline+max+edition+ignoreIdleSlots. Lives at the org or project level. -
Assignment. A mapping that says "this project / folder / organisation, when running this
jobType(QUERY / PIPELINE), uses this reservation." A project without an assignment falls back to on-demand. - Commitment. An optional 1-year or 3-year purchase of baseline slots at a 20% / 40% discount. Attached to a reservation but managed separately so you can change the reservation shape without breaking the commitment.
Why one reservation per workload class is the senior pattern.
- ETL workloads. Predictable schedule (hourly / daily), tolerant of small latency, long-running queries. Want a baseline that covers steady state + autoscaler for monthly peaks. Priority: throughput.
- BI workloads. Bursty (working hours), latency-sensitive (dashboards), short queries. Want baseline 0 + tight autoscaler max + idle borrowing to absorb spikes. Priority: latency.
- Ad-hoc / sandbox. Best-effort, no SLO, often individual analysts running exploratory queries. Want baseline 0 + small autoscaler max + idle-only mode (no commitment). Priority: cap the spend.
- Mixing them in one reservation means BI queries queue behind ETL jobs, or ad-hoc spikes drive up the bill for ETL. Isolating them by class makes the bill — and the SLO — predictable.
Assignments — how projects find reservations.
- An assignment has three parts: a target reservation, a parent (org / folder / project), and a
jobType(QUERY, PIPELINE, ML, or ALL). - Assignments are inheritable down the resource hierarchy — an org-level assignment applies to every project below unless overridden.
- A project can have multiple assignments for different job types (e.g. QUERY →
bi-res, PIPELINE →etl-res). - A project without an assignment runs against the on-demand pricing pool — a common gotcha when migrating to Editions.
Multi-project reservations — sharing slots across business units.
- One reservation can serve many projects via separate assignments. This is the standard pattern for a central data platform team.
- The reservation's slots are shared across all assigned projects; there is no per-project carve-out within a single reservation.
- If you want per-project caps, use either separate reservations per project or a Resource Monitor that caps a project's slot-ms consumption.
- For chargeback / showback, use labels on the jobs (set by the producing project) and aggregate slot-ms by label in
INFORMATION_SCHEMA.JOBS.
Idle reservation slot lending — the default that should stay default.
-
ignoreIdleSlots = false(default) lets a reservation lend its currently-idle slots to other reservations that are queuing. - Lending is automatic and bounded: the lender always gets its slots back when it needs them.
- The lent slots come from the baseline portion of the lender; the autoscaler portion is never lent.
-
The only reason to set
ignoreIdleSlots = trueis a hard isolation requirement — a regulated workload that must never share compute with another team's queries. For 95% of teams, lending stays on.
Resource Monitor — the budget circuit breaker.
- A Resource Monitor caps the total slot-ms a project (or reservation, or org) can consume in a fixed time window (typically 24 hours).
- When the cap is hit, new jobs in the protected scope queue or fail; existing jobs continue to completion.
- This is the only mechanism that bounds the worst-case bill from a single runaway team or a misbehaving CI pipeline.
- The cap is a limit, not a target — set it ~2x the team's normal daily slot-ms so accidental spikes are caught without throttling normal work.
Workload isolation patterns in production.
-
The 3-reservation default.
etl-res(baseline 50, max 200, lend) +bi-res(baseline 0, max 200, lend) +adhoc-res(baseline 0, max 50, lend). Covers most data teams. - The per-team-reservation pattern. Each business unit gets its own reservation with its own cap. Useful when chargeback is by reservation rather than by label.
- The dev / staging / prod split. Three reservations along environment lines; dev gets the smallest cap, prod gets the most. Useful when developers run heavy queries that should not impact production dashboards.
Common interview probes on reservations.
- "Why one reservation per workload class?" — different SLOs, different latency tolerances, different priorities. Mixing them creates contention.
- "How do projects get attached to reservations?" — via assignments at org / folder / project level; assignments are job-type-specific and inheritable.
- "What is idle slot lending and when do you disable it?" — automatic borrowing across reservations; disable only for hard-isolation regulated workloads.
- "How do you cap spending per project?" — Resource Monitor on the project's slot-ms in a 24-hour window, set to ~2x normal daily usage.
Worked example — 3-reservation pattern for a data platform
Detailed explanation. A data platform team wants to isolate ETL, BI, and ad-hoc workloads. They have a 5-project setup: prod-pipelines, staging, looker, tableau, analyst-sandbox. The senior pattern is three reservations with assignments mapping projects to reservations.
Question. Define the three reservations, the assignments, and verify the slot accounting via INFORMATION_SCHEMA.
Input.
| Project | Workload | Reservation target |
|---|---|---|
| prod-pipelines | hourly ETL | etl-res |
| staging | ETL test | etl-res |
| looker | BI dashboards | bi-res |
| tableau | BI dashboards | bi-res |
| analyst-sandbox | ad-hoc | adhoc-res |
Code.
# Create the three reservations
bq mk --reservation --location=US --slots=50 --autoscale_max_slots=200 --edition=ENTERPRISE etl-res
bq mk --reservation --location=US --slots=0 --autoscale_max_slots=200 --edition=ENTERPRISE bi-res
bq mk --reservation --location=US --slots=0 --autoscale_max_slots=50 --edition=STANDARD adhoc-res
# Map projects to reservations (assignments)
bq mk --reservation_assignment --project_id=admin-project \
--assignee_id=prod-pipelines --assignee_type=PROJECT \
--job_type=QUERY --reservation_id=etl-res
bq mk --reservation_assignment --project_id=admin-project \
--assignee_id=staging --assignee_type=PROJECT \
--job_type=QUERY --reservation_id=etl-res
bq mk --reservation_assignment --project_id=admin-project \
--assignee_id=looker --assignee_type=PROJECT \
--job_type=QUERY --reservation_id=bi-res
bq mk --reservation_assignment --project_id=admin-project \
--assignee_id=tableau --assignee_type=PROJECT \
--job_type=QUERY --reservation_id=bi-res
bq mk --reservation_assignment --project_id=admin-project \
--assignee_id=analyst-sandbox --assignee_type=PROJECT \
--job_type=QUERY --reservation_id=adhoc-res
Step-by-step explanation.
-
etl-reshas baseline 50 (covers steady-state hourly ETL) + autoscaler 200 (handles monthly-peak spikes). Enterprise edition is required because the team uses BQML on the ETL side for anomaly detection. -
bi-reshas baseline 0 (BI is weekday-hours bursty) + autoscaler 200. Enterprise edition is required for BI Engine acceleration in Looker. -
adhoc-reshas baseline 0 + autoscaler 50 (capped because analysts do not need more, and the cap protects the bill). Standard edition (no BI Engine / BQML needed for ad-hoc). - Each assignment maps one project to one reservation for the QUERY job type. PIPELINE job types (Dataflow jobs that import via BigQuery's storage write API) inherit the org-level default unless given a separate assignment.
- Verify with
INFORMATION_SCHEMA.JOBS— every query should now show areservation_idmatching its project's assignment. Queries withreservation_idempty are still on the on-demand pool (likely a missing assignment).
Output.
| Reservation | Baseline | Max | Edition | Projects assigned |
|---|---|---|---|---|
| etl-res | 50 | 200 | Enterprise | prod-pipelines, staging |
| bi-res | 0 | 200 | Enterprise | looker, tableau |
| adhoc-res | 0 | 50 | Standard | analyst-sandbox |
Rule of thumb. Define reservations by workload class, not by team. Teams often have multiple workloads; map each workload's projects into the right class. The chargeback story is then handled by labels, not reservations.
Worked example — multi-project reservation with per-project chargeback
Detailed explanation. Two business units share a single bi-res reservation but the FinOps team needs to chargeback each unit for its actual slot-ms consumption. The pattern is to require labels on every job (set by the producing project) and aggregate by label.
Question. Configure a label policy and write the chargeback query.
Input.
| Project | Business unit label |
|---|---|
| looker | bu:marketing |
| tableau | bu:finance |
Code.
-- Chargeback query — per-BU slot-ms attribution
SELECT
-- pull the bu label
COALESCE((SELECT value
FROM UNNEST(labels)
WHERE key = 'bu'), 'unlabeled') AS business_unit,
COUNT(*) AS query_count,
SUM(total_slot_ms) / 3.6e6 AS slot_hours,
ROUND((SUM(total_slot_ms) / 3.6e6) * 0.06, 2) AS cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
AND reservation_id = 'bi-res'
AND job_type = 'QUERY'
GROUP BY business_unit
ORDER BY slot_hours DESC;
# Enforce labels at the producing project level via a query default
# (set in each project's BigQuery client configuration)
bq query --use_legacy_sql=false \
--label=bu:marketing \
--label=env:prod \
'SELECT ...'
Step-by-step explanation.
- Both projects share
bi-res. The reservation does not partition slots per project; it pools them across all assigned projects. - The producing project sets a
bulabel on every job (via the BigQuery client'slabelsconfig). The label is recorded inINFORMATION_SCHEMA.JOBS. - The chargeback query unnests the labels array, pulls the
buvalue, groups by it, and sumstotal_slot_msper BU. - Slot-hours × $0.06 (Enterprise rate) gives the dollar attribution per BU. Marketing gets billed for marketing-labelled slot-ms; finance for finance-labelled.
- The "unlabeled" bucket catches jobs that did not set a label — a CI check in each project should reject queries without a
bulabel so this bucket stays empty.
Output.
| Business unit | Query count | Slot-hours | Cost (USD) |
|---|---|---|---|
| marketing | 18,200 | 220.5 | $13.23 |
| finance | 9,400 | 87.2 | $5.23 |
| unlabeled | 12 | 0.3 | $0.02 |
Rule of thumb. Labels are the chargeback mechanism, not reservations. Reservations isolate workload classes; labels attribute cost owners. Enforce labels with a CI check in every producing project; the FinOps query then becomes trivial.
Worked example — idle lending across ETL and BI reservations
Detailed explanation. During weekday hours, the ETL reservation is often idle (most ETL runs hourly at HH:00–HH:15). During those windows, BI queries can borrow the idle ETL slots. The pattern is to leave ignoreIdleSlots = false on both reservations and verify the borrowing via the timeline view.
Question. Configure both reservations for lending. Show via INFORMATION_SCHEMA.RESERVATIONS_TIMELINE that BI is borrowing ETL slots during off-cycle minutes.
Input.
| Reservation | Baseline | Max | ignoreIdleSlots |
|---|---|---|---|
| etl-res | 50 | 200 | false |
| bi-res | 0 | 200 | false |
Code.
# Ensure lending is on for both
bq update --reservation --ignore-idle-slots=false etl-res
bq update --reservation --ignore-idle-slots=false bi-res
-- Per-minute slot use vs assigned slot pool
SELECT
TIMESTAMP_TRUNC(period_start, MINUTE) AS minute,
reservation_id,
SUM(period_assigned_slots) AS assigned_slots,
SUM(period_slots_used) AS slots_used,
SUM(period_slots_used) - SUM(period_assigned_slots) AS borrowed_slots
FROM `region-us`.INFORMATION_SCHEMA.RESERVATIONS_TIMELINE
WHERE period_start >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
AND reservation_id IN ('etl-res', 'bi-res')
GROUP BY minute, reservation_id
ORDER BY minute, reservation_id;
Step-by-step explanation.
- Both reservations have
ignoreIdleSlots=false. ETL's baseline 50 slots are paid-for 24/7; when no ETL job is running, those 50 are idle and lendable. - The BI reservation has baseline 0; during a 12:30 dashboard spike, it autoscales to 150 of its own slots and borrows 50 from
etl-res— total 200 in use against an own-capacity of 200 max. - The timeline view's
period_assigned_slotsis the slots BigQuery scheduled to the reservation that minute (including borrowed);period_slots_usedis what was actually consumed. The difference reveals borrowing. - When an ETL job starts at 13:00, the lender priority kicks in — within seconds, the borrowed 50 slots are reclaimed by
etl-res, andbi-resshrinks to its own pool (queuing if needed). - The bill is unchanged: ETL pays for the 50 baseline slots whether they are lent or not. BI pays only for the autoscaler-rate slots it actually used. Lending is free utilisation.
Output.
| Minute | etl-res used | bi-res used | bi-res borrowed | Notes |
|---|---|---|---|---|
| 12:00 | 50 | 100 | 0 | normal |
| 12:15 | 5 (ETL idle) | 150 | 0 | ETL idle, BI not at max |
| 12:30 | 5 | 200 | 45 | BI borrows from idle ETL |
| 13:00 | 50 (ETL job) | 150 (50 reclaimed) | 0 | ETL takes back its baseline |
Rule of thumb. Leave lending on by default. The only time to turn it off is a regulated workload that must guarantee isolation. The cost of leaving lending on is zero; the benefit is 10–20% better effective throughput across the pool.
Senior interview question on reservation design
A senior interviewer might ask: "You join a team that has 12 projects all running BigQuery on-demand. Walk me through how you'd design the reservation + assignment structure to move them to Editions while isolating production from ad-hoc."
Solution Using workload-class reservations + chargeback labels
Reservation design — 12 projects, three workload classes
========================================================
Step 1: Classify each project by workload class
-----------------------------------------------
ETL (5 projects) → etl-res
BI (4 projects) → bi-res
Ad-hoc (3 projects) → adhoc-res
Step 2: Right-size each reservation
-----------------------------------
etl-res: Enterprise, baseline 50, autoscaler max 200, lending on
bi-res: Enterprise, baseline 0, autoscaler max 300, lending on
adhoc-res: Standard, baseline 0, autoscaler max 50, lending on
Step 3: Create assignments per project
--------------------------------------
For each ETL project: --assignee=p --reservation=etl-res --job_type=QUERY
For each BI project: --assignee=p --reservation=bi-res --job_type=QUERY
For each ad-hoc: --assignee=p --reservation=adhoc-res --job_type=QUERY
Step 4: Enforce chargeback labels at producer side
--------------------------------------------------
Every query must carry: bu:<unit> env:<prod|stg|dev> team:<owner>
Step 5: Resource Monitor caps
-----------------------------
adhoc-res: 24h cap = ~2x normal daily slot-ms
etl-res: 24h cap = ~3x normal (room for backfills)
bi-res: 24h cap = ~2x normal
Step-by-step trace.
| Project class | Count | Reservation | Baseline | Max | Edition |
|---|---|---|---|---|---|
| ETL | 5 | etl-res | 50 | 200 | Enterprise |
| BI | 4 | bi-res | 0 | 300 | Enterprise |
| Ad-hoc | 3 | adhoc-res | 0 | 50 | Standard |
| Total | 12 | 3 reservations | 50 | 550 effective max | mixed |
Three reservations cover all 12 projects, isolating workload classes while letting idle lending fill the gaps between them. Resource Monitors cap each scope. Labels handle chargeback.
Output:
| Concern | Solved by |
|---|---|
| Workload isolation | one reservation per workload class |
| Cost per BU | labels on every job + INFORMATION_SCHEMA.JOBS query |
| Worst-case bill | Resource Monitor per reservation, 24h slot-ms cap |
| Spare-capacity utilisation |
ignoreIdleSlots=false (lending on) across reservations |
| Edition cost | Standard for ad-hoc, Enterprise for ETL + BI |
Why this works — concept by concept:
- One reservation per workload class — different SLOs and latency tolerances. Mixing them creates contention; isolating them makes both the bill and the SLO predictable.
- Assignments are the project-to-reservation map — without an assignment, a project falls back to on-demand. Always create explicit assignments during Editions migration so no project leaks back to per-byte pricing.
- Labels for chargeback — reservations isolate workloads; labels attribute cost. Both layers are needed; neither alone is enough.
- Idle lending fills gaps — paid-for baseline slots should always be available to other reservations when idle. Free utilisation, no downside outside regulated workloads.
- Resource Monitor caps the worst case — without a cap, a misbehaving job can drive the bill to the autoscaler max all day. A 24-hour slot-ms cap at ~2x normal is the safety net.
- Cost — bill = Σ_reservations (baseline × month-hours × rate + autoscaler-hours × PAYG). Reservation design is the structural part of the cost equation; query optimisation is the variable part. Both matter.
ETL
Topic — etl
ETL workload-design problems
5. FinOps practices — monitoring, attribution, alerts
bigquery finops is JOBS-view attribution + Resource Monitor caps + per-team labels — predict the bill before the bill arrives
The mental model in one line: FinOps on BigQuery is the loop of attribute every slot-ms (via JOBS view + labels), cap the worst case (via Resource Monitor), alert on drift (via Cloud Monitoring + cost dashboards), and harvest idle (via lending + per-team-reservation rebalancing) — done daily, not monthly. Once you say "attribute, cap, alert, harvest," the entire bigquery finops interview surface becomes a sequence of operational disciplines, not a one-off design.
INFORMATION_SCHEMA.JOBS — the canonical attribution source.
- Every completed query produces one row in
INFORMATION_SCHEMA.JOBSwith:job_id,user_email,total_slot_ms,total_bytes_billed,reservation_id,labels,creation_time,end_time, plus statement-level metadata. - The view is region-scoped — query
region-us.INFORMATION_SCHEMA.JOBS,region-eu.INFORMATION_SCHEMA.JOBS, etc. There is no global view; union the regions if needed. - The view shows the past 180 days by default. For longer retention, export to a permanent BigQuery dataset via scheduled query.
- Slot-ms is the unit; everything else is metadata. Sum slot-ms grouped by whatever attribution dimension matters (user, team, project, label, reservation).
Labels — the cost-owner dimension.
- Every job can carry an arbitrary set of
(key, value)labels set by the client (CLI flag, SDK config, or query option). - Recommended schema:
bu:<unit>,env:<prod|stg|dev>,team:<owner>,workload:<etl|bi|adhoc>,pipeline:<name>(for ETL jobs). - Enforce labels at the producer. Add a CI check in each project's CI/CD pipeline that rejects queries without the required label keys.
- Labels are queryable via
UNNEST(labels)in the JOBS view; aggregate by label to produce chargeback / showback reports.
Resource Monitor — the budget circuit breaker.
- Caps a scope's total slot-ms in a fixed time window (typically 24 hours).
- Once the cap is hit, new jobs in the scope queue or fail; existing jobs run to completion.
- Set the cap at ~2x the scope's normal daily slot-ms. Too tight and you throttle normal work; too loose and the cap is meaningless.
- Best practice: one Resource Monitor per reservation, with the cap aligned to the reservation's intended monthly budget ÷ 30.
Cost dashboards — Looker Studio + JOBS view.
- Daily cost dashboard. A Looker Studio dashboard that pulls from JOBS view and shows: daily slot-hours by reservation, daily cost by reservation, top-N most expensive queries, week-over-week drift, queries without labels.
- Weekly drift alert. A scheduled query that compares this week's slot-hours per (bu, team) to the trailing 4-week median and posts to Slack when drift exceeds 30%.
-
Per-query cost ranking. A daily report that lists the top 20 slot-ms-consuming queries with their
job_id,user_email, and a rough cost estimate. This is the FinOps team's hit list for optimisation.
Per-team labels for chargeback / showback.
- Chargeback. The cost is billed back to the team that incurred it (their cloud budget shrinks). Requires labels enforced at producer side and a finance-grade dashboard.
- Showback. The cost is reported to the team but not actually billed back. Lower friction, useful when you want awareness without changing budget ownership.
- Pick one and stick to it. Switching between chargeback and showback every quarter destroys the incentive signal.
Idle slot harvesting — borrow vs hoard.
-
Borrow (lend default). Reservations with
ignoreIdleSlots=falselend their idle baseline slots to other reservations. Free utilisation across the org. -
Hoard (ignore idle). Reservations with
ignoreIdleSlots=truekeep their baseline slots private. Only for regulated workloads with hard isolation needs. - Harvest pattern. Periodically (monthly) re-balance baselines across reservations based on observed usage. Take baseline from over-provisioned reservations and give it to under-provisioned ones.
Predict the bill before running — the senior signal.
-
Dry-run estimate.
bq query --dry_runreturns the bytes scanned (on-demand) or the slot-ms estimate (Editions, approximate) without running the query. - Pre-flight wrapper. Wrap every analyst query with a check: if dry-run > $1 per query, require explicit acknowledgement before running.
- Materialised view sniff. If a query repeatedly hits the same large table with the same filter, propose a materialised view to the team.
- Partition + cluster discipline. Every fact table > 1 TB should be partitioned and clustered. The two together drop slot-ms 5–10x on filtered queries.
Cap before complaining — the operational discipline.
- The first time a new team or workload is onboarded, set the Resource Monitor cap immediately. Do not wait for the first runaway query.
- Every new reservation gets a cap as part of its creation runbook. No exceptions.
- Caps are a contract: "you can use up to X slot-ms per day; above that, queue." This is the budget ceiling.
- Complaints about being capped are signals to investigate, not bypasses to grant. The cap should be re-sized based on real growth, not lifted under pressure.
Common interview probes on FinOps practices.
- "How do you find the most expensive queries?" — JOBS view, ORDER BY total_slot_ms DESC LIMIT 20.
- "How do you cap a team's BigQuery spend?" — Resource Monitor on the project or reservation, 24-hour slot-ms cap at ~2x normal.
- "How do you do chargeback?" — labels on every job + JOBS view aggregation; producer-side CI enforces label presence.
- "How do you predict cost before running?" —
--dry_runfor bytes; static analysis of slot-ms-per-byte from past queries on the same table.
Worked example — top-20 expensive queries report
Detailed explanation. A daily FinOps report should surface the 20 most expensive queries of the previous day, with the user, team, and rough cost. This is the optimisation hit list.
Question. Write the SQL for a daily top-20 expensive-queries report, including user, team label, slot-hours, and cost at Enterprise rate.
Input.
| Source | Field |
|---|---|
| INFORMATION_SCHEMA.JOBS | total_slot_ms, user_email, labels, job_id, creation_time |
Code.
WITH yesterday AS (
SELECT
job_id,
user_email,
COALESCE((SELECT value FROM UNNEST(labels) WHERE key = 'team'), 'unlabeled') AS team,
COALESCE((SELECT value FROM UNNEST(labels) WHERE key = 'bu'), 'unlabeled') AS bu,
total_slot_ms,
total_bytes_billed,
reservation_id,
creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE DATE(creation_time) = CURRENT_DATE() - 1
AND job_type = 'QUERY'
AND state = 'DONE'
)
SELECT
job_id,
user_email,
team,
bu,
reservation_id,
total_slot_ms AS slot_ms,
total_slot_ms / 3.6e6 AS slot_hours,
ROUND((total_slot_ms / 3.6e6) * 0.06, 4) AS cost_usd_enterprise,
total_bytes_billed / POW(10, 9) AS gb_billed
FROM yesterday
ORDER BY total_slot_ms DESC
LIMIT 20;
Step-by-step explanation.
- Pull yesterday's completed QUERY jobs with their slot-ms, bytes billed, reservation, labels, and user.
- Extract the
teamandbulabels withUNNEST(labels)+WHERE key = ...subselects. Missing labels default to 'unlabeled'. - Compute slot-hours and Enterprise-rate cost in the SELECT; round to 4 decimals for readability.
- Order by total_slot_ms descending — the worst offender is on top — and limit to 20 rows for the daily standup.
- Schedule this query daily (e.g. via BigQuery Scheduled Queries or Airflow) and email or Slack the result to the FinOps channel.
Output.
| job_id | user | team | slot_hours | cost (USD) | gb_billed |
|---|---|---|---|---|---|
| job_abc1 | dana@co | etl | 12.3 | $0.74 | 480 |
| job_abc2 | erika@co | bi | 8.2 | $0.49 | 320 |
| job_abc3 | mark@co | etl | 6.7 | $0.40 | 250 |
| ... | ... | ... | ... | ... | ... |
| job_abc20 | sam@co | adhoc | 0.9 | $0.05 | 30 |
Rule of thumb. Run the top-20 report daily, not monthly. Catching a runaway query the morning after costs cents; catching it on the month-end bill costs hundreds or thousands.
Worked example — Resource Monitor cap on a team's daily slot-ms
Detailed explanation. The analyst sandbox team has been doing 50 slot-hours per day. You want a cap that protects against a 10x spike (someone runs SELECT * on the prod fact table). Set a Resource Monitor at 100 slot-hours per 24 hours.
Question. Configure the Resource Monitor for adhoc-res. Show what happens when the cap is hit.
Input.
| Metric | Value |
|---|---|
| Normal daily slot-hours | 50 |
| Cap target | 100 slot-hours (2x) |
| Cap window | 24 hours |
| Reservation | adhoc-res |
Code.
# Create a Resource Monitor capping adhoc-res at 100 slot-hours / day
# (slot-hours × 3.6e6 = slot-ms)
gcloud bq resource-monitors create adhoc-daily-cap \
--parent=organizations/123456789 \
--reservation=projects/admin-project/locations/us/reservations/adhoc-res \
--slot-ms-limit=360000000 \
--window=DAILY
-- Verify the current 24h consumption against the cap
SELECT
SUM(total_slot_ms) AS slot_ms_24h,
SUM(total_slot_ms) / 3.6e6 AS slot_hours_24h,
360000000 AS cap_slot_ms,
100 AS cap_slot_hours,
ROUND(SUM(total_slot_ms) / 3.6e6 / 100 * 100, 1) AS pct_of_cap
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE reservation_id = 'adhoc-res'
AND creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR);
Step-by-step explanation.
-
adhoc-resnormally consumes 50 slot-hours / day = 180,000,000 slot-ms. Doubling that gives the cap: 100 slot-hours = 360,000,000 slot-ms. - The Resource Monitor is created at the org level pointing at the reservation. Its slot-ms limit is the 24-hour cap.
- Throughout the day, the verify query shows current consumption vs cap. At 80% of cap, alerts should fire (Cloud Monitoring on the resource-monitor metric).
- When consumption hits 100%, new jobs in the reservation queue (default) or fail (if so configured). Running jobs complete normally.
- The cap resets every 24 hours (rolling window). Investigate any day that hits cap; resize cap only if the new normal justifies it.
Output.
| Time | Slot-hours used | % of cap | Action |
|---|---|---|---|
| 06:00 | 12 | 12% | normal |
| 12:00 | 38 | 38% | normal |
| 14:30 | 80 | 80% | alert fires (Cloud Monitoring) |
| 15:00 | 100 | 100% | new jobs queue |
| next day 06:00 | (reset) | 0% | normal resumes |
Rule of thumb. Set the cap at ~2x normal usage and the alert at 80% of cap. The 2x cap absorbs legitimate spikes; the 80% alert gives the team time to investigate before the cap fires.
Worked example — per-team weekly chargeback report
Detailed explanation. The FinOps team produces a weekly chargeback report that shows each business unit's cost and the top-3 queries that drove it. Stakeholders see this in their Monday standup; large drifts trigger a follow-up.
Question. Write the weekly chargeback SQL with per-BU totals and top-3 queries per BU.
Input.
| Field | Source |
|---|---|
| labels.bu | JOBS view |
| total_slot_ms | JOBS view |
| job_id, user_email | JOBS view |
Code.
WITH last_week AS (
SELECT
COALESCE((SELECT value FROM UNNEST(labels) WHERE key = 'bu'), 'unlabeled') AS bu,
job_id,
user_email,
total_slot_ms,
creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_TRUNC(CURRENT_TIMESTAMP() - INTERVAL 7 DAY, WEEK(MONDAY))
AND creation_time < TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), WEEK(MONDAY))
AND job_type = 'QUERY'
),
totals AS (
SELECT
bu,
COUNT(*) AS query_count,
SUM(total_slot_ms) / 3.6e6 AS slot_hours,
ROUND((SUM(total_slot_ms) / 3.6e6) * 0.06, 2) AS cost_usd
FROM last_week
GROUP BY bu
),
ranked AS (
SELECT
bu, job_id, user_email, total_slot_ms,
ROW_NUMBER() OVER (PARTITION BY bu ORDER BY total_slot_ms DESC) AS rk
FROM last_week
)
SELECT
t.bu,
t.query_count,
t.slot_hours,
t.cost_usd,
STRING_AGG(CONCAT(r.job_id, ' (', CAST(ROUND(r.total_slot_ms / 3.6e6, 2) AS STRING), ' sh)'),
', ' ORDER BY r.total_slot_ms DESC) AS top_3
FROM totals t
LEFT JOIN ranked r ON r.bu = t.bu AND r.rk <= 3
GROUP BY t.bu, t.query_count, t.slot_hours, t.cost_usd
ORDER BY t.slot_hours DESC;
Step-by-step explanation.
- The
last_weekCTE pulls all queries from the previous full ISO week (Monday–Sunday) and extracts thebulabel. - The
totalsCTE aggregates per BU: count, slot-hours, cost at Enterprise rate. - The
rankedCTE ranks each BU's queries by slot-ms descending; the top 3 are the BU's biggest drivers. - The final SELECT joins totals with the top-3 queries and string-aggregates them for a one-row-per-BU report.
- Schedule this as a Sunday-night job; deliver to the Monday FinOps standup. Large WoW deltas (>30%) get a follow-up investigation.
Output.
| BU | query_count | slot_hours | cost_usd | top_3 |
|---|---|---|---|---|
| marketing | 12,400 | 285.3 | $17.12 | job_a (12.1 sh), job_b (8.5 sh), job_c (6.7 sh) |
| finance | 8,200 | 142.7 | $8.56 | job_d (9.0 sh), job_e (5.5 sh), job_f (4.0 sh) |
| ops | 4,100 | 65.0 | $3.90 | job_g (3.5 sh), job_h (2.5 sh), job_i (2.0 sh) |
| unlabeled | 22 | 0.5 | $0.03 | job_j (0.3 sh), job_k (0.1 sh) |
Rule of thumb. Weekly cadence is the right granularity for chargeback. Daily is too noisy; monthly is too late. Weekly hits the sweet spot where teams have a chance to course-correct.
Senior interview question on operational FinOps cadence
A senior interviewer might frame this as: "You inherit a BigQuery setup with $50K/month spend, no labels, no Resource Monitor, no cost dashboards. Walk me through the first 30 days of FinOps work — what do you build, in what order, and what evidence do you collect?"
Solution Using a 4-week FinOps onboarding plan
30-day FinOps onboarding — BigQuery from $50K/month
====================================================
Week 1 — Visibility
-------------------
- Build daily top-20 expensive queries report (JOBS view + Looker Studio)
- Add daily slot-hours by reservation chart
- Audit reservations: baselines, autoscaler maxes, ignoreIdleSlots flags
- Audit assignments: every project assigned, no on-demand leakage
Week 2 — Attribution
--------------------
- Define label schema: bu, env, team, workload, pipeline
- Roll out labels in 1-2 high-spend producing projects first
- Add CI check in those projects rejecting unlabeled queries
- Per-team chargeback dashboard published
Week 3 — Caps
-------------
- Resource Monitor on every reservation: ~2x current daily slot-ms
- 80%-of-cap Cloud Monitoring alerts to Slack
- Document cap-resize SOP (who, why, when)
- Communicate caps to all team leads
Week 4 — Optimisation
---------------------
- Top-3 queries from week 1's daily reports → optimisation backlog
- Cluster + partition audit on top-3 source tables
- Materialised view candidates identified for repeated heavy queries
- 30-day cost trend report; project next-30 spend
Step-by-step trace.
| Week | Activity | Output | Spend impact |
|---|---|---|---|
| 1 | Visibility | top-20 report, reservation audit | none yet; baseline |
| 2 | Attribution | label schema + CI check + chargeback dashboard | -5% (awareness alone) |
| 3 | Caps | Resource Monitors + alerts + SOP | -10% (worst-case avoided) |
| 4 | Optimisation | rewrites + clustering + MVs | -25% (real workload reductions) |
By week 4 the team has visibility (week 1), attribution (week 2), worst-case protection (week 3), and concrete optimisation work in flight (week 4). The combined spend impact in the first quarter is typically 30–40%.
Output:
| Deliverable | Owner | When |
|---|---|---|
| Top-20 daily report | FinOps lead | end of week 1 |
| Chargeback dashboard | FinOps + data eng | end of week 2 |
| Resource Monitor caps | FinOps + reservation owners | end of week 3 |
| Optimisation backlog | data eng leads | end of week 4 |
| 30-day cost trend report | FinOps lead | end of week 4 |
Why this works — concept by concept:
- Visibility first — you cannot reduce what you cannot see. Week 1 builds the JOBS-view-driven dashboards; everything after stands on those.
- Attribution before caps — caps without attribution feel arbitrary. Labels first; then caps become "this team is using more than X — here is exactly why."
- Caps before optimisation — optimisation takes weeks. Caps prevent disasters now. Stop the bleeding, then heal.
- Optimisation is the longest tail — partition + cluster + materialised view rewrites compound over months. Treat them as a backlog, not a sprint.
- Predictable cadence — daily top-20, weekly chargeback, monthly resize. Each cadence catches a different size of problem.
- Cost — FinOps engineer time is O(weeks); savings are O(months × bill). The discipline pays for itself within one quarter on any account spending > $10K/month.
SQL
Topic — sql
SQL FinOps and attribution drills
ETL
Topic — etl
ETL cost-attribution problems
Cheat sheet — BQ FinOps recipes
- Edition-pick decision tree. Standard if no premium features needed. Enterprise if you need BI Engine, BQML, Gemini, CMEK, or row-level security. Enterprise Plus only for cross-region replication, Cloud EKM, or regulated industries. Default down, not up.
- Baseline + autoscaler formula. Duty cycle < 30% → baseline 0 + autoscaler max = P99 demand. 30–60% → small baseline ≈ average light demand + autoscaler max = P99. > 60% → baseline ≈ steady-state floor + autoscaler max = P99 + 1-year commit at 20% off.
-
Per-team JOBS attribution query.
SELECT bu, SUM(total_slot_ms)/3.6e6 AS slot_hours FROM INFORMATION_SCHEMA.JOBS, UNNEST(labels) WHERE key='bu' GROUP BY bu;— the one-line chargeback core. - Resource Monitor cap. Always set per reservation at ~2x normal daily slot-ms. Alert at 80% of cap to Cloud Monitoring → Slack. Caps reset every 24 hours.
-
Idle reservation lending pattern.
ignoreIdleSlots=falseon every reservation by default. Lending is free utilisation. Only hoard for regulated workloads with hard isolation needs. - Per-second billing rule. Since 2023, no 60-second floor. Small fast queries are no longer punished. Autoscaler can scale up and down within the same minute.
- Slot-ms is the unit of cost. Convert every quote: slot-ms ÷ 3.6e6 = slot-hours; slot-hours × $0.06 (Enterprise) = USD. Wall-clock seconds is meaningless without slot count.
- Commitment discount break-even. 1-year commit (20% off) pays off when baseline utilisation > 80%. 3-year commit (40% off) pays off when > 60%. Below those, PAYG autoscaler wins.
-
Dry-run before every expensive query.
bq query --dry_runreturns the bytes (on-demand) or slot-ms estimate (Editions). Reject any analyst query estimated > $1 without explicit ack. - Partition + cluster discipline. Every fact table > 1 TB must be partitioned by ingestion or event date and clustered by the top 1-4 filter keys. Together they drop slot-ms 5–10x.
- Materialised view candidates. Any query that hits the same large table with the same filter > 10x per day is a materialised view candidate. MVs are auto-refresh on Enterprise.
-
Label enforcement at producer. Every CI/CD pipeline that runs BigQuery queries must reject unlabeled queries.
bu,env,team,workload,pipelineis the recommended schema. - JOBS view retention. The default is 180 days. For year-over-year cost analysis, export JOBS to a permanent dataset via daily scheduled query.
- Resize cadence. Reservation baselines and autoscaler maxes: review monthly. Resource Monitor caps: review monthly with the chargeback report. Editions: review yearly or on major workload changes.
- No on-demand leakage. Every project must have a reservation assignment. A project without an assignment falls back to on-demand pricing — a silent bill leak during Editions migration.
Frequently asked questions
BigQuery on-demand vs editions — which is cheaper?
It depends on your workload shape, not your data volume. On-demand charges per byte scanned (≈ $6.25/TB) with no slot capacity to manage. Editions charges per slot-hour ($0.04 Standard / $0.06 Enterprise / $0.10 Enterprise Plus) and you size baseline + autoscaler. Pull the last 30 days of INFORMATION_SCHEMA.JOBS, compute total slot-hours, multiply by your edition rate, and compare to your on-demand bill — that is the only honest answer. As a rough rule, workloads below $200/month on on-demand stay there for the simplicity; above $500/month they almost always benefit from Editions Standard. The crossover is workload-shaped because some queries are byte-heavy (favours Editions) and others are CPU-heavy (favours on-demand) — re-run the math quarterly.
What is a BigQuery slot?
A bigquery slot is one unit of parallel compute — roughly 1 vCPU + memory + shuffle bandwidth — that the Dremel query engine dispatches in parallel across the sub-steps of a query plan. A query with a wide shuffle on a multi-billion-row table can use thousands of slots concurrently; a small scan on a single partition uses a handful. Billing is in slot-ms (slots × milliseconds), aggregated to slot-hours (÷ 3.6 million) and multiplied by the per-slot-hour rate of your edition. Slot-ms is the only atomic cost unit in BigQuery; wall-clock seconds are meaningless without the slot count, because a 1-second query at 100 slots costs the same as a 100-second query at 1 slot.
What's the difference between baseline and autoscaler?
Baseline is the always-on slot floor — slots you pay for 24/7 whether you use them or not, eligible for 20%/40% commitment discounts at 1y/3y. Autoscaler is the elastic ceiling — slots that scale automatically from baseline up to a configured max in 100-slot increments, billed only for the seconds they are active at the pay-as-you-go Editions rate. The default pattern in 2026 is baseline 0 + autoscaler max = P99 of per-minute slot demand for bursty workloads. A baseline > 0 only pays off when duty cycle exceeds 60% and you can lock in a 1y/3y commit at 20%/40% off. For weekday-hours-only BI workloads (duty cycle ~30%), baseline 0 wins by a large margin; for 24/7 ETL with steady state, a baseline at the steady-state floor + commit discount wins.
How do I cap BigQuery spending per team?
Two layers. Layer 1 — Resource Monitor: create one per reservation (or per project) capping the 24-hour total slot-ms. Set the cap at ~2x normal daily slot-ms and alert at 80% of cap via Cloud Monitoring → Slack. When the cap fires, new jobs in the scope queue or fail. Layer 2 — labels + chargeback: require every job to carry bu, team, env labels (CI-enforced in the producing project's pipeline), and aggregate slot-ms by label in INFORMATION_SCHEMA.JOBS for a weekly chargeback/showback report. The Resource Monitor is the hard cap — your worst-case bill protection. Labels are the attribution — they tell you which team is responsible for which slot-ms. You need both to do real bigquery finops; one alone is not enough.
Can I share slots across projects?
Yes — that is exactly what bigquery reservations are for. One reservation is a slot pool that can be assigned to multiple projects via separate assignments. The slots are pooled (no per-project carve-out within a reservation), and the per-project chargeback comes from labels on each job. If you need hard per-project isolation (regulated workloads), create separate reservations per project. The senior pattern is one reservation per workload class (ETL, BI, ad-hoc), each serving multiple projects, with ignoreIdleSlots=false so idle slots in one reservation are lent to busy ones — free utilisation across the org. Lending is automatic and bounded; the lender always has priority on return.
BQ Enterprise vs Enterprise Plus — when?
Enterprise ($0.06/slot-hour) unlocks BI Engine, BQML, Gemini in BigQuery, customer-managed encryption keys (CMEK), column-level security, dynamic data masking, materialised views with auto-refresh — the full feature set that 95% of analytics teams need. Enterprise Plus ($0.10/slot-hour, +67%) adds cross-region dataset replication, Cloud EKM (external key management), advanced ML features, enhanced VPC-SC, longer time-travel retention — features specific to regulated industries (finance, healthcare, government), multi-region active-active DR, and customer-managed key escrow. The test: if your security architect cannot name three Enterprise Plus features off the top of their head, you do not need this tier. Pick down by default; you can upgrade to Enterprise Plus later without re-platforming when a real feature requirement appears.
Practice on PipeCode
- Drill the SQL practice library → for the INFORMATION_SCHEMA.JOBS aggregation, label unnesting, and windowed-percentile queries this guide leans on.
- Rehearse query optimization drills → for the partition + cluster + materialised view rewrites that drop slot-ms 5–10x.
- Sharpen the ETL pipeline library → for the batch-vs-streaming reservation-sizing patterns senior engineers use to size production capacity.
- Layer data modeling problems → on top to lock in the partition / cluster discipline at table-design time, not at firefighting time.
- For the broader interview surface, browse the full PipeCode library at pipecode.ai → and pick the workload axis (SQL, optimization, ETL) you want to drill next.
Lock in BQ FinOps muscle memory
BQ docs explain pricing tables. PipeCode drills explain the decision — when Editions beats on-demand, when autoscaler caps save the bill, when to enforce a Resource Monitor per team. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the production trade-offs senior data engineers actually face.





Top comments (0)