bigquery studio is the single biggest workbench change Google has shipped to BigQuery since the standard-SQL migration in 2018 — and the one most senior data engineers underuse, because they still think of BQ as "the SQL console plus the API" instead of as a unified workspace where SQL, Python notebooks, visual pipelines, and bigquery gemini assistance all share the same query engine. In 2026, the GCP console's default BQ entry point is Studio, the Gemini integration is GA across the surface, and the line between "data engineer" and "analytics engineer" runs right through the question of whether you actually use the notebook and canvas surfaces or fall back to the editor by reflex.
This guide is the senior-DE walkthrough you wished existed the first time an interviewer asked "how do you use gemini in bigquery for production-grade SQL without shipping a hallucinated join condition?" or "when does a bigquery data canvas export beat a hand-rolled scheduled query?" It walks through the four Studio surfaces — the Gemini-assisted SQL editor (auto-complete, explain-this-query, NL-to-SQL with schema grounding), bigquery notebooks (Colab-style Python + SQL cells, bigframes pandas-on-BigQuery, BQML training inline), the visual Data Canvas (drag-and-drop branch-and-merge pipelines that compile to SQL), and Studio's positioning against Looker and Vertex Workbench — and finishes with the senior-interview-grade separation-of-concerns between workbench, BI tool, and ML platform. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the SQL practice library →, rehearse on ETL practice problems →, and sharpen the warehouse-aggregation axis with the aggregation drills →.
On this page
- Why Studio is the new BQ workbench
- SQL editor + Gemini assist
- Notebooks — Python + SQL cells
- Data Canvas — visual exploration
- Studio vs Looker vs Vertex — choosing
- Cheat sheet — Studio + Gemini recipes
- Frequently asked questions
- Practice on PipeCode
1. Why Studio is the new BQ workbench
bigquery studio collapses the SQL editor, the notebook, and the visual canvas into one workspace — the question is no longer "which tool" but "which surface for which task"
The mental model in one line: BigQuery Studio is a unified workspace sharing one query engine, one IAM surface, one billing scope, and one set of datasets across SQL editing, Python notebooks, Data Canvas pipelines, and Gemini-assisted code generation. Before Studio, those four surfaces lived in four different products — the legacy SQL editor, Vertex Workbench for notebooks, a third-party tool like Tableau Prep for canvas-style flows, and no AI assist at all. Studio turns them into tabs in the same console, reading and writing the same datasets through the same access policies.
The four must-answer axes interviewers actually probe.
-
SQL editor with Gemini. The BQ SQL pane has
bigquery ai assistbaked in — auto-complete for whole window functions and CTEs, "Explain this query" prose narration, and atext to sql bigqueryprompt grounded on dataset schema. The senior question is never "does Gemini work?" — it is "how do you keep wrong answers out of production?" -
Notebooks. Colab-style runtime sandboxes mixing SQL and Python cells. The Python kernel ships with
bigframes(pandas-on-BQ) andbigframes.ml(BQML wrapper). This is the surface that replaces Vertex Workbench for BQ-centric ML. - Data Canvas. A drag-and-drop visual editor for branch-and-merge SQL transformations; the graph exports to a single SQL script or schedules as a query. BigQuery is the engine — no intermediate Excel or in-memory tier.
- AI assist across the surface. Gemini is a sidecar in every surface — completes in the editor, generates cells in notebooks, suggests joins on canvas. Senior engineers treat it as a paired pilot, never the captain.
Why the unification matters in 2026.
-
One IAM surface. A
roles/bigquery.dataVieweron the source dataset androles/bigquery.jobUseron the project is enough to read in SQL, in a notebook, and on the canvas. No separate notebook-runtime IAM. - One billing scope. Every query — editor, notebook cell, canvas, or Gemini — bills the same project against the same on-demand or reservation slot pool. Slot usage is observable in one INFORMATION_SCHEMA view.
-
One copy of the data.
bigframesoperations stay server-side; the canvas's intermediate steps are CTEs, not materialised tables. The "single source of truth" promise is preserved. - Gemini grounded on schema. Every Gemini call is grounded with the project's schema metadata (table names, column types, descriptions, PK tags). That cuts hallucinated column names dramatically.
The 2026 reality — what changed since 2023.
- Studio is the default GCP console BQ entry since mid-2024 — the classic SQL editor is reachable but no longer the default.
- Gemini is GA across the surface. The old "Duet AI for BigQuery" brand was renamed and unified under Gemini in BigQuery; SQL-assist, notebook-assist, and canvas-assist are GA.
-
bigframes1.x is stable. Notebooks default tobigframesfor DataFrame work rather thanpandas+bqclient. - Data Canvas exports clean SQL. The 2026 exporter produces idiomatic CTEs with named steps that map back to canvas nodes.
What interviewers listen for.
- Saying "Studio shares one query engine, one IAM, and one billing scope across editor, notebook, and canvas" in the first sentence — senior signal.
- Describing Gemini as a paired pilot with a mandatory human review step on generated SQL — required answer.
- Naming
bigframespushdowns as why notebook ETL is cheap and Vertex Workbench is no longer the default for BQ workflows — senior signal. - Pushing back on "Studio replaces Looker" with "Studio is a workbench, Looker is a governed-BI semantic layer — different sides of the build-vs-serve line" — senior signal.
Worked example — same task, three surfaces
Detailed explanation. A common interviewer warm-up — "monthly revenue per region for the last 12 months on a sales table partitioned by order_date and clustered by region. Show me this in the SQL editor, in a notebook with bigframes, and on the Data Canvas. Which surface fits each scenario: ad-hoc, one-off chart in a model card, production-scheduled query?"
The question tests whether you can map the same logical query onto each Studio surface and which surface is the right home for which lifecycle stage.
Question. Express the query in the SQL editor (raw SQL), in a notebook (bigframes), and as a Data Canvas pipeline. State which surface fits each of the three scenarios.
Input.
| surface | scenario fit | input artefact |
|---|---|---|
| SQL editor | ad-hoc exploration | typed query |
Notebook + bigframes
|
one-off chart in a model card |
.ipynb with markdown + cell |
| Data Canvas | production-scheduled query | canvas graph exported to SQL |
Code.
-- Surface 1: SQL editor (Gemini may auto-complete this from "monthly revenue per region")
SELECT
region,
DATE_TRUNC(order_date, MONTH) AS month,
SUM(amount) AS revenue
FROM `proj.sales.orders`
WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 12 MONTH)
GROUP BY region, month
ORDER BY region, month;
# Surface 2: Notebook with bigframes (pandas-style, pushes down to BQ)
import bigframes.pandas as bpd
orders = bpd.read_gbq("proj.sales.orders")
last12 = orders[orders["order_date"] >= bpd.Timestamp.now() - bpd.Timedelta(days=365)]
monthly = (
last12
.assign(month=last12["order_date"].dt.to_period("M").dt.to_timestamp())
.groupby(["region", "month"])["amount"]
.sum()
.reset_index()
.sort_values(["region", "month"])
)
# bigframes only materialises when you call .to_pandas() or render — until then it's a SQL plan
monthly.head()
# Surface 3: Data Canvas (described as a graph; export produces SQL below)
Node 1 — Source : proj.sales.orders
Node 2 — Filter : order_date >= CURRENT_DATE - INTERVAL 12 MONTH
Node 3 — Derive column : month = DATE_TRUNC(order_date, MONTH)
Node 4 — Group + aggregate: GROUP BY region, month → SUM(amount) AS revenue
Node 5 — Sort : region ASC, month ASC
Node 6 — Sink : proj.sales.monthly_revenue_by_region (overwrite)
Step-by-step explanation.
- All three surfaces emit BigQuery SQL — the editor types it,
bigframescompiles pandas to SQL on.to_pandas(), the canvas exports its node graph to CTEs. - The SQL editor is the fastest path for ad-hoc exploration; Gemini auto-complete shortens typing but does not change what runs.
-
bigframesshines when surrounding work is Python — feature engineering, matplotlib plotting, sklearn calls. SQL pushdown means no data leaves BQ. - The canvas wins when branching logic + non-engineer maintainers meet — the visual flow is easier to onboard onto than a 200-line CTE chain.
- Mapping surfaces to scenarios: editor for ad-hoc; notebook for narrative one-shots; canvas for visual handoff to scheduled queries.
Output.
| scenario | best surface | why |
|---|---|---|
| Ad-hoc exploration | SQL editor | fastest iteration; Gemini accelerates typing |
| One-off chart in a model card | Notebook + bigframes
|
inline Python, matplotlib, markdown narrative |
| Production-scheduled query | Data Canvas → exported SQL | visual flow + scheduled-query target |
Rule of thumb. Pick the surface matching the lifecycle stage of the work, not the one you're most fluent in. Studio's whole point is that switching surfaces costs zero IAM, zero data movement, and zero context.
Worked example — what Studio replaces (and what it does not)
Detailed explanation. A senior probe — "what tools does Studio actually replace in a typical GCP data stack, and what is still needed alongside?" Studio replaces some adjacent tools fully, partially replaces others, and is irrelevant to a third group. Mapping that surface area precisely is a senior signal.
Question. For each adjacent tool (Vertex Workbench, Looker, Composer, Dataform, Dataproc), state whether Studio fully replaces it, partially replaces it, or is unrelated.
Input.
| Adjacent tool | Primary use case |
|---|---|
| Vertex Workbench | General-purpose ML notebook runtime |
| Looker | Governed BI semantic layer for end users |
| Cloud Composer | Workflow orchestration across many services |
| Dataform | Version-controlled SQL transformation (dbt-like) |
| Dataproc | Spark/Hadoop cluster runtime |
Code.
Replacement matrix — Studio vs adjacent tools
=============================================
Vertex Workbench → Fully replaced for BQ-centric notebooks (bigframes + BQML)
Still needed for GPU training, non-BQ ML, Model Registry
Looker → NOT replaced — Studio is build, Looker is serve
Composer → NOT replaced — no DAG scheduler in Studio
Dataform → Partial overlap — both produce versioned SQL
Dataform wins on Git + assertions + dependency graph
Dataproc → NOT replaced — no Spark engine in Studio
Step-by-step explanation.
- Vertex Workbench loses ground for BQ-centric ML because
bigframes+ BQML cover the warehouse-native model lifecycle inside Studio. Workbench is still required for GPU, custom dependencies, or Vertex Model Registry. - Looker is the most common "Studio replaces this" misconception — Looker is a semantic layer (LookML dimensions, measures, explores). Studio has no semantic layer. The build-vs-serve line separates them.
- Composer/Airflow is unaffected — Studio has no DAG scheduler. Canvas exports to a single scheduled query, not multi-step DAGs with retries and SLAs.
- Dataform sits closest to overlap. Both produce versioned SQL. Dataform is Git-native, has assertions, supports a multi-table dependency graph. Pattern: prototype on canvas, productionise in Dataform.
- Dataproc owns Spark workloads. Studio cannot run Spark; for Spark, Dataproc remains the answer.
Output.
| Tool | Studio's effect | Recommendation |
|---|---|---|
| Vertex Workbench | Fully replaced for BQ-centric ML | Keep for GPU + non-BQ ML |
| Looker | Not replaced | Studio builds, Looker serves |
| Composer | Not replaced | Composer owns cross-service DAGs |
| Dataform | Partial overlap | Canvas for proto, Dataform for prod |
| Dataproc | Not replaced | Dataproc owns Spark |
Rule of thumb. If a tool is BQ-specific and single-user-facing, Studio absorbs it. If it is cross-service, governed-multi-user, or non-SQL, Studio sits alongside it.
Worked example — the four-axis Studio readiness check
Detailed explanation. Before a team migrates from "classic BQ console + separate notebook IDE" to Studio, a senior engineer runs a four-axis readiness check — SQL editor depth, notebook depth, canvas appetite, Gemini governance. Each axis has a pass/fail criterion and a remediation if the team fails.
Question. Define the four-axis Studio readiness check. For each axis, give the pass/fail criterion and the action on fail.
Input.
| Team / workload | SQL editor | Notebook | Canvas | Gemini comfort |
|---|---|---|---|---|
| ETL backend | Heavy | Medium | Low | High |
| Analytics | Heavy | High | Medium | Medium |
| ML team | Medium | Heavy | Low | High |
| Business analysts | Light | Light | Heavy | Low |
Code.
Studio readiness check
======================
Axis 1 — SQL editor depth
Pass: team writes >50% of queries in BQ SQL today
Action on fail: pilot the editor for a week before full migration
Axis 2 — Notebook depth
Pass: team uses Python notebooks at least weekly
Action on fail: train on bigframes; keep Vertex Workbench as fallback
Axis 3 — Canvas appetite
Pass: at least one workflow maps to branch-and-merge SQL
Action on fail: skip the canvas; editor + scheduled queries
Axis 4 — Gemini governance
Pass: team has a code-review policy for AI-generated SQL
Action on fail: write the review checklist before enabling Gemini
Step-by-step explanation.
- Axis 1 (SQL editor) is the easiest to pass — most BQ-heavy teams already write SQL in the console. The fail mode is teams who live in third-party IDEs.
- Axis 2 (notebooks) is the most commonly failed axis on ETL teams. The fix is two weeks of
bigframespractice and pair sessions before relying on notebooks for critical work. - Axis 3 (canvas) is binary — branch-and-merge workloads benefit, linear single-query workloads do not. Don't force it.
- Axis 4 (Gemini governance) is the one that bites teams hardest in production. Insert a human review step between Gemini output and production — checked into code review, not optional.
- Studio's surfaces are not free — each requires team practice. The readiness check decides where to invest the practice budget.
Output.
| Team | Axes passed | Studio adoption plan |
|---|---|---|
| ETL backend | 1, 4 | Pilot canvas; train on notebooks |
| Analytics | 1, 2, 4 | Full Studio; defer canvas |
| ML team | 2, 4 | Notebooks first; defer canvas |
| Business analysts | 3 | Canvas-first; SQL editor fallback |
Rule of thumb. Migration is not all-or-nothing — the readiness check picks the order of adoption.
Senior interview question on Studio's place in the GCP stack
A senior interviewer often opens with: "BigQuery Studio launched in 2023 and became the default BQ console in 2024. From a senior data engineer's perspective, what did it actually change in your day-to-day workflow, and what stayed the same? Where do you still reach for tools outside Studio, and why?"
Solution Using surface-by-surface workflow mapping
Studio workflow mapping — what changed, what stayed
===================================================
Day-to-day change matrix
------------------------
Workflow Pre-Studio With Studio
Ad-hoc query BQ classic console Studio editor + Gemini auto-complete
Exploration with charts Workbench notebook + bq client Studio notebook + bigframes
Visual branch-merge transform Tableau Prep / hand-written SQL Data Canvas → exported SQL
Schedule a query Scheduled Queries UI Same UI, accessible from canvas
Train BQML model SQL editor + manual EVAL queries Notebook + bigframes.ml
End-user dashboard Looker Looker (unchanged)
Cross-service DAG Composer / Airflow Composer (unchanged)
Spark batch job Dataproc Dataproc (unchanged)
What changed: notebook + canvas + Gemini are now in one workspace, sharing IAM and billing
What stayed: BI (Looker), orchestration (Composer), Spark (Dataproc), GPU ML (Vertex)
Step-by-step trace.
| Workflow | Where I used to live | Where I live now | Why I moved |
|---|---|---|---|
| Ad-hoc SQL | Classic editor | Studio editor | Gemini auto-complete + schema panel |
| Notebook exploration | Vertex Workbench | Studio notebook |
bigframes pushdown; shared IAM |
| Visual transform | Tableau Prep | Studio canvas | One IAM; export to scheduled SQL |
| BQML training | SQL editor | Studio notebook |
bigframes.ml + chart in same cell |
| Looker dashboard | Looker | Looker | Studio has no semantic layer |
| Composer DAG | Composer | Composer | Studio has no cross-service scheduler |
| Spark job | Dataproc | Dataproc | Studio has no Spark engine |
After the mapping pass, the verdict is clear: Studio absorbs the BQ-specific surfaces (editor, notebook, canvas, AI assist) and leaves the cross-service, governed-BI, and Spark surfaces alone.
Output:
| Verdict | Reasoning |
|---|---|
| Studio is a workbench, not a platform | One IAM + one billing across editor / notebook / canvas / Gemini |
| Studio does not replace Looker, Composer, or Dataproc | Different problem domains |
| Studio mostly replaces Vertex Workbench for BQ work |
bigframes + BQML are first-class |
| Studio's edge: AI assist + zero-friction surface switching | Grounded Gemini + shared schema |
Why this works — concept by concept:
- Surface-by-surface mapping — walk each workflow and identify where the work used to live. "Studio is a unified workspace" is the junior answer; the senior answer enumerates surfaces.
- Unified IAM and billing — the structural change is "one IAM and one billing scope across four surfaces," not more features. That is what lets a notebook query a table the editor just looked at with no re-auth.
- Build vs serve line — Studio sits on build; Looker sits on serve. They are adjacent, not competitors. "Studio replaces Looker" reveals the line has not been internalised.
- BQ-specific vs cross-service — Studio absorbs BQ-specific tools (editor, BQ notebook, BQ canvas, BQ AI assist). It leaves cross-service tools (Composer, Dataproc, Vertex GPU) alone.
- Cost — migration is O(team workflows) of retraining. Staying on classic tools costs O(time) of fragmented IAM and surface switching. Most teams come out ahead — but only if they do the readiness check first.
SQL
Topic — sql
BigQuery SQL practice problems
2. SQL editor + Gemini assist
Gemini turns the SQL editor into a paired pilot — bigquery ai assist accelerates typing, but only a disciplined review pattern keeps hallucinated joins out of production
The mental model in one line: the Gemini-assisted SQL editor offers three distinct AI surfaces — inline auto-complete, "explain this query," and natural-language-to-SQL — and a senior engineer treats each as a draft that must pass a four-point review before it ships. The features are powerful because they surface without you leaving the SQL pane; the cost is that "just paste Gemini's answer" is the most common production-bug source in 2026 BQ codebases.
The three AI surfaces in the editor.
-
Inline auto-complete. As you type, Gemini suggests the next tokens — column names, function calls, full window expressions. Tab to accept. Grounded on the
FROMtable schema. - Explain this query. Highlight a SQL block, click "Explain", and Gemini returns a prose narration — joins, filters, aggregations, window logic. Useful for inheriting unfamiliar SQL.
- Natural-language-to-SQL. A "Generate SQL" prompt that takes free-text English and emits a query grounded on the active project's schema. Strong on standard patterns; weaker on multi-step pipelines and uncommon window frames.
Schema grounding — why Gemini hallucinates less inside Studio.
- The editor passes table names, column names, types, descriptions, and PK tags from INFORMATION_SCHEMA as context to Gemini.
- The grounding cuts hallucinated column names dramatically vs a generic ChatGPT prompt.
- Grounding fails when schema is unannotated. Populating column descriptions pays for itself in NL-to-SQL accuracy.
The four-point review pattern (senior signal).
-
Joins. Verify each
ONclause against a documented FK relationship. -
Filters. Check WHERE semantics and date math (
DATE_SUBvsDATE_TRUNCare not the same). -
Aggregations. Match the aggregation function and
GROUP BYto business intent. -
Window frames. Count the rows in
ROWS BETWEEN N PRECEDING AND CURRENT ROW.
Where Gemini struggles in 2026.
- Multi-step pipelines. Beyond ~4 CTEs produces verbose, repetitive SQL.
- Uncommon window patterns. Sessionisation, gap-and-island, top-N-with-ties — close but often wrong.
- Cost optimisation. Does not reliably partition-prune or cluster-prune.
- Schema evolution. New columns may not appear in older training data; grounding mitigates but does not eliminate.
The senior-engineer pattern — Gemini as drafter, not committer. Gemini drafts; engineer reviews; CI validates; code review approves. Treat every generated query as if a smart-but-inexperienced colleague wrote it.
Worked example — auto-complete on a window function
Detailed explanation. A common senior probe — "you're typing a 7-day moving average; Gemini auto-completes the window frame. What should it suggest, what does it actually suggest, and how do you tell correct from 'close enough but wrong'?" The question tests whether you read auto-completed code or rubber-stamp it.
Question. Type a 7-day trailing moving average over a sales table partitioned by order_date. Show the prompt-prefix, the likely auto-complete, and the four-point review you'd perform.
Input.
| order_date | region | revenue |
|---|---|---|
| 2026-06-01 | US | 100 |
| 2026-06-02 | US | 120 |
| 2026-06-03 | US | 90 |
| 2026-06-04 | US | 110 |
| 2026-06-05 | US | 130 |
| 2026-06-06 | US | 95 |
| 2026-06-07 | US | 140 |
| 2026-06-08 | US | 105 |
Code.
-- You type up to here, then pause:
SELECT
order_date,
region,
revenue,
-- Gemini auto-completes the next line. The suggestion you actually want:
AVG(revenue) OVER (
PARTITION BY region
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS revenue_7d_avg
FROM `proj.sales.daily_revenue`
WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 60 DAY)
ORDER BY region, order_date;
Step-by-step explanation.
- The intent is "7-day trailing moving average" — the window is today plus the previous 6 days.
- Gemini may auto-complete
ROWS BETWEEN 7 PRECEDING AND CURRENT ROW(8-day window) instead of6 PRECEDING(correct 7-day). The off-by-one is a classic Gemini failure mode. - Four-point review: (a) joins — none. (b) filters — partition-pruning date math verified. (c) aggregations —
AVG(revenue)correct. (d) window frames — count rows in the frame. - The senior pause on the window frame: CURRENT ROW + 6 PRECEDING = 7 rows. Accept or edit.
- The four-point review takes under 10 seconds and catches the most common bugs.
Output.
| order_date | region | revenue | revenue_7d_avg |
|---|---|---|---|
| 2026-06-01 | US | 100 | 100.00 |
| 2026-06-02 | US | 120 | 110.00 |
| 2026-06-03 | US | 90 | 103.33 |
| 2026-06-04 | US | 110 | 105.00 |
| 2026-06-05 | US | 130 | 110.00 |
| 2026-06-06 | US | 95 | 107.50 |
| 2026-06-07 | US | 140 | 112.14 |
| 2026-06-08 | US | 105 | 112.86 |
Rule of thumb. Every Gemini window frame deserves a one-second mental count: "how many rows are in this window?" If the answer isn't the number you wanted, edit before accepting.
Worked example — natural-language to SQL with schema grounding
Detailed explanation. A common demo — "top 5 customers by revenue in the last 30 days, only US region" emits a working query. Senior interviewers probe this surface because it is impressive and the source of more production bugs than any other Gemini feature.
Question. Use Generate SQL on the schema below for that prompt. Show the prompt, the likely output, and the schema-grounding details that make it reliable.
Input — schema.
| table | column | type | description |
|---|---|---|---|
proj.sales.orders |
order_id | INT64 | primary key |
proj.sales.orders |
customer_id | INT64 | FK to customers |
proj.sales.orders |
order_date | DATE | partition key |
proj.sales.orders |
amount | NUMERIC | order total |
proj.sales.orders |
region | STRING | clustered |
proj.sales.customers |
customer_id | INT64 | primary key |
proj.sales.customers |
name | STRING | display name |
Code.
Prompt to Gemini: "Top 5 customers by revenue in the last 30 days, only US region"
-- Likely Gemini output (with schema grounding from project metadata)
SELECT
c.customer_id,
c.name,
SUM(o.amount) AS revenue_30d
FROM `proj.sales.orders` AS o
JOIN `proj.sales.customers` AS c
ON o.customer_id = c.customer_id
WHERE o.order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
AND o.region = 'US'
GROUP BY c.customer_id, c.name
ORDER BY revenue_30d DESC
LIMIT 5;
Step-by-step explanation.
- The prompt is natural-language; Gemini reads it plus schema context (tables, columns, types, descriptions, PK tags) and emits parameterised SQL.
- Schema grounding gives Gemini the table names, the PK-tagged join column, the partition column, and the cluster column — all four guide the generated query.
- Four-point review: join uses the documented PK; filter partition-prunes and cluster-prunes;
SUM(o.amount)is the obvious revenue measure; no window frames. - The query is production-quality on the first pass because the schema was annotated. Without annotations, Gemini might miss the cluster column or hallucinate a region field.
- Invest in schema annotations before you expect NL-to-SQL to be reliable.
Output.
| customer_id | name | revenue_30d |
|---|---|---|
| 1001 | Acme Corp | 12,500 |
| 1002 | Beta Inc | 11,400 |
| 1003 | Gamma LLC | 9,800 |
| 1004 | Delta Co | 8,750 |
| 1005 | Epsilon Ltd | 7,600 |
Rule of thumb. NL-to-SQL accuracy tracks schema annotation quality more than Gemini model version. Annotate every column and tag every PK — the payoff is immediate.
Worked example — explain-this-query on inherited SQL
Detailed explanation. A common probe — paste a gnarly inherited SQL block (sessionisation with gap-and-island) and ask "Explain this query." The question: do you treat the prose as authoritative or as a starting point?
Question. A teammate hands you a sessionisation query. Use "Explain this query" to bootstrap understanding; identify which parts of the explanation deserve independent verification.
Input.
| user_id | event_time |
|---|---|
| u1 | 09:00:00 |
| u1 | 09:10:00 |
| u1 | 09:45:00 |
| u1 | 10:30:00 |
Code.
-- Inherited SQL, paste into the editor, select all, click "Explain"
WITH gaps AS (
SELECT
user_id,
event_time,
TIMESTAMP_DIFF(
event_time,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time),
MINUTE
) AS gap_min
FROM `proj.analytics.events`
),
sessions AS (
SELECT
user_id,
event_time,
SUM(CASE WHEN gap_min IS NULL OR gap_min > 30 THEN 1 ELSE 0 END)
OVER (PARTITION BY user_id ORDER BY event_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS session_id
FROM gaps
)
SELECT user_id, session_id, MIN(event_time) AS session_start, COUNT(*) AS events
FROM sessions
GROUP BY user_id, session_id
ORDER BY user_id, session_id;
Step-by-step explanation.
- Gemini's typical explanation: "groups events into sessions per user; gaps CTE computes the per-user gap; sessions CTE increments a counter whenever the gap exceeds 30 minutes."
- The prose is roughly right but glosses over the gap-and-island invariant — the cumulative SUM over the boundary indicator is the actual mechanism.
- Verification: at 09:00 gap is NULL → boundary, session 1. At 09:10 gap = 10 → no boundary, session 1. At 09:45 gap = 35 > 30 → boundary, session 2. At 10:30 gap = 45 > 30 → boundary, session 3.
- The explanation is a starting point, not a proof — Gemini summarises intent, not correctness.
- Senior pattern: trust the explanation's intent, independently verify the mechanism by tracing one or two inputs.
Output.
| user_id | session_id | session_start | events |
|---|---|---|---|
| u1 | 1 | 09:00:00 | 2 |
| u1 | 2 | 09:45:00 | 1 |
| u1 | 3 | 10:30:00 | 1 |
Rule of thumb. Use "Explain this query" to accelerate understanding; always trace one input through the SQL by hand before relying on the explanation. Gemini is a fast first draft, never a proof.
Senior interview question on Gemini governance
A senior interviewer might frame this as: "Your team enabled Gemini in the SQL editor a quarter ago. Three production incidents this quarter were traced back to AI-generated SQL that nobody reviewed carefully enough. How do you fix the process without disabling Gemini and losing the productivity gain?"
Solution Using a four-point review checklist + CI guard rails
-- Step 1: Codify the four-point review as a pull-request checklist
-- (Committed to .github/PULL_REQUEST_TEMPLATE.md)
-- AI-generated SQL review checklist
-- [ ] Joins: every ON clause uses a documented foreign-key relationship
-- [ ] Filters: WHERE / HAVING semantics verified against table descriptions
-- [ ] Aggregations: GROUP BY columns match the business intent
-- [ ] Window frames: row counts verified by tracing one example input
-- Step 2: Add a CI guard — every BQ SQL file with a "Generated by Gemini" tag
-- runs a dry-run plus a regression test before merge:
-- ci/validate_sql.sh (run on every PR touching sql/*.sql):
-- bq query --dry_run --use_legacy_sql=false < $sql_file
-- bq query --use_legacy_sql=false --max_bytes_billed=10000000 < $sql_file > /tmp/result.csv
-- diff /tmp/result.csv test/expected/$sql_file.expected.csv
-- Step 3: Tag the generated SQL at write time
-- Add a comment header to any query produced by Gemini:
/*
* Generated by Gemini in BQ Studio on 2026-06-22
* Prompt: "top 5 customers by revenue in the last 30 days, only US region"
* Reviewer: @teammate (sign-off required)
*/
SELECT ...
Step-by-step trace.
| Incident | Root cause | What the checklist would have caught |
|---|---|---|
| Wrong join key | Gemini joined on id not customer_id
|
Point 1 (joins): verify FK relationship |
| Off-by-one window | Gemini used 7 PRECEDING for a 7-day window |
Point 4 (frames): count rows in the frame |
| Missed cluster filter | Gemini omitted region = 'US'
|
Point 2 (filters): cross-check business intent |
| Wrong aggregation | Gemini used COUNT(*) not COUNT(DISTINCT)
|
Point 3 (aggregations): match business intent |
After the checklist + CI guard rails are in place, the team retains the typing-speed gain from Gemini and loses the production-incident rate that came with rubber-stamping its output.
Output:
| Lever | What it changes | What it preserves |
|---|---|---|
| Four-point review checklist | Every AI SQL passes a mental trace | Gemini typing-speed gain |
| CI dry-run + regression | Every AI SQL passes a syntactic + semantic gate | Iteration speed |
| Generated-by tag | Every AI SQL is traceable to a prompt + reviewer | Accountability |
Why this works — concept by concept:
- Process over disablement — disabling Gemini after incidents is the junior move. The senior fix is a cheap review step that catches common failure modes without reverting to manual SQL.
- Four points are enough — joins / filters / aggregations / window-frames catches >90% of Gemini failures. Longer checklists cost review time without proportionate accuracy gain.
- CI guard rails — humans miss things; a dry-run + regression diff catches cases a tired reviewer skipped.
-
Tag at write time — the
Generated by Geminiheader preserves traceability when a query breaks six months later. - Cost — the four-point review adds ~10 seconds per AI query and saves hours per incident. ROI is asymmetric.
SQL
Topic — sql
SQL editor + Gemini practice
3. Notebooks — Python + SQL cells
bigquery notebooks ship with bigframes — pandas semantics that compile to BigQuery SQL, so your "ETL in Python" is actually warehouse compute
The mental model in one line: a Studio notebook is a Colab-style document where SQL and Python cells share the same query engine and dataset access, with bigframes — a pandas-compatible DataFrame whose every operation lazily pushes down into BigQuery SQL. That architectural choice — Python cells compiling to SQL rather than pulling data to a notebook VM — is what makes bigquery jupyter notebooks cheap on data sizes that would crush a regular Jupyter kernel.
The notebook surface in Studio.
- Cells. SQL cells run BigQuery jobs; Python cells run on a managed Python runtime, both bill through Studio.
-
Sharing between cells. A SQL cell's result is exposed as a
bigframesDataFrame in the next Python cell via%%bigquery df --target=bigframes. - Markdown cells. First-class — build a model card or one-shot report with title, narrative, charts, and the SQL that produced them.
- Scheduling. Notebooks can be scheduled on a cron via the Notebook Executor (single-job; multi-step DAGs still belong in Composer).
bigframes — pandas on BigQuery.
- DataFrame API mirrors pandas (
.loc,.iloc,.groupby,.merge,.assign,.dt,.str). - Every operation is lazy — builds a SQL plan until you call
.to_pandas(),.head(),.cache(), or render. - Compiled SQL is visible via
df.sql— useful for debugging plan-level bugs. - Compute lives in BigQuery; the notebook VM only holds the plan and materialised results.
bigframes.ml — BQML wrapped in scikit-learn semantics.
-
LinearRegression,LogisticRegression,KMeans,RandomForest, plus BQML's DNN and AutoML. -
.fit()issuesCREATE MODEL;.predict()issuesML.PREDICT. Models live in BQ. - The notebook is the authoring environment; the model lives in the warehouse and is queryable from any SQL surface.
Python cells beyond bigframes. Standard Python kernel — numpy, scikit-learn, matplotlib, seaborn, plotly. %pip install works inline. Non-BQ data is reachable but pulls into the notebook VM.
When the notebook beats the editor. The work is >50% Python (features, ML, plotting); the output is a narrative document; you need scheduled notebook execution; you mix BQ + non-BQ data.
When the editor still wins. Pure SQL, single query, fastest possible iteration.
Worked example — running a SQL cell + Python cell with bigframes
Detailed explanation. A common notebook pattern — SQL cell to fetch data, Python cell to massage with bigframes, plot in matplotlib. The senior question: which line actually triggers a BQ job, and which line is just appending to the plan?
Question. Build a notebook that fetches monthly revenue, computes year-over-year growth with bigframes, and plots it. Identify the line that triggers the BQ job.
Input.
| order_date | region | amount |
|---|---|---|
| 2025-01-15 | US | 500 |
| 2025-02-15 | US | 600 |
| 2026-01-15 | US | 700 |
| 2026-02-15 | US | 750 |
Code.
# Cell 1 — SQL cell
%%bigquery monthly --target=bigframes
SELECT
DATE_TRUNC(order_date, MONTH) AS month,
region,
SUM(amount) AS revenue
FROM `proj.sales.orders`
WHERE order_date >= DATE('2025-01-01')
GROUP BY month, region;
# Cell 2 — Python cell using bigframes
import bigframes.pandas as bpd
# monthly is a bigframes DataFrame, NOT a pandas DataFrame
# Adding columns is lazy — appends to the SQL plan
monthly = monthly.sort_values(["region", "month"])
monthly["revenue_prev_year"] = (
monthly.groupby("region")["revenue"].shift(12)
)
monthly["yoy_growth_pct"] = (
(monthly["revenue"] - monthly["revenue_prev_year"])
/ monthly["revenue_prev_year"] * 100
)
# This line is still lazy — no BQ job yet
result = monthly.dropna(subset=["yoy_growth_pct"])
# THIS line triggers the BQ job (materialises into pandas for matplotlib)
result_pd = result.to_pandas()
# Cell 3 — Plotting
import matplotlib.pyplot as plt
for region, group in result_pd.groupby("region"):
plt.plot(group["month"], group["yoy_growth_pct"], label=region)
plt.title("YoY revenue growth by region")
plt.xlabel("month")
plt.ylabel("% growth")
plt.legend()
plt.show()
Step-by-step explanation.
- Cell 1 runs a BQ job; the result is exposed in cell 2 as
monthly, abigframesDataFrame. - In cell 2, every
bigframesoperation (sort_values,groupby,shift, assignments,dropna) adds nodes to the SQL plan — none runs a BQ job..sqlwould show the compiled plan. -
result.to_pandas()is the line that triggers the BQ job. The plan runs in BQ; the result ships to the notebook VM as a pandas DataFrame. - Cell 3 plots in matplotlib — pure notebook VM work; no BQ.
- Lazy execution is the whole point — long
bigframespipelines stay free until you materialise.
Output.
| month | region | revenue | revenue_prev_year | yoy_growth_pct |
|---|---|---|---|---|
| 2026-01-01 | US | 700 | 500 | 40.0 |
| 2026-02-01 | US | 750 | 600 | 25.0 |
Rule of thumb. Every bigframes operation is lazy until .to_pandas(), .head(), .cache(), or render. Treat .to_pandas() as the "press the button" moment; everything else is plan-building.
Worked example — training a BQML model from a notebook
Detailed explanation. A senior probe — "train and evaluate a logistic regression model end-to-end from a Studio notebook, with BQML under the hood; training data, model, and evaluation chart in the same notebook." The question tests whether you understand that the model lives in BigQuery, not the notebook VM.
Question. Train a bigframes.ml.LogisticRegression model on customers. Evaluate on held-out data. Plot ROC. Be explicit about which artefacts live where.
Input.
| feature | description |
|---|---|
| tenure_months | months as a customer |
| monthly_spend | average monthly spend |
| support_tickets | tickets in last 90 days |
| churned (label) | 1 if churned, 0 otherwise |
Code.
import bigframes.pandas as bpd
from bigframes.ml.linear_model import LogisticRegression
from bigframes.ml.model_selection import train_test_split
customers = bpd.read_gbq("proj.crm.customers")
features = ["tenure_months", "monthly_spend", "support_tickets"]
label = "churned"
X = customers[features]
y = customers[label]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LogisticRegression()
model.fit(X_train, y_train)
# The fit() call issues a BQML CREATE MODEL statement against BQ.
# The model object is a handle to the model in BigQuery, not a local sklearn object.
# Evaluate
metrics = model.score(X_test, y_test)
print(metrics)
# Predict probabilities for the ROC curve
preds = model.predict(X_test) # bigframes DataFrame with predicted_churned + probs
preds_pd = preds.to_pandas() # materialise for sklearn ROC
y_test_pd = y_test.to_pandas()
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
fpr, tpr, _ = roc_curve(y_test_pd, preds_pd["predicted_churned_probs"])
plt.plot(fpr, tpr, label=f"AUC = {auc(fpr, tpr):.3f}")
plt.plot([0, 1], [0, 1], linestyle="--")
plt.title("ROC — churn model")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.show()
Step-by-step explanation.
-
bpd.read_gbqreturns a lazy handle — no data movement. -
train_test_splitbuilds the split plan; still lazy. -
.fit()is the first line that runs a BQ job — a BQMLCREATE MODEL. The trained model lives in BQ; the Pythonmodelobject is a handle. -
.score()issuesML.EVALUATE;.predict()issuesML.PREDICTagainst the held-out set. -
.to_pandas()materialises predictions for sklearn'sroc_curveand matplotlib. All training data and the model stay in BQ; only test predictions enter the notebook VM.
Output.
| metric | value |
|---|---|
| accuracy | 0.84 |
| precision | 0.79 |
| recall | 0.71 |
| f1 | 0.75 |
| auc_roc | 0.88 |
Rule of thumb. bigframes.ml is the right surface for warehouse-native ML — training data in BQ, served model queried from SQL. For GPU, custom architectures, or the Vertex Registry, use Vertex Workbench.
Worked example — caching to keep iteration cheap
Detailed explanation. A common gotcha — a notebook with many cells, each kicking off a new BQ job because nothing was cached. The senior pattern: materialise the expensive intermediate once with .cache(), then iterate without re-running the upstream plan.
Question. A notebook explores a 200 GB DataFrame across 15 cells, each re-running the full join. Refactor with .cache().
Input.
| Step | Job cost (GB scanned) |
|---|---|
| Cell 3 — initial query | 200 |
| Cell 5 — filter (no cache) | 200 (re-runs from source) |
| Cell 7 — groupby (no cache) | 200 |
| Cell 9 — sort + head (no cache) | 200 |
Code.
# BEFORE (every cell re-runs the upstream SQL plan)
import bigframes.pandas as bpd
big = bpd.read_gbq("proj.warehouse.fact_join_view") # cell 3 → 200 GB job
filtered = big[big["region"] == "US"] # lazy
agg = filtered.groupby("month").sum() # lazy
top = agg.sort_values("revenue", ascending=False).head(10) # cell 5+7+9 → 200 GB each
# AFTER (cache the expensive intermediate once)
big = bpd.read_gbq("proj.warehouse.fact_join_view").cache() # cell 3 → 200 GB job once
filtered = big[big["region"] == "US"] # against cached
agg = filtered.groupby("month").sum() # against cached
top = agg.sort_values("revenue", ascending=False).head(10) # against cached
Step-by-step explanation.
- Without
.cache(), every materialisation re-issues the SQL plan from source — 200 GB per cell. -
.cache()materialises into a temporary BQ table once; subsequent operations read from cache (a few GB). - The cache lifetime is session-bound. Persistent caches live as regular BQ tables for cross-session reuse.
- 15 cells × 200 GB → 200 GB once + 15 × cache scan. On on-demand pricing, the saving is real money.
- Any
bigframesDataFrame touched by 3+ downstream cells deserves a.cache()call.
Output.
| Iteration | Without cache | With cache |
|---|---|---|
| Cell 3 | 200 GB | 200 GB |
| Cell 5 | 200 GB | <1 GB cache scan |
| Cell 7 | 200 GB | <1 GB cache scan |
| Cell 9 | 200 GB | <1 GB cache scan |
| Total | 800 GB | ~203 GB |
Rule of thumb. Cache early, cache once, cache at the most-reused intermediate. Lazy execution rewards thoughtful caching with order-of-magnitude savings.
Senior interview question on bigframes vs pandas
A senior interviewer often asks: "Your team uses pandas in notebooks today, and you're proposing moving to bigframes. What changes for the team, what's the same, and where would you still pick plain pandas?"
Solution Using a feature-by-feature comparison + migration plan
# bigframes vs pandas — the senior comparison
# 1. API surface
# bigframes: ~90% of the pandas API is mirrored
# pandas: 100% (it IS the API)
# Verdict: bigframes covers the common 90%, edge cases may need pandas
# 2. Execution model
# bigframes: lazy → SQL → BigQuery compute
# pandas: eager → in-memory on the notebook VM
# Verdict: bigframes wins on data > VM RAM, pandas wins on local-only flows
# 3. Cost model
# bigframes: pays BQ on-demand or slot reservation
# pandas: pays notebook VM CPU/memory; data egress to load
# Verdict: bigframes is cheaper at scale; pandas is cheaper for tiny data
# 4. Library compatibility
# bigframes: works with scikit-learn (via to_pandas()), matplotlib, seaborn
# pandas: works with everything
# Verdict: bigframes for plan-building, materialise to pandas for niche libs
# 5. Debuggability
# bigframes: inspect compiled SQL via df.sql
# pandas: inspect intermediate DataFrames inline
# Verdict: different debug models; both workable
# When to keep plain pandas
# - Data already in-memory (CSV upload, API response)
# - Data is small (< notebook VM RAM)
# - Library being used has no bigframes equivalent and needs pandas in
# - Iterative debugging where SQL plan inspection is awkward
# Migration plan
# 1. New notebooks default to bigframes
# 2. Existing notebooks: convert the data-fetch step to bpd.read_gbq + .cache()
# 3. Keep .to_pandas() at the boundary where you need a niche library
# 4. Train the team on .sql inspection for plan debugging
Step-by-step trace.
| Workflow | Before (pandas) | After (bigframes) |
Win |
|---|---|---|---|
| Load 50 GB warehouse table | OOM on 16 GB VM | lazy plan, runs in BQ | bigframes |
| Groupby + sum + sort | 5 min in VM | 30 s in BQ | bigframes |
| Quick CSV from a vendor | 1 line in pandas | indirect via upload to BQ | pandas |
| Sklearn KMeans on tiny features | .fit(df) |
.to_pandas() then .fit()
|
tie |
| Plot with matplotlib | df.plot() |
df.to_pandas().plot() |
tie |
| Persistent feature store | hand-rolled in pandas |
bigframes writes to BQ table |
bigframes |
After the migration plan, the team uses bigframes for everything that touches warehouse data and keeps pandas at the boundary for niche-library interop.
Output:
| Surface | Tool | Reason |
|---|---|---|
| Warehouse-scale ETL in notebook | bigframes |
lazy plan + BQ compute |
| Tiny CSV from a vendor | pandas | already in-memory |
| Scikit-learn niche model | pandas at the boundary | sklearn doesn't speak bigframes
|
| Plotting | pandas at the boundary | matplotlib speaks pandas |
| Notebook-scheduled ETL | bigframes |
cost-efficient at warehouse scale |
Why this works — concept by concept:
-
Lazy plan vs eager evaluation —
bigframesbuilds a SQL plan running server-side; pandas evaluates in the notebook VM. Plan model wins on warehouse-scale; eager wins on tiny data. -
API parity —
bigframesmirrors ~90% of pandas; day-to-day code looks identical. The 10% gap is where pandas wins, or you call.to_pandas()at the boundary. -
Cost surface —
bigframespays BQ; pandas pays notebook VM. Pick by which compute pool is cheaper at your scale. -
Library interop at the boundary — sklearn, matplotlib, and other pandas-speaking libraries work via
.to_pandas().bigframes.mlcovers in-warehouse ML so the boundary moves later. -
Cost — migration is O(notebooks) of read-and-edit; staying on pandas is O(data size) per re-run.
bigframespays back fast on warehouse-scale teams.
SQL
Topic — sql
Notebook + SQL practice problems
4. Data Canvas — visual exploration
bigquery data canvas is a branch-and-merge visual editor that exports to clean BigQuery SQL — the right home for transforms that need a flow diagram more than a typed query
The mental model in one line: the Data Canvas is a drag-and-drop graph editor where each node is a SQL transformation step and edges carry result tables between nodes; the graph compiles to a single SQL script with named CTEs that BigQuery actually executes. The canvas is a UI on top of CTEs — but the visual surface is the killer feature for transforms with branch-and-merge logic, lookup joins, or non-engineer maintainers.
The canvas surface.
- Nodes. Single transformation steps — Source, Filter, Derive column, Group + aggregate, Join, Union, Sort, Sample, Sink. Form-based config panel; no SQL typing for the common cases.
- Edges. Carry output table of one node into the next. Multiple outputs (branch) and multiple inputs (merge / join / union) are supported.
-
Preview. Every node shows a 100-500 row sample inline via a lightweight
LIMITquery. - Export to SQL. Emits a single SQL script with one CTE per node, named after the node label.
- Schedule. Wraps the exported SQL as a BigQuery Scheduled Query with cron, destination, and run-as service account.
When the canvas beats the editor. Branch-and-merge logic; hand-off to non-engineer maintainers; iterative add/remove steps with preview; lookup joins with form-based join nodes.
When the editor still wins. Linear single-block transforms; production-critical transforms with Git diff requirements (keep in Dataform); uncommon SQL the form-based nodes don't cover (custom window frames, recursive CTEs, qualify).
Canvas → exported SQL pattern. Each node label becomes a CTE name; the final node is the top-level SELECT; comments preserve node provenance. The exported SQL is idiomatic and runnable in the editor without modification.
Production pattern. Build on canvas with preview at every step → export to SQL → review SQL in code review → schedule via "Schedule" button → re-open canvas only to edit; re-export and re-deploy.
Worked example — building a 4-step canvas pipeline
Detailed explanation. A classic canvas workflow — load raw events, filter to the last 30 days, derive a date column, group by date and region, sink to a reporting table. Each step is a node with an inline preview; the exported SQL is one CTE per node.
Question. Build a 4-step canvas pipeline aggregating raw events into daily-region totals. Show each node, the exported SQL, and why this is canvas-fit.
Input.
| event_id | event_time | region | user_id | revenue |
|---|---|---|---|---|
| e1 | 2026-06-01 10:15 | US | u1 | 50 |
| e2 | 2026-06-01 11:30 | EU | u2 | 35 |
| e3 | 2026-06-02 09:45 | US | u3 | 80 |
| e4 | 2026-06-02 14:20 | APAC | u4 | 25 |
Code.
Canvas graph
============
Node 1 — Source
type: BigQuery table
ref: proj.events.raw_events
Node 2 — Filter
predicate: event_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
Node 3 — Derive column
new column: event_date = DATE(event_time)
Node 4 — Group + aggregate
group by: event_date, region
aggregates: SUM(revenue) AS daily_revenue, COUNT(DISTINCT user_id) AS dau
Node 5 — Sink
destination: proj.events.daily_region_totals
write mode: WRITE_TRUNCATE
-- Exported SQL from the canvas (paste verbatim into editor or schedule)
WITH
source_raw_events AS (
SELECT * FROM `proj.events.raw_events`
),
filter_last_30d AS (
SELECT *
FROM source_raw_events
WHERE event_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
),
derive_event_date AS (
SELECT
*,
DATE(event_time) AS event_date
FROM filter_last_30d
),
group_by_date_region AS (
SELECT
event_date,
region,
SUM(revenue) AS daily_revenue,
COUNT(DISTINCT user_id) AS dau
FROM derive_event_date
GROUP BY event_date, region
)
SELECT * FROM group_by_date_region;
Step-by-step explanation.
- The Source node points at the raw events table; preview shows a 100-row sample.
- The Filter node applies the 30-day predicate; preview shows only in-window rows — mistypes surface immediately.
- The Derive node adds the
event_datecolumn visible in the preview. - The Group + aggregate node collapses rows to one per
(event_date, region)with the aggregates. - The Sink writes to the destination with
WRITE_TRUNCATE; the exported SQL captures the flow as a CTE chain, ready to schedule.
Output (daily_region_totals).
| event_date | region | daily_revenue | dau |
|---|---|---|---|
| 2026-06-01 | US | 50 | 1 |
| 2026-06-01 | EU | 35 | 1 |
| 2026-06-02 | US | 80 | 1 |
| 2026-06-02 | APAC | 25 | 1 |
Rule of thumb. A 4-step pipeline is a fine canvas fit — visual, easy to hand off, exports to clean SQL. The same transform in the editor is a 15-line CTE; both work, but canvas wins on readability.
Worked example — branch-and-merge with a lookup join
Detailed explanation. Branch-and-merge is the canvas's killer use case — load orders, split into US and non-US streams, apply different aggregations, join with customers, union into one output. The visual graph collapses an 80-line CTE chain into an at-a-glance diagram.
Question. Build a canvas pipeline branching orders into US and non-US streams with different aggregates, joining each with the customers lookup, and unioning the result.
Input — orders.
| order_id | region | customer_id | amount |
|---|---|---|---|
| 1 | US | c1 | 100 |
| 2 | EU | c2 | 60 |
| 3 | US | c1 | 80 |
Input — customers (lookup).
| customer_id | name |
|---|---|
| c1 | Acme Corp |
| c2 | Beta Inc |
Code.
Canvas graph
============
Node 1 — Source: orders
Node 2 — Source: customers
Node 3 — Filter (orders): region = 'US' → us_orders
Node 4 — Filter (orders): region != 'US' → non_us_orders
Node 5 — Group US: SUM(amount), COUNT(*) per customer_id
Node 6 — Group non-US: AVG(amount), COUNT(*) per customer_id
Node 7 — Join us_agg + customers on customer_id
Node 8 — Join non_us_agg + customers on customer_id
Node 9 — Union node 7 + node 8 with a region_label column
Node 10 — Sink: proj.sales.regional_summary
-- Exported SQL
WITH
source_orders AS (
SELECT * FROM `proj.sales.orders`
),
source_customers AS (
SELECT * FROM `proj.sales.customers`
),
filter_us AS (
SELECT * FROM source_orders WHERE region = 'US'
),
filter_non_us AS (
SELECT * FROM source_orders WHERE region != 'US'
),
group_us AS (
SELECT customer_id, SUM(amount) AS total_amount, COUNT(*) AS orders
FROM filter_us
GROUP BY customer_id
),
group_non_us AS (
SELECT customer_id, AVG(amount) AS avg_amount, COUNT(*) AS orders
FROM filter_non_us
GROUP BY customer_id
),
join_us AS (
SELECT g.*, c.name, 'US' AS region_label
FROM group_us g
JOIN source_customers c USING (customer_id)
),
join_non_us AS (
SELECT g.*, c.name, 'NON_US' AS region_label
FROM group_non_us g
JOIN source_customers c USING (customer_id)
)
SELECT customer_id, name, region_label, orders,
total_amount AS total_amount_us,
NULL AS avg_amount_non_us
FROM join_us
UNION ALL
SELECT customer_id, name, region_label, orders,
NULL AS total_amount_us,
avg_amount AS avg_amount_non_us
FROM join_non_us;
Step-by-step explanation.
- Two source nodes load
ordersandcustomers; preview confirms schemas. - The orders stream branches into US and non-US filter nodes.
- Each filtered stream feeds a different aggregate (sum vs avg) — preview confirms per-customer aggregates.
- Each aggregate joins with the
customerslookup. - The two enriched streams union with a
region_labelcolumn; the Sink writes to the destination. - The visual graph collapses ~50 lines of nested CTEs into a single readable diagram.
Output (regional_summary).
| customer_id | name | region_label | orders | total_amount_us | avg_amount_non_us |
|---|---|---|---|---|---|
| c1 | Acme Corp | US | 2 | 180 | NULL |
| c2 | Beta Inc | NON_US | 1 | NULL | 60 |
Rule of thumb. Branch-and-merge is the canvas's sweet spot. If your transform splits a single source into multiple streams with different treatment, the visual diagram is much easier to maintain than the equivalent CTE chain.
Worked example — exporting canvas to a scheduled query
Detailed explanation. The production pattern — you built the transform on the canvas, now it runs nightly. The "Schedule" button wraps the exported SQL in a BQ Scheduled Query with cron, destination, and run-as service account. The senior probe: do you treat the canvas state or the exported SQL as the source of truth?
Question. Schedule the 4-step pipeline from the first worked example as a daily query. Identify what goes to source control, what goes to BQ, and how you'd modify the transform six months later.
Input.
| Artefact | Lives where |
|---|---|
| Canvas state (graph JSON) | BQ Studio workspace |
| Exported SQL | Git repo + BQ Scheduled Query |
| Schedule config (cron, dest, IAM) | BQ Scheduled Query metadata |
| Audit log of runs | BQ INFORMATION_SCHEMA |
Code.
-- Production-ready scheduled query
-- (Created from the canvas's "Schedule" button; committed to Git for visibility)
-- @scheduled-query
-- @schedule: every day 02:00 America/Los_Angeles
-- @destination: proj.events.daily_region_totals (WRITE_TRUNCATE)
-- @service-account: bq-sched@proj.iam.gserviceaccount.com
-- @source: canvas/daily_region_totals.canvas
WITH
source_raw_events AS (
SELECT * FROM `proj.events.raw_events`
),
filter_last_30d AS (
SELECT *
FROM source_raw_events
WHERE event_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
),
derive_event_date AS (
SELECT *, DATE(event_time) AS event_date
FROM filter_last_30d
),
group_by_date_region AS (
SELECT
event_date,
region,
SUM(revenue) AS daily_revenue,
COUNT(DISTINCT user_id) AS dau
FROM derive_event_date
GROUP BY event_date, region
)
SELECT * FROM group_by_date_region;
Step-by-step explanation.
- "Schedule" registers the exported SQL as a BQ Scheduled Query with destination, cron, service account, and retry policy.
- The exported SQL is committed to Git; the canvas JSON is also committed for visual editing, but SQL is the deployment artefact.
- Six months later: re-open the canvas, edit, re-export, re-deploy. Git diff on the SQL shows the change.
- Run history lives in
INFORMATION_SCHEMA.SCHEDULED_QUERIES_RUNS— wire it into Cloud Monitoring for SLA alerts. - Treat the exported SQL as the source of truth, the canvas as the authoring tool.
Output (artefact ledger).
| Artefact | Storage | Purpose |
|---|---|---|
daily_region_totals.canvas (graph JSON) |
Git repo | Re-open in canvas for editing |
daily_region_totals.sql (exported SQL) |
Git repo + BQ Scheduled Query | Deployment artefact |
| Schedule config | BQ Scheduled Query metadata | Cron, destination, IAM |
| Run history | INFORMATION_SCHEMA.SCHEDULED_QUERIES_RUNS |
Audit / monitoring |
Rule of thumb. Exported SQL is the source of truth; canvas state is the authoring helper. Both go to Git; the SQL runs in production.
Senior interview question on canvas vs Dataform
A senior interviewer might ask: "Your team uses Data Canvas for a few transforms and Dataform for the rest. Where exactly is the line between them, and have you ever needed to migrate a transform from one to the other?"
Solution Using a feature-by-feature comparison + migration playbook
Data Canvas vs Dataform — when to use which
===========================================
Feature axes
------------
Data Canvas Dataform
Visual editor YES No (SQLX files in Git)
Branch-and-merge YES (graph) Manual via dependencies
Dependency graph Within canvas Across many tables
Version control Canvas JSON Git-native SQLX
Assertions / data tests No YES (assertions)
Variables / templating Limited YES (JavaScript blocks)
Multi-target deploy Single sched Tag-based deploys
Cron / schedule Per canvas Per workflow
Hand-off to non-engineers YES (visual) Lower (Git + SQLX)
Production governance Lower Higher
When to pick Data Canvas
- Exploration, prototyping
- Branch-and-merge transforms with few external dependencies
- Non-engineer maintainers
- Single scheduled output
When to pick Dataform
- Production ELT with many tables
- Assertions / data quality tests
- Multi-environment (dev / staging / prod) deploys
- Engineer-only maintenance
- Templated SQL across many similar tables
Migration playbook (canvas → Dataform)
1. Export canvas SQL
2. Wrap each CTE as a SQLX file with dependencies
3. Add assertions (uniqueness, not_null, accepted values)
4. Hook into Dataform's tag-based deploy
5. Decommission the scheduled query that ran the canvas
Step-by-step trace.
| Transform | Maturity stage | Surface | Reason |
|---|---|---|---|
| Daily region totals | Stable, 1 year old | Dataform | production governance + assertions |
| New branch-and-merge prototype | Week 1 | Data Canvas | visual flow + fast iteration |
| Per-customer scoring (50 tables) | Production | Dataform | dependency graph + templating |
| Quarterly board pack source | Once a quarter | Data Canvas → export | non-engineer maintains |
| Streaming-to-batch staging | Production | Dataform | environment deploys |
After the comparison, the line is clear: prototypes and hand-offs to non-engineers live in Canvas; production ELT with quality tests and multi-environment deploys lives in Dataform.
Output:
| Use case | Pick | Why |
|---|---|---|
| Visual prototype | Canvas | Branch-and-merge UI |
| Production ELT with tests | Dataform | Assertions + dependencies |
| Non-engineer maintains | Canvas | Visual flow |
| Engineer-only, many tables | Dataform | Templating + tag deploys |
| One-off scheduled query | Canvas | Export + schedule |
Why this works — concept by concept:
- Lifecycle stage decides — prototypes and hand-offs go to Canvas; productionised, governed ELT goes to Dataform. Lifecycle stage is a cleaner predicate than "which tool can express this transform."
-
Assertions are the moat — Dataform's data tests (
uniqueness,not_null,accepted_values) are the production governance Canvas cannot provide. - Dependency graph — Canvas has a single-graph view; Dataform has a cross-table graph. For 50+ table pipelines, the cross-table view is essential.
- Migration is mechanical — export SQL, split into SQLX files, add assertions, deploy. The pattern is repeatable.
- Cost — Canvas authoring is O(weeks); Dataform authoring is O(months). Prototype in Canvas, productionise in Dataform.
ETL
Topic — etl
Visual ETL pipeline problems
5. Studio vs Looker vs Vertex — choosing
Studio is a workbench, Looker is a governed BI semantic layer, Vertex is an ML platform — the senior signal is naming the build-vs-serve and warehouse-vs-platform lines crisply
The mental model in one line: Studio sits on the build side (workbench for SQL + Python + visual ETL); Looker sits on the serve side (governed BI semantic layer with LookML); Vertex sits on the ML platform side (Workbench, model registry, pipelines, non-BQ ML) — and most production stacks need all three because they solve different problems. Senior interviews probe whether you've internalised these three lines or collapse them into "GCP data tools."
The three tools, one sentence each.
- BigQuery Studio. Workbench for engineers: SQL + notebooks + canvas + Gemini, sharing one BQ engine, IAM, and billing scope. The build side.
- Looker. Governed BI semantic layer for end users: LookML dimensions / measures / explores, Looks and Dashboards, dimension-level permissions. The serve side.
- Vertex AI. ML platform: Workbench, GPU/TPU training, Model Registry, pipelines, endpoints, monitoring. The non-BQ ML side.
The build-vs-serve line. Studio for engineers writing transformations; Looker for end users consuming curated dashboards. Transformations live in Studio / Dataform and feed Looker's semantic layer — Looker does not transform; it serves. Using Looker as a transformation tool (over-leaning on Persistent Derived Tables) is a common junior anti-pattern.
The warehouse-vs-platform line. Studio is BQ-only — every surface binds to BigQuery as the engine. Vertex is multi-engine — Workbench can train on GPU, run PyTorch / TensorFlow, register in Model Registry, serve via endpoints. For BQ-centric ML, Studio is enough; for everything else, Vertex.
Hybrid stacks — the production reality.
- Studio + Looker. Most common. Engineers build in Studio; Looker serves curated outputs as dashboards.
- Studio + Vertex. When the model needs GPUs, custom code, or registry governance. Studio for BQ-side authoring; Vertex for heavy training and serving.
- All three. Large data orgs — Studio for engineer workflows, Looker for end users, Vertex for non-BQ ML.
Senior interview signals.
- Saying "Studio is build, Looker is serve, Vertex is ML platform" in the first sentence — senior signal.
- Describing the curated path raw → Dataform/Studio → mart → Looker semantic layer → dashboard — required answer.
- Naming LookML dimensions, measures, explores, persistent derived tables rather than "Looker thing" — senior signal.
- Pushing back on "BQML replaces Vertex" with "BQML covers warehouse-native; Vertex owns GPU + registry + custom serving" — senior signal.
Worked example — picking the right tool for a request
Detailed explanation. An interviewer hands you four requests and asks for the tool. They span Studio / Looker / Vertex / hybrid territory; the senior signal is naming the line that decides each one.
Question. For each request below, pick the primary tool and justify briefly. Acknowledge hybrid cases.
Input.
| # | Request |
|---|---|
| 1 | "Build a one-off chart for tomorrow's exec meeting showing weekly active users by region" |
| 2 | "Build a self-serve dashboard for marketing showing CAC, LTV, and channel mix, updated daily" |
| 3 | "Train a deep-learning fraud-detection model on a 1B-row events table with GPU acceleration" |
| 4 | "Run a quarterly batch report that re-aggregates revenue by product, with an emailed PDF" |
Code.
Tool selection by request
=========================
Request 1 — one-off exec chart
Primary: BQ Studio (notebook OR SQL editor + chart)
Why: ephemeral; not worth wiring Looker for a one-shot
Request 2 — self-serve marketing dashboard
Primary: Looker (LookML semantic layer + dashboard)
Hybrid: Studio / Dataform builds the curated table; Looker serves
Why: governed BI for end users — Looker's home turf
Request 3 — GPU fraud model on 1B rows
Primary: Vertex Workbench (notebook on GPU + Vertex training job)
Hybrid: Studio + bigframes for feature engineering; Vertex for training
Why: GPU + custom architecture exceeds BQML
Request 4 — quarterly emailed PDF
Primary: Studio (canvas → scheduled query) + Looker (PDF schedule)
Why: transform in Studio; Looker handles the email + PDF
Step-by-step explanation.
- Request 1 is ephemeral — wiring it into Looker (LookML + dashboard + access) is overkill. A Studio notebook + matplotlib is faster.
- Request 2 is governed, recurring, end-user-facing — Looker's home turf. Transformations live in Studio / Dataform; LookML exposes them.
- Request 3 needs GPU + custom architecture — BQML cannot do that. Vertex Workbench with Studio for feature prep is the hybrid pattern.
- Request 4 is BQ-native transform + scheduled PDF — Studio for the transform, Looker for the scheduled-report delivery.
- Name the build-vs-serve line and the warehouse-vs-platform line; the tool falls out.
Output.
| # | Primary tool | Hybrid? |
|---|---|---|
| 1 | Studio | none — ephemeral |
| 2 | Looker | Studio / Dataform builds the table |
| 3 | Vertex | Studio for features → Vertex for training |
| 4 | Studio + Looker | Studio transforms, Looker delivers PDF |
Rule of thumb. Most requests have a primary tool and a supporting tool. Senior answers name both; junior answers collapse to "use BigQuery."
Worked example — building the curated path from raw → BQ → Looker
Detailed explanation. A core question — "walk me through the data path from raw events to a Looker dashboard, naming every tool and every table." This is production storytelling that separates engineers who've shipped from those who haven't.
Question. Describe the path from raw_events to a Looker dashboard "WAU by region." Name the tools, tables, and transformation owners.
Input.
| Stage | Layer | Owner |
|---|---|---|
| Ingest | Pub/Sub → BQ streaming insert | Platform team |
| Raw | proj.raw.events |
Platform team |
| Transformation | Dataform | Data engineering |
| Mart | proj.mart.weekly_active_users |
Data engineering |
| Semantic layer | Looker LookML | Analytics engineering |
| Dashboard | Looker | Marketing / Exec |
Code.
End-to-end curated path
=======================
1. Raw events stream into BigQuery
Source: Pub/Sub topic events.app.activity
Sink: proj.raw.events (partitioned by event_date)
2. Dataform (or Studio canvas → exported SQL) materialises a curated table
Source: proj.raw.events
Job: weekly_active_users.sqlx
WITH last_7d AS (
SELECT user_id, region, DATE_TRUNC(event_date, WEEK) AS week
FROM proj.raw.events
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
)
SELECT week, region, COUNT(DISTINCT user_id) AS wau
FROM last_7d
GROUP BY week, region
Sink: proj.mart.weekly_active_users
Schedule: nightly
3. Looker LookML exposes the mart as a view
File: weekly_active_users.view.lkml
view: weekly_active_users {
sql_table_name: proj.mart.weekly_active_users ;;
dimension: week { sql: ${TABLE}.week ;; }
dimension: region { sql: ${TABLE}.region ;; }
measure: wau { type: sum sql: ${TABLE}.wau ;; }
}
4. Looker explore stacks views; dashboard renders the chart
Explore: activity
Dashboard: marketing → "WAU by region"
Schedule: daily PDF email to marketing@
Step-by-step explanation.
- Platform team owns raw ingestion — Pub/Sub, streaming insert, partitioned raw table. Studio is not in this layer.
- Data engineering owns the curated table built in Dataform (or canvas → exported SQL). Output is
proj.mart.weekly_active_users, partitioned by week. - Analytics engineering writes LookML — dimensions (
week,region) and measures (wau) — wrapping the mart as a semantic view. - Marketing consumes the dashboard with nightly PDF emails. IAM trail: platform → raw, DE → raw+mart, AE → LookML, marketing → dashboard.
- Studio's role: DE uses Studio (notebook + editor + canvas) while building the Dataform job. Production runs in Dataform; Studio is the authoring surface.
Output (artefact ledger).
| Artefact | Storage | Owner |
|---|---|---|
proj.raw.events |
BQ raw layer | Platform |
weekly_active_users.sqlx |
Dataform Git | Data engineering |
proj.mart.weekly_active_users |
BQ mart layer | Data engineering |
weekly_active_users.view.lkml |
Looker Git | Analytics engineering |
| "WAU by region" dashboard | Looker | Marketing |
Rule of thumb. Naming tools and tables at every stage is the senior signal. Collapsing the path to "raw goes to dashboard" reveals you've not shipped this end-to-end.
Worked example — when BQML in Studio replaces Vertex Workbench
Detailed explanation. A focused probe — "you have a churn model. When does BQML in Studio fully replace Vertex Workbench and when does it not?" Tests whether you've actually compared BQML against the Vertex stack on the warehouse-native ML axis.
Question. Compare BQML in Studio against Vertex Workbench for a churn model. Identify the boundary conditions pushing each side.
Input.
| Dimension | BQML in Studio | Vertex Workbench |
|---|---|---|
| Training data location | BigQuery | BigQuery or anywhere |
| Compute | BQ slots | Notebook VM + optional GPU |
| Model architectures | Linear, logistic, KMeans, random forest, DNN, AutoML | Anything (PyTorch / TF / scikit-learn / XGBoost) |
| Hyperparameter tuning | Limited (BQML grid) | Vertex Vizier or custom |
| Model registry | INFORMATION_SCHEMA.MODELS | Vertex Model Registry |
| Serving | SQL ML.PREDICT
|
Vertex Endpoints |
| Cost model | BQ on-demand or slots | Notebook VM + endpoint billing |
Code.
# Case A — BQML in Studio is enough
# Standard churn problem, ~10 features, logistic regression, served from SQL
import bigframes.pandas as bpd
from bigframes.ml.linear_model import LogisticRegression
customers = bpd.read_gbq("proj.crm.customers")
X, y = customers[["tenure", "spend", "tickets"]], customers["churned"]
model = LogisticRegression().fit(X, y)
# Trained model lives in BQ. Predictions are SQL ML.PREDICT calls.
# Case B — Vertex Workbench is required
# Same churn problem, but with a 200-feature LSTM over event sequences and GPU training
import torch
from torch import nn
class ChurnLSTM(nn.Module):
def __init__(self, hidden=128):
super().__init__()
self.lstm = nn.LSTM(input_size=200, hidden_size=hidden, batch_first=True)
self.fc = nn.Linear(hidden, 1)
def forward(self, x):
_, (h, _) = self.lstm(x)
return torch.sigmoid(self.fc(h.squeeze(0)))
# Train on a GPU VM, register in Vertex Model Registry, serve via Vertex Endpoint
Step-by-step explanation.
- Case A is tabular logistic regression on ~10 features — BQML covers it.
LogisticRegressionissuesCREATE MODEL; predictions areML.PREDICTfrom any SQL surface. - Case B has event-sequence features and an LSTM trained on GPU — BQML cannot do this. Vertex Workbench is the answer.
- Boundary conditions: (a) architecture catalogue (BQML finite vs Vertex unrestricted); (b) compute (BQ slots vs GPU/TPU); (c) registry (INFORMATION_SCHEMA.MODELS vs Vertex Model Registry).
- Decision: BQML in Studio if architecture is in the catalogue AND training fits BQ slots AND serving is from SQL. Otherwise, Vertex.
- Hybrid pattern: features in Studio (
bigframes), curated feature table in BQ, training in Vertex on GPU, model in Vertex Registry, served from Vertex Endpoint.
Output (decision matrix).
| Scenario | Pick |
|---|---|
| Tabular churn, logistic / random forest | BQML in Studio |
| Tabular churn, AutoML | BQML in Studio |
| Sequence LSTM churn | Vertex |
| Large embedding model | Vertex |
| Model registry + multi-version governance | Vertex |
| SQL-served predictions only | BQML in Studio |
| Custom serving (REST endpoint with auth) | Vertex |
Rule of thumb. BQML in Studio is enough when the architecture is in the catalogue and serving is from SQL. Otherwise, Vertex. Senior answers enumerate both axes; junior answers collapse to "use Vertex for ML."
Senior interview question on tool boundaries
A senior interviewer might ask: "Sketch the GCP data stack for a 200-engineer data org. Where does BQ Studio fit, where does Looker fit, where does Vertex fit, and what tools do you still need alongside? Defend each boundary."
Solution Using a layered architecture with explicit boundaries
GCP data stack — 200-engineer org
=================================
Ingest Pub/Sub + Dataflow + Datastream → BigQuery
Raw proj.raw.* (partitioned, clustered)
Transformation Dataform (prod ELT) + Studio canvas (proto) + Dataproc (Spark)
Mart proj.mart.* (curated, governed)
Serving (BI) Looker (governed dashboards) + Studio (ad-hoc charts)
ML training BQML in Studio (warehouse) + Vertex Workbench (GPU/custom)
ML serving ML.PREDICT (SQL) + Vertex Endpoints (REST)
Orchestration Composer/Airflow (cross-service) + BQ Scheduled Queries
Observability Cloud Monitoring + Looker alerts + Vertex Model Monitoring
Boundaries
Build (Studio, Dataform) ←→ Serve (Looker, Endpoints)
Warehouse ML (BQML) ←→ Platform ML (Vertex)
Single-engine (BQ) ←→ Multi-engine (Dataproc, Vertex)
Step-by-step trace.
| Layer | Primary tool | Why it sits here |
|---|---|---|
| Ingest | Pub/Sub + Dataflow + Datastream | Cross-service ingestion is not BQ Studio's domain |
| Transformation | Dataform (prod) + Studio canvas (proto) | Build side; Studio is authoring, Dataform is production |
| Serving (BI) | Looker | Governed, semantic, end-user-facing; not Studio's role |
| Serving (ML) | Vertex Endpoints | Multi-architecture; not BQML's role |
| ML training | BQML in Studio (warehouse) + Vertex (GPU) | Warehouse vs platform line |
| Orchestration | Composer | Cross-service DAGs; Studio cannot do this |
After the layered architecture, every tool has a clear home and the boundaries are testable. A new request maps to a layer and a tool unambiguously.
Output:
| Stack layer | Studio's role | Looker's role | Vertex's role |
|---|---|---|---|
| Ingest | none | none | none |
| Transformation | prototyping + ad-hoc | none | none |
| Mart | reading + ad-hoc | none | none |
| Serving (BI) | one-off charts | governed dashboards | none |
| ML training | BQML | none | non-BQML training |
| ML serving | SQL ML.PREDICT | none | Endpoints |
| Orchestration | none | scheduled PDF | Vertex Pipelines |
Why this works — concept by concept:
- Layered architecture — naming layers (ingest / raw / transform / mart / serve / ML / orchestration / observability) gives every tool a clear home. The conversation shifts from "which tool" to "which layer."
- Build vs serve line — Studio + Dataform on build; Looker + Endpoints on serve. Crossing this line (Looker as transformation tool, Studio as end-user dashboard) is the most common stack mistake.
- Warehouse-ML vs platform-ML — BQML covers warehouse-native, SQL-served models. Vertex covers everything else. The line is the architecture catalogue and serving mode.
- Single-engine vs multi-engine — Studio is BQ-only; Composer and Dataproc are multi-engine. For multi-service DAGs and non-BQ compute, Studio steps aside.
- Cost — over-investing in a wrong layer (Vertex for tabular regression, Looker for one-shot exploration) wastes money and time. The layered map keeps you in your lane.
SQL
Topic — sql
Warehouse SQL design problems
ETL
Topic — etl
Stack design + tool boundary problems
Cheat sheet — Studio + Gemini recipes
- Gemini explain-this-query. Highlight SQL → click "Explain" → read the prose → trace one input through the SQL to verify the mechanism. The explanation is a starting point, not a proof.
- Gemini NL-to-SQL grounding. Annotate every column with a one-line description; tag every primary key in INFORMATION_SCHEMA. Schema grounding is what makes "top 5 customers by revenue" emit a working query.
- Four-point review. Joins (FK relationship) → Filters (date math) → Aggregations (function + GROUP BY) → Window frames (row count). Apply to every AI query before merge.
-
bigframesquick-start.import bigframes.pandas as bpd; df = bpd.read_gbq("proj.dataset.table"). Lazy until.to_pandas(). Inspect compiled SQL viadf.sql. -
bigframescache. Any DataFrame touched by 3+ downstream cells gets.cache(). Cuts iteration cost by an order of magnitude. -
bigframes.mlBQML wrapper.LogisticRegression().fit(X, y)issuesCREATE MODEL;.predict()issuesML.PREDICT. The model lives in BQ. -
Notebook + BQML template. SQL cell to fetch features →
train_test_split→.fit()→.score()→.predict()+ ROC plot. Artefacts in BQ; chart in notebook. - Data Canvas export. Treat exported SQL as the source of truth. Commit both canvas JSON and SQL to Git; re-open canvas only to edit.
-
Canvas → scheduled query. Build on canvas → preview → export to SQL → "Schedule" → cron + destination + service account. Audit via
INFORMATION_SCHEMA.SCHEDULED_QUERIES_RUNS. - Canvas vs Dataform. Canvas for prototypes + non-engineer hand-offs. Dataform for prod ELT with assertions, environments, and templated SQL across tables.
- Studio vs Looker. Studio is build; Looker is serve. Curated tables in BQ marts; LookML wraps them as views; dashboards consume views. Looker is not a transformation tool.
- Studio vs Vertex. BQML in Studio for warehouse-native + SQL-served models. Vertex for custom architectures, GPU training, registry governance, REST serving.
-
Hybrid Studio + Vertex. Features in Studio (
bigframes); curated feature table in BQ; training in Vertex on GPU; model in Vertex Registry; serving via Endpoint. -
Schedule observability. Wire
INFORMATION_SCHEMA.JOBS+SCHEDULED_QUERIES_RUNSinto Cloud Monitoring; alert on long jobs, failures, and slot saturation. -
AI SQL tagging. Add
/* Generated by Gemini YYYY-MM-DD — prompt: "..." — reviewer: @x */to any AI query in production. Six months later you know the prompt and the reviewer.
Frequently asked questions
What is BigQuery Studio and how is it different from the classic BQ console?
BigQuery Studio is the unified workbench Google ships as the default GCP console BQ entry — it collapses the SQL editor, Python notebooks, the visual Data Canvas, and Gemini-assisted code generation into one workspace sharing the same query engine, IAM, and billing scope. The classic BQ console was a SQL editor and a results pane; everything else lived in adjacent products like Vertex Workbench, Tableau Prep, or external IDEs. Studio's value is not "more features" — it is "all the surfaces in one place, with zero re-auth and zero data movement between them." Engineers who treat Studio as "the new SQL editor" leave most of its value on the table; the notebook and canvas surfaces are what genuinely change how a BQ team works in 2026.
Can Gemini in BigQuery write production-grade SQL?
Gemini is excellent at drafting SQL and dramatically less reliable at production-grade SQL — treat it as a paired pilot, never a committer. Inline auto-complete is high-accuracy for column names and standard syntax because it is schema-grounded against your project's INFORMATION_SCHEMA. Natural-language-to-SQL is strong on standard patterns (top-N per group, monthly aggregation, joins on documented FKs) and weak on multi-step pipelines, uncommon window frames, and cost optimisation. The four-point review — joins, filters, aggregations, window frames — catches the common failure modes in under 10 seconds per query. Production teams pair the review with a CI guard (dry-run + regression diff) and a Generated by Gemini comment header for traceability. The legacy brand "Duet AI for BigQuery" was unified under Gemini in BigQuery.
BigFrames vs pandas — what is the difference and when should I use which?
bigframes is a pandas-compatible DataFrame whose every operation lazily compiles to BigQuery SQL and runs in BQ; pandas evaluates eagerly in-memory on the notebook VM. The API is ~90% identical — .groupby, .merge, .assign, .dt, .str all behave like pandas. The execution model is what differs: bigframes builds a SQL plan that runs on warehouse compute, with .to_pandas() as the materialise moment; pandas runs immediately and is bounded by VM RAM. Pick bigframes for warehouse-scale data, for cost efficiency on a team with BQ slot reservations, and for warehouse-native ML via bigframes.ml. Pick pandas at the boundary where you need a niche library that doesn't speak bigframes (sklearn, matplotlib, custom XGBoost) — call .to_pandas() and continue. The 2026 default for new Studio notebooks is bigframes first, pandas at the edges.
Can I run a notebook on a schedule in BigQuery Studio?
Yes — Studio notebooks can be scheduled via the Notebook Executor, which runs the notebook headlessly on a managed runtime at a cron interval. The schedule config covers cron expression, parameters, runtime VM size, and destination for the executed artefact. The caveats: scheduled notebooks are not a replacement for multi-step DAG orchestration; if your workflow has cross-service dependencies (Pub/Sub trigger, upstream Dataform run, downstream Looker refresh), keep Composer / Airflow for the DAG and let the scheduled notebook be one node in it. For pure BQ-resident notebooks (feature refresh, model retraining, scheduled report generation), the in-Studio scheduler is fine.
BigQuery Studio vs Vertex Workbench — which should I use for ML?
Use Studio when the model architecture is in the BQML catalogue (linear, logistic, KMeans, random forest, DNN, AutoML) AND the serving mode is from SQL (ML.PREDICT) — bigframes.ml gives you scikit-learn-style ergonomics over BQML, the model lives in BQ, and there is no separate notebook VM cost beyond the editing session. Use Vertex Workbench when any of these is true: the model needs a custom architecture (sequence LSTM, transformer fine-tune, custom embedding), training needs GPU/TPU acceleration, you need versioning and lineage via the Vertex Model Registry, or the serving mode is a REST endpoint with auth (Vertex Endpoint). The hybrid pattern most production teams settle on: feature engineering in Studio (bigframes + curated feature table in BQ); training in Vertex on GPU; model registered in Vertex; serving via Vertex Endpoint or back into BQ via batch prediction. Studio replaces Workbench for BQ-native ML, not for the rest.
Data Canvas vs Dataform — when do I use each?
Use Data Canvas for visual exploration, prototyping, branch-and-merge transforms with few external dependencies, and pipelines maintained by non-engineers. The visual graph collapses nested CTEs into a diagram; the exported SQL is what runs in BQ when the canvas is scheduled. Use Dataform for production ELT with many tables, version-controlled SQLX in Git, data assertions (uniqueness, not_null, accepted_values), templated SQL across similar tables, and multi-environment deploys (dev / staging / prod). The migration playbook canvas → Dataform is mechanical: export the SQL, split each CTE into a SQLX file, add assertions, hook into tag-based deploy, decommission the original scheduled query. Most production data orgs use both — Canvas for the first few weeks of a new transform, Dataform once it stabilises. The line is lifecycle stage, not "which tool can express the transform."
Practice on PipeCode
- Drill the SQL practice library → for BigQuery SQL — windows, partition pruning, qualify clauses, the gnarly frames Gemini sometimes gets wrong.
- Rehearse on ETL practice problems → for warehouse-level pipeline design — Canvas → exported SQL → scheduled query patterns.
- Sharpen warehouse aggregation with the aggregation drills → — group-by, window, and rollup patterns that dominate BQ work.
- Layer the data manipulation library → for
bigframes-style DataFrame patterns — joins, derives, splits, unions — that map onto the canvas. - Treat
bigquery studioas part of the modern warehouse stack — practice SQL hard, and Studio + Gemini ergonomics become a fluency multiplier on top of fundamentals.
Lock in Studio + Gemini muscle memory
BQ docs explain the surface. PipeCode drills explain the decision — when Studio replaces a separate notebook stack, when Gemini SQL needs review, when Data Canvas exports beat hand-rolled SQL. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the production trade-offs senior data engineers actually face.





Top comments (0)