DEV Community

Cover image for BigQuery Studio + Gemini: Notebooks, Data Canvas & AI-Assisted SQL
Gowtham Potureddi
Gowtham Potureddi

Posted on

BigQuery Studio + Gemini: Notebooks, Data Canvas & AI-Assisted SQL

bigquery studio is the single biggest workbench change Google has shipped to BigQuery since the standard-SQL migration in 2018 — and the one most senior data engineers underuse, because they still think of BQ as "the SQL console plus the API" instead of as a unified workspace where SQL, Python notebooks, visual pipelines, and bigquery gemini assistance all share the same query engine. In 2026, the GCP console's default BQ entry point is Studio, the Gemini integration is GA across the surface, and the line between "data engineer" and "analytics engineer" runs right through the question of whether you actually use the notebook and canvas surfaces or fall back to the editor by reflex.

This guide is the senior-DE walkthrough you wished existed the first time an interviewer asked "how do you use gemini in bigquery for production-grade SQL without shipping a hallucinated join condition?" or "when does a bigquery data canvas export beat a hand-rolled scheduled query?" It walks through the four Studio surfaces — the Gemini-assisted SQL editor (auto-complete, explain-this-query, NL-to-SQL with schema grounding), bigquery notebooks (Colab-style Python + SQL cells, bigframes pandas-on-BigQuery, BQML training inline), the visual Data Canvas (drag-and-drop branch-and-merge pipelines that compile to SQL), and Studio's positioning against Looker and Vertex Workbench — and finishes with the senior-interview-grade separation-of-concerns between workbench, BI tool, and ML platform. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

PipeCode blog header for BigQuery Studio + Gemini — bold white headline 'BigQuery Studio + Gemini' with subtitle 'Notebooks, Data Canvas, AI-Assisted SQL' and a stylised unified-workbench scene on a dark gradient with purple, green, orange, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the SQL practice library →, rehearse on ETL practice problems →, and sharpen the warehouse-aggregation axis with the aggregation drills →.


On this page


1. Why Studio is the new BQ workbench

bigquery studio collapses the SQL editor, the notebook, and the visual canvas into one workspace — the question is no longer "which tool" but "which surface for which task"

The mental model in one line: BigQuery Studio is a unified workspace sharing one query engine, one IAM surface, one billing scope, and one set of datasets across SQL editing, Python notebooks, Data Canvas pipelines, and Gemini-assisted code generation. Before Studio, those four surfaces lived in four different products — the legacy SQL editor, Vertex Workbench for notebooks, a third-party tool like Tableau Prep for canvas-style flows, and no AI assist at all. Studio turns them into tabs in the same console, reading and writing the same datasets through the same access policies.

The four must-answer axes interviewers actually probe.

  • SQL editor with Gemini. The BQ SQL pane has bigquery ai assist baked in — auto-complete for whole window functions and CTEs, "Explain this query" prose narration, and a text to sql bigquery prompt grounded on dataset schema. The senior question is never "does Gemini work?" — it is "how do you keep wrong answers out of production?"
  • Notebooks. Colab-style runtime sandboxes mixing SQL and Python cells. The Python kernel ships with bigframes (pandas-on-BQ) and bigframes.ml (BQML wrapper). This is the surface that replaces Vertex Workbench for BQ-centric ML.
  • Data Canvas. A drag-and-drop visual editor for branch-and-merge SQL transformations; the graph exports to a single SQL script or schedules as a query. BigQuery is the engine — no intermediate Excel or in-memory tier.
  • AI assist across the surface. Gemini is a sidecar in every surface — completes in the editor, generates cells in notebooks, suggests joins on canvas. Senior engineers treat it as a paired pilot, never the captain.

Why the unification matters in 2026.

  • One IAM surface. A roles/bigquery.dataViewer on the source dataset and roles/bigquery.jobUser on the project is enough to read in SQL, in a notebook, and on the canvas. No separate notebook-runtime IAM.
  • One billing scope. Every query — editor, notebook cell, canvas, or Gemini — bills the same project against the same on-demand or reservation slot pool. Slot usage is observable in one INFORMATION_SCHEMA view.
  • One copy of the data. bigframes operations stay server-side; the canvas's intermediate steps are CTEs, not materialised tables. The "single source of truth" promise is preserved.
  • Gemini grounded on schema. Every Gemini call is grounded with the project's schema metadata (table names, column types, descriptions, PK tags). That cuts hallucinated column names dramatically.

The 2026 reality — what changed since 2023.

  • Studio is the default GCP console BQ entry since mid-2024 — the classic SQL editor is reachable but no longer the default.
  • Gemini is GA across the surface. The old "Duet AI for BigQuery" brand was renamed and unified under Gemini in BigQuery; SQL-assist, notebook-assist, and canvas-assist are GA.
  • bigframes 1.x is stable. Notebooks default to bigframes for DataFrame work rather than pandas + bq client.
  • Data Canvas exports clean SQL. The 2026 exporter produces idiomatic CTEs with named steps that map back to canvas nodes.

What interviewers listen for.

  • Saying "Studio shares one query engine, one IAM, and one billing scope across editor, notebook, and canvas" in the first sentence — senior signal.
  • Describing Gemini as a paired pilot with a mandatory human review step on generated SQL — required answer.
  • Naming bigframes pushdowns as why notebook ETL is cheap and Vertex Workbench is no longer the default for BQ workflows — senior signal.
  • Pushing back on "Studio replaces Looker" with "Studio is a workbench, Looker is a governed-BI semantic layer — different sides of the build-vs-serve line" — senior signal.

Worked example — same task, three surfaces

Detailed explanation. A common interviewer warm-up — "monthly revenue per region for the last 12 months on a sales table partitioned by order_date and clustered by region. Show me this in the SQL editor, in a notebook with bigframes, and on the Data Canvas. Which surface fits each scenario: ad-hoc, one-off chart in a model card, production-scheduled query?"

The question tests whether you can map the same logical query onto each Studio surface and which surface is the right home for which lifecycle stage.

Question. Express the query in the SQL editor (raw SQL), in a notebook (bigframes), and as a Data Canvas pipeline. State which surface fits each of the three scenarios.

Input.

surface scenario fit input artefact
SQL editor ad-hoc exploration typed query
Notebook + bigframes one-off chart in a model card .ipynb with markdown + cell
Data Canvas production-scheduled query canvas graph exported to SQL

Code.

-- Surface 1: SQL editor (Gemini may auto-complete this from "monthly revenue per region")
SELECT
  region,
  DATE_TRUNC(order_date, MONTH) AS month,
  SUM(amount)                   AS revenue
FROM `proj.sales.orders`
WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 12 MONTH)
GROUP BY region, month
ORDER BY region, month;
Enter fullscreen mode Exit fullscreen mode
# Surface 2: Notebook with bigframes (pandas-style, pushes down to BQ)
import bigframes.pandas as bpd

orders = bpd.read_gbq("proj.sales.orders")
last12 = orders[orders["order_date"] >= bpd.Timestamp.now() - bpd.Timedelta(days=365)]

monthly = (
    last12
    .assign(month=last12["order_date"].dt.to_period("M").dt.to_timestamp())
    .groupby(["region", "month"])["amount"]
    .sum()
    .reset_index()
    .sort_values(["region", "month"])
)

# bigframes only materialises when you call .to_pandas() or render — until then it's a SQL plan
monthly.head()
Enter fullscreen mode Exit fullscreen mode
# Surface 3: Data Canvas (described as a graph; export produces SQL below)
Node 1 — Source           : proj.sales.orders
Node 2 — Filter           : order_date >= CURRENT_DATE - INTERVAL 12 MONTH
Node 3 — Derive column    : month = DATE_TRUNC(order_date, MONTH)
Node 4 — Group + aggregate: GROUP BY region, month → SUM(amount) AS revenue
Node 5 — Sort             : region ASC, month ASC
Node 6 — Sink             : proj.sales.monthly_revenue_by_region (overwrite)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. All three surfaces emit BigQuery SQL — the editor types it, bigframes compiles pandas to SQL on .to_pandas(), the canvas exports its node graph to CTEs.
  2. The SQL editor is the fastest path for ad-hoc exploration; Gemini auto-complete shortens typing but does not change what runs.
  3. bigframes shines when surrounding work is Python — feature engineering, matplotlib plotting, sklearn calls. SQL pushdown means no data leaves BQ.
  4. The canvas wins when branching logic + non-engineer maintainers meet — the visual flow is easier to onboard onto than a 200-line CTE chain.
  5. Mapping surfaces to scenarios: editor for ad-hoc; notebook for narrative one-shots; canvas for visual handoff to scheduled queries.

Output.

scenario best surface why
Ad-hoc exploration SQL editor fastest iteration; Gemini accelerates typing
One-off chart in a model card Notebook + bigframes inline Python, matplotlib, markdown narrative
Production-scheduled query Data Canvas → exported SQL visual flow + scheduled-query target

Rule of thumb. Pick the surface matching the lifecycle stage of the work, not the one you're most fluent in. Studio's whole point is that switching surfaces costs zero IAM, zero data movement, and zero context.

Worked example — what Studio replaces (and what it does not)

Detailed explanation. A senior probe — "what tools does Studio actually replace in a typical GCP data stack, and what is still needed alongside?" Studio replaces some adjacent tools fully, partially replaces others, and is irrelevant to a third group. Mapping that surface area precisely is a senior signal.

Question. For each adjacent tool (Vertex Workbench, Looker, Composer, Dataform, Dataproc), state whether Studio fully replaces it, partially replaces it, or is unrelated.

Input.

Adjacent tool Primary use case
Vertex Workbench General-purpose ML notebook runtime
Looker Governed BI semantic layer for end users
Cloud Composer Workflow orchestration across many services
Dataform Version-controlled SQL transformation (dbt-like)
Dataproc Spark/Hadoop cluster runtime

Code.

Replacement matrix — Studio vs adjacent tools
=============================================

Vertex Workbench  → Fully replaced for BQ-centric notebooks (bigframes + BQML)
                    Still needed for GPU training, non-BQ ML, Model Registry
Looker            → NOT replaced — Studio is build, Looker is serve
Composer          → NOT replaced — no DAG scheduler in Studio
Dataform          → Partial overlap — both produce versioned SQL
                    Dataform wins on Git + assertions + dependency graph
Dataproc          → NOT replaced — no Spark engine in Studio
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Vertex Workbench loses ground for BQ-centric ML because bigframes + BQML cover the warehouse-native model lifecycle inside Studio. Workbench is still required for GPU, custom dependencies, or Vertex Model Registry.
  2. Looker is the most common "Studio replaces this" misconception — Looker is a semantic layer (LookML dimensions, measures, explores). Studio has no semantic layer. The build-vs-serve line separates them.
  3. Composer/Airflow is unaffected — Studio has no DAG scheduler. Canvas exports to a single scheduled query, not multi-step DAGs with retries and SLAs.
  4. Dataform sits closest to overlap. Both produce versioned SQL. Dataform is Git-native, has assertions, supports a multi-table dependency graph. Pattern: prototype on canvas, productionise in Dataform.
  5. Dataproc owns Spark workloads. Studio cannot run Spark; for Spark, Dataproc remains the answer.

Output.

Tool Studio's effect Recommendation
Vertex Workbench Fully replaced for BQ-centric ML Keep for GPU + non-BQ ML
Looker Not replaced Studio builds, Looker serves
Composer Not replaced Composer owns cross-service DAGs
Dataform Partial overlap Canvas for proto, Dataform for prod
Dataproc Not replaced Dataproc owns Spark

Rule of thumb. If a tool is BQ-specific and single-user-facing, Studio absorbs it. If it is cross-service, governed-multi-user, or non-SQL, Studio sits alongside it.

Worked example — the four-axis Studio readiness check

Detailed explanation. Before a team migrates from "classic BQ console + separate notebook IDE" to Studio, a senior engineer runs a four-axis readiness check — SQL editor depth, notebook depth, canvas appetite, Gemini governance. Each axis has a pass/fail criterion and a remediation if the team fails.

Question. Define the four-axis Studio readiness check. For each axis, give the pass/fail criterion and the action on fail.

Input.

Team / workload SQL editor Notebook Canvas Gemini comfort
ETL backend Heavy Medium Low High
Analytics Heavy High Medium Medium
ML team Medium Heavy Low High
Business analysts Light Light Heavy Low

Code.

Studio readiness check
======================

Axis 1 — SQL editor depth
  Pass: team writes >50% of queries in BQ SQL today
  Action on fail: pilot the editor for a week before full migration

Axis 2 — Notebook depth
  Pass: team uses Python notebooks at least weekly
  Action on fail: train on bigframes; keep Vertex Workbench as fallback

Axis 3 — Canvas appetite
  Pass: at least one workflow maps to branch-and-merge SQL
  Action on fail: skip the canvas; editor + scheduled queries

Axis 4 — Gemini governance
  Pass: team has a code-review policy for AI-generated SQL
  Action on fail: write the review checklist before enabling Gemini
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Axis 1 (SQL editor) is the easiest to pass — most BQ-heavy teams already write SQL in the console. The fail mode is teams who live in third-party IDEs.
  2. Axis 2 (notebooks) is the most commonly failed axis on ETL teams. The fix is two weeks of bigframes practice and pair sessions before relying on notebooks for critical work.
  3. Axis 3 (canvas) is binary — branch-and-merge workloads benefit, linear single-query workloads do not. Don't force it.
  4. Axis 4 (Gemini governance) is the one that bites teams hardest in production. Insert a human review step between Gemini output and production — checked into code review, not optional.
  5. Studio's surfaces are not free — each requires team practice. The readiness check decides where to invest the practice budget.

Output.

Team Axes passed Studio adoption plan
ETL backend 1, 4 Pilot canvas; train on notebooks
Analytics 1, 2, 4 Full Studio; defer canvas
ML team 2, 4 Notebooks first; defer canvas
Business analysts 3 Canvas-first; SQL editor fallback

Rule of thumb. Migration is not all-or-nothing — the readiness check picks the order of adoption.

Senior interview question on Studio's place in the GCP stack

A senior interviewer often opens with: "BigQuery Studio launched in 2023 and became the default BQ console in 2024. From a senior data engineer's perspective, what did it actually change in your day-to-day workflow, and what stayed the same? Where do you still reach for tools outside Studio, and why?"

Solution Using surface-by-surface workflow mapping

Studio workflow mapping — what changed, what stayed
===================================================

Day-to-day change matrix
------------------------

Workflow                          Pre-Studio                       With Studio
Ad-hoc query                      BQ classic console               Studio editor + Gemini auto-complete
Exploration with charts           Workbench notebook + bq client   Studio notebook + bigframes
Visual branch-merge transform     Tableau Prep / hand-written SQL  Data Canvas → exported SQL
Schedule a query                  Scheduled Queries UI             Same UI, accessible from canvas
Train BQML model                  SQL editor + manual EVAL queries Notebook + bigframes.ml
End-user dashboard                Looker                           Looker (unchanged)
Cross-service DAG                 Composer / Airflow               Composer (unchanged)
Spark batch job                   Dataproc                         Dataproc (unchanged)

What changed: notebook + canvas + Gemini are now in one workspace, sharing IAM and billing
What stayed: BI (Looker), orchestration (Composer), Spark (Dataproc), GPU ML (Vertex)
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Workflow Where I used to live Where I live now Why I moved
Ad-hoc SQL Classic editor Studio editor Gemini auto-complete + schema panel
Notebook exploration Vertex Workbench Studio notebook bigframes pushdown; shared IAM
Visual transform Tableau Prep Studio canvas One IAM; export to scheduled SQL
BQML training SQL editor Studio notebook bigframes.ml + chart in same cell
Looker dashboard Looker Looker Studio has no semantic layer
Composer DAG Composer Composer Studio has no cross-service scheduler
Spark job Dataproc Dataproc Studio has no Spark engine

After the mapping pass, the verdict is clear: Studio absorbs the BQ-specific surfaces (editor, notebook, canvas, AI assist) and leaves the cross-service, governed-BI, and Spark surfaces alone.

Output:

Verdict Reasoning
Studio is a workbench, not a platform One IAM + one billing across editor / notebook / canvas / Gemini
Studio does not replace Looker, Composer, or Dataproc Different problem domains
Studio mostly replaces Vertex Workbench for BQ work bigframes + BQML are first-class
Studio's edge: AI assist + zero-friction surface switching Grounded Gemini + shared schema

Why this works — concept by concept:

  • Surface-by-surface mapping — walk each workflow and identify where the work used to live. "Studio is a unified workspace" is the junior answer; the senior answer enumerates surfaces.
  • Unified IAM and billing — the structural change is "one IAM and one billing scope across four surfaces," not more features. That is what lets a notebook query a table the editor just looked at with no re-auth.
  • Build vs serve line — Studio sits on build; Looker sits on serve. They are adjacent, not competitors. "Studio replaces Looker" reveals the line has not been internalised.
  • BQ-specific vs cross-service — Studio absorbs BQ-specific tools (editor, BQ notebook, BQ canvas, BQ AI assist). It leaves cross-service tools (Composer, Dataproc, Vertex GPU) alone.
  • Cost — migration is O(team workflows) of retraining. Staying on classic tools costs O(time) of fragmented IAM and surface switching. Most teams come out ahead — but only if they do the readiness check first.

SQL
Topic — sql
BigQuery SQL practice problems

Practice →

ETL Topic — etl Warehouse ETL design problems

Practice →


2. SQL editor + Gemini assist

Gemini turns the SQL editor into a paired pilot — bigquery ai assist accelerates typing, but only a disciplined review pattern keeps hallucinated joins out of production

The mental model in one line: the Gemini-assisted SQL editor offers three distinct AI surfaces — inline auto-complete, "explain this query," and natural-language-to-SQL — and a senior engineer treats each as a draft that must pass a four-point review before it ships. The features are powerful because they surface without you leaving the SQL pane; the cost is that "just paste Gemini's answer" is the most common production-bug source in 2026 BQ codebases.

Visual diagram of the Gemini-assisted SQL editor — left pane showing a SQL editor with inline Gemini auto-complete and a 'Generate from prompt' button; right pane showing an 'Explain this query' panel; on a light PipeCode card with purple, green, and orange accents.

The three AI surfaces in the editor.

  • Inline auto-complete. As you type, Gemini suggests the next tokens — column names, function calls, full window expressions. Tab to accept. Grounded on the FROM table schema.
  • Explain this query. Highlight a SQL block, click "Explain", and Gemini returns a prose narration — joins, filters, aggregations, window logic. Useful for inheriting unfamiliar SQL.
  • Natural-language-to-SQL. A "Generate SQL" prompt that takes free-text English and emits a query grounded on the active project's schema. Strong on standard patterns; weaker on multi-step pipelines and uncommon window frames.

Schema grounding — why Gemini hallucinates less inside Studio.

  • The editor passes table names, column names, types, descriptions, and PK tags from INFORMATION_SCHEMA as context to Gemini.
  • The grounding cuts hallucinated column names dramatically vs a generic ChatGPT prompt.
  • Grounding fails when schema is unannotated. Populating column descriptions pays for itself in NL-to-SQL accuracy.

The four-point review pattern (senior signal).

  • Joins. Verify each ON clause against a documented FK relationship.
  • Filters. Check WHERE semantics and date math (DATE_SUB vs DATE_TRUNC are not the same).
  • Aggregations. Match the aggregation function and GROUP BY to business intent.
  • Window frames. Count the rows in ROWS BETWEEN N PRECEDING AND CURRENT ROW.

Where Gemini struggles in 2026.

  • Multi-step pipelines. Beyond ~4 CTEs produces verbose, repetitive SQL.
  • Uncommon window patterns. Sessionisation, gap-and-island, top-N-with-ties — close but often wrong.
  • Cost optimisation. Does not reliably partition-prune or cluster-prune.
  • Schema evolution. New columns may not appear in older training data; grounding mitigates but does not eliminate.

The senior-engineer pattern — Gemini as drafter, not committer. Gemini drafts; engineer reviews; CI validates; code review approves. Treat every generated query as if a smart-but-inexperienced colleague wrote it.

Worked example — auto-complete on a window function

Detailed explanation. A common senior probe — "you're typing a 7-day moving average; Gemini auto-completes the window frame. What should it suggest, what does it actually suggest, and how do you tell correct from 'close enough but wrong'?" The question tests whether you read auto-completed code or rubber-stamp it.

Question. Type a 7-day trailing moving average over a sales table partitioned by order_date. Show the prompt-prefix, the likely auto-complete, and the four-point review you'd perform.

Input.

order_date region revenue
2026-06-01 US 100
2026-06-02 US 120
2026-06-03 US 90
2026-06-04 US 110
2026-06-05 US 130
2026-06-06 US 95
2026-06-07 US 140
2026-06-08 US 105

Code.

-- You type up to here, then pause:
SELECT
  order_date,
  region,
  revenue,
  -- Gemini auto-completes the next line. The suggestion you actually want:
  AVG(revenue) OVER (
    PARTITION BY region
    ORDER BY order_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS revenue_7d_avg
FROM `proj.sales.daily_revenue`
WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 60 DAY)
ORDER BY region, order_date;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The intent is "7-day trailing moving average" — the window is today plus the previous 6 days.
  2. Gemini may auto-complete ROWS BETWEEN 7 PRECEDING AND CURRENT ROW (8-day window) instead of 6 PRECEDING (correct 7-day). The off-by-one is a classic Gemini failure mode.
  3. Four-point review: (a) joins — none. (b) filters — partition-pruning date math verified. (c) aggregations — AVG(revenue) correct. (d) window frames — count rows in the frame.
  4. The senior pause on the window frame: CURRENT ROW + 6 PRECEDING = 7 rows. Accept or edit.
  5. The four-point review takes under 10 seconds and catches the most common bugs.

Output.

order_date region revenue revenue_7d_avg
2026-06-01 US 100 100.00
2026-06-02 US 120 110.00
2026-06-03 US 90 103.33
2026-06-04 US 110 105.00
2026-06-05 US 130 110.00
2026-06-06 US 95 107.50
2026-06-07 US 140 112.14
2026-06-08 US 105 112.86

Rule of thumb. Every Gemini window frame deserves a one-second mental count: "how many rows are in this window?" If the answer isn't the number you wanted, edit before accepting.

Worked example — natural-language to SQL with schema grounding

Detailed explanation. A common demo — "top 5 customers by revenue in the last 30 days, only US region" emits a working query. Senior interviewers probe this surface because it is impressive and the source of more production bugs than any other Gemini feature.

Question. Use Generate SQL on the schema below for that prompt. Show the prompt, the likely output, and the schema-grounding details that make it reliable.

Input — schema.

table column type description
proj.sales.orders order_id INT64 primary key
proj.sales.orders customer_id INT64 FK to customers
proj.sales.orders order_date DATE partition key
proj.sales.orders amount NUMERIC order total
proj.sales.orders region STRING clustered
proj.sales.customers customer_id INT64 primary key
proj.sales.customers name STRING display name

Code.

Prompt to Gemini: "Top 5 customers by revenue in the last 30 days, only US region"
Enter fullscreen mode Exit fullscreen mode
-- Likely Gemini output (with schema grounding from project metadata)
SELECT
  c.customer_id,
  c.name,
  SUM(o.amount) AS revenue_30d
FROM `proj.sales.orders` AS o
JOIN `proj.sales.customers` AS c
  ON o.customer_id = c.customer_id
WHERE o.order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
  AND o.region = 'US'
GROUP BY c.customer_id, c.name
ORDER BY revenue_30d DESC
LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The prompt is natural-language; Gemini reads it plus schema context (tables, columns, types, descriptions, PK tags) and emits parameterised SQL.
  2. Schema grounding gives Gemini the table names, the PK-tagged join column, the partition column, and the cluster column — all four guide the generated query.
  3. Four-point review: join uses the documented PK; filter partition-prunes and cluster-prunes; SUM(o.amount) is the obvious revenue measure; no window frames.
  4. The query is production-quality on the first pass because the schema was annotated. Without annotations, Gemini might miss the cluster column or hallucinate a region field.
  5. Invest in schema annotations before you expect NL-to-SQL to be reliable.

Output.

customer_id name revenue_30d
1001 Acme Corp 12,500
1002 Beta Inc 11,400
1003 Gamma LLC 9,800
1004 Delta Co 8,750
1005 Epsilon Ltd 7,600

Rule of thumb. NL-to-SQL accuracy tracks schema annotation quality more than Gemini model version. Annotate every column and tag every PK — the payoff is immediate.

Worked example — explain-this-query on inherited SQL

Detailed explanation. A common probe — paste a gnarly inherited SQL block (sessionisation with gap-and-island) and ask "Explain this query." The question: do you treat the prose as authoritative or as a starting point?

Question. A teammate hands you a sessionisation query. Use "Explain this query" to bootstrap understanding; identify which parts of the explanation deserve independent verification.

Input.

user_id event_time
u1 09:00:00
u1 09:10:00
u1 09:45:00
u1 10:30:00

Code.

-- Inherited SQL, paste into the editor, select all, click "Explain"
WITH gaps AS (
  SELECT
    user_id,
    event_time,
    TIMESTAMP_DIFF(
      event_time,
      LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time),
      MINUTE
    ) AS gap_min
  FROM `proj.analytics.events`
),
sessions AS (
  SELECT
    user_id,
    event_time,
    SUM(CASE WHEN gap_min IS NULL OR gap_min > 30 THEN 1 ELSE 0 END)
      OVER (PARTITION BY user_id ORDER BY event_time
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS session_id
  FROM gaps
)
SELECT user_id, session_id, MIN(event_time) AS session_start, COUNT(*) AS events
FROM sessions
GROUP BY user_id, session_id
ORDER BY user_id, session_id;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Gemini's typical explanation: "groups events into sessions per user; gaps CTE computes the per-user gap; sessions CTE increments a counter whenever the gap exceeds 30 minutes."
  2. The prose is roughly right but glosses over the gap-and-island invariant — the cumulative SUM over the boundary indicator is the actual mechanism.
  3. Verification: at 09:00 gap is NULL → boundary, session 1. At 09:10 gap = 10 → no boundary, session 1. At 09:45 gap = 35 > 30 → boundary, session 2. At 10:30 gap = 45 > 30 → boundary, session 3.
  4. The explanation is a starting point, not a proof — Gemini summarises intent, not correctness.
  5. Senior pattern: trust the explanation's intent, independently verify the mechanism by tracing one or two inputs.

Output.

user_id session_id session_start events
u1 1 09:00:00 2
u1 2 09:45:00 1
u1 3 10:30:00 1

Rule of thumb. Use "Explain this query" to accelerate understanding; always trace one input through the SQL by hand before relying on the explanation. Gemini is a fast first draft, never a proof.

Senior interview question on Gemini governance

A senior interviewer might frame this as: "Your team enabled Gemini in the SQL editor a quarter ago. Three production incidents this quarter were traced back to AI-generated SQL that nobody reviewed carefully enough. How do you fix the process without disabling Gemini and losing the productivity gain?"

Solution Using a four-point review checklist + CI guard rails

-- Step 1: Codify the four-point review as a pull-request checklist
-- (Committed to .github/PULL_REQUEST_TEMPLATE.md)

-- AI-generated SQL review checklist
-- [ ] Joins: every ON clause uses a documented foreign-key relationship
-- [ ] Filters: WHERE / HAVING semantics verified against table descriptions
-- [ ] Aggregations: GROUP BY columns match the business intent
-- [ ] Window frames: row counts verified by tracing one example input

-- Step 2: Add a CI guard — every BQ SQL file with a "Generated by Gemini" tag
-- runs a dry-run plus a regression test before merge:

-- ci/validate_sql.sh (run on every PR touching sql/*.sql):
--   bq query --dry_run --use_legacy_sql=false < $sql_file
--   bq query --use_legacy_sql=false --max_bytes_billed=10000000 < $sql_file > /tmp/result.csv
--   diff /tmp/result.csv test/expected/$sql_file.expected.csv

-- Step 3: Tag the generated SQL at write time
-- Add a comment header to any query produced by Gemini:
/*
 * Generated by Gemini in BQ Studio on 2026-06-22
 * Prompt: "top 5 customers by revenue in the last 30 days, only US region"
 * Reviewer: @teammate (sign-off required)
 */
SELECT ...
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Incident Root cause What the checklist would have caught
Wrong join key Gemini joined on id not customer_id Point 1 (joins): verify FK relationship
Off-by-one window Gemini used 7 PRECEDING for a 7-day window Point 4 (frames): count rows in the frame
Missed cluster filter Gemini omitted region = 'US' Point 2 (filters): cross-check business intent
Wrong aggregation Gemini used COUNT(*) not COUNT(DISTINCT) Point 3 (aggregations): match business intent

After the checklist + CI guard rails are in place, the team retains the typing-speed gain from Gemini and loses the production-incident rate that came with rubber-stamping its output.

Output:

Lever What it changes What it preserves
Four-point review checklist Every AI SQL passes a mental trace Gemini typing-speed gain
CI dry-run + regression Every AI SQL passes a syntactic + semantic gate Iteration speed
Generated-by tag Every AI SQL is traceable to a prompt + reviewer Accountability

Why this works — concept by concept:

  • Process over disablement — disabling Gemini after incidents is the junior move. The senior fix is a cheap review step that catches common failure modes without reverting to manual SQL.
  • Four points are enough — joins / filters / aggregations / window-frames catches >90% of Gemini failures. Longer checklists cost review time without proportionate accuracy gain.
  • CI guard rails — humans miss things; a dry-run + regression diff catches cases a tired reviewer skipped.
  • Tag at write time — the Generated by Gemini header preserves traceability when a query breaks six months later.
  • Cost — the four-point review adds ~10 seconds per AI query and saves hours per incident. ROI is asymmetric.

SQL
Topic — sql
SQL editor + Gemini practice

Practice →

SQL Topic — aggregation Aggregation drills (window + group)

Practice →


3. Notebooks — Python + SQL cells

bigquery notebooks ship with bigframes — pandas semantics that compile to BigQuery SQL, so your "ETL in Python" is actually warehouse compute

The mental model in one line: a Studio notebook is a Colab-style document where SQL and Python cells share the same query engine and dataset access, with bigframes — a pandas-compatible DataFrame whose every operation lazily pushes down into BigQuery SQL. That architectural choice — Python cells compiling to SQL rather than pulling data to a notebook VM — is what makes bigquery jupyter notebooks cheap on data sizes that would crush a regular Jupyter kernel.

Visual diagram of a BigQuery Studio notebook — a vertical stack of cells alternating Python (with bigframes) and SQL, with a BQML model training step and a matplotlib chart at the bottom; on a light PipeCode card with purple, green, and orange accents.

The notebook surface in Studio.

  • Cells. SQL cells run BigQuery jobs; Python cells run on a managed Python runtime, both bill through Studio.
  • Sharing between cells. A SQL cell's result is exposed as a bigframes DataFrame in the next Python cell via %%bigquery df --target=bigframes.
  • Markdown cells. First-class — build a model card or one-shot report with title, narrative, charts, and the SQL that produced them.
  • Scheduling. Notebooks can be scheduled on a cron via the Notebook Executor (single-job; multi-step DAGs still belong in Composer).

bigframes — pandas on BigQuery.

  • DataFrame API mirrors pandas (.loc, .iloc, .groupby, .merge, .assign, .dt, .str).
  • Every operation is lazy — builds a SQL plan until you call .to_pandas(), .head(), .cache(), or render.
  • Compiled SQL is visible via df.sql — useful for debugging plan-level bugs.
  • Compute lives in BigQuery; the notebook VM only holds the plan and materialised results.

bigframes.ml — BQML wrapped in scikit-learn semantics.

  • LinearRegression, LogisticRegression, KMeans, RandomForest, plus BQML's DNN and AutoML.
  • .fit() issues CREATE MODEL; .predict() issues ML.PREDICT. Models live in BQ.
  • The notebook is the authoring environment; the model lives in the warehouse and is queryable from any SQL surface.

Python cells beyond bigframes. Standard Python kernel — numpy, scikit-learn, matplotlib, seaborn, plotly. %pip install works inline. Non-BQ data is reachable but pulls into the notebook VM.

When the notebook beats the editor. The work is >50% Python (features, ML, plotting); the output is a narrative document; you need scheduled notebook execution; you mix BQ + non-BQ data.

When the editor still wins. Pure SQL, single query, fastest possible iteration.

Worked example — running a SQL cell + Python cell with bigframes

Detailed explanation. A common notebook pattern — SQL cell to fetch data, Python cell to massage with bigframes, plot in matplotlib. The senior question: which line actually triggers a BQ job, and which line is just appending to the plan?

Question. Build a notebook that fetches monthly revenue, computes year-over-year growth with bigframes, and plots it. Identify the line that triggers the BQ job.

Input.

order_date region amount
2025-01-15 US 500
2025-02-15 US 600
2026-01-15 US 700
2026-02-15 US 750

Code.

# Cell 1 — SQL cell
%%bigquery monthly --target=bigframes
SELECT
  DATE_TRUNC(order_date, MONTH) AS month,
  region,
  SUM(amount)                   AS revenue
FROM `proj.sales.orders`
WHERE order_date >= DATE('2025-01-01')
GROUP BY month, region;
Enter fullscreen mode Exit fullscreen mode
# Cell 2 — Python cell using bigframes
import bigframes.pandas as bpd

# monthly is a bigframes DataFrame, NOT a pandas DataFrame
# Adding columns is lazy — appends to the SQL plan
monthly = monthly.sort_values(["region", "month"])
monthly["revenue_prev_year"] = (
    monthly.groupby("region")["revenue"].shift(12)
)
monthly["yoy_growth_pct"] = (
    (monthly["revenue"] - monthly["revenue_prev_year"])
    / monthly["revenue_prev_year"] * 100
)

# This line is still lazy — no BQ job yet
result = monthly.dropna(subset=["yoy_growth_pct"])

# THIS line triggers the BQ job (materialises into pandas for matplotlib)
result_pd = result.to_pandas()
Enter fullscreen mode Exit fullscreen mode
# Cell 3 — Plotting
import matplotlib.pyplot as plt

for region, group in result_pd.groupby("region"):
    plt.plot(group["month"], group["yoy_growth_pct"], label=region)
plt.title("YoY revenue growth by region")
plt.xlabel("month")
plt.ylabel("% growth")
plt.legend()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Cell 1 runs a BQ job; the result is exposed in cell 2 as monthly, a bigframes DataFrame.
  2. In cell 2, every bigframes operation (sort_values, groupby, shift, assignments, dropna) adds nodes to the SQL plan — none runs a BQ job. .sql would show the compiled plan.
  3. result.to_pandas() is the line that triggers the BQ job. The plan runs in BQ; the result ships to the notebook VM as a pandas DataFrame.
  4. Cell 3 plots in matplotlib — pure notebook VM work; no BQ.
  5. Lazy execution is the whole point — long bigframes pipelines stay free until you materialise.

Output.

month region revenue revenue_prev_year yoy_growth_pct
2026-01-01 US 700 500 40.0
2026-02-01 US 750 600 25.0

Rule of thumb. Every bigframes operation is lazy until .to_pandas(), .head(), .cache(), or render. Treat .to_pandas() as the "press the button" moment; everything else is plan-building.

Worked example — training a BQML model from a notebook

Detailed explanation. A senior probe — "train and evaluate a logistic regression model end-to-end from a Studio notebook, with BQML under the hood; training data, model, and evaluation chart in the same notebook." The question tests whether you understand that the model lives in BigQuery, not the notebook VM.

Question. Train a bigframes.ml.LogisticRegression model on customers. Evaluate on held-out data. Plot ROC. Be explicit about which artefacts live where.

Input.

feature description
tenure_months months as a customer
monthly_spend average monthly spend
support_tickets tickets in last 90 days
churned (label) 1 if churned, 0 otherwise

Code.

import bigframes.pandas as bpd
from bigframes.ml.linear_model import LogisticRegression
from bigframes.ml.model_selection import train_test_split

customers = bpd.read_gbq("proj.crm.customers")
features = ["tenure_months", "monthly_spend", "support_tickets"]
label = "churned"

X = customers[features]
y = customers[label]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression()
model.fit(X_train, y_train)
# The fit() call issues a BQML CREATE MODEL statement against BQ.
# The model object is a handle to the model in BigQuery, not a local sklearn object.

# Evaluate
metrics = model.score(X_test, y_test)
print(metrics)

# Predict probabilities for the ROC curve
preds = model.predict(X_test)            # bigframes DataFrame with predicted_churned + probs
preds_pd = preds.to_pandas()             # materialise for sklearn ROC
y_test_pd = y_test.to_pandas()

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fpr, tpr, _ = roc_curve(y_test_pd, preds_pd["predicted_churned_probs"])
plt.plot(fpr, tpr, label=f"AUC = {auc(fpr, tpr):.3f}")
plt.plot([0, 1], [0, 1], linestyle="--")
plt.title("ROC — churn model")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. bpd.read_gbq returns a lazy handle — no data movement.
  2. train_test_split builds the split plan; still lazy.
  3. .fit() is the first line that runs a BQ job — a BQML CREATE MODEL. The trained model lives in BQ; the Python model object is a handle.
  4. .score() issues ML.EVALUATE; .predict() issues ML.PREDICT against the held-out set.
  5. .to_pandas() materialises predictions for sklearn's roc_curve and matplotlib. All training data and the model stay in BQ; only test predictions enter the notebook VM.

Output.

metric value
accuracy 0.84
precision 0.79
recall 0.71
f1 0.75
auc_roc 0.88

Rule of thumb. bigframes.ml is the right surface for warehouse-native ML — training data in BQ, served model queried from SQL. For GPU, custom architectures, or the Vertex Registry, use Vertex Workbench.

Worked example — caching to keep iteration cheap

Detailed explanation. A common gotcha — a notebook with many cells, each kicking off a new BQ job because nothing was cached. The senior pattern: materialise the expensive intermediate once with .cache(), then iterate without re-running the upstream plan.

Question. A notebook explores a 200 GB DataFrame across 15 cells, each re-running the full join. Refactor with .cache().

Input.

Step Job cost (GB scanned)
Cell 3 — initial query 200
Cell 5 — filter (no cache) 200 (re-runs from source)
Cell 7 — groupby (no cache) 200
Cell 9 — sort + head (no cache) 200

Code.

# BEFORE (every cell re-runs the upstream SQL plan)
import bigframes.pandas as bpd

big = bpd.read_gbq("proj.warehouse.fact_join_view")        # cell 3 → 200 GB job
filtered = big[big["region"] == "US"]                       # lazy
agg = filtered.groupby("month").sum()                       # lazy
top = agg.sort_values("revenue", ascending=False).head(10)  # cell 5+7+9 → 200 GB each

# AFTER (cache the expensive intermediate once)
big = bpd.read_gbq("proj.warehouse.fact_join_view").cache()  # cell 3 → 200 GB job once
filtered = big[big["region"] == "US"]                        # against cached
agg = filtered.groupby("month").sum()                        # against cached
top = agg.sort_values("revenue", ascending=False).head(10)   # against cached
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Without .cache(), every materialisation re-issues the SQL plan from source — 200 GB per cell.
  2. .cache() materialises into a temporary BQ table once; subsequent operations read from cache (a few GB).
  3. The cache lifetime is session-bound. Persistent caches live as regular BQ tables for cross-session reuse.
  4. 15 cells × 200 GB → 200 GB once + 15 × cache scan. On on-demand pricing, the saving is real money.
  5. Any bigframes DataFrame touched by 3+ downstream cells deserves a .cache() call.

Output.

Iteration Without cache With cache
Cell 3 200 GB 200 GB
Cell 5 200 GB <1 GB cache scan
Cell 7 200 GB <1 GB cache scan
Cell 9 200 GB <1 GB cache scan
Total 800 GB ~203 GB

Rule of thumb. Cache early, cache once, cache at the most-reused intermediate. Lazy execution rewards thoughtful caching with order-of-magnitude savings.

Senior interview question on bigframes vs pandas

A senior interviewer often asks: "Your team uses pandas in notebooks today, and you're proposing moving to bigframes. What changes for the team, what's the same, and where would you still pick plain pandas?"

Solution Using a feature-by-feature comparison + migration plan

# bigframes vs pandas — the senior comparison

# 1. API surface
#    bigframes: ~90% of the pandas API is mirrored
#    pandas:    100% (it IS the API)
#    Verdict:   bigframes covers the common 90%, edge cases may need pandas

# 2. Execution model
#    bigframes: lazy → SQL → BigQuery compute
#    pandas:    eager → in-memory on the notebook VM
#    Verdict:   bigframes wins on data > VM RAM, pandas wins on local-only flows

# 3. Cost model
#    bigframes: pays BQ on-demand or slot reservation
#    pandas:    pays notebook VM CPU/memory; data egress to load
#    Verdict:   bigframes is cheaper at scale; pandas is cheaper for tiny data

# 4. Library compatibility
#    bigframes: works with scikit-learn (via to_pandas()), matplotlib, seaborn
#    pandas:    works with everything
#    Verdict:   bigframes for plan-building, materialise to pandas for niche libs

# 5. Debuggability
#    bigframes: inspect compiled SQL via df.sql
#    pandas:    inspect intermediate DataFrames inline
#    Verdict:   different debug models; both workable

# When to keep plain pandas
# - Data already in-memory (CSV upload, API response)
# - Data is small (< notebook VM RAM)
# - Library being used has no bigframes equivalent and needs pandas in
# - Iterative debugging where SQL plan inspection is awkward

# Migration plan
# 1. New notebooks default to bigframes
# 2. Existing notebooks: convert the data-fetch step to bpd.read_gbq + .cache()
# 3. Keep .to_pandas() at the boundary where you need a niche library
# 4. Train the team on .sql inspection for plan debugging
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Workflow Before (pandas) After (bigframes) Win
Load 50 GB warehouse table OOM on 16 GB VM lazy plan, runs in BQ bigframes
Groupby + sum + sort 5 min in VM 30 s in BQ bigframes
Quick CSV from a vendor 1 line in pandas indirect via upload to BQ pandas
Sklearn KMeans on tiny features .fit(df) .to_pandas() then .fit() tie
Plot with matplotlib df.plot() df.to_pandas().plot() tie
Persistent feature store hand-rolled in pandas bigframes writes to BQ table bigframes

After the migration plan, the team uses bigframes for everything that touches warehouse data and keeps pandas at the boundary for niche-library interop.

Output:

Surface Tool Reason
Warehouse-scale ETL in notebook bigframes lazy plan + BQ compute
Tiny CSV from a vendor pandas already in-memory
Scikit-learn niche model pandas at the boundary sklearn doesn't speak bigframes
Plotting pandas at the boundary matplotlib speaks pandas
Notebook-scheduled ETL bigframes cost-efficient at warehouse scale

Why this works — concept by concept:

  • Lazy plan vs eager evaluationbigframes builds a SQL plan running server-side; pandas evaluates in the notebook VM. Plan model wins on warehouse-scale; eager wins on tiny data.
  • API paritybigframes mirrors ~90% of pandas; day-to-day code looks identical. The 10% gap is where pandas wins, or you call .to_pandas() at the boundary.
  • Cost surfacebigframes pays BQ; pandas pays notebook VM. Pick by which compute pool is cheaper at your scale.
  • Library interop at the boundary — sklearn, matplotlib, and other pandas-speaking libraries work via .to_pandas(). bigframes.ml covers in-warehouse ML so the boundary moves later.
  • Cost — migration is O(notebooks) of read-and-edit; staying on pandas is O(data size) per re-run. bigframes pays back fast on warehouse-scale teams.

SQL
Topic — sql
Notebook + SQL practice problems

Practice →

ETL Topic — data manipulation Data manipulation drills

Practice →


4. Data Canvas — visual exploration

bigquery data canvas is a branch-and-merge visual editor that exports to clean BigQuery SQL — the right home for transforms that need a flow diagram more than a typed query

The mental model in one line: the Data Canvas is a drag-and-drop graph editor where each node is a SQL transformation step and edges carry result tables between nodes; the graph compiles to a single SQL script with named CTEs that BigQuery actually executes. The canvas is a UI on top of CTEs — but the visual surface is the killer feature for transforms with branch-and-merge logic, lookup joins, or non-engineer maintainers.

Visual diagram of the BigQuery Data Canvas — a graph with 6 nodes (source → filter → derive → branch into two filters → join → sink) showing the data flow visually, with an exported-SQL panel on the right; on a light PipeCode card with purple, green, and orange accents.

The canvas surface.

  • Nodes. Single transformation steps — Source, Filter, Derive column, Group + aggregate, Join, Union, Sort, Sample, Sink. Form-based config panel; no SQL typing for the common cases.
  • Edges. Carry output table of one node into the next. Multiple outputs (branch) and multiple inputs (merge / join / union) are supported.
  • Preview. Every node shows a 100-500 row sample inline via a lightweight LIMIT query.
  • Export to SQL. Emits a single SQL script with one CTE per node, named after the node label.
  • Schedule. Wraps the exported SQL as a BigQuery Scheduled Query with cron, destination, and run-as service account.

When the canvas beats the editor. Branch-and-merge logic; hand-off to non-engineer maintainers; iterative add/remove steps with preview; lookup joins with form-based join nodes.

When the editor still wins. Linear single-block transforms; production-critical transforms with Git diff requirements (keep in Dataform); uncommon SQL the form-based nodes don't cover (custom window frames, recursive CTEs, qualify).

Canvas → exported SQL pattern. Each node label becomes a CTE name; the final node is the top-level SELECT; comments preserve node provenance. The exported SQL is idiomatic and runnable in the editor without modification.

Production pattern. Build on canvas with preview at every step → export to SQL → review SQL in code review → schedule via "Schedule" button → re-open canvas only to edit; re-export and re-deploy.

Worked example — building a 4-step canvas pipeline

Detailed explanation. A classic canvas workflow — load raw events, filter to the last 30 days, derive a date column, group by date and region, sink to a reporting table. Each step is a node with an inline preview; the exported SQL is one CTE per node.

Question. Build a 4-step canvas pipeline aggregating raw events into daily-region totals. Show each node, the exported SQL, and why this is canvas-fit.

Input.

event_id event_time region user_id revenue
e1 2026-06-01 10:15 US u1 50
e2 2026-06-01 11:30 EU u2 35
e3 2026-06-02 09:45 US u3 80
e4 2026-06-02 14:20 APAC u4 25

Code.

Canvas graph
============

Node 1 — Source
  type: BigQuery table
  ref:  proj.events.raw_events

Node 2 — Filter
  predicate: event_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)

Node 3 — Derive column
  new column: event_date = DATE(event_time)

Node 4 — Group + aggregate
  group by: event_date, region
  aggregates: SUM(revenue) AS daily_revenue, COUNT(DISTINCT user_id) AS dau

Node 5 — Sink
  destination: proj.events.daily_region_totals
  write mode:  WRITE_TRUNCATE
Enter fullscreen mode Exit fullscreen mode
-- Exported SQL from the canvas (paste verbatim into editor or schedule)
WITH
source_raw_events AS (
  SELECT * FROM `proj.events.raw_events`
),
filter_last_30d AS (
  SELECT *
  FROM source_raw_events
  WHERE event_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
),
derive_event_date AS (
  SELECT
    *,
    DATE(event_time) AS event_date
  FROM filter_last_30d
),
group_by_date_region AS (
  SELECT
    event_date,
    region,
    SUM(revenue)              AS daily_revenue,
    COUNT(DISTINCT user_id)   AS dau
  FROM derive_event_date
  GROUP BY event_date, region
)
SELECT * FROM group_by_date_region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The Source node points at the raw events table; preview shows a 100-row sample.
  2. The Filter node applies the 30-day predicate; preview shows only in-window rows — mistypes surface immediately.
  3. The Derive node adds the event_date column visible in the preview.
  4. The Group + aggregate node collapses rows to one per (event_date, region) with the aggregates.
  5. The Sink writes to the destination with WRITE_TRUNCATE; the exported SQL captures the flow as a CTE chain, ready to schedule.

Output (daily_region_totals).

event_date region daily_revenue dau
2026-06-01 US 50 1
2026-06-01 EU 35 1
2026-06-02 US 80 1
2026-06-02 APAC 25 1

Rule of thumb. A 4-step pipeline is a fine canvas fit — visual, easy to hand off, exports to clean SQL. The same transform in the editor is a 15-line CTE; both work, but canvas wins on readability.

Worked example — branch-and-merge with a lookup join

Detailed explanation. Branch-and-merge is the canvas's killer use case — load orders, split into US and non-US streams, apply different aggregations, join with customers, union into one output. The visual graph collapses an 80-line CTE chain into an at-a-glance diagram.

Question. Build a canvas pipeline branching orders into US and non-US streams with different aggregates, joining each with the customers lookup, and unioning the result.

Input — orders.

order_id region customer_id amount
1 US c1 100
2 EU c2 60
3 US c1 80

Input — customers (lookup).

customer_id name
c1 Acme Corp
c2 Beta Inc

Code.

Canvas graph
============

Node 1 — Source: orders
Node 2 — Source: customers
Node 3 — Filter (orders): region = 'US'        → us_orders
Node 4 — Filter (orders): region != 'US'        → non_us_orders
Node 5 — Group US:        SUM(amount), COUNT(*) per customer_id
Node 6 — Group non-US:    AVG(amount), COUNT(*) per customer_id
Node 7 — Join us_agg  + customers on customer_id
Node 8 — Join non_us_agg + customers on customer_id
Node 9 — Union node 7 + node 8 with a region_label column
Node 10 — Sink: proj.sales.regional_summary
Enter fullscreen mode Exit fullscreen mode
-- Exported SQL
WITH
source_orders AS (
  SELECT * FROM `proj.sales.orders`
),
source_customers AS (
  SELECT * FROM `proj.sales.customers`
),
filter_us AS (
  SELECT * FROM source_orders WHERE region = 'US'
),
filter_non_us AS (
  SELECT * FROM source_orders WHERE region != 'US'
),
group_us AS (
  SELECT customer_id, SUM(amount) AS total_amount, COUNT(*) AS orders
  FROM filter_us
  GROUP BY customer_id
),
group_non_us AS (
  SELECT customer_id, AVG(amount) AS avg_amount, COUNT(*) AS orders
  FROM filter_non_us
  GROUP BY customer_id
),
join_us AS (
  SELECT g.*, c.name, 'US' AS region_label
  FROM group_us g
  JOIN source_customers c USING (customer_id)
),
join_non_us AS (
  SELECT g.*, c.name, 'NON_US' AS region_label
  FROM group_non_us g
  JOIN source_customers c USING (customer_id)
)
SELECT customer_id, name, region_label, orders,
       total_amount AS total_amount_us,
       NULL AS avg_amount_non_us
FROM join_us
UNION ALL
SELECT customer_id, name, region_label, orders,
       NULL AS total_amount_us,
       avg_amount AS avg_amount_non_us
FROM join_non_us;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Two source nodes load orders and customers; preview confirms schemas.
  2. The orders stream branches into US and non-US filter nodes.
  3. Each filtered stream feeds a different aggregate (sum vs avg) — preview confirms per-customer aggregates.
  4. Each aggregate joins with the customers lookup.
  5. The two enriched streams union with a region_label column; the Sink writes to the destination.
  6. The visual graph collapses ~50 lines of nested CTEs into a single readable diagram.

Output (regional_summary).

customer_id name region_label orders total_amount_us avg_amount_non_us
c1 Acme Corp US 2 180 NULL
c2 Beta Inc NON_US 1 NULL 60

Rule of thumb. Branch-and-merge is the canvas's sweet spot. If your transform splits a single source into multiple streams with different treatment, the visual diagram is much easier to maintain than the equivalent CTE chain.

Worked example — exporting canvas to a scheduled query

Detailed explanation. The production pattern — you built the transform on the canvas, now it runs nightly. The "Schedule" button wraps the exported SQL in a BQ Scheduled Query with cron, destination, and run-as service account. The senior probe: do you treat the canvas state or the exported SQL as the source of truth?

Question. Schedule the 4-step pipeline from the first worked example as a daily query. Identify what goes to source control, what goes to BQ, and how you'd modify the transform six months later.

Input.

Artefact Lives where
Canvas state (graph JSON) BQ Studio workspace
Exported SQL Git repo + BQ Scheduled Query
Schedule config (cron, dest, IAM) BQ Scheduled Query metadata
Audit log of runs BQ INFORMATION_SCHEMA

Code.

-- Production-ready scheduled query
-- (Created from the canvas's "Schedule" button; committed to Git for visibility)

-- @scheduled-query
-- @schedule: every day 02:00 America/Los_Angeles
-- @destination: proj.events.daily_region_totals (WRITE_TRUNCATE)
-- @service-account: bq-sched@proj.iam.gserviceaccount.com
-- @source: canvas/daily_region_totals.canvas

WITH
source_raw_events AS (
  SELECT * FROM `proj.events.raw_events`
),
filter_last_30d AS (
  SELECT *
  FROM source_raw_events
  WHERE event_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
),
derive_event_date AS (
  SELECT *, DATE(event_time) AS event_date
  FROM filter_last_30d
),
group_by_date_region AS (
  SELECT
    event_date,
    region,
    SUM(revenue)            AS daily_revenue,
    COUNT(DISTINCT user_id) AS dau
  FROM derive_event_date
  GROUP BY event_date, region
)
SELECT * FROM group_by_date_region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. "Schedule" registers the exported SQL as a BQ Scheduled Query with destination, cron, service account, and retry policy.
  2. The exported SQL is committed to Git; the canvas JSON is also committed for visual editing, but SQL is the deployment artefact.
  3. Six months later: re-open the canvas, edit, re-export, re-deploy. Git diff on the SQL shows the change.
  4. Run history lives in INFORMATION_SCHEMA.SCHEDULED_QUERIES_RUNS — wire it into Cloud Monitoring for SLA alerts.
  5. Treat the exported SQL as the source of truth, the canvas as the authoring tool.

Output (artefact ledger).

Artefact Storage Purpose
daily_region_totals.canvas (graph JSON) Git repo Re-open in canvas for editing
daily_region_totals.sql (exported SQL) Git repo + BQ Scheduled Query Deployment artefact
Schedule config BQ Scheduled Query metadata Cron, destination, IAM
Run history INFORMATION_SCHEMA.SCHEDULED_QUERIES_RUNS Audit / monitoring

Rule of thumb. Exported SQL is the source of truth; canvas state is the authoring helper. Both go to Git; the SQL runs in production.

Senior interview question on canvas vs Dataform

A senior interviewer might ask: "Your team uses Data Canvas for a few transforms and Dataform for the rest. Where exactly is the line between them, and have you ever needed to migrate a transform from one to the other?"

Solution Using a feature-by-feature comparison + migration playbook

Data Canvas vs Dataform — when to use which
===========================================

Feature axes
------------

                              Data Canvas      Dataform
Visual editor                 YES              No (SQLX files in Git)
Branch-and-merge              YES (graph)      Manual via dependencies
Dependency graph              Within canvas    Across many tables
Version control               Canvas JSON      Git-native SQLX
Assertions / data tests       No               YES (assertions)
Variables / templating        Limited          YES (JavaScript blocks)
Multi-target deploy           Single sched     Tag-based deploys
Cron / schedule               Per canvas       Per workflow
Hand-off to non-engineers     YES (visual)     Lower (Git + SQLX)
Production governance         Lower            Higher

When to pick Data Canvas
- Exploration, prototyping
- Branch-and-merge transforms with few external dependencies
- Non-engineer maintainers
- Single scheduled output

When to pick Dataform
- Production ELT with many tables
- Assertions / data quality tests
- Multi-environment (dev / staging / prod) deploys
- Engineer-only maintenance
- Templated SQL across many similar tables

Migration playbook (canvas → Dataform)
1. Export canvas SQL
2. Wrap each CTE as a SQLX file with dependencies
3. Add assertions (uniqueness, not_null, accepted values)
4. Hook into Dataform's tag-based deploy
5. Decommission the scheduled query that ran the canvas
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Transform Maturity stage Surface Reason
Daily region totals Stable, 1 year old Dataform production governance + assertions
New branch-and-merge prototype Week 1 Data Canvas visual flow + fast iteration
Per-customer scoring (50 tables) Production Dataform dependency graph + templating
Quarterly board pack source Once a quarter Data Canvas → export non-engineer maintains
Streaming-to-batch staging Production Dataform environment deploys

After the comparison, the line is clear: prototypes and hand-offs to non-engineers live in Canvas; production ELT with quality tests and multi-environment deploys lives in Dataform.

Output:

Use case Pick Why
Visual prototype Canvas Branch-and-merge UI
Production ELT with tests Dataform Assertions + dependencies
Non-engineer maintains Canvas Visual flow
Engineer-only, many tables Dataform Templating + tag deploys
One-off scheduled query Canvas Export + schedule

Why this works — concept by concept:

  • Lifecycle stage decides — prototypes and hand-offs go to Canvas; productionised, governed ELT goes to Dataform. Lifecycle stage is a cleaner predicate than "which tool can express this transform."
  • Assertions are the moat — Dataform's data tests (uniqueness, not_null, accepted_values) are the production governance Canvas cannot provide.
  • Dependency graph — Canvas has a single-graph view; Dataform has a cross-table graph. For 50+ table pipelines, the cross-table view is essential.
  • Migration is mechanical — export SQL, split into SQLX files, add assertions, deploy. The pattern is repeatable.
  • Cost — Canvas authoring is O(weeks); Dataform authoring is O(months). Prototype in Canvas, productionise in Dataform.

ETL
Topic — etl
Visual ETL pipeline problems

Practice →

SQL Topic — aggregation Aggregation + lookup join drills

Practice →


5. Studio vs Looker vs Vertex — choosing

Studio is a workbench, Looker is a governed BI semantic layer, Vertex is an ML platform — the senior signal is naming the build-vs-serve and warehouse-vs-platform lines crisply

The mental model in one line: Studio sits on the build side (workbench for SQL + Python + visual ETL); Looker sits on the serve side (governed BI semantic layer with LookML); Vertex sits on the ML platform side (Workbench, model registry, pipelines, non-BQ ML) — and most production stacks need all three because they solve different problems. Senior interviews probe whether you've internalised these three lines or collapse them into "GCP data tools."

Visual diagram comparing BigQuery Studio, Looker, and Vertex AI side by side — three columns showing icons for SQL editor / notebook / canvas (Studio), LookML / dashboard / explore (Looker), and Workbench / model registry / pipeline (Vertex); on a light PipeCode card with purple, green, and orange accents.

The three tools, one sentence each.

  • BigQuery Studio. Workbench for engineers: SQL + notebooks + canvas + Gemini, sharing one BQ engine, IAM, and billing scope. The build side.
  • Looker. Governed BI semantic layer for end users: LookML dimensions / measures / explores, Looks and Dashboards, dimension-level permissions. The serve side.
  • Vertex AI. ML platform: Workbench, GPU/TPU training, Model Registry, pipelines, endpoints, monitoring. The non-BQ ML side.

The build-vs-serve line. Studio for engineers writing transformations; Looker for end users consuming curated dashboards. Transformations live in Studio / Dataform and feed Looker's semantic layer — Looker does not transform; it serves. Using Looker as a transformation tool (over-leaning on Persistent Derived Tables) is a common junior anti-pattern.

The warehouse-vs-platform line. Studio is BQ-only — every surface binds to BigQuery as the engine. Vertex is multi-engine — Workbench can train on GPU, run PyTorch / TensorFlow, register in Model Registry, serve via endpoints. For BQ-centric ML, Studio is enough; for everything else, Vertex.

Hybrid stacks — the production reality.

  • Studio + Looker. Most common. Engineers build in Studio; Looker serves curated outputs as dashboards.
  • Studio + Vertex. When the model needs GPUs, custom code, or registry governance. Studio for BQ-side authoring; Vertex for heavy training and serving.
  • All three. Large data orgs — Studio for engineer workflows, Looker for end users, Vertex for non-BQ ML.

Senior interview signals.

  • Saying "Studio is build, Looker is serve, Vertex is ML platform" in the first sentence — senior signal.
  • Describing the curated path raw → Dataform/Studio → mart → Looker semantic layer → dashboard — required answer.
  • Naming LookML dimensions, measures, explores, persistent derived tables rather than "Looker thing" — senior signal.
  • Pushing back on "BQML replaces Vertex" with "BQML covers warehouse-native; Vertex owns GPU + registry + custom serving" — senior signal.

Worked example — picking the right tool for a request

Detailed explanation. An interviewer hands you four requests and asks for the tool. They span Studio / Looker / Vertex / hybrid territory; the senior signal is naming the line that decides each one.

Question. For each request below, pick the primary tool and justify briefly. Acknowledge hybrid cases.

Input.

# Request
1 "Build a one-off chart for tomorrow's exec meeting showing weekly active users by region"
2 "Build a self-serve dashboard for marketing showing CAC, LTV, and channel mix, updated daily"
3 "Train a deep-learning fraud-detection model on a 1B-row events table with GPU acceleration"
4 "Run a quarterly batch report that re-aggregates revenue by product, with an emailed PDF"

Code.

Tool selection by request
=========================

Request 1 — one-off exec chart
  Primary: BQ Studio (notebook OR SQL editor + chart)
  Why:     ephemeral; not worth wiring Looker for a one-shot

Request 2 — self-serve marketing dashboard
  Primary: Looker (LookML semantic layer + dashboard)
  Hybrid:  Studio / Dataform builds the curated table; Looker serves
  Why:     governed BI for end users — Looker's home turf

Request 3 — GPU fraud model on 1B rows
  Primary: Vertex Workbench (notebook on GPU + Vertex training job)
  Hybrid:  Studio + bigframes for feature engineering; Vertex for training
  Why:     GPU + custom architecture exceeds BQML

Request 4 — quarterly emailed PDF
  Primary: Studio (canvas → scheduled query) + Looker (PDF schedule)
  Why:     transform in Studio; Looker handles the email + PDF
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Request 1 is ephemeral — wiring it into Looker (LookML + dashboard + access) is overkill. A Studio notebook + matplotlib is faster.
  2. Request 2 is governed, recurring, end-user-facing — Looker's home turf. Transformations live in Studio / Dataform; LookML exposes them.
  3. Request 3 needs GPU + custom architecture — BQML cannot do that. Vertex Workbench with Studio for feature prep is the hybrid pattern.
  4. Request 4 is BQ-native transform + scheduled PDF — Studio for the transform, Looker for the scheduled-report delivery.
  5. Name the build-vs-serve line and the warehouse-vs-platform line; the tool falls out.

Output.

# Primary tool Hybrid?
1 Studio none — ephemeral
2 Looker Studio / Dataform builds the table
3 Vertex Studio for features → Vertex for training
4 Studio + Looker Studio transforms, Looker delivers PDF

Rule of thumb. Most requests have a primary tool and a supporting tool. Senior answers name both; junior answers collapse to "use BigQuery."

Worked example — building the curated path from raw → BQ → Looker

Detailed explanation. A core question — "walk me through the data path from raw events to a Looker dashboard, naming every tool and every table." This is production storytelling that separates engineers who've shipped from those who haven't.

Question. Describe the path from raw_events to a Looker dashboard "WAU by region." Name the tools, tables, and transformation owners.

Input.

Stage Layer Owner
Ingest Pub/Sub → BQ streaming insert Platform team
Raw proj.raw.events Platform team
Transformation Dataform Data engineering
Mart proj.mart.weekly_active_users Data engineering
Semantic layer Looker LookML Analytics engineering
Dashboard Looker Marketing / Exec

Code.

End-to-end curated path
=======================

1. Raw events stream into BigQuery
   Source: Pub/Sub topic events.app.activity
   Sink:   proj.raw.events (partitioned by event_date)

2. Dataform (or Studio canvas → exported SQL) materialises a curated table
   Source: proj.raw.events
   Job:    weekly_active_users.sqlx
           WITH last_7d AS (
             SELECT user_id, region, DATE_TRUNC(event_date, WEEK) AS week
             FROM proj.raw.events
             WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
           )
           SELECT week, region, COUNT(DISTINCT user_id) AS wau
           FROM last_7d
           GROUP BY week, region
   Sink:   proj.mart.weekly_active_users
   Schedule: nightly

3. Looker LookML exposes the mart as a view
   File: weekly_active_users.view.lkml
     view: weekly_active_users {
       sql_table_name: proj.mart.weekly_active_users ;;
       dimension: week { sql: ${TABLE}.week ;; }
       dimension: region { sql: ${TABLE}.region ;; }
       measure: wau { type: sum  sql: ${TABLE}.wau ;; }
     }

4. Looker explore stacks views; dashboard renders the chart
   Explore: activity
   Dashboard: marketing → "WAU by region"
   Schedule: daily PDF email to marketing@
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Platform team owns raw ingestion — Pub/Sub, streaming insert, partitioned raw table. Studio is not in this layer.
  2. Data engineering owns the curated table built in Dataform (or canvas → exported SQL). Output is proj.mart.weekly_active_users, partitioned by week.
  3. Analytics engineering writes LookML — dimensions (week, region) and measures (wau) — wrapping the mart as a semantic view.
  4. Marketing consumes the dashboard with nightly PDF emails. IAM trail: platform → raw, DE → raw+mart, AE → LookML, marketing → dashboard.
  5. Studio's role: DE uses Studio (notebook + editor + canvas) while building the Dataform job. Production runs in Dataform; Studio is the authoring surface.

Output (artefact ledger).

Artefact Storage Owner
proj.raw.events BQ raw layer Platform
weekly_active_users.sqlx Dataform Git Data engineering
proj.mart.weekly_active_users BQ mart layer Data engineering
weekly_active_users.view.lkml Looker Git Analytics engineering
"WAU by region" dashboard Looker Marketing

Rule of thumb. Naming tools and tables at every stage is the senior signal. Collapsing the path to "raw goes to dashboard" reveals you've not shipped this end-to-end.

Worked example — when BQML in Studio replaces Vertex Workbench

Detailed explanation. A focused probe — "you have a churn model. When does BQML in Studio fully replace Vertex Workbench and when does it not?" Tests whether you've actually compared BQML against the Vertex stack on the warehouse-native ML axis.

Question. Compare BQML in Studio against Vertex Workbench for a churn model. Identify the boundary conditions pushing each side.

Input.

Dimension BQML in Studio Vertex Workbench
Training data location BigQuery BigQuery or anywhere
Compute BQ slots Notebook VM + optional GPU
Model architectures Linear, logistic, KMeans, random forest, DNN, AutoML Anything (PyTorch / TF / scikit-learn / XGBoost)
Hyperparameter tuning Limited (BQML grid) Vertex Vizier or custom
Model registry INFORMATION_SCHEMA.MODELS Vertex Model Registry
Serving SQL ML.PREDICT Vertex Endpoints
Cost model BQ on-demand or slots Notebook VM + endpoint billing

Code.

# Case A — BQML in Studio is enough
# Standard churn problem, ~10 features, logistic regression, served from SQL
import bigframes.pandas as bpd
from bigframes.ml.linear_model import LogisticRegression

customers = bpd.read_gbq("proj.crm.customers")
X, y = customers[["tenure", "spend", "tickets"]], customers["churned"]
model = LogisticRegression().fit(X, y)
# Trained model lives in BQ. Predictions are SQL ML.PREDICT calls.
Enter fullscreen mode Exit fullscreen mode
# Case B — Vertex Workbench is required
# Same churn problem, but with a 200-feature LSTM over event sequences and GPU training
import torch
from torch import nn

class ChurnLSTM(nn.Module):
    def __init__(self, hidden=128):
        super().__init__()
        self.lstm = nn.LSTM(input_size=200, hidden_size=hidden, batch_first=True)
        self.fc   = nn.Linear(hidden, 1)
    def forward(self, x):
        _, (h, _) = self.lstm(x)
        return torch.sigmoid(self.fc(h.squeeze(0)))

# Train on a GPU VM, register in Vertex Model Registry, serve via Vertex Endpoint
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Case A is tabular logistic regression on ~10 features — BQML covers it. LogisticRegression issues CREATE MODEL; predictions are ML.PREDICT from any SQL surface.
  2. Case B has event-sequence features and an LSTM trained on GPU — BQML cannot do this. Vertex Workbench is the answer.
  3. Boundary conditions: (a) architecture catalogue (BQML finite vs Vertex unrestricted); (b) compute (BQ slots vs GPU/TPU); (c) registry (INFORMATION_SCHEMA.MODELS vs Vertex Model Registry).
  4. Decision: BQML in Studio if architecture is in the catalogue AND training fits BQ slots AND serving is from SQL. Otherwise, Vertex.
  5. Hybrid pattern: features in Studio (bigframes), curated feature table in BQ, training in Vertex on GPU, model in Vertex Registry, served from Vertex Endpoint.

Output (decision matrix).

Scenario Pick
Tabular churn, logistic / random forest BQML in Studio
Tabular churn, AutoML BQML in Studio
Sequence LSTM churn Vertex
Large embedding model Vertex
Model registry + multi-version governance Vertex
SQL-served predictions only BQML in Studio
Custom serving (REST endpoint with auth) Vertex

Rule of thumb. BQML in Studio is enough when the architecture is in the catalogue and serving is from SQL. Otherwise, Vertex. Senior answers enumerate both axes; junior answers collapse to "use Vertex for ML."

Senior interview question on tool boundaries

A senior interviewer might ask: "Sketch the GCP data stack for a 200-engineer data org. Where does BQ Studio fit, where does Looker fit, where does Vertex fit, and what tools do you still need alongside? Defend each boundary."

Solution Using a layered architecture with explicit boundaries

GCP data stack — 200-engineer org
=================================

Ingest         Pub/Sub + Dataflow + Datastream → BigQuery
Raw            proj.raw.*  (partitioned, clustered)
Transformation Dataform (prod ELT) + Studio canvas (proto) + Dataproc (Spark)
Mart           proj.mart.* (curated, governed)
Serving (BI)   Looker (governed dashboards) + Studio (ad-hoc charts)
ML training    BQML in Studio (warehouse) + Vertex Workbench (GPU/custom)
ML serving     ML.PREDICT (SQL) + Vertex Endpoints (REST)
Orchestration  Composer/Airflow (cross-service) + BQ Scheduled Queries
Observability  Cloud Monitoring + Looker alerts + Vertex Model Monitoring

Boundaries
  Build (Studio, Dataform) ←→ Serve (Looker, Endpoints)
  Warehouse ML (BQML) ←→ Platform ML (Vertex)
  Single-engine (BQ) ←→ Multi-engine (Dataproc, Vertex)
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Layer Primary tool Why it sits here
Ingest Pub/Sub + Dataflow + Datastream Cross-service ingestion is not BQ Studio's domain
Transformation Dataform (prod) + Studio canvas (proto) Build side; Studio is authoring, Dataform is production
Serving (BI) Looker Governed, semantic, end-user-facing; not Studio's role
Serving (ML) Vertex Endpoints Multi-architecture; not BQML's role
ML training BQML in Studio (warehouse) + Vertex (GPU) Warehouse vs platform line
Orchestration Composer Cross-service DAGs; Studio cannot do this

After the layered architecture, every tool has a clear home and the boundaries are testable. A new request maps to a layer and a tool unambiguously.

Output:

Stack layer Studio's role Looker's role Vertex's role
Ingest none none none
Transformation prototyping + ad-hoc none none
Mart reading + ad-hoc none none
Serving (BI) one-off charts governed dashboards none
ML training BQML none non-BQML training
ML serving SQL ML.PREDICT none Endpoints
Orchestration none scheduled PDF Vertex Pipelines

Why this works — concept by concept:

  • Layered architecture — naming layers (ingest / raw / transform / mart / serve / ML / orchestration / observability) gives every tool a clear home. The conversation shifts from "which tool" to "which layer."
  • Build vs serve line — Studio + Dataform on build; Looker + Endpoints on serve. Crossing this line (Looker as transformation tool, Studio as end-user dashboard) is the most common stack mistake.
  • Warehouse-ML vs platform-ML — BQML covers warehouse-native, SQL-served models. Vertex covers everything else. The line is the architecture catalogue and serving mode.
  • Single-engine vs multi-engine — Studio is BQ-only; Composer and Dataproc are multi-engine. For multi-service DAGs and non-BQ compute, Studio steps aside.
  • Cost — over-investing in a wrong layer (Vertex for tabular regression, Looker for one-shot exploration) wastes money and time. The layered map keeps you in your lane.

SQL
Topic — sql
Warehouse SQL design problems

Practice →

ETL
Topic — etl
Stack design + tool boundary problems

Practice →


Cheat sheet — Studio + Gemini recipes

  • Gemini explain-this-query. Highlight SQL → click "Explain" → read the prose → trace one input through the SQL to verify the mechanism. The explanation is a starting point, not a proof.
  • Gemini NL-to-SQL grounding. Annotate every column with a one-line description; tag every primary key in INFORMATION_SCHEMA. Schema grounding is what makes "top 5 customers by revenue" emit a working query.
  • Four-point review. Joins (FK relationship) → Filters (date math) → Aggregations (function + GROUP BY) → Window frames (row count). Apply to every AI query before merge.
  • bigframes quick-start. import bigframes.pandas as bpd; df = bpd.read_gbq("proj.dataset.table"). Lazy until .to_pandas(). Inspect compiled SQL via df.sql.
  • bigframes cache. Any DataFrame touched by 3+ downstream cells gets .cache(). Cuts iteration cost by an order of magnitude.
  • bigframes.ml BQML wrapper. LogisticRegression().fit(X, y) issues CREATE MODEL; .predict() issues ML.PREDICT. The model lives in BQ.
  • Notebook + BQML template. SQL cell to fetch features → train_test_split.fit().score().predict() + ROC plot. Artefacts in BQ; chart in notebook.
  • Data Canvas export. Treat exported SQL as the source of truth. Commit both canvas JSON and SQL to Git; re-open canvas only to edit.
  • Canvas → scheduled query. Build on canvas → preview → export to SQL → "Schedule" → cron + destination + service account. Audit via INFORMATION_SCHEMA.SCHEDULED_QUERIES_RUNS.
  • Canvas vs Dataform. Canvas for prototypes + non-engineer hand-offs. Dataform for prod ELT with assertions, environments, and templated SQL across tables.
  • Studio vs Looker. Studio is build; Looker is serve. Curated tables in BQ marts; LookML wraps them as views; dashboards consume views. Looker is not a transformation tool.
  • Studio vs Vertex. BQML in Studio for warehouse-native + SQL-served models. Vertex for custom architectures, GPU training, registry governance, REST serving.
  • Hybrid Studio + Vertex. Features in Studio (bigframes); curated feature table in BQ; training in Vertex on GPU; model in Vertex Registry; serving via Endpoint.
  • Schedule observability. Wire INFORMATION_SCHEMA.JOBS + SCHEDULED_QUERIES_RUNS into Cloud Monitoring; alert on long jobs, failures, and slot saturation.
  • AI SQL tagging. Add /* Generated by Gemini YYYY-MM-DD — prompt: "..." — reviewer: @x */ to any AI query in production. Six months later you know the prompt and the reviewer.

Frequently asked questions

What is BigQuery Studio and how is it different from the classic BQ console?

BigQuery Studio is the unified workbench Google ships as the default GCP console BQ entry — it collapses the SQL editor, Python notebooks, the visual Data Canvas, and Gemini-assisted code generation into one workspace sharing the same query engine, IAM, and billing scope. The classic BQ console was a SQL editor and a results pane; everything else lived in adjacent products like Vertex Workbench, Tableau Prep, or external IDEs. Studio's value is not "more features" — it is "all the surfaces in one place, with zero re-auth and zero data movement between them." Engineers who treat Studio as "the new SQL editor" leave most of its value on the table; the notebook and canvas surfaces are what genuinely change how a BQ team works in 2026.

Can Gemini in BigQuery write production-grade SQL?

Gemini is excellent at drafting SQL and dramatically less reliable at production-grade SQL — treat it as a paired pilot, never a committer. Inline auto-complete is high-accuracy for column names and standard syntax because it is schema-grounded against your project's INFORMATION_SCHEMA. Natural-language-to-SQL is strong on standard patterns (top-N per group, monthly aggregation, joins on documented FKs) and weak on multi-step pipelines, uncommon window frames, and cost optimisation. The four-point review — joins, filters, aggregations, window frames — catches the common failure modes in under 10 seconds per query. Production teams pair the review with a CI guard (dry-run + regression diff) and a Generated by Gemini comment header for traceability. The legacy brand "Duet AI for BigQuery" was unified under Gemini in BigQuery.

BigFrames vs pandas — what is the difference and when should I use which?

bigframes is a pandas-compatible DataFrame whose every operation lazily compiles to BigQuery SQL and runs in BQ; pandas evaluates eagerly in-memory on the notebook VM. The API is ~90% identical — .groupby, .merge, .assign, .dt, .str all behave like pandas. The execution model is what differs: bigframes builds a SQL plan that runs on warehouse compute, with .to_pandas() as the materialise moment; pandas runs immediately and is bounded by VM RAM. Pick bigframes for warehouse-scale data, for cost efficiency on a team with BQ slot reservations, and for warehouse-native ML via bigframes.ml. Pick pandas at the boundary where you need a niche library that doesn't speak bigframes (sklearn, matplotlib, custom XGBoost) — call .to_pandas() and continue. The 2026 default for new Studio notebooks is bigframes first, pandas at the edges.

Can I run a notebook on a schedule in BigQuery Studio?

Yes — Studio notebooks can be scheduled via the Notebook Executor, which runs the notebook headlessly on a managed runtime at a cron interval. The schedule config covers cron expression, parameters, runtime VM size, and destination for the executed artefact. The caveats: scheduled notebooks are not a replacement for multi-step DAG orchestration; if your workflow has cross-service dependencies (Pub/Sub trigger, upstream Dataform run, downstream Looker refresh), keep Composer / Airflow for the DAG and let the scheduled notebook be one node in it. For pure BQ-resident notebooks (feature refresh, model retraining, scheduled report generation), the in-Studio scheduler is fine.

BigQuery Studio vs Vertex Workbench — which should I use for ML?

Use Studio when the model architecture is in the BQML catalogue (linear, logistic, KMeans, random forest, DNN, AutoML) AND the serving mode is from SQL (ML.PREDICT) — bigframes.ml gives you scikit-learn-style ergonomics over BQML, the model lives in BQ, and there is no separate notebook VM cost beyond the editing session. Use Vertex Workbench when any of these is true: the model needs a custom architecture (sequence LSTM, transformer fine-tune, custom embedding), training needs GPU/TPU acceleration, you need versioning and lineage via the Vertex Model Registry, or the serving mode is a REST endpoint with auth (Vertex Endpoint). The hybrid pattern most production teams settle on: feature engineering in Studio (bigframes + curated feature table in BQ); training in Vertex on GPU; model registered in Vertex; serving via Vertex Endpoint or back into BQ via batch prediction. Studio replaces Workbench for BQ-native ML, not for the rest.

Data Canvas vs Dataform — when do I use each?

Use Data Canvas for visual exploration, prototyping, branch-and-merge transforms with few external dependencies, and pipelines maintained by non-engineers. The visual graph collapses nested CTEs into a diagram; the exported SQL is what runs in BQ when the canvas is scheduled. Use Dataform for production ELT with many tables, version-controlled SQLX in Git, data assertions (uniqueness, not_null, accepted_values), templated SQL across similar tables, and multi-environment deploys (dev / staging / prod). The migration playbook canvas → Dataform is mechanical: export the SQL, split each CTE into a SQLX file, add assertions, hook into tag-based deploy, decommission the original scheduled query. Most production data orgs use both — Canvas for the first few weeks of a new transform, Dataform once it stabilises. The line is lifecycle stage, not "which tool can express the transform."

Practice on PipeCode

  • Drill the SQL practice library → for BigQuery SQL — windows, partition pruning, qualify clauses, the gnarly frames Gemini sometimes gets wrong.
  • Rehearse on ETL practice problems → for warehouse-level pipeline design — Canvas → exported SQL → scheduled query patterns.
  • Sharpen warehouse aggregation with the aggregation drills → — group-by, window, and rollup patterns that dominate BQ work.
  • Layer the data manipulation library → for bigframes-style DataFrame patterns — joins, derives, splits, unions — that map onto the canvas.
  • Treat bigquery studio as part of the modern warehouse stack — practice SQL hard, and Studio + Gemini ergonomics become a fluency multiplier on top of fundamentals.

Lock in Studio + Gemini muscle memory

BQ docs explain the surface. PipeCode drills explain the decision — when Studio replaces a separate notebook stack, when Gemini SQL needs review, when Data Canvas exports beat hand-rolled SQL. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the production trade-offs senior data engineers actually face.

Practice SQL problems →
Practice ETL problems →

Top comments (0)