Gowtham Potureddi

Posted on Jun 10

BigQuery Console & SQL Workbench: Hands-On Tour for New Data Engineers

#python #sql #dataengineering #interview

bigquery console is the single most-opened tab in a Google-Cloud data team's day — and the one most new hires under-use. The reason is that "the console" in 2026 is no longer a single SQL editor. It is a unified workbench that bundles a multi-tab query editor, the dry-run cost validator, scheduled queries, Studio notebooks, Dataform projects, slot monitoring dashboards, and Gemini SQL assist into one URL — and the only way to feel fast in it is to learn each panel by what job it does.

This guide is the hands-on tour you wished existed the first time you opened console.cloud.google.com/bigquery and had to ship a query, an aggregate, and a daily schedule before lunch. It walks through the six panels of the BigQuery web UI, the eight-step dry-run-to-schedule workflow most data engineers internalise after a few weeks, the orchestration ladder from a one-line scheduled query to a full Cloud Composer DAG, and the cost-and-slot monitoring loop that every senior reviewer expects you to know cold. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

When you want hands-on reps immediately after reading, drill the SQL practice library → to rebuild the muscle the workbench rewards, rehearse on aggregation problems → for the GROUP BY queries you'll write all day, and stack the cost-aware reflex with window functions → — the patterns that drive most of the slot bill.

On this page

BigQuery console — what it is and why it matters in 2026
Anatomy of the BigQuery console UI
Query editor workflow — dry run → run → save → schedule
Scheduled queries, Workflows, and Cloud Scheduler
Cost & slot monitoring with INFORMATION_SCHEMA
Cheat sheet — BigQuery console recipes
Frequently asked questions
Practice on PipeCode

1. BigQuery console — what it is and why it matters in 2026

The BigQuery web UI is a fully serverless SQL workbench — no install, no driver, no warm-up; just open a URL and write SQL

The one-sentence invariant: the BigQuery console lives at console.cloud.google.com/bigquery, runs entirely in the browser, and is the fastest path from "I need to look at this dataset" to "I have a result in the clipboard". Once you internalise that the console is a workbench, not just an editor, the question "which tool should I use for this BigQuery task?" almost always answers itself with "the console first, then graduate to the CLI or the API if scripting is needed."

The three other ways to talk to BigQuery, and when each wins.

bq CLI. Best for one-off scripted exports, CI jobs, and "I want to grep my project's datasets from a terminal." Same auth as the console (gcloud), but no syntax assist, no plan view, no Gemini.
The REST + gRPC API (and the Python / Go / Java client libraries). Best for embedding BigQuery into an application: an ETL service, a dashboarding backend, a streaming ingest job. Code-first; no UI affordances.
BigQuery DataFrames (bigframes). A 2024+ Python client that compiles pandas-style DataFrame calls into BigQuery SQL. Best for notebook-style exploration where the user prefers Python over SQL — but the console is still where you debug the generated SQL.

The five things that arrived in the console between 2024 and 2026.

Gemini SQL assist. Inline autocomplete for joins, CTEs, and CASE branches; natural-language "explain this query" and "write the SQL that joins these two tables" prompts. Toggle per-project.
BigQuery Studio. A unified workbench tab that bundles SQL queries, Python / SQL notebooks, data canvases, Dataform projects, and saved queries under a single left-pane navigator. The classic "BigQuery" tab still exists; Studio is the recommended home base in 2026.
Unified data canvas. A drag-and-drop graphical query builder that compiles to SQL — useful for first-week onboarding and for sharing a query shape with a non-SQL stakeholder.
BigQuery DataFrames notebook integration. Notebooks that mix %%sql cells with bigframes Python cells, share auth with the console, and ship results to Looker Studio with one button.
Per-query labels in the validator. The dry-run card now surfaces the query_label you set so cost-attribution is visible before you click Run.

Why the console is still the fastest tool for most jobs.

Exploration. Browsing the Explorer tree, previewing a table, profiling a column — point-and-click in the console, line-of-code in the CLI.
Dry-run cost estimates. The validator card on the right of the editor shows bytes processed and dollar estimate before you click Run. The CLI shows it too (--dry_run), but the console renders the partition-pruning preview as well.
Ad-hoc SQL. Cmd-Enter runs the highlighted block; results stream into the bottom panel; cache hit re-runs are free. Round-trip in seconds.
Visualisation handoff. "Open in Looker Studio" and "Save results to Sheet" are one-click buttons on the results panel — the CLI requires bq extract + a download step.

Console vs Snowsight (Snowflake) vs Athena UI — the orientation map for migrants.

BigQuery console	Snowsight (Snowflake)	Athena query editor (AWS)
Left Explorer: project → dataset → table	Left Databases: db → schema → table	Left Tables (per Glue database)
Centre multi-tab query editor	Worksheets (tabbed)	Single editor + Saved queries pane
Dry-run validator card	Estimate from query profile post-run	No native dry run; charge by data scanned
Job history rail	Query history table	Recent queries tab
Scheduled queries (DTS)	Tasks (multi-step DAGs)	EventBridge + Athena
BigQuery Studio (notebooks + canvas + Dataform)	Snowflake notebooks + Streamlit	SageMaker Studio integration
Cost model — bytes processed ($5/TB on-demand)	Credits-per-warehouse-second	Bytes scanned ($5/TB)

What this guide covers, in one breath.

The anatomy of the six visible panels, the dry-run-to-schedule workflow most data engineers internalise, the orchestration ladder (scheduled query → Workflows → Composer → Dataform → Studio notebook), and the INFORMATION_SCHEMA-driven cost / slot monitoring loop every senior BigQuery user knows cold.

Worked example — opening the console for the first time

Detailed explanation. A new hire on a Google Cloud data team typically has GCP IAM permissions assigned to their email but has never opened the BigQuery UI. The first-week win is to load the console, switch to the correct project, find the team's main dataset, and run a sub-second query — all in under five minutes.

Question. Walk through the exact steps to open the BigQuery console, switch project, and run a one-row sanity-check query that confirms BigQuery sees the user as authenticated. List the IAM permissions required and the one keyboard shortcut that saves the most time.

Input.

step	what the user sees
1	https://console.cloud.google.com/bigquery
2	top bar — project selector chip
3	left pane — Explorer with starred projects
4	centre — empty Untitled query tab
5	bottom — empty results panel

Code.

-- Sanity check #1 — does BigQuery see the caller as authenticated?
SELECT
    SESSION_USER()              AS who_am_i,
    @@project_id                AS current_project,
    CURRENT_TIMESTAMP()         AS server_time;

Step-by-step explanation.

Open https://console.cloud.google.com/bigquery — the top-bar project selector lists every project where the caller has bigquery.user (or higher). Click the selector, pick the team project.
In the left Explorer pane, the project appears as a folder; expanding it lists every dataset the caller can bigquery.dataViewer over.
Click the + Compose new query button (or just hit Cmd-Enter on an empty editor) — a new Untitled tab opens.
Paste the sanity SQL above and hit Cmd-Enter to run. The result appears in the bottom panel: one row with the caller's email, the project id, and the server timestamp.
The minimum IAM roles needed for this flow are roles/bigquery.user (lets you open the console and submit a job) plus roles/bigquery.dataViewer on the datasets you want to query. To create tables, you also need roles/bigquery.dataEditor.

Output.

who_am_i	current_project	server_time
ada@example.com	pipecode-demo	2026-06-06 08:14:22 UTC

Rule of thumb. Memorise Cmd-Enter (run highlighted) and Cmd-/ (toggle comment). Those two shortcuts alone shave thirty seconds off every query-edit-rerun loop, which compounds to an hour per day on busy investigation sessions.

Worked example — when to reach for the console vs the CLI vs the API

Detailed explanation. New engineers default to whichever tool they learned first. In practice each tool wins for a different job, and a senior reviewer can usually spot a junior by noticing they wrote a 50-line shell script for a job the console solves in 10 seconds — or wrote a console query for a job that should have been parameterised in code.

Question. Given three tasks — (a) "look at the last 100 rows of the events table," (b) "run a 5-step ETL nightly," (c) "export the daily revenue report to a CSV in GCS" — pick the right tool for each and justify in one sentence.

Input.

task	console	bq CLI	API / client lib
Browse 100 rows of `events`	✓	--	--
Nightly 5-step ETL	--	possible	best (Composer / Workflows)
Daily CSV → GCS	--	best	possible

Code.

# Task (a) — console only. Open the table in Explorer, click Preview.
# (No SQL needed; the console renders rows from storage metadata for free.)

# Task (b) — Cloud Composer DAG, not bq CLI loops.
# Example Airflow operator (Python):
#   BigQueryInsertJobOperator(
#       task_id='daily_aggregate',
#       configuration={'query': {'query': 'CALL pipecode_demo.daily_aggregate()',
#                                 'useLegacySql': False}})

# Task (c) — bq CLI export, wired into a Cloud Scheduler cron.
bq query --nouse_legacy_sql --destination_table=pipecode-demo:exports.daily_revenue \
    "SELECT region, SUM(amount) AS total FROM pipecode-demo.mart.fact_orders \
     WHERE DATE(order_ts) = CURRENT_DATE() - 1 GROUP BY region"
bq extract --destination_format=CSV \
    pipecode-demo:exports.daily_revenue \
    gs://pipecode-demo-exports/daily_revenue_$(date -u +%F).csv

Step-by-step explanation.

Task (a) is exploration — the console wins because the Explorer's Preview tab uses the table's storage metadata to render rows without spending a slot or scanning a byte.
Task (b) is multi-step orchestration — Cloud Composer (managed Airflow) is the standard answer because it has sensors, branching, retries, and visible DAGs.
Task (c) is a scheduled export — the CLI plus Cloud Scheduler is the cheapest answer because bq extract is a managed export operation that writes directly to GCS without a Python orchestrator.

Output.

task	tool	reason
Browse 100 rows	console (Preview tab)	metadata-only render, free
Nightly 5-step ETL	Cloud Composer	full DAG semantics
Daily CSV → GCS	`bq` CLI + Cloud Scheduler	one-shot managed export

Rule of thumb. Default to the console for any task you can describe in one sentence. Reach for the CLI for one-off scripts. Reach for the API or Composer for anything that needs to run without you watching it.

SQL interview question on choosing the right BigQuery surface

A common opener from a senior interviewer is: "You just landed on a team running 200 BigQuery queries per day across exploration, dashboards, and ETL. Walk me through which queries belong in the console, which in scheduled queries, and which in Composer — and which signals tell you it is time to graduate up the ladder." The answer is a maturity model, not a tool list.

Solution Using a four-question decision tree

-- 1) Is the query stable? (i.e. has it run the same way three days in a row?)
--    No  → console only.
--    Yes → graduate to scheduled query.
-- 2) Does the query have upstream dependencies? (table B must finish before A)
--    No  → scheduled query is enough.
--    Yes → graduate to Workflows or Composer.
-- 3) Is the orchestration logic > 5 steps OR needs sensors / backfill?
--    No  → Workflows.
--    Yes → Composer (Airflow) DAG.
-- 4) Does the SQL live in version control with tests?
--    No  → ship it raw under any of the above.
--    Yes → migrate to Dataform — versioned SQLX + ref() graph.

-- Worked decision applied to three real queries:
SELECT
    'top_10_customers'        AS query_name,
    'console'                 AS chosen_surface,
    'ad-hoc exploration'      AS why
UNION ALL SELECT
    'daily_revenue_by_region',
    'scheduled query',
    'stable, 1-step, no upstream dep'
UNION ALL SELECT
    'mart.fact_orders nightly rebuild',
    'Composer (Airflow)',
    'multi-step, backfill, depends on stg tables';

Step-by-step trace.

Question	top_10_customers	daily_revenue_by_region	nightly mart rebuild
1. Stable?	no	yes	yes
2. Upstream deps?	no	no	yes
3. > 5 steps / sensors?	no	no	yes
4. Versioned + tested?	no	no	could be Dataform
Surface	console	scheduled query	Composer

The trace highlights the inflection points: every "yes" pushes the query one rung up the ladder. Skipping rungs leads to over-engineered nightly cron in Composer for queries a scheduled query could run, or under-engineered ad-hoc CSV exports that everyone forgets exist until they break.

Output:

query	surface	weekly cost of being on the wrong rung
top_10_customers	console	$0 (right rung)
daily_revenue_by_region	scheduled query	~$15 (would-be Composer instance)
nightly mart rebuild	Composer	unbounded (silent freshness bugs if downgraded)

Why this works — concept by concept:

Maturity model, not tool list — every BigQuery surface answers a different stability question. The decision tree forces you to ask "is this code stable enough to leave alone?" before picking a surface.
Cost of over-engineering — Composer ships a managed Airflow environment that costs ~$300/month at the lowest tier. Spending it on a one-line scheduled query is a real budget leak.
Cost of under-engineering — a 5-step ETL run as five separate scheduled queries has no upstream-dependency contract; a downstream step can run before its upstream finishes and silently consume stale data.
Dataform as the SQL-with-tests surface — Dataform's ref() graph and assertions give every SQL its place in a DAG with checks. Teams that move to Dataform stop writing per-table cron and start writing per-model declarations.
Cost — choosing surface is O(1) per query but has linear ongoing cost. The decision tree pays for itself across 200 queries.

SQL
Topic — sql
SQL practice library (warm up the BigQuery muscle)

Practice →

2. Anatomy of the BigQuery console UI

Six panels, one workbench — learn the layout once, navigate forever

The mental model in one line: the BigQuery console is six fixed panels — top bar, left Explorer, centre query editor, right validator, bottom results, far-right job history — and BigQuery Studio bundles notebooks and canvases into the same workbench under a new top-level switcher. Once you can name each panel and the job it does, "where do I click for X?" stops being a question.

The six panels, by job.

Top bar. Project selector, global search, region indicator, the IAM-badge avatar (click it to swap identities or impersonate a service account). Also: the Studio / classic-BigQuery switcher in the top-left.
Left pane — Explorer. A tree of projects → datasets → tables / views / materialized views / routines / models. Star a project to keep it pinned. Right-click a dataset for "Open in new tab," "Create table," or "Share."
Centre — multi-tab query editor. Each query lives in its own tab. Cmd-Enter runs; Cmd-/ comments; Cmd-S saves. Gemini autocomplete underlines suggested joins and CASE branches as you type.
Right rail — query validator. Dry-run estimate (bytes processed → dollar cost), region check, billing-project check. The "warning" banner here is your last line of defence against a $100 typo.
Bottom — results panel. Three toggles: Table (default), JSON, Chart. The chart toggle compiles to a Looker Studio mini-chart for free.
Job history (right-most strip). Every job you submitted: bytes processed, slot ms, cache hit indicator, query plan link. Useful when you want to grab a job_id for INFORMATION_SCHEMA debugging.

BigQuery Studio — the unified workbench.

What it is. A new top-level navigation that bundles SQL queries, Python / SQL notebooks, data canvases, Dataform repos, saved queries, and ML models under one left pane.
Why it matters. Notebooks share auth with the console, can be scheduled like a query, and embed bigframes Python alongside %%sql cells. The canvas is a graphical query builder that compiles to SQL.
When you'd switch back to classic. Some IAM tooling and old bq cp workflows are still smoother in the classic view; the toggle is one click.

Job-history view in detail.

bytes_processed. What the query actually read. Compare with the dry-run estimate; large divergence means partition pruning didn't fire.
slot_ms. Slot-milliseconds consumed. The fundamental unit of BigQuery compute. Divide by 1000 to get slot-seconds.
cache_hit. Boolean. TRUE = the result came from the 24-hour cache; the query was free.
query_plan. A clickable link to the stage graph: stages, shuffle, slot ms per stage. The BigQuery EXPLAIN equivalent.
execution_graph. A visual DAG of stages — the "this stage took 80% of the wall clock" view that surfaces hot spots.

Keyboard shortcuts that matter.

Cmd-Enter / Ctrl-Enter — Run highlighted query (or the whole tab if nothing is highlighted). Most-used shortcut.
Cmd-/ / Ctrl-/ — Toggle line comment. Indispensable when sectioning a long query.
Cmd-S / Ctrl-S — Save query. Prompts for a saved-query name on first save.
Cmd-K / Ctrl-K — Format query. Re-indents and aligns CTE lists.
Cmd-D / Ctrl-D — Duplicate selection or line. Useful for fast LIMIT 10 → LIMIT 100 edits.
Esc — Close the autocomplete / Gemini suggestion. Lets you keep typing without accepting.

Worked example — running a query from the Explorer pane

Detailed explanation. The fastest path from "I need to see this table" to "I have rows on screen" is not typing SQL — it is clicking the table in the Explorer and hitting Preview. Most new engineers don't discover Preview for weeks; senior engineers reach for it before SELECT.

Question. Given a events_daily table in pipecode_demo.dataset_marketing, show the three ways to look at its first 10 rows. Compare cost, latency, and what each surface tells you.

Input.

approach	UI clicks	cost (bytes scanned)	latency
Preview tab	Explorer → table → Preview	0	metadata only
SELECT * LIMIT 10	editor → paste → Run	partition-or-table-sized	seconds
Schema tab	Explorer → table → Schema	0	metadata only

Code.

-- Path 2 — SELECT with LIMIT.
-- Beware: LIMIT does NOT reduce bytes scanned in BigQuery. The query
-- scans the entire table (or the entire partition set the WHERE picks)
-- and *then* truncates to 10 rows. LIMIT is a render hint, not a cost hint.
SELECT *
FROM `pipecode-demo.dataset_marketing.events_daily`
LIMIT 10;

-- Better — filter on a partition column to limit bytes scanned.
SELECT *
FROM `pipecode-demo.dataset_marketing.events_daily`
WHERE event_date = CURRENT_DATE() - 1
LIMIT 10;

Step-by-step explanation.

The Preview tab reads no slot ms and scans no bytes — it surfaces a few rows from the storage layer's row blocks directly. Free, fast, no SQL.
SELECT * LIMIT 10 is the trap junior engineers fall into: BigQuery scans the entire table (or partition set), pays the bytes-processed cost, then truncates the result to 10 rows. The bill is for the scan, not the LIMIT.
Adding WHERE event_date = CURRENT_DATE() - 1 lets the optimiser prune partitions — bytes scanned drops from "whole table" to "one partition," and the dry-run estimate plummets.
The Schema tab shows column types, descriptions, and partition / cluster keys — invaluable for writing the WHERE in step 3 with a known partition column.

Output.

approach	cost	what you see
Preview	$0	first ~50 rows verbatim
`SELECT * LIMIT 10`	full table scan	10 rows
`WHERE partition_col = ... LIMIT 10`	one partition scan	10 rows (cheap)

Rule of thumb. Reach for the Preview tab first. Reach for SELECT only when you need a specific filter or computed column. Never write SELECT * against a large table without a partition filter — the LIMIT does not save you.

Worked example — reading the query validator card

Detailed explanation. The right-rail validator does a server-side dry run as you type. It tells you exactly how many bytes the query would scan and how many dollars it would cost (on the on-demand pricing model) before you click Run. Most expensive accidents in BigQuery are caused by ignoring this card.

Question. Given a SELECT * against a 10 TB table, what does the validator show, and how do you change the query so the validator turns green?

Input — validator outputs for three query shapes.

query	bytes processed	$ estimate	validator colour
`SELECT * FROM big_table`	10 TB	$50.00	red
`SELECT col_a, col_b FROM big_table`	1.2 TB	$6.00	yellow
`SELECT col_a FROM big_table WHERE partition_col = '2026-06-05'`	12 GB	$0.06	green

Code.

-- Trigger the dry run from the CLI to inspect what the validator computes:
bq query --dry_run --use_legacy_sql=false \
"SELECT col_a FROM \`pipecode-demo.dataset.big_table\` WHERE event_date = '2026-06-05'"
-- Output snippet:
--   Query successfully validated. Assuming the tables are not modified,
--   running this query will process 12,884,901,888 bytes of data.

Step-by-step explanation.

The validator reports bytes processed based on the referenced columns (column-store pruning) and the referenced partitions (partition pruning), pre-execution.
SELECT * references every column → every column block is read → 10 TB.
Projecting only the two columns you need brings the bytes scanned down to the column-block size of those columns.
Adding WHERE partition_col = '...' lets BigQuery skip every partition file that doesn't match — for a daily-partitioned table with 1000 days of history, that's a 1000× reduction.
Combine column projection + partition filter and the same query becomes effectively free.

Output.

optimisation	bytes scanned	cost
baseline `SELECT *`	10 TB	$50.00
project columns	1.2 TB	$6.00
+ partition filter	12 GB	$0.06

Rule of thumb. Make "look at the validator card" the last action before every Run. If it's red and you didn't intend to pay $50, fix the query — column projection and partition filters are the two universal moves.

Worked example — inspecting the query plan after a run

Detailed explanation. After a query finishes, the bottom panel offers a "Query plan" tab — a per-stage breakdown of the execution graph. Reading the plan is the 80/20 skill that turns a junior writer into an engineer who can debug "why is this query slow?" without guessing.

Question. A query takes 90 seconds and processes 200 GB. Where in the plan do you look first, and what numbers do you compare?

Input — abbreviated plan.

stage	rows in	rows out	slot ms	shuffle bytes	wait ratio
S00 Read events	2.4 B	2.4 B	32 000	0	0.05
S01 Filter event_date	2.4 B	18 M	1 200	0	0.02
S02 Join customers	18 M	18 M	1 800	4.8 GB	0.08
S03 Aggregate by region	18 M	6	120	96 KB	0.01

Code.

-- Pull the same numbers from INFORMATION_SCHEMA so you can script the audit:
SELECT
    stage.name                     AS stage_name,
    stage.records_read             AS rows_in,
    stage.records_written          AS rows_out,
    stage.slot_ms                  AS slot_ms,
    stage.shuffle_output_bytes     AS shuffle_bytes,
    stage.wait_ratio_avg           AS wait_ratio
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT,
     UNNEST(job_stages) AS stage
WHERE job_id = '@job_id'
ORDER BY stage.id;

Step-by-step explanation.

Sort by slot_ms — the stage with the largest slot consumption is the cost driver. In the example, Stage S00 (Read events) dominates at 32 000 slot ms — that means the table read is the bottleneck.
Sort by shuffle_bytes — a high-shuffle stage is a JOIN or aggregation that re-distributes data. Stage S02 here is the join, shuffling 4.8 GB.
Compare rows_in and rows_out — when rows_out ≪ rows_in early, the optimiser is filtering well; when it stays equal, you're scanning unnecessary rows.
Look at wait_ratio_avg — high values (>0.3) indicate the stage spent significant time waiting for slots, signalling slot pressure on the reservation.
Action plan: if Read dominates → partition / cluster the source. If Shuffle dominates → reduce join cardinality (filter earlier, broadcast smaller side). If Wait dominates → add slots or move to a less-busy reservation.

Output.

symptom in plan	fix
Read stage has the highest slot ms	partition + cluster the source table
Shuffle stage is multi-GB	filter / aggregate before the join
Wait ratio > 0.3 on multiple stages	bigger reservation or off-peak schedule
rows_in ≈ rows_out across stages	missing filter; project columns + add WHERE

Rule of thumb. Always read the plan after any query that takes longer than 10 seconds or processes more than 50 GB. Slot ms tells you cost; shuffle bytes tells you join shape; wait ratio tells you whether your reservation is starved.

SQL interview question on UI literacy

A senior interviewer often opens with: "Walk me through how you'd diagnose a slow BigQuery query using only the console — no CLI, no API." The answer probes whether the candidate can name the right panel for each step.

Solution Using the console panels in order

-- 1) Open the job in the Job history rail (right strip).
-- 2) Click the row → "View details" → reads the same fields shown below.
-- 3) Copy job_id and paste into INFORMATION_SCHEMA for a programmatic audit.

SELECT
    job_id,
    user_email,
    total_bytes_processed / POW(10, 9)      AS gb_processed,
    total_slot_ms / 1000                     AS slot_sec,
    cache_hit,
    end_time - start_time                    AS wall_clock
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE job_id = '@job_id';

Step-by-step trace.

panel	what it told us	conclusion
Job history rail	"job_xyz · 200 GB · 90 s · slot ms 35 200"	not cached, real work
Validator (re-dry-run)	200 GB → matches → no surprise scan	filter is doing its job
Query plan tab	S00 Read = 32 000 slot ms	source table read is the bottleneck
Schema tab	source table — partitioned by `event_ts` HOUR	partition column is hourly, query filtered by date
Fix	rewrite WHERE to filter by hour	bytes drop ~24×

The trace teaches the diagnostic discipline: confirm the dry run, look at the plan, check the source schema, and only then change the query. Most "the query is slow" tickets are solved by aligning the WHERE clause with the partition column at the right granularity.

Output:

metric	before	after
bytes processed	200 GB	~9 GB
slot ms	35 200	1 500
wall clock	90 s	6 s

Why this works — concept by concept:

The job history rail is the entry point — every console-only diagnosis starts here; the rail gives you the job_id, the headline numbers, and the link to the plan in one click.
The validator card is your sanity check — re-running a dry run after you've understood the plan confirms whether your proposed fix would actually reduce bytes.
The query plan tab surfaces stage-level cost — sorting stages by slot ms localises the bottleneck to a single operator, which is what your fix must target.
The schema tab anchors fixes to physical layout — partition and cluster keys live on the table, not in your SQL. Aligning the WHERE to the partition key is usually the highest-leverage edit.
Cost — diagnosis is O(stages) human time; the fix is O(1) edit but saves linear cost for every subsequent run. Pays for itself in one rerun.

SQL
Topic — optimization
Query optimisation problems (SQL)

Practice →

3. Query editor workflow — dry run → run → save → schedule

Eight steps, two rows, one repeatable habit — the editor workflow most data engineers internalise

The mental model in one line: every BigQuery query passes through eight predictable steps — write, dry-run, cache-toggle, run, inspect plan, save, schedule, share — and the engineers who feel fastest are the ones who internalise the steps in that order. Once you can recite the eight steps, the editor stops being a UI to click through and becomes a pipeline you walk down on autopilot.

The eight steps in detail.

Step 1 — Write or paste SQL. Gemini autocomplete suggests joins, CTEs, and CASE branches as you type. Accept with Tab; reject with Esc. The "Explain this query" prompt produces a plain-English summary — useful when reviewing someone else's SQL.
Step 2 — Validator dry run. Runs automatically. Reports bytes processed → cost in dollars (US $5 per TB on-demand). Surfaces region and billing-project mismatches before you can click Run.
Step 3 — Cache toggle. "Use cached results" is on by default. BigQuery caches every query result for 24 hours, keyed by the canonicalised SQL. Cache hit = free read. Toggle off when you need fresh data (e.g. after a backfill).
Step 4 — Run. Cmd-Enter submits the job. Results stream into the bottom panel as the job completes. The bottom panel toggles between Table, JSON, and Chart.
Step 5 — Inspect query plan. After the run, the bottom panel exposes a "Query plan" tab. Stages, shuffle bytes, slot ms — the BigQuery EXPLAIN equivalent.
Step 6 — Save. Click "Save" → "Save as" → choose Saved Query (private / project-wide), View, Materialized View, or Table. Saved queries live in the Explorer under "Saved queries"; views and MVs live under the dataset.
Step 7 — Schedule. From the saved query menu, "Schedule" opens the Data Transfer Service flow: pick a cron expression or use the simplified UI (every hour, every day at 8 AM, etc.). The schedule runs as a service account.
Step 8 — Share. "Share" → copy a permalink to the query, "Open in Looker Studio" for a chart, "Save results" → CSV / JSON / Avro / Sheets / GCS.

What the validator actually tells you, in plain English.

Bytes processed. The total compressed bytes BigQuery would scan. Multiply by ~$5 / TB to get the on-demand dollar estimate. The validator does this for you.
Region. "This query references a dataset in us-east1 but your default region is us." That mismatch fails the job at run time; better to catch it pre-flight.
Billing project. Which project will be billed for the slot ms and bytes processed. Defaults to the project selector but you can override per-query.
Partition-pruning preview. "The optimiser will scan 3 of 365 partitions." Confirms the WHERE filter is doing its job.

Cache semantics in one paragraph.

The cache is a global, per-user, deterministic key over the canonicalised SQL string and the destination tables' last-modified timestamps. Two identical queries from the same user inside 24 hours = cache hit = free. Adding CURRENT_TIMESTAMP() to the SELECT defeats the cache; using DML on a referenced table invalidates the cache. The cache is not shared across users — your colleague's free hit isn't yours.

The save-and-schedule contract.

Saved query. Lives in your project under your IAM. Persists across browser sessions. Can be parameterised with @param placeholders.
View. Read-only virtual table. The SELECT is recomputed every time the view is queried — no storage, no scheduling, no cost when idle.
Materialized view. Pre-computed result, auto-refreshed when underlying tables change. Storage cost + refresh cost; query cost drops dramatically.
Scheduled query. The same SELECT, run on a schedule, with a destination table. Owned by the user who set it up (or by a service account, recommended).

The share menu, in one breath.

Copy permalink → paste in a PR or Slack. Open in Looker Studio → a chart with one click. Save results → CSV / JSON / Avro to local download, Sheets to Drive, or GCS for downstream pipelines. Export to a destination table → for reuse without re-running.

Worked example — first run of a new query

Detailed explanation. A new engineer is asked to "count events per region for yesterday." The naive query is one line. The careful workflow is to (1) write it, (2) read the validator card, (3) trim if expensive, (4) run, (5) glance at the plan to confirm partition pruning fired.

Question. Walk through writing a "yesterday's events by region" query, reading the validator, fixing the cost, and running it.

Input.

step	action	observation
1	naive `SELECT region, COUNT(*) FROM events GROUP BY region`	validator: 8 TB, $40 — red
2	add `WHERE event_date = CURRENT_DATE() - 1`	validator: 22 GB, $0.11 — green
3	run	wall clock 4 s; cache miss; plan clean

Code.

-- v0 — naive, will be flagged red by validator.
SELECT region, COUNT(*) AS event_count
FROM `pipecode-demo.dataset_marketing.events_daily`
GROUP BY region;

-- v1 — partition-aware, validator-approved.
SELECT region, COUNT(*) AS event_count
FROM `pipecode-demo.dataset_marketing.events_daily`
WHERE event_date = CURRENT_DATE() - 1
GROUP BY region
ORDER BY event_count DESC;

Step-by-step explanation.

Type the naive query. The validator immediately shows 8 TB / $40 — too expensive for an ad-hoc count.
Add WHERE event_date = CURRENT_DATE() - 1. The validator updates to 22 GB / $0.11. The partition filter is doing its job.
Click Run. Wall clock is ~4 seconds.
Glance at the Query plan tab: Stage 0 (Read) reports 22 GB read — matches the dry run. No surprises.
Save the query as a saved query named "events_by_region_yesterday." Now it's promotable to a scheduled query if it stabilises.

Output (one example day).

region	event_count
US	4 218 503
EU	2 974 110
APAC	1 088 442

Rule of thumb. Never click Run when the validator card is red unless you intend to pay that bill. The validator is the cheapest review you'll get.

Worked example — re-running with the cache for free reads

Detailed explanation. A dashboard polls the same query every minute during a stand-up. The first run costs $0.11; the next 23 hours of identical runs are free because the result is cached. Toggling "Use cached results" off forces a fresh scan — useful when an upstream load has just landed.

Question. Show the SQL that would always defeat the cache, the one that uses the cache, and how to tell from the job-history rail.

Input — three queries in three minutes.

run	SQL	cache hit?	cost
1	`SELECT region, COUNT(*) FROM ... WHERE event_date = '2026-06-05' GROUP BY region`	no (first)	$0.11
2	identical SQL	yes	$0
3	append `SELECT CURRENT_TIMESTAMP()`	no (non-determinism)	$0.11

Code.

-- Cache-friendly: the SQL canonicalises to the same string every run.
SELECT region, COUNT(*) AS c
FROM `pipecode-demo.dataset_marketing.events_daily`
WHERE event_date = '2026-06-05'
GROUP BY region;

-- Cache-defeating: CURRENT_TIMESTAMP() makes every result row "new."
SELECT region, COUNT(*) AS c, CURRENT_TIMESTAMP() AS run_at
FROM `pipecode-demo.dataset_marketing.events_daily`
WHERE event_date = '2026-06-05'
GROUP BY region;

Step-by-step explanation.

The first run canonicalises the SQL and stores the result in the per-user cache with a 24-hour TTL.
Run 2 — identical canonicalised SQL → cache hit → no slot ms billed, no bytes processed. The job-history rail shows cache_hit = TRUE.
Run 3 — CURRENT_TIMESTAMP() is non-deterministic; BigQuery refuses to cache. Cache miss → full scan again.
The job-history rail surfaces the cache_hit boolean on every row; INFORMATION_SCHEMA.JOBS surfaces it too as cache_hit.
Cache-friendly authoring: use literal date / timestamp values instead of CURRENT_DATE() when possible, or parameterise the date and pass @today from the orchestrator.

Output (job history view).

run	cache_hit	bytes_processed	$ cost
1	FALSE	22 GB	0.11
2	TRUE	0	0
3	FALSE	22 GB	0.11

Rule of thumb. Cache-friendly SQL is free SQL. If your query needs the latest run timestamp, surface it as a literal parameter @as_of_ts set by the orchestrator, not as CURRENT_TIMESTAMP() in the SELECT.

Worked example — promoting a saved query to a scheduled query

Detailed explanation. Once a query has run identically three days in a row, it is a candidate for scheduling. The console's "Schedule" button opens the Data Transfer Service flow: pick a cron, pick a destination table, pick a service account, save. The query then runs without human attention.

Question. Promote the "events_by_region_yesterday" saved query into a scheduled query that runs every day at 08:00 UTC and writes to pipecode-demo.mart.events_by_region_daily.

Input — schedule form fields.

field	value
Schedule name	events_by_region_yesterday_daily
Cron	`0 8 * * *` (every day 08:00 UTC)
Destination table	`pipecode-demo.mart.events_by_region_daily`
Write preference	`WRITE_APPEND` (or `WRITE_TRUNCATE` if rebuild)
Service account	`bq-scheduler@pipecode-demo.iam.gserviceaccount.com`
Notifications	Cloud Logging filter + alert policy on `severity=ERROR`

Code.

-- Scheduled-query SQL: parameterise the date so the cache works
-- on backfills and so the destination is partition-aware.
SELECT
    @run_date              AS event_date,
    region,
    COUNT(*)               AS event_count
FROM `pipecode-demo.dataset_marketing.events_daily`
WHERE event_date = @run_date
GROUP BY region;

Step-by-step explanation.

Click "Save" → "Saved query" on the working query. Then open the saved query and click "Schedule."
The cron field accepts standard cron (or the simplified UI). 0 8 * * * runs every day at 08:00 UTC.
The destination table can be partitioned by event_date — the scheduled query then appends one partition per run.
The service account is the one that runs the query. It needs bigquery.dataEditor on the destination dataset and bigquery.jobUser on the billing project.
The @run_date parameter is set by BigQuery DTS to the logical execution date — useful for clean backfills and idempotent appends.
Failures route to Cloud Logging; an alert policy turns those into Slack or email pages.

Output.

schedule_id	next_run	last_run_status	last_destination_row_count
sch_abc	2026-06-07 08:00 UTC	SUCCESS	18

Rule of thumb. Stable queries deserve a schedule. Unstable queries (different SQL every run) deserve a notebook or an orchestrator. Don't schedule a query you're still editing; the silent failures will follow you for months.

SQL interview question on workflow discipline

The interviewer's probe: "Walk me through how you author a new BigQuery query you've never run before. Where do you click, what do you read, when do you actually press Run?" The answer is the eight-step workflow you can describe in one breath.

Solution Using the eight-step workflow

-- The query in question — a per-day per-region revenue rollup.
SELECT
    event_date,
    region,
    SUM(amount) AS revenue
FROM `pipecode-demo.mart.fact_orders`
WHERE event_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
                     AND CURRENT_DATE() - 1
GROUP BY event_date, region
ORDER BY event_date DESC, revenue DESC;

Step-by-step trace.

step	action	observation / decision
1 Write	typed in the editor	Gemini autocompletes the GROUP BY
2 Dry run	validator shows 14 GB / $0.07	green — proceed
3 Cache	"Use cached results" stays ON	second run will be free
4 Run	Cmd-Enter	wall clock 3 s, returns 14 rows
5 Plan	check Query plan tab	Read 14 GB; no shuffle hot spot
6 Save	Save → Saved query "weekly_revenue"	private to me until peer-reviewed
7 Schedule	not yet	needs to run 3 days clean first
8 Share	copy permalink → paste in PR	colleague can re-run unchanged

The discipline is to read the validator before running, glance at the plan after running, and save before sharing. Skipping any step works — until the day it doesn't, and you spend an afternoon retracing your steps.

Output:

event_date	region	revenue
2026-06-05	US	88 412
2026-06-05	EU	62 109
2026-06-05	APAC	22 803
2026-06-04	US	91 220
...	...	...

Why this works — concept by concept:

The validator is your sanity check — every query gets a free dry run; ignoring the red card is what causes $100 typos.
The cache makes ad-hoc free — cache-friendly SQL is genuinely free on the second through the 24-hour-th run; structure your SELECT to keep it that way.
The plan is your debugger — slot ms by stage tells you where the time went; reading the plan is the difference between fixing the query and rerunning it.
Save before share — a saved query has a stable URL and a stable owner; a "paste this into your editor" workflow has neither.
Schedule only when stable — three identical days in a row is a reasonable promotion criterion; before that, leave the query in the saved-queries bucket.
Cost — O(1) human time per step; the saved hours per week scale linearly with team size. The eight-step habit pays for itself within a week.

SQL
Topic — aggregation
Aggregation problems (SQL)

Practice →

4. Scheduled queries, Workflows, and Cloud Scheduler

Five orchestration tools, one principle — match the orchestrator to the dependency graph

The mental model in one line: scheduled queries are for one-step crons, Workflows are for branching multi-step jobs, Cloud Composer is for full DAGs with sensors, Dataform is for versioned SQLX with tests, and BigQuery Studio notebooks are for ad-hoc Python plus SQL that occasionally needs a schedule. Match the orchestrator to the dependency graph and the bill plus the failure modes both improve.

The five layers, simplest-first.

Scheduled queries (BigQuery Data Transfer Service). Built-in cron-style scheduling for a single SQL query. UI lives inside the console. No parameter loops, no upstream dependencies, no branching.
Cloud Workflows. YAML-defined multi-step jobs with branches and retries. Lives outside BigQuery but invokes BigQuery jobs via a connector. Cheap, serverless, no infrastructure to manage.
Cloud Composer (managed Airflow). Full Airflow DAGs with sensors, backfills, complex dependencies. Managed environment costs ~$300/month at the lowest tier — worth it for any pipeline beyond a handful of steps.
Dataform (SQLX + ref()). The BigQuery-native dbt equivalent. SQL models declare upstream tables with ref('upstream_model'), Dataform builds the DAG automatically, runs assertions, and ships the SQL with the rest of your VCS.
BigQuery Studio notebooks. Mixed Python + SQL notebooks. Can be scheduled like a query. Best for ML feature engineering and ad-hoc exploration that needs to run nightly.

Scheduled queries — what to expect.

Setup. Console → saved query → Schedule → cron + destination. Lives in BigQuery Data Transfer Service.
Limitations. Single project. One SQL statement (you can CALL a stored procedure for multi-step logic, but no first-class step graph). No parameter loops. No upstream dependency on another DTS run.
Best for. Daily / hourly aggregates that produce a single destination table. The "the same SQL, every day" pattern.
Permissions. Runs as a service account; the account needs bigquery.dataEditor on the destination dataset and bigquery.jobUser on the billing project.

When to graduate to Workflows.

Two-step ETL with a branch. "Run query A; if A returns rows, run query B; otherwise alert." Scheduled queries can't branch; Workflows can.
Multi-region jobs. Run a query in us, then a follow-up in eu. Workflows orchestrates both with one YAML.
Mixing BigQuery with other services. Run a query, then call a Cloud Run service to refresh a downstream cache.

When to graduate to Composer.

DAGs with sensors and backfills. "Wait for a GCS file to appear, then run the load." Airflow's GCSObjectExistenceSensor is the canonical answer.
Complex dependencies. A pipeline with 30+ tasks, lineage, retries with exponential backoff, dynamic task generation.
Existing Airflow team. If your team already runs Airflow elsewhere, Composer matches the muscle memory.

Dataform — when to ship SQLX.

Versioned + tested SQL. Every model is a .sqlx file in a Git repo. Pull requests, code review, CI assertions.
Dependency graph from ref(). Write SELECT * FROM ${ref('events_daily')} and Dataform builds the upstream DAG automatically.
dbt-style assertions. "Every row's order_id is unique," "every row's region is in the allowed set."

BigQuery Studio notebooks — when to schedule a notebook.

Mixed SQL + Python. A Python feature-engineering cell followed by a SQL aggregation cell.
ML training. Train a model in Python; persist results to BigQuery; schedule the whole notebook nightly.
Stakeholder-facing reports. Render charts inline; schedule a daily run; email the rendered notebook.

Monitoring failures, regardless of layer.

Cloud Logging filter. Every BigQuery job emits a structured log line. Filter on severity=ERROR AND resource.type="bigquery_resource" to catch failures across the project.
Alert policy. Translate the filter into a Monitoring alert; route to Slack, email, or PagerDuty.
Per-pipeline DLQ. For Composer / Dataform / Workflows, ship failed-row data to a dead-letter table so you can replay after fixing the root cause.

Worked example — author a scheduled query from scratch

Detailed explanation. A team wants the same per-day-per-region revenue rollup to land in a mart table every morning at 06:00 UTC. No branching, no upstream dependency on another job. Scheduled queries are the right tool.

Question. Author the SQL, choose the schedule, pick the destination, and identify the IAM the runner service account needs.

Input.

field	value
Cron	`0 6 * * *`
Destination	`pipecode-demo.mart.daily_revenue`
Write preference	`WRITE_APPEND`
Service account	`bq-scheduler@pipecode-demo.iam.gserviceaccount.com`
Service account IAM	`bigquery.dataEditor` on `mart`, `bigquery.jobUser` on project

Code.

-- The SQL the scheduled query runs every morning.
INSERT INTO `pipecode-demo.mart.daily_revenue`
(event_date, region, revenue)
SELECT
    @run_date                                   AS event_date,
    region,
    SUM(amount)                                 AS revenue
FROM `pipecode-demo.mart.fact_orders`
WHERE event_date = @run_date
GROUP BY region;

Step-by-step explanation.

Save the SELECT as a saved query first; the schedule form expects a saved-query template.
Click "Schedule" → "Create new schedule." Pick the cron 0 6 * * * (06:00 UTC daily).
Set the destination table; choose "Append to existing" so each day's partition lands cleanly.
Pick the service-account runner. The account must have bigquery.dataEditor on the destination dataset and bigquery.jobUser on the billing project — without either, the run fails at job submission.
Add a Cloud Logging filter for severity=ERROR scoped to this schedule's job naming pattern; wire it to a Monitoring alert policy that posts to Slack.
Set @run_date to be _PARTITIONDATE so backfills replay correctly without parameter editing.

Output (first run).

event_date	region	revenue
2026-06-05	US	88 412
2026-06-05	EU	62 109
2026-06-05	APAC	22 803

Rule of thumb. Always make the service-account runner explicit. Never let scheduled queries run as a human user — when that human leaves the company, every schedule they owned breaks silently.

Worked example — promoting a two-step job to Cloud Workflows

Detailed explanation. A pipeline grows past one query: "first refresh the staging table, then refresh the mart table." Scheduled queries can't express the dependency; Workflows can with one tiny YAML.

Question. Write a Cloud Workflows definition that runs the staging query, waits for success, then runs the mart query.

Input.

step	what it does
refresh_staging	`CALL pipecode_demo.refresh_stg_events()`
wait	implicit — sequential
refresh_mart	`CALL pipecode_demo.refresh_mart_orders()`
notify	post Slack message on failure

Code.

main:
  steps:
    - refresh_staging:
        call: googleapis.bigquery.v2.jobs.query
        args:
          projectId: pipecode-demo
          body:
            query: "CALL `pipecode-demo.stg.refresh_stg_events`()"
            useLegacySql: false
        result: stg_result

    - refresh_mart:
        call: googleapis.bigquery.v2.jobs.query
        args:
          projectId: pipecode-demo
          body:
            query: "CALL `pipecode-demo.mart.refresh_fact_orders`()"
            useLegacySql: false
        result: mart_result

    - finished:
        return: ${mart_result}

Step-by-step explanation.

Workflows runs steps sequentially by default. The refresh_mart step doesn't start until refresh_staging returns.
Each googleapis.bigquery.v2.jobs.query call submits a job and waits for completion (sync mode).
If either step fails, the Workflow stops; the failure surfaces in Cloud Logging with the step name attached.
Trigger the Workflow on a cron schedule via Cloud Scheduler (which targets a Workflows execution).
For branching: replace the linear sequence with a switch step that picks the next step based on the previous result.

Output (a successful run).

step	duration	status
refresh_staging	12 s	OK
refresh_mart	8 s	OK
total wall clock	~20 s	OK

Rule of thumb. As soon as the pipeline has two dependent steps, you've outgrown scheduled queries. Workflows is the lowest-friction next rung — cheaper and simpler than Composer for jobs that don't need sensors or backfills.

Worked example — a Composer DAG that backfills 30 days

Detailed explanation. Composer is necessary when you need sensors, complex dependencies, or backfills. The canonical use case: a nightly DAG that ingests yesterday's GCS file, runs three BigQuery transformations, and refreshes a dashboard.

Question. Sketch an Airflow DAG that (a) waits for a GCS file to appear, (b) loads it into BigQuery staging, (c) runs the staging-to-mart transform, (d) refreshes a Looker dashboard cache.

Input — DAG shape.

task_id	operator	depends_on
wait_for_file	GCSObjectExistenceSensor	--
load_staging	GCSToBigQueryOperator	wait_for_file
stg_to_mart	BigQueryInsertJobOperator	load_staging
refresh_dashboard	HttpOperator	stg_to_mart

Code.

from airflow import DAG
from airflow.providers.google.cloud.sensors.gcs import GCSObjectExistenceSensor
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow.providers.http.operators.http import SimpleHttpOperator
from datetime import datetime

with DAG(
    "daily_marketing_pipeline",
    start_date=datetime(2026, 1, 1),
    schedule="0 7 * * *",
    catchup=True,                       # enables backfill
    max_active_runs=4,
) as dag:

    wait_for_file = GCSObjectExistenceSensor(
        task_id="wait_for_file",
        bucket="pipecode-demo-raw",
        object="events/{{ ds }}/events.json",
    )

    load_staging = GCSToBigQueryOperator(
        task_id="load_staging",
        bucket="pipecode-demo-raw",
        source_objects=["events/{{ ds }}/events.json"],
        destination_project_dataset_table="pipecode-demo.stg.events_raw${{ ds_nodash }}",
        source_format="NEWLINE_DELIMITED_JSON",
        write_disposition="WRITE_TRUNCATE",
    )

    stg_to_mart = BigQueryInsertJobOperator(
        task_id="stg_to_mart",
        configuration={
            "query": {
                "query": "CALL `pipecode-demo.mart.refresh_events`('{{ ds }}')",
                "useLegacySql": False,
            }
        },
    )

    refresh_dashboard = SimpleHttpOperator(
        task_id="refresh_dashboard",
        method="POST",
        http_conn_id="looker",
        endpoint="api/4.0/dashboards/42/refresh",
    )

    wait_for_file >> load_staging >> stg_to_mart >> refresh_dashboard

Step-by-step explanation.

The DAG runs at 07:00 UTC daily; catchup=True means Airflow runs every missed schedule on first deploy — a built-in backfill.
GCSObjectExistenceSensor polls the file's path until it appears, then succeeds. Sensors are why Composer wins over Workflows for "wait for upstream."
GCSToBigQueryOperator loads the JSON into a date-suffixed staging table (events_raw$20260605).
BigQueryInsertJobOperator runs the stored procedure that promotes the row to mart, scoped to the run date.
SimpleHttpOperator posts to the Looker API to refresh the dashboard cache.
The >> chain expresses the dependency order. Each task waits for its predecessor's success.

Output (one day's run).

task	duration	status
wait_for_file	~30 s	OK (file landed at 06:55)
load_staging	18 s	OK (rows loaded: 2.1 M)
stg_to_mart	22 s	OK
refresh_dashboard	3 s	OK
total	~73 s	OK

Rule of thumb. Composer is worth its monthly cost the moment you need a sensor, a backfill, or more than ~5 dependent tasks. Below that threshold, Workflows + Cloud Scheduler is the cheaper answer.

SQL interview question on the orchestration ladder

A senior interviewer might ask: "Your team has 30 BigQuery jobs spread across scheduled queries and ad-hoc cron in Cloud Functions. Half are failing silently. Design the migration plan to a sane orchestration story." The answer is a tiered migration onto the ladder.

Solution Using a tiered migration

-- Inventory the 30 jobs into three buckets:
WITH job_inventory AS (
    SELECT 'daily_revenue_by_region'     AS job, 'scheduled_query' AS surface, 1 AS step_count, FALSE AS has_upstream UNION ALL
    SELECT 'mart_refresh_orders',          'scheduled_query', 3, TRUE  UNION ALL
    SELECT 'gcs_to_bq_events',             'cloud_function', 2, TRUE  UNION ALL
    SELECT 'feature_engineering_ml',       'cloud_function', 5, TRUE
)
SELECT
    job,
    CASE
        WHEN step_count = 1 AND NOT has_upstream THEN 'scheduled query (keep)'
        WHEN step_count <= 3 AND has_upstream    THEN 'Workflows (migrate)'
        WHEN step_count >= 4                     THEN 'Composer (migrate)'
        ELSE 'Workflows (default)'
    END AS target_surface
FROM job_inventory;

Step-by-step trace.

job	step_count	has_upstream	current	target
daily_revenue_by_region	1	no	scheduled_query	scheduled_query (keep)
mart_refresh_orders	3	yes	scheduled_query	Workflows
gcs_to_bq_events	2	yes	cloud_function	Workflows
feature_engineering_ml	5	yes	cloud_function	Composer

The trace surfaces the migration matrix. One-step jobs stay; two-to-three-step jobs migrate to Workflows; anything four-plus moves to Composer or Dataform. Cloud-Function-based ad-hoc cron is always migrated off because it has no visibility into BigQuery job semantics.

Output:

target surface	count of jobs	monthly cost delta
scheduled query (kept)	12	$0
Workflows (new)	10	+$5 / month
Composer (new)	8	+$300 / month
total	30	+$305 / month

Why this works — concept by concept:

Match orchestrator to dependency graph — one-step → DTS; branching → Workflows; sensors / DAGs → Composer; versioned SQL → Dataform. The shape of the dependency graph dictates the surface.
Cost vs visibility tradeoff — Composer costs more but exposes every step in a DAG view; Workflows is cheaper but trades off the visual DAG.
Cloud-Function cron is a smell — running BigQuery jobs from a Cloud Function deployed via gcloud has no built-in retry, lineage, or audit. Migrate off as a hygiene item.
Dataform is the long-term destination — once your SQL is versioned in Git with assertions, Dataform's ref() graph plus auto-generated DAG is hard to beat.
Cost — O(n_jobs) migration human time, O(monthly) infrastructure cost. The break-even point for Composer is usually 5+ dependent tasks per pipeline.

SQL
Topic — etl
ETL pipeline problems (SQL)

Practice →

5. Cost & slot monitoring with INFORMATION_SCHEMA

One view, one chart, one guardrail policy — the day-to-day BigQuery cost loop

The mental model in one line: the INFORMATION_SCHEMA.JOBS_BY_PROJECT view tells you who ran what, when, how many bytes it processed, and how many slot ms it burned — and pairing it with cost guardrails (max bytes billed per query, custom quotas, alert thresholds) closes the loop. Once you can run the top-10-cost-burners query from memory, BigQuery cost stops being a mystery line on the bill.

On-demand vs flat-rate pricing (Editions).

On-demand. $5 per TB of bytes billed, per query. No commitment, no monthly minimum. Best when usage is bursty.
Editions (Standard / Enterprise / Enterprise Plus). Reserved slot capacity, charged per slot-hour. Predictable cost, "free" queries within the reservation. Best when usage is steady or heavy.
The crossover point. Roughly 100 TB billed per month on-demand ≈ the cost of a small reservation. Below that, on-demand wins; above, Editions wins.

The one view that runs the whole monitoring loop.

INFORMATION_SCHEMA.JOBS_BY_PROJECT. Every job submitted in the project for the last 180 days. Columns include user_email, job_id, query, creation_time, total_bytes_processed, total_bytes_billed, total_slot_ms, cache_hit, query_label.
INFORMATION_SCHEMA.JOBS_BY_USER. Same shape, scoped to the calling user.
INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION. Org-wide; needs org-level IAM. Useful for FinOps dashboards.

The top-10 cost-burning queries — the query.

SELECT
    user_email,
    job_id,
    total_bytes_billed / POW(10, 12)             AS tb_billed,
    total_bytes_billed / POW(10, 12) * 5         AS usd_cost,
    total_slot_ms / 1000                          AS slot_sec,
    query
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
ORDER BY total_bytes_billed DESC
LIMIT 10;

Slot usage chart — what to plot.

X-axis. Hour of the day (or 5-minute buckets for fine-grained dashboards).
Y-axis. Slots consumed.
Two series. Average slots and peak (95th percentile or max) slots per bucket.
What to look for. A flat blue line with a spiky orange line means most slots are bursty; consider a peak-time reservation. A wavy blue line means usage is even; a flat reservation matches well.

BI Engine reservation — when to enable.

What it is. An in-memory cache for sub-second dashboard queries. Reserved capacity (1–250 GB).
When it helps. Looker Studio / Looker dashboards that scan the same dimensional tables repeatedly. Latency drops from seconds to milliseconds; cost moves from per-query bytes to a flat hourly reservation.
When it doesn't. Ad-hoc queries with novel WHERE clauses (cache miss). One-shot exports.

Query labels — the chargeback primitive.

What. A key/value pair attached to a job. Examples: team:de, job:nightly, cost_center:marketing.
How to set. SET @@query_label = 'team:de,job:nightly'; at the top of the query, or via the API / CLI's --label flag.
Why. The label appears in INFORMATION_SCHEMA.JOBS and is the join key for chargeback dashboards.

Cost guardrails — three layers of defence.

Maximum bytes billed per query. SET @@maximum_bytes_billed = 50000000000; (50 GB). Any query that would exceed this cap is aborted before billing. Per-query; per-user via API.
Custom quotas. Project-level or per-user limits on bytes scanned per day. Hard cap; queries fail when exhausted.
Alert policy. Cloud Monitoring threshold on bigquery.googleapis.com/slots/total_allocated_for_reservation or on a custom log-based metric for total_bytes_billed. Routes to Slack / email.

Cost-cutting moves that actually work.

Select only the columns you need. Column-store pruning is the single biggest cost lever.
Filter on the partition column first. Pruning skips entire partitions — 10–100× reduction.
Cluster on filter columns. Clustering accelerates per-block filtering for high-cardinality WHEREs.
Materialized view for repeated aggregates. Pre-computed daily / hourly rollups make dashboards cheap.
Wrap dashboards in BI Engine. Sub-second latency, flat hourly cost, no per-query billing.

Worked example — the top-10 cost-burners audit

Detailed explanation. A FinOps engineer runs the top-10-cost-burners query every Monday morning, posts the results in the data-platform Slack channel, and follows up with the authors of the top three. The discipline puts a soft tax on expensive queries and surfaces real optimisation candidates.

Question. Given the audit query above, write the follow-up query that aggregates by user_email to find the cost owners over the last week, and rank them.

Input — five sample rows from JOBS_BY_PROJECT.

user_email	job_id	bytes_billed	slot_ms	query_label
data-team@...	job_001	2.1 TB	1 400 000	team:de,job:nightly
data-team@...	job_002	410 GB	290 000	team:bi,job:dash
analyst@...	job_003	88 GB	41 000	adhoc
ml-team@...	job_004	920 GB	680 000	team:ml,job:train
data-team@...	job_005	9 GB	4 000	team:de,job:hourly

Code.

SELECT
    user_email,
    COUNT(*)                                            AS jobs_count,
    SUM(total_bytes_billed) / POW(10, 12)               AS tb_billed,
    SUM(total_bytes_billed) / POW(10, 12) * 5           AS usd_cost,
    SUM(total_slot_ms) / 1000                            AS slot_sec
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
GROUP BY user_email
ORDER BY usd_cost DESC
LIMIT 10;

Step-by-step explanation.

Filter to the last 7 days and to job_type = 'QUERY' (excludes load and copy jobs, which have different billing).
Group by user_email to collapse to one row per cost owner.
SUM the bytes billed (the units BigQuery actually charges) and convert to TB by dividing by 10^12.
Multiply by 5 (the per-TB on-demand price) to surface the dollar cost.
Sum slot ms separately — useful when you're on Editions (slot-second pricing) instead of on-demand.
ORDER BY the dollar cost DESC and LIMIT 10 — these are the conversations FinOps wants to have.

Output.

user_email	jobs_count	tb_billed	usd_cost	slot_sec
data-team@...	322	18.4	92.00	11 200
ml-team@...	88	6.1	30.50	4 800
analyst@...	612	1.9	9.50	880

Rule of thumb. Run this audit weekly. The biggest wins are almost always in the top-3 owners; the rest of the project tends to be statistical noise.

Worked example — finding the un-pruned-partition queries

Detailed explanation. A query that's expensive but doesn't look expensive in the source SQL usually has a missing partition filter. The audit query that surfaces it compares total_bytes_processed against the source table's total size; large ratios mean no partition pruning fired.

Question. Write the query that finds queries which processed more than 100 GB on a partitioned source table without a partition filter.

Input — assumed audit window: last 24h.

job_id	bytes_processed	source_table	partition_col	filtered_on_partition?
job_001	800 GB	events_daily	event_date	yes
job_002	4 TB	events_daily	event_date	no
job_003	200 GB	orders	order_date	yes

Code.

-- Find queries that scanned > 100 GB and likely missed a partition filter.
SELECT
    job_id,
    user_email,
    referenced_tables[OFFSET(0)].table_id   AS source_table,
    total_bytes_processed / POW(10, 9)       AS gb_processed,
    SUBSTR(query, 1, 200)                    AS query_snippet
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT,
     UNNEST(referenced_tables) AS referenced_tables
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
  AND job_type = 'QUERY'
  AND total_bytes_processed > 100 * POW(10, 9)
  AND REGEXP_CONTAINS(query, r'(?i)events_daily')
  AND NOT REGEXP_CONTAINS(query, r'(?i)event_date\s*[=<>]')
ORDER BY total_bytes_processed DESC
LIMIT 20;

Step-by-step explanation.

Restrict to the last 24 hours of QUERY jobs above the 100 GB threshold.
UNNEST referenced_tables to surface the source-table name.
Filter to the specific table you suspect (events_daily).
Use a negative regex (NOT REGEXP_CONTAINS) to find queries that don't mention the partition column. This is a heuristic — not a hard rule — but a strong one.
Return the job_id, user, GB processed, and a snippet of the SQL for manual review.
The output is a short list of queries to rewrite with a partition filter.

Output.

job_id	user_email	source_table	gb_processed	query_snippet
job_002	analyst@...	events_daily	4 000	SELECT region, COUNT(*) FROM events_daily GROUP BY region
job_007	ml-team@...	events_daily	1 200	SELECT * FROM events_daily WHERE user_id IN (...)

Rule of thumb. Pair this audit with a Slack ping to the author. Most "I scanned 4 TB by accident" queries are written by engineers who didn't know the table was partitioned — the ping doubles as documentation.

Worked example — setting the maximum-bytes-billed guardrail

Detailed explanation. Even with good discipline, accidents happen. The maximum_bytes_billed query option is a hard cap that aborts a query before BigQuery bills more than the cap. Setting it to 50 GB per user, per query, is a sensible default for analyst seats.

Question. Show the three ways to apply the cap (per-query, per-session, per-user-default), and explain when each makes sense.

Input.

level	scope	how it's set
per-query	one statement	`SET @@maximum_bytes_billed = ...;`
per-session	the whole tab	session-level SET
per-user-default	every query that user runs	enforced via API / IAM policy

Code.

-- Per-query guardrail — 50 GB cap, set at the top of the editor tab.
SET @@maximum_bytes_billed = 50 * POW(10, 9);  -- 50 GB

SELECT region, COUNT(*) AS c
FROM `pipecode-demo.dataset_marketing.events_daily`
WHERE event_date = CURRENT_DATE() - 1
GROUP BY region;

-- Per-job, set via the CLI (overrides session default).
-- bq query --use_legacy_sql=false --maximum_bytes_billed=50000000000 ...

-- Per-user default, set via the python client when constructing a QueryJobConfig:
-- job_config = bigquery.QueryJobConfig(maximum_bytes_billed=50 * 10**9)

Step-by-step explanation.

SET @@maximum_bytes_billed = ...; at the top of an editor tab sets the cap for every subsequent query in the tab. Reset on tab close.
The CLI flag --maximum_bytes_billed=... enforces the same cap per-job — useful in CI / ETL scripts.
The Python client lets you set the cap per QueryJobConfig — embed it in your data-engineering library so every query inherits the cap by default.
When a query exceeds the cap, BigQuery aborts it pre-execution with a clear error: "would exceed maximum_bytes_billed."
The cap is per-query; a multi-query session can still spend more in aggregate, so pair it with a daily quota for hard upper bounds.

Output.

query	bytes_billed	result
query A (under cap)	12 GB	success
query B (over cap)	80 GB attempted	aborted with "exceeds maximum_bytes_billed"

Rule of thumb. Default the cap in your team's BigQuery client wrapper. The cost of one engineer forgetting to set it is the exact cost the cap prevents — usually 50× higher than the cap itself.

SQL interview question on cost discipline

The interviewer's probe: "Walk me through how you'd set up a from-scratch BigQuery cost monitoring loop for a 20-engineer team on on-demand pricing." The answer is a four-piece package: top-10 audit, partition-miss audit, query labels, guardrails.

Solution Using the four-piece cost loop

-- Piece 1 — top-10 cost owners weekly
CREATE OR REPLACE TABLE pipecode_demo.ops.weekly_cost_top10 AS
SELECT
    user_email,
    COUNT(*)                                             AS jobs_count,
    SUM(total_bytes_billed) / POW(10, 12) * 5            AS usd_cost
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
GROUP BY user_email
ORDER BY usd_cost DESC
LIMIT 10;

-- Piece 2 — un-pruned-partition queries (heuristic)
CREATE OR REPLACE TABLE pipecode_demo.ops.partition_miss_audit AS
SELECT
    job_id, user_email,
    total_bytes_processed / POW(10, 9) AS gb_processed,
    SUBSTR(query, 1, 200) AS snippet
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND total_bytes_processed > 100 * POW(10, 9)
  AND REGEXP_CONTAINS(query, r'(?i)events_daily')
  AND NOT REGEXP_CONTAINS(query, r'(?i)event_date\s*[=<>]')
ORDER BY gb_processed DESC
LIMIT 20;

-- Piece 3 — enforce query labels via wrapper
-- (In Python: every QueryJobConfig sets labels={'team': team, 'job': name}.)

-- Piece 4 — maximum_bytes_billed guardrail in every editor session
SET @@maximum_bytes_billed = 100 * POW(10, 9);  -- 100 GB hard cap

Step-by-step trace.

piece	cadence	output	action it drives
top-10 owners	weekly	named list	DM the top 3
partition-miss audit	weekly	named list	rewrite + educate
query labels	always-on	chargeback table	finance reports
100 GB cap	per-session	aborted query	educates immediately

The four pieces close the loop: weekly audits surface candidates, query labels give finance a chargeback path, and the cap stops accidental damage before billing. Most teams stop at piece 1; the discipline shows when you have all four.

Output (after one month of running the loop):

metric	before loop	after loop	delta
weekly cost	$2 400	$1 600	-33%
largest-single-query bill	$480	$0 (capped)	-100%
% queries with labels	12%	94%	+82pp
partition-miss queries	27 / week	6 / week	-78%

Why this works — concept by concept:

INFORMATION_SCHEMA.JOBS as the single source of truth — every monitoring artefact derives from the same view; no out-of-band data export, no parallel dashboard to maintain.
Top-10 audit as a soft tax — public visibility on cost-burners creates social pressure to optimise without imposing a hard rule.
Partition-miss heuristic — the regex isn't perfect, but it surfaces the worst 80% of accidents. False positives are cheap; false negatives are expensive.
Query labels as the chargeback primitive — the label string is free, joinable, and propagates through INFORMATION_SCHEMA. Finance gets a chargeback table without building a parallel pipeline.
Maximum bytes billed as the safety net — even when an engineer skips the validator, the cap stops a $500 typo at the bytes threshold you actually chose.
Cost — pieces 1 and 2 are O(jobs) per week (~$0.01 each); piece 4 is free; the savings from -33% weekly cost dwarf the audit's cost in week one.

SQL
Topic — window-functions
Window function problems (SQL)

Practice →

Cheat sheet — BigQuery console recipes

Always read the validator before clicking Run. Bytes processed → dollars. A red card is your last chance to stop a $100 typo.
SELECT * against a 10 TB table is a $50 click. Always project columns. The validator turns from red to yellow on column projection alone.
Filter on the partition column first. Bytes scanned drops by 10–100× when the WHERE matches the partition layout. Look up the partition column in the Schema tab if you don't know it.
LIMIT does not reduce bytes scanned. BigQuery scans the partition set first, then truncates. Use a partition WHERE to control cost; use LIMIT only for the rendered row count.
Cache hit is free. Toggle "Use cached results" ON for ad-hoc exploration. Identical SQL within 24 hours = $0.
Cmd-Enter to run; Cmd-/ to comment; Cmd-K to format. Three shortcuts; a minute saved per query.
Use the Preview tab for "what does this table look like." Free; metadata only; no SQL needed.
Read the Query plan after any > 10 s query. Slot ms by stage tells you cost; shuffle bytes tells you join shape.
INFORMATION_SCHEMA.JOBS_BY_PROJECT is the one view that runs the cost loop. Top-10 owners, partition misses, and labels all derive from it.
Tag queries with SET @@query_label = 'team:de,job:nightly'; at the top of every editor session. Finance can join the label to JOBS for chargeback.
Save common queries; promote to scheduled queries when stable. Three identical days in a row is a reasonable promotion criterion.
Match orchestrator to dependency graph. One step → scheduled query. Two-to-three steps → Workflows. DAG with sensors → Composer. Versioned SQL with tests → Dataform.
SET @@maximum_bytes_billed = 100 * POW(10, 9); is the safety net for the editor. Set it once per tab.
BI Engine for dashboards that scan the same tables. Flat hourly cost, sub-second latency, no per-query bill.
Service-account-owned schedules survive team turnover. Never schedule a query as yourself; create a runner SA and grant the minimum two roles.

Frequently asked questions

How do I open the BigQuery console?

Navigate to https://console.cloud.google.com/bigquery in any modern browser. Sign in with the Google account that has GCP IAM permissions on the project. The top-bar project selector lists every project where you have bigquery.user (or higher); pick the one you need. The left Explorer pane shows the datasets you can bigquery.dataViewer over; the centre pane is the multi-tab query editor. Two keyboard shortcuts pay for themselves immediately: Cmd-Enter to run, Cmd-/ to toggle comments. If you want the new unified workbench (notebooks, canvases, Dataform), use the Studio toggle in the top-left.

What is a dry run in BigQuery and how do I do one?

A dry run is a server-side query validation that reports how many bytes the query would scan, what it would cost (on the on-demand model), and whether the query references invalid columns or wrong regions — without actually running. In the console, the validator card on the right of the editor performs a dry run automatically as you type. From the CLI, run bq query --dry_run --use_legacy_sql=false "...". From the Python client, set dry_run=True on the QueryJobConfig. Read the dry run before every Run, especially against partitioned tables, where a missing WHERE clause can be the difference between a $0.01 query and a $50 query.

How much does a BigQuery query cost?

On the on-demand model, BigQuery charges $5 per TB of bytes billed, where "bytes billed" is the sum of column blocks read across all referenced columns and partitions. A query that reads 100 GB pays $0.50; one that reads 1 TB pays $5. Cache hits are free. On Editions (Standard / Enterprise / Enterprise Plus), you pay per slot-hour against a reservation, and individual queries cost nothing extra within the reservation's capacity. The crossover is around 100 TB billed per month — below, on-demand wins; above, Editions wins. To audit cost per query, use INFORMATION_SCHEMA.JOBS_BY_PROJECT.total_bytes_billed.

What are scheduled queries and when should I use them instead of Airflow?

Scheduled queries are BigQuery's built-in cron, powered by the Data Transfer Service. You attach a cron expression to a saved query, pick a destination table, pick a runner service account, and BigQuery handles the rest — no infrastructure to manage. Use them for one-step jobs: a daily aggregate, an hourly rollup, a single SQL statement that writes to one destination. The moment your pipeline has two dependent steps, an upstream sensor, a backfill, or branching logic, graduate up the ladder to Cloud Workflows, Cloud Composer (managed Airflow), or Dataform. The rule of thumb: scheduled queries are for SQL that is the pipeline; Composer is for SQL that's part of a pipeline.

How do I see who's running expensive queries in my BigQuery project?

Query INFORMATION_SCHEMA.JOBS_BY_PROJECT with a WHERE on creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) and job_type = 'QUERY', then GROUP BY user_email and ORDER BY SUM(total_bytes_billed) DESC — the top 10 rows are your cost owners. Pair with a partition-miss audit (NOT REGEXP_CONTAINS(query, partition_col) on queries above 100 GB) to surface the rewrites that matter most. Run the audit weekly, post the results in your team's Slack, and follow up with the top three. The visibility alone usually drops weekly cost by 20–40% in the first month.

What's the difference between BigQuery Studio and the classic console?

The classic BigQuery console is a SQL editor with an Explorer pane, a validator card, and a job history rail — optimised for "write a query, see the result." BigQuery Studio is the 2024+ unified workbench that bundles SQL queries, Python / SQL notebooks, data canvases (drag-and-drop graphical query builder), Dataform projects, saved queries, and ML models under one left-pane navigator. You can schedule a notebook the same way you schedule a query; you can mix %%sql and bigframes Python in the same notebook. The Studio toggle is in the top-left of the console; switching is non-destructive. In 2026, most teams default to Studio for new work and keep the classic view bookmarked for legacy IAM workflows.

Practice on PipeCode

Drill the SQL practice library → for the BigQuery-flavoured ANSI SQL that the workbench rewards.
Rehearse on aggregation problems → for the GROUP BY queries you'll write all day.
Sharpen window function drills → for the patterns that drive most of the slot bill.
Layer the joins practice library → for the multi-table queries that dominate mart-layer SQL.
Stack the CTE practice library → — every multi-step BigQuery query lives or dies on clean CTEs.
For the broader interview surface, read top data engineering interview questions →.
Stack the prerequisites with the only 5 skills you need to become a data engineer →.
Sharpen the SQL axis with the SQL for data engineering interviews course →.
For long-form schema craft, work through data modelling for DE interviews →.
To frame BigQuery in a broader stack, study ETL system design for DE interviews →.

Pipecode.ai is Leetcode for Data Engineering — every console recipe above ships with hands-on practice rooms where you write the partition-aware WHERE, the INFORMATION_SCHEMA audit, and the LEFT JOIN with COALESCE against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your BigQuery query would survive a code review on a real team's reservation.

Practice SQL now →
Aggregation drills →