bigquery console is the single most-opened tab in a Google-Cloud data team's day — and the one most new hires under-use. The reason is that "the console" in 2026 is no longer a single SQL editor. It is a unified workbench that bundles a multi-tab query editor, the dry-run cost validator, scheduled queries, Studio notebooks, Dataform projects, slot monitoring dashboards, and Gemini SQL assist into one URL — and the only way to feel fast in it is to learn each panel by what job it does.
This guide is the hands-on tour you wished existed the first time you opened console.cloud.google.com/bigquery and had to ship a query, an aggregate, and a daily schedule before lunch. It walks through the six panels of the BigQuery web UI, the eight-step dry-run-to-schedule workflow most data engineers internalise after a few weeks, the orchestration ladder from a one-line scheduled query to a full Cloud Composer DAG, and the cost-and-slot monitoring loop that every senior reviewer expects you to know cold. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the SQL practice library → to rebuild the muscle the workbench rewards, rehearse on aggregation problems → for the GROUP BY queries you'll write all day, and stack the cost-aware reflex with window functions → — the patterns that drive most of the slot bill.
On this page
- BigQuery console — what it is and why it matters in 2026
- Anatomy of the BigQuery console UI
- Query editor workflow — dry run → run → save → schedule
- Scheduled queries, Workflows, and Cloud Scheduler
- Cost & slot monitoring with INFORMATION_SCHEMA
- Cheat sheet — BigQuery console recipes
- Frequently asked questions
- Practice on PipeCode
1. BigQuery console — what it is and why it matters in 2026
The BigQuery web UI is a fully serverless SQL workbench — no install, no driver, no warm-up; just open a URL and write SQL
The one-sentence invariant: the BigQuery console lives at console.cloud.google.com/bigquery, runs entirely in the browser, and is the fastest path from "I need to look at this dataset" to "I have a result in the clipboard". Once you internalise that the console is a workbench, not just an editor, the question "which tool should I use for this BigQuery task?" almost always answers itself with "the console first, then graduate to the CLI or the API if scripting is needed."
The three other ways to talk to BigQuery, and when each wins.
-
bqCLI. Best for one-off scripted exports, CI jobs, and "I want to grep my project's datasets from a terminal." Same auth as the console (gcloud), but no syntax assist, no plan view, no Gemini. - The REST + gRPC API (and the Python / Go / Java client libraries). Best for embedding BigQuery into an application: an ETL service, a dashboarding backend, a streaming ingest job. Code-first; no UI affordances.
-
BigQuery DataFrames (
bigframes). A 2024+ Python client that compiles pandas-style DataFrame calls into BigQuery SQL. Best for notebook-style exploration where the user prefers Python over SQL — but the console is still where you debug the generated SQL.
The five things that arrived in the console between 2024 and 2026.
- Gemini SQL assist. Inline autocomplete for joins, CTEs, and CASE branches; natural-language "explain this query" and "write the SQL that joins these two tables" prompts. Toggle per-project.
- BigQuery Studio. A unified workbench tab that bundles SQL queries, Python / SQL notebooks, data canvases, Dataform projects, and saved queries under a single left-pane navigator. The classic "BigQuery" tab still exists; Studio is the recommended home base in 2026.
- Unified data canvas. A drag-and-drop graphical query builder that compiles to SQL — useful for first-week onboarding and for sharing a query shape with a non-SQL stakeholder.
-
BigQuery DataFrames notebook integration. Notebooks that mix
%%sqlcells withbigframesPython cells, share auth with the console, and ship results to Looker Studio with one button. -
Per-query labels in the validator. The dry-run card now surfaces the
query_labelyou set so cost-attribution is visible before you click Run.
Why the console is still the fastest tool for most jobs.
- Exploration. Browsing the Explorer tree, previewing a table, profiling a column — point-and-click in the console, line-of-code in the CLI.
-
Dry-run cost estimates. The validator card on the right of the editor shows bytes processed and dollar estimate before you click Run. The CLI shows it too (
--dry_run), but the console renders the partition-pruning preview as well. - Ad-hoc SQL. Cmd-Enter runs the highlighted block; results stream into the bottom panel; cache hit re-runs are free. Round-trip in seconds.
-
Visualisation handoff. "Open in Looker Studio" and "Save results to Sheet" are one-click buttons on the results panel — the CLI requires
bq extract+ a download step.
Console vs Snowsight (Snowflake) vs Athena UI — the orientation map for migrants.
| BigQuery console | Snowsight (Snowflake) | Athena query editor (AWS) |
|---|---|---|
| Left Explorer: project → dataset → table | Left Databases: db → schema → table | Left Tables (per Glue database) |
| Centre multi-tab query editor | Worksheets (tabbed) | Single editor + Saved queries pane |
| Dry-run validator card | Estimate from query profile post-run | No native dry run; charge by data scanned |
| Job history rail | Query history table | Recent queries tab |
| Scheduled queries (DTS) | Tasks (multi-step DAGs) | EventBridge + Athena |
| BigQuery Studio (notebooks + canvas + Dataform) | Snowflake notebooks + Streamlit | SageMaker Studio integration |
| Cost model — bytes processed ($5/TB on-demand) | Credits-per-warehouse-second | Bytes scanned ($5/TB) |
What this guide covers, in one breath.
The anatomy of the six visible panels, the dry-run-to-schedule workflow most data engineers internalise, the orchestration ladder (scheduled query → Workflows → Composer → Dataform → Studio notebook), and the INFORMATION_SCHEMA-driven cost / slot monitoring loop every senior BigQuery user knows cold.
Worked example — opening the console for the first time
Detailed explanation. A new hire on a Google Cloud data team typically has GCP IAM permissions assigned to their email but has never opened the BigQuery UI. The first-week win is to load the console, switch to the correct project, find the team's main dataset, and run a sub-second query — all in under five minutes.
Question. Walk through the exact steps to open the BigQuery console, switch project, and run a one-row sanity-check query that confirms BigQuery sees the user as authenticated. List the IAM permissions required and the one keyboard shortcut that saves the most time.
Input.
| step | what the user sees |
|---|---|
| 1 | https://console.cloud.google.com/bigquery |
| 2 | top bar — project selector chip |
| 3 | left pane — Explorer with starred projects |
| 4 | centre — empty Untitled query tab |
| 5 | bottom — empty results panel |
Code.
-- Sanity check #1 — does BigQuery see the caller as authenticated?
SELECT
SESSION_USER() AS who_am_i,
@@project_id AS current_project,
CURRENT_TIMESTAMP() AS server_time;
Step-by-step explanation.
- Open
https://console.cloud.google.com/bigquery— the top-bar project selector lists every project where the caller hasbigquery.user(or higher). Click the selector, pick the team project. - In the left Explorer pane, the project appears as a folder; expanding it lists every dataset the caller can
bigquery.dataViewerover. - Click the + Compose new query button (or just hit
Cmd-Enteron an empty editor) — a new Untitled tab opens. - Paste the sanity SQL above and hit
Cmd-Enterto run. The result appears in the bottom panel: one row with the caller's email, the project id, and the server timestamp. - The minimum IAM roles needed for this flow are
roles/bigquery.user(lets you open the console and submit a job) plusroles/bigquery.dataVieweron the datasets you want to query. To create tables, you also needroles/bigquery.dataEditor.
Output.
| who_am_i | current_project | server_time |
|---|---|---|
| ada@example.com | pipecode-demo | 2026-06-06 08:14:22 UTC |
Rule of thumb. Memorise Cmd-Enter (run highlighted) and Cmd-/ (toggle comment). Those two shortcuts alone shave thirty seconds off every query-edit-rerun loop, which compounds to an hour per day on busy investigation sessions.
Worked example — when to reach for the console vs the CLI vs the API
Detailed explanation. New engineers default to whichever tool they learned first. In practice each tool wins for a different job, and a senior reviewer can usually spot a junior by noticing they wrote a 50-line shell script for a job the console solves in 10 seconds — or wrote a console query for a job that should have been parameterised in code.
Question. Given three tasks — (a) "look at the last 100 rows of the events table," (b) "run a 5-step ETL nightly," (c) "export the daily revenue report to a CSV in GCS" — pick the right tool for each and justify in one sentence.
Input.
| task | console | bq CLI | API / client lib |
|---|---|---|---|
Browse 100 rows of events
|
✓ | -- | -- |
| Nightly 5-step ETL | -- | possible | best (Composer / Workflows) |
| Daily CSV → GCS | -- | best | possible |
Code.
# Task (a) — console only. Open the table in Explorer, click Preview.
# (No SQL needed; the console renders rows from storage metadata for free.)
# Task (b) — Cloud Composer DAG, not bq CLI loops.
# Example Airflow operator (Python):
# BigQueryInsertJobOperator(
# task_id='daily_aggregate',
# configuration={'query': {'query': 'CALL pipecode_demo.daily_aggregate()',
# 'useLegacySql': False}})
# Task (c) — bq CLI export, wired into a Cloud Scheduler cron.
bq query --nouse_legacy_sql --destination_table=pipecode-demo:exports.daily_revenue \
"SELECT region, SUM(amount) AS total FROM pipecode-demo.mart.fact_orders \
WHERE DATE(order_ts) = CURRENT_DATE() - 1 GROUP BY region"
bq extract --destination_format=CSV \
pipecode-demo:exports.daily_revenue \
gs://pipecode-demo-exports/daily_revenue_$(date -u +%F).csv
Step-by-step explanation.
- Task (a) is exploration — the console wins because the Explorer's Preview tab uses the table's storage metadata to render rows without spending a slot or scanning a byte.
- Task (b) is multi-step orchestration — Cloud Composer (managed Airflow) is the standard answer because it has sensors, branching, retries, and visible DAGs.
- Task (c) is a scheduled export — the CLI plus Cloud Scheduler is the cheapest answer because
bq extractis a managed export operation that writes directly to GCS without a Python orchestrator.
Output.
| task | tool | reason |
|---|---|---|
| Browse 100 rows | console (Preview tab) | metadata-only render, free |
| Nightly 5-step ETL | Cloud Composer | full DAG semantics |
| Daily CSV → GCS |
bq CLI + Cloud Scheduler |
one-shot managed export |
Rule of thumb. Default to the console for any task you can describe in one sentence. Reach for the CLI for one-off scripts. Reach for the API or Composer for anything that needs to run without you watching it.
SQL interview question on choosing the right BigQuery surface
A common opener from a senior interviewer is: "You just landed on a team running 200 BigQuery queries per day across exploration, dashboards, and ETL. Walk me through which queries belong in the console, which in scheduled queries, and which in Composer — and which signals tell you it is time to graduate up the ladder." The answer is a maturity model, not a tool list.
Solution Using a four-question decision tree
-- 1) Is the query stable? (i.e. has it run the same way three days in a row?)
-- No → console only.
-- Yes → graduate to scheduled query.
-- 2) Does the query have upstream dependencies? (table B must finish before A)
-- No → scheduled query is enough.
-- Yes → graduate to Workflows or Composer.
-- 3) Is the orchestration logic > 5 steps OR needs sensors / backfill?
-- No → Workflows.
-- Yes → Composer (Airflow) DAG.
-- 4) Does the SQL live in version control with tests?
-- No → ship it raw under any of the above.
-- Yes → migrate to Dataform — versioned SQLX + ref() graph.
-- Worked decision applied to three real queries:
SELECT
'top_10_customers' AS query_name,
'console' AS chosen_surface,
'ad-hoc exploration' AS why
UNION ALL SELECT
'daily_revenue_by_region',
'scheduled query',
'stable, 1-step, no upstream dep'
UNION ALL SELECT
'mart.fact_orders nightly rebuild',
'Composer (Airflow)',
'multi-step, backfill, depends on stg tables';
Step-by-step trace.
| Question | top_10_customers | daily_revenue_by_region | nightly mart rebuild |
|---|---|---|---|
| 1. Stable? | no | yes | yes |
| 2. Upstream deps? | no | no | yes |
| 3. > 5 steps / sensors? | no | no | yes |
| 4. Versioned + tested? | no | no | could be Dataform |
| Surface | console | scheduled query | Composer |
The trace highlights the inflection points: every "yes" pushes the query one rung up the ladder. Skipping rungs leads to over-engineered nightly cron in Composer for queries a scheduled query could run, or under-engineered ad-hoc CSV exports that everyone forgets exist until they break.
Output:
| query | surface | weekly cost of being on the wrong rung |
|---|---|---|
| top_10_customers | console | $0 (right rung) |
| daily_revenue_by_region | scheduled query | ~$15 (would-be Composer instance) |
| nightly mart rebuild | Composer | unbounded (silent freshness bugs if downgraded) |
Why this works — concept by concept:
- Maturity model, not tool list — every BigQuery surface answers a different stability question. The decision tree forces you to ask "is this code stable enough to leave alone?" before picking a surface.
- Cost of over-engineering — Composer ships a managed Airflow environment that costs ~$300/month at the lowest tier. Spending it on a one-line scheduled query is a real budget leak.
- Cost of under-engineering — a 5-step ETL run as five separate scheduled queries has no upstream-dependency contract; a downstream step can run before its upstream finishes and silently consume stale data.
-
Dataform as the SQL-with-tests surface — Dataform's
ref()graph and assertions give every SQL its place in a DAG with checks. Teams that move to Dataform stop writing per-table cron and start writing per-model declarations. - Cost — choosing surface is O(1) per query but has linear ongoing cost. The decision tree pays for itself across 200 queries.
SQL
Topic — sql
SQL practice library (warm up the BigQuery muscle)
2. Anatomy of the BigQuery console UI
Six panels, one workbench — learn the layout once, navigate forever
The mental model in one line: the BigQuery console is six fixed panels — top bar, left Explorer, centre query editor, right validator, bottom results, far-right job history — and BigQuery Studio bundles notebooks and canvases into the same workbench under a new top-level switcher. Once you can name each panel and the job it does, "where do I click for X?" stops being a question.
The six panels, by job.
- Top bar. Project selector, global search, region indicator, the IAM-badge avatar (click it to swap identities or impersonate a service account). Also: the Studio / classic-BigQuery switcher in the top-left.
- Left pane — Explorer. A tree of projects → datasets → tables / views / materialized views / routines / models. Star a project to keep it pinned. Right-click a dataset for "Open in new tab," "Create table," or "Share."
- Centre — multi-tab query editor. Each query lives in its own tab. Cmd-Enter runs; Cmd-/ comments; Cmd-S saves. Gemini autocomplete underlines suggested joins and CASE branches as you type.
- Right rail — query validator. Dry-run estimate (bytes processed → dollar cost), region check, billing-project check. The "warning" banner here is your last line of defence against a $100 typo.
- Bottom — results panel. Three toggles: Table (default), JSON, Chart. The chart toggle compiles to a Looker Studio mini-chart for free.
- Job history (right-most strip). Every job you submitted: bytes processed, slot ms, cache hit indicator, query plan link. Useful when you want to grab a job_id for INFORMATION_SCHEMA debugging.
BigQuery Studio — the unified workbench.
- What it is. A new top-level navigation that bundles SQL queries, Python / SQL notebooks, data canvases, Dataform repos, saved queries, and ML models under one left pane.
-
Why it matters. Notebooks share auth with the console, can be scheduled like a query, and embed
bigframesPython alongside%%sqlcells. The canvas is a graphical query builder that compiles to SQL. -
When you'd switch back to classic. Some IAM tooling and old
bq cpworkflows are still smoother in the classic view; the toggle is one click.
Job-history view in detail.
- bytes_processed. What the query actually read. Compare with the dry-run estimate; large divergence means partition pruning didn't fire.
- slot_ms. Slot-milliseconds consumed. The fundamental unit of BigQuery compute. Divide by 1000 to get slot-seconds.
- cache_hit. Boolean. TRUE = the result came from the 24-hour cache; the query was free.
- query_plan. A clickable link to the stage graph: stages, shuffle, slot ms per stage. The BigQuery EXPLAIN equivalent.
- execution_graph. A visual DAG of stages — the "this stage took 80% of the wall clock" view that surfaces hot spots.
Keyboard shortcuts that matter.
-
Cmd-Enter/Ctrl-Enter— Run highlighted query (or the whole tab if nothing is highlighted). Most-used shortcut. -
Cmd-//Ctrl-/— Toggle line comment. Indispensable when sectioning a long query. -
Cmd-S/Ctrl-S— Save query. Prompts for a saved-query name on first save. -
Cmd-K/Ctrl-K— Format query. Re-indents and aligns CTE lists. -
Cmd-D/Ctrl-D— Duplicate selection or line. Useful for fastLIMIT 10→LIMIT 100edits. -
Esc— Close the autocomplete / Gemini suggestion. Lets you keep typing without accepting.
Worked example — running a query from the Explorer pane
Detailed explanation. The fastest path from "I need to see this table" to "I have rows on screen" is not typing SQL — it is clicking the table in the Explorer and hitting Preview. Most new engineers don't discover Preview for weeks; senior engineers reach for it before SELECT.
Question. Given a events_daily table in pipecode_demo.dataset_marketing, show the three ways to look at its first 10 rows. Compare cost, latency, and what each surface tells you.
Input.
| approach | UI clicks | cost (bytes scanned) | latency |
|---|---|---|---|
| Preview tab | Explorer → table → Preview | 0 | metadata only |
| SELECT * LIMIT 10 | editor → paste → Run | partition-or-table-sized | seconds |
| Schema tab | Explorer → table → Schema | 0 | metadata only |
Code.
-- Path 2 — SELECT with LIMIT.
-- Beware: LIMIT does NOT reduce bytes scanned in BigQuery. The query
-- scans the entire table (or the entire partition set the WHERE picks)
-- and *then* truncates to 10 rows. LIMIT is a render hint, not a cost hint.
SELECT *
FROM `pipecode-demo.dataset_marketing.events_daily`
LIMIT 10;
-- Better — filter on a partition column to limit bytes scanned.
SELECT *
FROM `pipecode-demo.dataset_marketing.events_daily`
WHERE event_date = CURRENT_DATE() - 1
LIMIT 10;
Step-by-step explanation.
- The Preview tab reads no slot ms and scans no bytes — it surfaces a few rows from the storage layer's row blocks directly. Free, fast, no SQL.
-
SELECT * LIMIT 10is the trap junior engineers fall into: BigQuery scans the entire table (or partition set), pays the bytes-processed cost, then truncates the result to 10 rows. The bill is for the scan, not the LIMIT. - Adding
WHERE event_date = CURRENT_DATE() - 1lets the optimiser prune partitions — bytes scanned drops from "whole table" to "one partition," and the dry-run estimate plummets. - The Schema tab shows column types, descriptions, and partition / cluster keys — invaluable for writing the WHERE in step 3 with a known partition column.
Output.
| approach | cost | what you see |
|---|---|---|
| Preview | $0 | first ~50 rows verbatim |
SELECT * LIMIT 10 |
full table scan | 10 rows |
WHERE partition_col = ... LIMIT 10 |
one partition scan | 10 rows (cheap) |
Rule of thumb. Reach for the Preview tab first. Reach for SELECT only when you need a specific filter or computed column. Never write SELECT * against a large table without a partition filter — the LIMIT does not save you.
Worked example — reading the query validator card
Detailed explanation. The right-rail validator does a server-side dry run as you type. It tells you exactly how many bytes the query would scan and how many dollars it would cost (on the on-demand pricing model) before you click Run. Most expensive accidents in BigQuery are caused by ignoring this card.
Question. Given a SELECT * against a 10 TB table, what does the validator show, and how do you change the query so the validator turns green?
Input — validator outputs for three query shapes.
| query | bytes processed | $ estimate | validator colour |
|---|---|---|---|
SELECT * FROM big_table |
10 TB | $50.00 | red |
SELECT col_a, col_b FROM big_table |
1.2 TB | $6.00 | yellow |
SELECT col_a FROM big_table WHERE partition_col = '2026-06-05' |
12 GB | $0.06 | green |
Code.
-- Trigger the dry run from the CLI to inspect what the validator computes:
bq query --dry_run --use_legacy_sql=false \
"SELECT col_a FROM \`pipecode-demo.dataset.big_table\` WHERE event_date = '2026-06-05'"
-- Output snippet:
-- Query successfully validated. Assuming the tables are not modified,
-- running this query will process 12,884,901,888 bytes of data.
Step-by-step explanation.
- The validator reports
bytes processedbased on the referenced columns (column-store pruning) and the referenced partitions (partition pruning), pre-execution. -
SELECT *references every column → every column block is read → 10 TB. - Projecting only the two columns you need brings the bytes scanned down to the column-block size of those columns.
- Adding
WHERE partition_col = '...'lets BigQuery skip every partition file that doesn't match — for a daily-partitioned table with 1000 days of history, that's a 1000× reduction. - Combine column projection + partition filter and the same query becomes effectively free.
Output.
| optimisation | bytes scanned | cost |
|---|---|---|
baseline SELECT *
|
10 TB | $50.00 |
| project columns | 1.2 TB | $6.00 |
| + partition filter | 12 GB | $0.06 |
Rule of thumb. Make "look at the validator card" the last action before every Run. If it's red and you didn't intend to pay $50, fix the query — column projection and partition filters are the two universal moves.
Worked example — inspecting the query plan after a run
Detailed explanation. After a query finishes, the bottom panel offers a "Query plan" tab — a per-stage breakdown of the execution graph. Reading the plan is the 80/20 skill that turns a junior writer into an engineer who can debug "why is this query slow?" without guessing.
Question. A query takes 90 seconds and processes 200 GB. Where in the plan do you look first, and what numbers do you compare?
Input — abbreviated plan.
| stage | rows in | rows out | slot ms | shuffle bytes | wait ratio |
|---|---|---|---|---|---|
| S00 Read events | 2.4 B | 2.4 B | 32 000 | 0 | 0.05 |
| S01 Filter event_date | 2.4 B | 18 M | 1 200 | 0 | 0.02 |
| S02 Join customers | 18 M | 18 M | 1 800 | 4.8 GB | 0.08 |
| S03 Aggregate by region | 18 M | 6 | 120 | 96 KB | 0.01 |
Code.
-- Pull the same numbers from INFORMATION_SCHEMA so you can script the audit:
SELECT
stage.name AS stage_name,
stage.records_read AS rows_in,
stage.records_written AS rows_out,
stage.slot_ms AS slot_ms,
stage.shuffle_output_bytes AS shuffle_bytes,
stage.wait_ratio_avg AS wait_ratio
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT,
UNNEST(job_stages) AS stage
WHERE job_id = '@job_id'
ORDER BY stage.id;
Step-by-step explanation.
- Sort by
slot_ms— the stage with the largest slot consumption is the cost driver. In the example, Stage S00 (Read events) dominates at 32 000 slot ms — that means the table read is the bottleneck. - Sort by
shuffle_bytes— a high-shuffle stage is a JOIN or aggregation that re-distributes data. Stage S02 here is the join, shuffling 4.8 GB. - Compare
rows_inandrows_out— whenrows_out≪rows_inearly, the optimiser is filtering well; when it stays equal, you're scanning unnecessary rows. - Look at
wait_ratio_avg— high values (>0.3) indicate the stage spent significant time waiting for slots, signalling slot pressure on the reservation. - Action plan: if Read dominates → partition / cluster the source. If Shuffle dominates → reduce join cardinality (filter earlier, broadcast smaller side). If Wait dominates → add slots or move to a less-busy reservation.
Output.
| symptom in plan | fix |
|---|---|
| Read stage has the highest slot ms | partition + cluster the source table |
| Shuffle stage is multi-GB | filter / aggregate before the join |
| Wait ratio > 0.3 on multiple stages | bigger reservation or off-peak schedule |
| rows_in ≈ rows_out across stages | missing filter; project columns + add WHERE |
Rule of thumb. Always read the plan after any query that takes longer than 10 seconds or processes more than 50 GB. Slot ms tells you cost; shuffle bytes tells you join shape; wait ratio tells you whether your reservation is starved.
SQL interview question on UI literacy
A senior interviewer often opens with: "Walk me through how you'd diagnose a slow BigQuery query using only the console — no CLI, no API." The answer probes whether the candidate can name the right panel for each step.
Solution Using the console panels in order
-- 1) Open the job in the Job history rail (right strip).
-- 2) Click the row → "View details" → reads the same fields shown below.
-- 3) Copy job_id and paste into INFORMATION_SCHEMA for a programmatic audit.
SELECT
job_id,
user_email,
total_bytes_processed / POW(10, 9) AS gb_processed,
total_slot_ms / 1000 AS slot_sec,
cache_hit,
end_time - start_time AS wall_clock
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE job_id = '@job_id';
Step-by-step trace.
| panel | what it told us | conclusion |
|---|---|---|
| Job history rail | "job_xyz · 200 GB · 90 s · slot ms 35 200" | not cached, real work |
| Validator (re-dry-run) | 200 GB → matches → no surprise scan | filter is doing its job |
| Query plan tab | S00 Read = 32 000 slot ms | source table read is the bottleneck |
| Schema tab | source table — partitioned by event_ts HOUR |
partition column is hourly, query filtered by date |
| Fix | rewrite WHERE to filter by hour | bytes drop ~24× |
The trace teaches the diagnostic discipline: confirm the dry run, look at the plan, check the source schema, and only then change the query. Most "the query is slow" tickets are solved by aligning the WHERE clause with the partition column at the right granularity.
Output:
| metric | before | after |
|---|---|---|
| bytes processed | 200 GB | ~9 GB |
| slot ms | 35 200 | 1 500 |
| wall clock | 90 s | 6 s |
Why this works — concept by concept:
- The job history rail is the entry point — every console-only diagnosis starts here; the rail gives you the job_id, the headline numbers, and the link to the plan in one click.
- The validator card is your sanity check — re-running a dry run after you've understood the plan confirms whether your proposed fix would actually reduce bytes.
- The query plan tab surfaces stage-level cost — sorting stages by slot ms localises the bottleneck to a single operator, which is what your fix must target.
- The schema tab anchors fixes to physical layout — partition and cluster keys live on the table, not in your SQL. Aligning the WHERE to the partition key is usually the highest-leverage edit.
- Cost — diagnosis is O(stages) human time; the fix is O(1) edit but saves linear cost for every subsequent run. Pays for itself in one rerun.
SQL
Topic — optimization
Query optimisation problems (SQL)
3. Query editor workflow — dry run → run → save → schedule
Eight steps, two rows, one repeatable habit — the editor workflow most data engineers internalise
The mental model in one line: every BigQuery query passes through eight predictable steps — write, dry-run, cache-toggle, run, inspect plan, save, schedule, share — and the engineers who feel fastest are the ones who internalise the steps in that order. Once you can recite the eight steps, the editor stops being a UI to click through and becomes a pipeline you walk down on autopilot.
The eight steps in detail.
-
Step 1 — Write or paste SQL. Gemini autocomplete suggests joins, CTEs, and CASE branches as you type. Accept with
Tab; reject withEsc. The "Explain this query" prompt produces a plain-English summary — useful when reviewing someone else's SQL. - Step 2 — Validator dry run. Runs automatically. Reports bytes processed → cost in dollars (US $5 per TB on-demand). Surfaces region and billing-project mismatches before you can click Run.
- Step 3 — Cache toggle. "Use cached results" is on by default. BigQuery caches every query result for 24 hours, keyed by the canonicalised SQL. Cache hit = free read. Toggle off when you need fresh data (e.g. after a backfill).
-
Step 4 — Run.
Cmd-Entersubmits the job. Results stream into the bottom panel as the job completes. The bottom panel toggles between Table, JSON, and Chart. - Step 5 — Inspect query plan. After the run, the bottom panel exposes a "Query plan" tab. Stages, shuffle bytes, slot ms — the BigQuery EXPLAIN equivalent.
- Step 6 — Save. Click "Save" → "Save as" → choose Saved Query (private / project-wide), View, Materialized View, or Table. Saved queries live in the Explorer under "Saved queries"; views and MVs live under the dataset.
- Step 7 — Schedule. From the saved query menu, "Schedule" opens the Data Transfer Service flow: pick a cron expression or use the simplified UI (every hour, every day at 8 AM, etc.). The schedule runs as a service account.
- Step 8 — Share. "Share" → copy a permalink to the query, "Open in Looker Studio" for a chart, "Save results" → CSV / JSON / Avro / Sheets / GCS.
What the validator actually tells you, in plain English.
- Bytes processed. The total compressed bytes BigQuery would scan. Multiply by ~$5 / TB to get the on-demand dollar estimate. The validator does this for you.
-
Region. "This query references a dataset in
us-east1but your default region isus." That mismatch fails the job at run time; better to catch it pre-flight. - Billing project. Which project will be billed for the slot ms and bytes processed. Defaults to the project selector but you can override per-query.
- Partition-pruning preview. "The optimiser will scan 3 of 365 partitions." Confirms the WHERE filter is doing its job.
Cache semantics in one paragraph.
The cache is a global, per-user, deterministic key over the canonicalised SQL string and the destination tables' last-modified timestamps. Two identical queries from the same user inside 24 hours = cache hit = free. Adding CURRENT_TIMESTAMP() to the SELECT defeats the cache; using DML on a referenced table invalidates the cache. The cache is not shared across users — your colleague's free hit isn't yours.
The save-and-schedule contract.
-
Saved query. Lives in your project under your IAM. Persists across browser sessions. Can be parameterised with
@paramplaceholders. - View. Read-only virtual table. The SELECT is recomputed every time the view is queried — no storage, no scheduling, no cost when idle.
- Materialized view. Pre-computed result, auto-refreshed when underlying tables change. Storage cost + refresh cost; query cost drops dramatically.
- Scheduled query. The same SELECT, run on a schedule, with a destination table. Owned by the user who set it up (or by a service account, recommended).
The share menu, in one breath.
Copy permalink → paste in a PR or Slack. Open in Looker Studio → a chart with one click. Save results → CSV / JSON / Avro to local download, Sheets to Drive, or GCS for downstream pipelines. Export to a destination table → for reuse without re-running.
Worked example — first run of a new query
Detailed explanation. A new engineer is asked to "count events per region for yesterday." The naive query is one line. The careful workflow is to (1) write it, (2) read the validator card, (3) trim if expensive, (4) run, (5) glance at the plan to confirm partition pruning fired.
Question. Walk through writing a "yesterday's events by region" query, reading the validator, fixing the cost, and running it.
Input.
| step | action | observation |
|---|---|---|
| 1 | naive SELECT region, COUNT(*) FROM events GROUP BY region
|
validator: 8 TB, $40 — red |
| 2 | add WHERE event_date = CURRENT_DATE() - 1
|
validator: 22 GB, $0.11 — green |
| 3 | run | wall clock 4 s; cache miss; plan clean |
Code.
-- v0 — naive, will be flagged red by validator.
SELECT region, COUNT(*) AS event_count
FROM `pipecode-demo.dataset_marketing.events_daily`
GROUP BY region;
-- v1 — partition-aware, validator-approved.
SELECT region, COUNT(*) AS event_count
FROM `pipecode-demo.dataset_marketing.events_daily`
WHERE event_date = CURRENT_DATE() - 1
GROUP BY region
ORDER BY event_count DESC;
Step-by-step explanation.
- Type the naive query. The validator immediately shows 8 TB / $40 — too expensive for an ad-hoc count.
- Add
WHERE event_date = CURRENT_DATE() - 1. The validator updates to 22 GB / $0.11. The partition filter is doing its job. - Click Run. Wall clock is ~4 seconds.
- Glance at the Query plan tab: Stage 0 (Read) reports 22 GB read — matches the dry run. No surprises.
- Save the query as a saved query named "events_by_region_yesterday." Now it's promotable to a scheduled query if it stabilises.
Output (one example day).
| region | event_count |
|---|---|
| US | 4 218 503 |
| EU | 2 974 110 |
| APAC | 1 088 442 |
Rule of thumb. Never click Run when the validator card is red unless you intend to pay that bill. The validator is the cheapest review you'll get.
Worked example — re-running with the cache for free reads
Detailed explanation. A dashboard polls the same query every minute during a stand-up. The first run costs $0.11; the next 23 hours of identical runs are free because the result is cached. Toggling "Use cached results" off forces a fresh scan — useful when an upstream load has just landed.
Question. Show the SQL that would always defeat the cache, the one that uses the cache, and how to tell from the job-history rail.
Input — three queries in three minutes.
| run | SQL | cache hit? | cost |
|---|---|---|---|
| 1 | SELECT region, COUNT(*) FROM ... WHERE event_date = '2026-06-05' GROUP BY region |
no (first) | $0.11 |
| 2 | identical SQL | yes | $0 |
| 3 | append SELECT CURRENT_TIMESTAMP()
|
no (non-determinism) | $0.11 |
Code.
-- Cache-friendly: the SQL canonicalises to the same string every run.
SELECT region, COUNT(*) AS c
FROM `pipecode-demo.dataset_marketing.events_daily`
WHERE event_date = '2026-06-05'
GROUP BY region;
-- Cache-defeating: CURRENT_TIMESTAMP() makes every result row "new."
SELECT region, COUNT(*) AS c, CURRENT_TIMESTAMP() AS run_at
FROM `pipecode-demo.dataset_marketing.events_daily`
WHERE event_date = '2026-06-05'
GROUP BY region;
Step-by-step explanation.
- The first run canonicalises the SQL and stores the result in the per-user cache with a 24-hour TTL.
- Run 2 — identical canonicalised SQL → cache hit → no slot ms billed, no bytes processed. The job-history rail shows
cache_hit = TRUE. - Run 3 —
CURRENT_TIMESTAMP()is non-deterministic; BigQuery refuses to cache. Cache miss → full scan again. - The job-history rail surfaces the
cache_hitboolean on every row; INFORMATION_SCHEMA.JOBS surfaces it too ascache_hit. - Cache-friendly authoring: use literal date / timestamp values instead of
CURRENT_DATE()when possible, or parameterise the date and pass@todayfrom the orchestrator.
Output (job history view).
| run | cache_hit | bytes_processed | $ cost |
|---|---|---|---|
| 1 | FALSE | 22 GB | 0.11 |
| 2 | TRUE | 0 | 0 |
| 3 | FALSE | 22 GB | 0.11 |
Rule of thumb. Cache-friendly SQL is free SQL. If your query needs the latest run timestamp, surface it as a literal parameter @as_of_ts set by the orchestrator, not as CURRENT_TIMESTAMP() in the SELECT.
Worked example — promoting a saved query to a scheduled query
Detailed explanation. Once a query has run identically three days in a row, it is a candidate for scheduling. The console's "Schedule" button opens the Data Transfer Service flow: pick a cron, pick a destination table, pick a service account, save. The query then runs without human attention.
Question. Promote the "events_by_region_yesterday" saved query into a scheduled query that runs every day at 08:00 UTC and writes to pipecode-demo.mart.events_by_region_daily.
Input — schedule form fields.
| field | value |
|---|---|
| Schedule name | events_by_region_yesterday_daily |
| Cron |
0 8 * * * (every day 08:00 UTC) |
| Destination table | pipecode-demo.mart.events_by_region_daily |
| Write preference |
WRITE_APPEND (or WRITE_TRUNCATE if rebuild) |
| Service account | bq-scheduler@pipecode-demo.iam.gserviceaccount.com |
| Notifications | Cloud Logging filter + alert policy on severity=ERROR
|
Code.
-- Scheduled-query SQL: parameterise the date so the cache works
-- on backfills and so the destination is partition-aware.
SELECT
@run_date AS event_date,
region,
COUNT(*) AS event_count
FROM `pipecode-demo.dataset_marketing.events_daily`
WHERE event_date = @run_date
GROUP BY region;
Step-by-step explanation.
- Click "Save" → "Saved query" on the working query. Then open the saved query and click "Schedule."
- The cron field accepts standard cron (or the simplified UI).
0 8 * * *runs every day at 08:00 UTC. - The destination table can be partitioned by
event_date— the scheduled query then appends one partition per run. - The service account is the one that runs the query. It needs
bigquery.dataEditoron the destination dataset andbigquery.jobUseron the billing project. - The
@run_dateparameter is set by BigQuery DTS to the logical execution date — useful for clean backfills and idempotent appends. - Failures route to Cloud Logging; an alert policy turns those into Slack or email pages.
Output.
| schedule_id | next_run | last_run_status | last_destination_row_count |
|---|---|---|---|
| sch_abc | 2026-06-07 08:00 UTC | SUCCESS | 18 |
Rule of thumb. Stable queries deserve a schedule. Unstable queries (different SQL every run) deserve a notebook or an orchestrator. Don't schedule a query you're still editing; the silent failures will follow you for months.
SQL interview question on workflow discipline
The interviewer's probe: "Walk me through how you author a new BigQuery query you've never run before. Where do you click, what do you read, when do you actually press Run?" The answer is the eight-step workflow you can describe in one breath.
Solution Using the eight-step workflow
-- The query in question — a per-day per-region revenue rollup.
SELECT
event_date,
region,
SUM(amount) AS revenue
FROM `pipecode-demo.mart.fact_orders`
WHERE event_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
AND CURRENT_DATE() - 1
GROUP BY event_date, region
ORDER BY event_date DESC, revenue DESC;
Step-by-step trace.
| step | action | observation / decision |
|---|---|---|
| 1 Write | typed in the editor | Gemini autocompletes the GROUP BY |
| 2 Dry run | validator shows 14 GB / $0.07 | green — proceed |
| 3 Cache | "Use cached results" stays ON | second run will be free |
| 4 Run | Cmd-Enter | wall clock 3 s, returns 14 rows |
| 5 Plan | check Query plan tab | Read 14 GB; no shuffle hot spot |
| 6 Save | Save → Saved query "weekly_revenue" | private to me until peer-reviewed |
| 7 Schedule | not yet | needs to run 3 days clean first |
| 8 Share | copy permalink → paste in PR | colleague can re-run unchanged |
The discipline is to read the validator before running, glance at the plan after running, and save before sharing. Skipping any step works — until the day it doesn't, and you spend an afternoon retracing your steps.
Output:
| event_date | region | revenue |
|---|---|---|
| 2026-06-05 | US | 88 412 |
| 2026-06-05 | EU | 62 109 |
| 2026-06-05 | APAC | 22 803 |
| 2026-06-04 | US | 91 220 |
| ... | ... | ... |
Why this works — concept by concept:
- The validator is your sanity check — every query gets a free dry run; ignoring the red card is what causes $100 typos.
- The cache makes ad-hoc free — cache-friendly SQL is genuinely free on the second through the 24-hour-th run; structure your SELECT to keep it that way.
- The plan is your debugger — slot ms by stage tells you where the time went; reading the plan is the difference between fixing the query and rerunning it.
- Save before share — a saved query has a stable URL and a stable owner; a "paste this into your editor" workflow has neither.
- Schedule only when stable — three identical days in a row is a reasonable promotion criterion; before that, leave the query in the saved-queries bucket.
- Cost — O(1) human time per step; the saved hours per week scale linearly with team size. The eight-step habit pays for itself within a week.
SQL
Topic — aggregation
Aggregation problems (SQL)
4. Scheduled queries, Workflows, and Cloud Scheduler
Five orchestration tools, one principle — match the orchestrator to the dependency graph
The mental model in one line: scheduled queries are for one-step crons, Workflows are for branching multi-step jobs, Cloud Composer is for full DAGs with sensors, Dataform is for versioned SQLX with tests, and BigQuery Studio notebooks are for ad-hoc Python plus SQL that occasionally needs a schedule. Match the orchestrator to the dependency graph and the bill plus the failure modes both improve.
The five layers, simplest-first.
- Scheduled queries (BigQuery Data Transfer Service). Built-in cron-style scheduling for a single SQL query. UI lives inside the console. No parameter loops, no upstream dependencies, no branching.
- Cloud Workflows. YAML-defined multi-step jobs with branches and retries. Lives outside BigQuery but invokes BigQuery jobs via a connector. Cheap, serverless, no infrastructure to manage.
- Cloud Composer (managed Airflow). Full Airflow DAGs with sensors, backfills, complex dependencies. Managed environment costs ~$300/month at the lowest tier — worth it for any pipeline beyond a handful of steps.
-
Dataform (SQLX +
ref()). The BigQuery-native dbt equivalent. SQL models declare upstream tables withref('upstream_model'), Dataform builds the DAG automatically, runs assertions, and ships the SQL with the rest of your VCS. - BigQuery Studio notebooks. Mixed Python + SQL notebooks. Can be scheduled like a query. Best for ML feature engineering and ad-hoc exploration that needs to run nightly.
Scheduled queries — what to expect.
- Setup. Console → saved query → Schedule → cron + destination. Lives in BigQuery Data Transfer Service.
-
Limitations. Single project. One SQL statement (you can
CALLa stored procedure for multi-step logic, but no first-class step graph). No parameter loops. No upstream dependency on another DTS run. - Best for. Daily / hourly aggregates that produce a single destination table. The "the same SQL, every day" pattern.
-
Permissions. Runs as a service account; the account needs
bigquery.dataEditoron the destination dataset andbigquery.jobUseron the billing project.
When to graduate to Workflows.
- Two-step ETL with a branch. "Run query A; if A returns rows, run query B; otherwise alert." Scheduled queries can't branch; Workflows can.
-
Multi-region jobs. Run a query in
us, then a follow-up ineu. Workflows orchestrates both with one YAML. - Mixing BigQuery with other services. Run a query, then call a Cloud Run service to refresh a downstream cache.
When to graduate to Composer.
-
DAGs with sensors and backfills. "Wait for a GCS file to appear, then run the load." Airflow's
GCSObjectExistenceSensoris the canonical answer. - Complex dependencies. A pipeline with 30+ tasks, lineage, retries with exponential backoff, dynamic task generation.
- Existing Airflow team. If your team already runs Airflow elsewhere, Composer matches the muscle memory.
Dataform — when to ship SQLX.
-
Versioned + tested SQL. Every model is a
.sqlxfile in a Git repo. Pull requests, code review, CI assertions. -
Dependency graph from
ref(). WriteSELECT * FROM ${ref('events_daily')}and Dataform builds the upstream DAG automatically. - dbt-style assertions. "Every row's order_id is unique," "every row's region is in the allowed set."
BigQuery Studio notebooks — when to schedule a notebook.
- Mixed SQL + Python. A Python feature-engineering cell followed by a SQL aggregation cell.
- ML training. Train a model in Python; persist results to BigQuery; schedule the whole notebook nightly.
- Stakeholder-facing reports. Render charts inline; schedule a daily run; email the rendered notebook.
Monitoring failures, regardless of layer.
-
Cloud Logging filter. Every BigQuery job emits a structured log line. Filter on
severity=ERROR AND resource.type="bigquery_resource"to catch failures across the project. - Alert policy. Translate the filter into a Monitoring alert; route to Slack, email, or PagerDuty.
- Per-pipeline DLQ. For Composer / Dataform / Workflows, ship failed-row data to a dead-letter table so you can replay after fixing the root cause.
Worked example — author a scheduled query from scratch
Detailed explanation. A team wants the same per-day-per-region revenue rollup to land in a mart table every morning at 06:00 UTC. No branching, no upstream dependency on another job. Scheduled queries are the right tool.
Question. Author the SQL, choose the schedule, pick the destination, and identify the IAM the runner service account needs.
Input.
| field | value |
|---|---|
| Cron | 0 6 * * * |
| Destination | pipecode-demo.mart.daily_revenue |
| Write preference | WRITE_APPEND |
| Service account | bq-scheduler@pipecode-demo.iam.gserviceaccount.com |
| Service account IAM |
bigquery.dataEditor on mart, bigquery.jobUser on project |
Code.
-- The SQL the scheduled query runs every morning.
INSERT INTO `pipecode-demo.mart.daily_revenue`
(event_date, region, revenue)
SELECT
@run_date AS event_date,
region,
SUM(amount) AS revenue
FROM `pipecode-demo.mart.fact_orders`
WHERE event_date = @run_date
GROUP BY region;
Step-by-step explanation.
- Save the SELECT as a saved query first; the schedule form expects a saved-query template.
- Click "Schedule" → "Create new schedule." Pick the cron
0 6 * * *(06:00 UTC daily). - Set the destination table; choose "Append to existing" so each day's partition lands cleanly.
- Pick the service-account runner. The account must have
bigquery.dataEditoron the destination dataset andbigquery.jobUseron the billing project — without either, the run fails at job submission. - Add a Cloud Logging filter for
severity=ERRORscoped to this schedule's job naming pattern; wire it to a Monitoring alert policy that posts to Slack. - Set
@run_dateto be_PARTITIONDATEso backfills replay correctly without parameter editing.
Output (first run).
| event_date | region | revenue |
|---|---|---|
| 2026-06-05 | US | 88 412 |
| 2026-06-05 | EU | 62 109 |
| 2026-06-05 | APAC | 22 803 |
Rule of thumb. Always make the service-account runner explicit. Never let scheduled queries run as a human user — when that human leaves the company, every schedule they owned breaks silently.
Worked example — promoting a two-step job to Cloud Workflows
Detailed explanation. A pipeline grows past one query: "first refresh the staging table, then refresh the mart table." Scheduled queries can't express the dependency; Workflows can with one tiny YAML.
Question. Write a Cloud Workflows definition that runs the staging query, waits for success, then runs the mart query.
Input.
| step | what it does |
|---|---|
| refresh_staging | CALL pipecode_demo.refresh_stg_events() |
| wait | implicit — sequential |
| refresh_mart | CALL pipecode_demo.refresh_mart_orders() |
| notify | post Slack message on failure |
Code.
main:
steps:
- refresh_staging:
call: googleapis.bigquery.v2.jobs.query
args:
projectId: pipecode-demo
body:
query: "CALL `pipecode-demo.stg.refresh_stg_events`()"
useLegacySql: false
result: stg_result
- refresh_mart:
call: googleapis.bigquery.v2.jobs.query
args:
projectId: pipecode-demo
body:
query: "CALL `pipecode-demo.mart.refresh_fact_orders`()"
useLegacySql: false
result: mart_result
- finished:
return: ${mart_result}
Step-by-step explanation.
- Workflows runs steps sequentially by default. The
refresh_martstep doesn't start untilrefresh_stagingreturns. - Each
googleapis.bigquery.v2.jobs.querycall submits a job and waits for completion (sync mode). - If either step fails, the Workflow stops; the failure surfaces in Cloud Logging with the step name attached.
- Trigger the Workflow on a cron schedule via Cloud Scheduler (which targets a Workflows execution).
- For branching: replace the linear sequence with a
switchstep that picks the next step based on the previous result.
Output (a successful run).
| step | duration | status |
|---|---|---|
| refresh_staging | 12 s | OK |
| refresh_mart | 8 s | OK |
| total wall clock | ~20 s | OK |
Rule of thumb. As soon as the pipeline has two dependent steps, you've outgrown scheduled queries. Workflows is the lowest-friction next rung — cheaper and simpler than Composer for jobs that don't need sensors or backfills.
Worked example — a Composer DAG that backfills 30 days
Detailed explanation. Composer is necessary when you need sensors, complex dependencies, or backfills. The canonical use case: a nightly DAG that ingests yesterday's GCS file, runs three BigQuery transformations, and refreshes a dashboard.
Question. Sketch an Airflow DAG that (a) waits for a GCS file to appear, (b) loads it into BigQuery staging, (c) runs the staging-to-mart transform, (d) refreshes a Looker dashboard cache.
Input — DAG shape.
| task_id | operator | depends_on |
|---|---|---|
| wait_for_file | GCSObjectExistenceSensor | -- |
| load_staging | GCSToBigQueryOperator | wait_for_file |
| stg_to_mart | BigQueryInsertJobOperator | load_staging |
| refresh_dashboard | HttpOperator | stg_to_mart |
Code.
from airflow import DAG
from airflow.providers.google.cloud.sensors.gcs import GCSObjectExistenceSensor
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow.providers.http.operators.http import SimpleHttpOperator
from datetime import datetime
with DAG(
"daily_marketing_pipeline",
start_date=datetime(2026, 1, 1),
schedule="0 7 * * *",
catchup=True, # enables backfill
max_active_runs=4,
) as dag:
wait_for_file = GCSObjectExistenceSensor(
task_id="wait_for_file",
bucket="pipecode-demo-raw",
object="events/{{ ds }}/events.json",
)
load_staging = GCSToBigQueryOperator(
task_id="load_staging",
bucket="pipecode-demo-raw",
source_objects=["events/{{ ds }}/events.json"],
destination_project_dataset_table="pipecode-demo.stg.events_raw${{ ds_nodash }}",
source_format="NEWLINE_DELIMITED_JSON",
write_disposition="WRITE_TRUNCATE",
)
stg_to_mart = BigQueryInsertJobOperator(
task_id="stg_to_mart",
configuration={
"query": {
"query": "CALL `pipecode-demo.mart.refresh_events`('{{ ds }}')",
"useLegacySql": False,
}
},
)
refresh_dashboard = SimpleHttpOperator(
task_id="refresh_dashboard",
method="POST",
http_conn_id="looker",
endpoint="api/4.0/dashboards/42/refresh",
)
wait_for_file >> load_staging >> stg_to_mart >> refresh_dashboard
Step-by-step explanation.
- The DAG runs at 07:00 UTC daily;
catchup=Truemeans Airflow runs every missed schedule on first deploy — a built-in backfill. -
GCSObjectExistenceSensorpolls the file's path until it appears, then succeeds. Sensors are why Composer wins over Workflows for "wait for upstream." -
GCSToBigQueryOperatorloads the JSON into a date-suffixed staging table (events_raw$20260605). -
BigQueryInsertJobOperatorruns the stored procedure that promotes the row to mart, scoped to the run date. -
SimpleHttpOperatorposts to the Looker API to refresh the dashboard cache. - The
>>chain expresses the dependency order. Each task waits for its predecessor's success.
Output (one day's run).
| task | duration | status |
|---|---|---|
| wait_for_file | ~30 s | OK (file landed at 06:55) |
| load_staging | 18 s | OK (rows loaded: 2.1 M) |
| stg_to_mart | 22 s | OK |
| refresh_dashboard | 3 s | OK |
| total | ~73 s | OK |
Rule of thumb. Composer is worth its monthly cost the moment you need a sensor, a backfill, or more than ~5 dependent tasks. Below that threshold, Workflows + Cloud Scheduler is the cheaper answer.
SQL interview question on the orchestration ladder
A senior interviewer might ask: "Your team has 30 BigQuery jobs spread across scheduled queries and ad-hoc cron in Cloud Functions. Half are failing silently. Design the migration plan to a sane orchestration story." The answer is a tiered migration onto the ladder.
Solution Using a tiered migration
-- Inventory the 30 jobs into three buckets:
WITH job_inventory AS (
SELECT 'daily_revenue_by_region' AS job, 'scheduled_query' AS surface, 1 AS step_count, FALSE AS has_upstream UNION ALL
SELECT 'mart_refresh_orders', 'scheduled_query', 3, TRUE UNION ALL
SELECT 'gcs_to_bq_events', 'cloud_function', 2, TRUE UNION ALL
SELECT 'feature_engineering_ml', 'cloud_function', 5, TRUE
)
SELECT
job,
CASE
WHEN step_count = 1 AND NOT has_upstream THEN 'scheduled query (keep)'
WHEN step_count <= 3 AND has_upstream THEN 'Workflows (migrate)'
WHEN step_count >= 4 THEN 'Composer (migrate)'
ELSE 'Workflows (default)'
END AS target_surface
FROM job_inventory;
Step-by-step trace.
| job | step_count | has_upstream | current | target |
|---|---|---|---|---|
| daily_revenue_by_region | 1 | no | scheduled_query | scheduled_query (keep) |
| mart_refresh_orders | 3 | yes | scheduled_query | Workflows |
| gcs_to_bq_events | 2 | yes | cloud_function | Workflows |
| feature_engineering_ml | 5 | yes | cloud_function | Composer |
The trace surfaces the migration matrix. One-step jobs stay; two-to-three-step jobs migrate to Workflows; anything four-plus moves to Composer or Dataform. Cloud-Function-based ad-hoc cron is always migrated off because it has no visibility into BigQuery job semantics.
Output:
| target surface | count of jobs | monthly cost delta |
|---|---|---|
| scheduled query (kept) | 12 | $0 |
| Workflows (new) | 10 | +$5 / month |
| Composer (new) | 8 | +$300 / month |
| total | 30 | +$305 / month |
Why this works — concept by concept:
- Match orchestrator to dependency graph — one-step → DTS; branching → Workflows; sensors / DAGs → Composer; versioned SQL → Dataform. The shape of the dependency graph dictates the surface.
- Cost vs visibility tradeoff — Composer costs more but exposes every step in a DAG view; Workflows is cheaper but trades off the visual DAG.
- Cloud-Function cron is a smell — running BigQuery jobs from a Cloud Function deployed via gcloud has no built-in retry, lineage, or audit. Migrate off as a hygiene item.
-
Dataform is the long-term destination — once your SQL is versioned in Git with assertions, Dataform's
ref()graph plus auto-generated DAG is hard to beat. - Cost — O(n_jobs) migration human time, O(monthly) infrastructure cost. The break-even point for Composer is usually 5+ dependent tasks per pipeline.
SQL
Topic — etl
ETL pipeline problems (SQL)
5. Cost & slot monitoring with INFORMATION_SCHEMA
One view, one chart, one guardrail policy — the day-to-day BigQuery cost loop
The mental model in one line: the INFORMATION_SCHEMA.JOBS_BY_PROJECT view tells you who ran what, when, how many bytes it processed, and how many slot ms it burned — and pairing it with cost guardrails (max bytes billed per query, custom quotas, alert thresholds) closes the loop. Once you can run the top-10-cost-burners query from memory, BigQuery cost stops being a mystery line on the bill.
On-demand vs flat-rate pricing (Editions).
- On-demand. $5 per TB of bytes billed, per query. No commitment, no monthly minimum. Best when usage is bursty.
- Editions (Standard / Enterprise / Enterprise Plus). Reserved slot capacity, charged per slot-hour. Predictable cost, "free" queries within the reservation. Best when usage is steady or heavy.
- The crossover point. Roughly 100 TB billed per month on-demand ≈ the cost of a small reservation. Below that, on-demand wins; above, Editions wins.
The one view that runs the whole monitoring loop.
-
INFORMATION_SCHEMA.JOBS_BY_PROJECT. Every job submitted in the project for the last 180 days. Columns includeuser_email,job_id,query,creation_time,total_bytes_processed,total_bytes_billed,total_slot_ms,cache_hit,query_label. -
INFORMATION_SCHEMA.JOBS_BY_USER. Same shape, scoped to the calling user. -
INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION. Org-wide; needs org-level IAM. Useful for FinOps dashboards.
The top-10 cost-burning queries — the query.
SELECT
user_email,
job_id,
total_bytes_billed / POW(10, 12) AS tb_billed,
total_bytes_billed / POW(10, 12) * 5 AS usd_cost,
total_slot_ms / 1000 AS slot_sec,
query
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND job_type = 'QUERY'
ORDER BY total_bytes_billed DESC
LIMIT 10;
Slot usage chart — what to plot.
- X-axis. Hour of the day (or 5-minute buckets for fine-grained dashboards).
- Y-axis. Slots consumed.
- Two series. Average slots and peak (95th percentile or max) slots per bucket.
- What to look for. A flat blue line with a spiky orange line means most slots are bursty; consider a peak-time reservation. A wavy blue line means usage is even; a flat reservation matches well.
BI Engine reservation — when to enable.
- What it is. An in-memory cache for sub-second dashboard queries. Reserved capacity (1–250 GB).
- When it helps. Looker Studio / Looker dashboards that scan the same dimensional tables repeatedly. Latency drops from seconds to milliseconds; cost moves from per-query bytes to a flat hourly reservation.
- When it doesn't. Ad-hoc queries with novel WHERE clauses (cache miss). One-shot exports.
Query labels — the chargeback primitive.
-
What. A key/value pair attached to a job. Examples:
team:de,job:nightly,cost_center:marketing. -
How to set.
SET @@query_label = 'team:de,job:nightly';at the top of the query, or via the API / CLI's--labelflag. -
Why. The label appears in
INFORMATION_SCHEMA.JOBSand is the join key for chargeback dashboards.
Cost guardrails — three layers of defence.
-
Maximum bytes billed per query.
SET @@maximum_bytes_billed = 50000000000;(50 GB). Any query that would exceed this cap is aborted before billing. Per-query; per-user via API. - Custom quotas. Project-level or per-user limits on bytes scanned per day. Hard cap; queries fail when exhausted.
-
Alert policy. Cloud Monitoring threshold on
bigquery.googleapis.com/slots/total_allocated_for_reservationor on a custom log-based metric fortotal_bytes_billed. Routes to Slack / email.
Cost-cutting moves that actually work.
- Select only the columns you need. Column-store pruning is the single biggest cost lever.
- Filter on the partition column first. Pruning skips entire partitions — 10–100× reduction.
- Cluster on filter columns. Clustering accelerates per-block filtering for high-cardinality WHEREs.
- Materialized view for repeated aggregates. Pre-computed daily / hourly rollups make dashboards cheap.
- Wrap dashboards in BI Engine. Sub-second latency, flat hourly cost, no per-query billing.
Worked example — the top-10 cost-burners audit
Detailed explanation. A FinOps engineer runs the top-10-cost-burners query every Monday morning, posts the results in the data-platform Slack channel, and follows up with the authors of the top three. The discipline puts a soft tax on expensive queries and surfaces real optimisation candidates.
Question. Given the audit query above, write the follow-up query that aggregates by user_email to find the cost owners over the last week, and rank them.
Input — five sample rows from JOBS_BY_PROJECT.
| user_email | job_id | bytes_billed | slot_ms | query_label |
|---|---|---|---|---|
| data-team@... | job_001 | 2.1 TB | 1 400 000 | team:de,job:nightly |
| data-team@... | job_002 | 410 GB | 290 000 | team:bi,job:dash |
| analyst@... | job_003 | 88 GB | 41 000 | adhoc |
| ml-team@... | job_004 | 920 GB | 680 000 | team:ml,job:train |
| data-team@... | job_005 | 9 GB | 4 000 | team:de,job:hourly |
Code.
SELECT
user_email,
COUNT(*) AS jobs_count,
SUM(total_bytes_billed) / POW(10, 12) AS tb_billed,
SUM(total_bytes_billed) / POW(10, 12) * 5 AS usd_cost,
SUM(total_slot_ms) / 1000 AS slot_sec
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND job_type = 'QUERY'
GROUP BY user_email
ORDER BY usd_cost DESC
LIMIT 10;
Step-by-step explanation.
- Filter to the last 7 days and to
job_type = 'QUERY'(excludes load and copy jobs, which have different billing). - Group by
user_emailto collapse to one row per cost owner. - SUM the bytes billed (the units BigQuery actually charges) and convert to TB by dividing by 10^12.
- Multiply by 5 (the per-TB on-demand price) to surface the dollar cost.
- Sum slot ms separately — useful when you're on Editions (slot-second pricing) instead of on-demand.
- ORDER BY the dollar cost DESC and LIMIT 10 — these are the conversations FinOps wants to have.
Output.
| user_email | jobs_count | tb_billed | usd_cost | slot_sec |
|---|---|---|---|---|
| data-team@... | 322 | 18.4 | 92.00 | 11 200 |
| ml-team@... | 88 | 6.1 | 30.50 | 4 800 |
| analyst@... | 612 | 1.9 | 9.50 | 880 |
Rule of thumb. Run this audit weekly. The biggest wins are almost always in the top-3 owners; the rest of the project tends to be statistical noise.
Worked example — finding the un-pruned-partition queries
Detailed explanation. A query that's expensive but doesn't look expensive in the source SQL usually has a missing partition filter. The audit query that surfaces it compares total_bytes_processed against the source table's total size; large ratios mean no partition pruning fired.
Question. Write the query that finds queries which processed more than 100 GB on a partitioned source table without a partition filter.
Input — assumed audit window: last 24h.
| job_id | bytes_processed | source_table | partition_col | filtered_on_partition? |
|---|---|---|---|---|
| job_001 | 800 GB | events_daily | event_date | yes |
| job_002 | 4 TB | events_daily | event_date | no |
| job_003 | 200 GB | orders | order_date | yes |
Code.
-- Find queries that scanned > 100 GB and likely missed a partition filter.
SELECT
job_id,
user_email,
referenced_tables[OFFSET(0)].table_id AS source_table,
total_bytes_processed / POW(10, 9) AS gb_processed,
SUBSTR(query, 1, 200) AS query_snippet
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT,
UNNEST(referenced_tables) AS referenced_tables
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
AND job_type = 'QUERY'
AND total_bytes_processed > 100 * POW(10, 9)
AND REGEXP_CONTAINS(query, r'(?i)events_daily')
AND NOT REGEXP_CONTAINS(query, r'(?i)event_date\s*[=<>]')
ORDER BY total_bytes_processed DESC
LIMIT 20;
Step-by-step explanation.
- Restrict to the last 24 hours of
QUERYjobs above the 100 GB threshold. - UNNEST
referenced_tablesto surface the source-table name. - Filter to the specific table you suspect (
events_daily). - Use a negative regex (
NOT REGEXP_CONTAINS) to find queries that don't mention the partition column. This is a heuristic — not a hard rule — but a strong one. - Return the job_id, user, GB processed, and a snippet of the SQL for manual review.
- The output is a short list of queries to rewrite with a partition filter.
Output.
| job_id | user_email | source_table | gb_processed | query_snippet |
|---|---|---|---|---|
| job_002 | analyst@... | events_daily | 4 000 | SELECT region, COUNT(*) FROM events_daily GROUP BY region |
| job_007 | ml-team@... | events_daily | 1 200 | SELECT * FROM events_daily WHERE user_id IN (...) |
Rule of thumb. Pair this audit with a Slack ping to the author. Most "I scanned 4 TB by accident" queries are written by engineers who didn't know the table was partitioned — the ping doubles as documentation.
Worked example — setting the maximum-bytes-billed guardrail
Detailed explanation. Even with good discipline, accidents happen. The maximum_bytes_billed query option is a hard cap that aborts a query before BigQuery bills more than the cap. Setting it to 50 GB per user, per query, is a sensible default for analyst seats.
Question. Show the three ways to apply the cap (per-query, per-session, per-user-default), and explain when each makes sense.
Input.
| level | scope | how it's set |
|---|---|---|
| per-query | one statement | SET @@maximum_bytes_billed = ...; |
| per-session | the whole tab | session-level SET |
| per-user-default | every query that user runs | enforced via API / IAM policy |
Code.
-- Per-query guardrail — 50 GB cap, set at the top of the editor tab.
SET @@maximum_bytes_billed = 50 * POW(10, 9); -- 50 GB
SELECT region, COUNT(*) AS c
FROM `pipecode-demo.dataset_marketing.events_daily`
WHERE event_date = CURRENT_DATE() - 1
GROUP BY region;
-- Per-job, set via the CLI (overrides session default).
-- bq query --use_legacy_sql=false --maximum_bytes_billed=50000000000 ...
-- Per-user default, set via the python client when constructing a QueryJobConfig:
-- job_config = bigquery.QueryJobConfig(maximum_bytes_billed=50 * 10**9)
Step-by-step explanation.
-
SET @@maximum_bytes_billed = ...;at the top of an editor tab sets the cap for every subsequent query in the tab. Reset on tab close. - The CLI flag
--maximum_bytes_billed=...enforces the same cap per-job — useful in CI / ETL scripts. - The Python client lets you set the cap per
QueryJobConfig— embed it in your data-engineering library so every query inherits the cap by default. - When a query exceeds the cap, BigQuery aborts it pre-execution with a clear error: "would exceed maximum_bytes_billed."
- The cap is per-query; a multi-query session can still spend more in aggregate, so pair it with a daily quota for hard upper bounds.
Output.
| query | bytes_billed | result |
|---|---|---|
| query A (under cap) | 12 GB | success |
| query B (over cap) | 80 GB attempted | aborted with "exceeds maximum_bytes_billed" |
Rule of thumb. Default the cap in your team's BigQuery client wrapper. The cost of one engineer forgetting to set it is the exact cost the cap prevents — usually 50× higher than the cap itself.
SQL interview question on cost discipline
The interviewer's probe: "Walk me through how you'd set up a from-scratch BigQuery cost monitoring loop for a 20-engineer team on on-demand pricing." The answer is a four-piece package: top-10 audit, partition-miss audit, query labels, guardrails.
Solution Using the four-piece cost loop
-- Piece 1 — top-10 cost owners weekly
CREATE OR REPLACE TABLE pipecode_demo.ops.weekly_cost_top10 AS
SELECT
user_email,
COUNT(*) AS jobs_count,
SUM(total_bytes_billed) / POW(10, 12) * 5 AS usd_cost
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND job_type = 'QUERY'
GROUP BY user_email
ORDER BY usd_cost DESC
LIMIT 10;
-- Piece 2 — un-pruned-partition queries (heuristic)
CREATE OR REPLACE TABLE pipecode_demo.ops.partition_miss_audit AS
SELECT
job_id, user_email,
total_bytes_processed / POW(10, 9) AS gb_processed,
SUBSTR(query, 1, 200) AS snippet
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND total_bytes_processed > 100 * POW(10, 9)
AND REGEXP_CONTAINS(query, r'(?i)events_daily')
AND NOT REGEXP_CONTAINS(query, r'(?i)event_date\s*[=<>]')
ORDER BY gb_processed DESC
LIMIT 20;
-- Piece 3 — enforce query labels via wrapper
-- (In Python: every QueryJobConfig sets labels={'team': team, 'job': name}.)
-- Piece 4 — maximum_bytes_billed guardrail in every editor session
SET @@maximum_bytes_billed = 100 * POW(10, 9); -- 100 GB hard cap
Step-by-step trace.
| piece | cadence | output | action it drives |
|---|---|---|---|
| top-10 owners | weekly | named list | DM the top 3 |
| partition-miss audit | weekly | named list | rewrite + educate |
| query labels | always-on | chargeback table | finance reports |
| 100 GB cap | per-session | aborted query | educates immediately |
The four pieces close the loop: weekly audits surface candidates, query labels give finance a chargeback path, and the cap stops accidental damage before billing. Most teams stop at piece 1; the discipline shows when you have all four.
Output (after one month of running the loop):
| metric | before loop | after loop | delta |
|---|---|---|---|
| weekly cost | $2 400 | $1 600 | -33% |
| largest-single-query bill | $480 | $0 (capped) | -100% |
| % queries with labels | 12% | 94% | +82pp |
| partition-miss queries | 27 / week | 6 / week | -78% |
Why this works — concept by concept:
- INFORMATION_SCHEMA.JOBS as the single source of truth — every monitoring artefact derives from the same view; no out-of-band data export, no parallel dashboard to maintain.
- Top-10 audit as a soft tax — public visibility on cost-burners creates social pressure to optimise without imposing a hard rule.
- Partition-miss heuristic — the regex isn't perfect, but it surfaces the worst 80% of accidents. False positives are cheap; false negatives are expensive.
- Query labels as the chargeback primitive — the label string is free, joinable, and propagates through INFORMATION_SCHEMA. Finance gets a chargeback table without building a parallel pipeline.
- Maximum bytes billed as the safety net — even when an engineer skips the validator, the cap stops a $500 typo at the bytes threshold you actually chose.
- Cost — pieces 1 and 2 are O(jobs) per week (~$0.01 each); piece 4 is free; the savings from -33% weekly cost dwarf the audit's cost in week one.
SQL
Topic — window-functions
Window function problems (SQL)
Cheat sheet — BigQuery console recipes
- Always read the validator before clicking Run. Bytes processed → dollars. A red card is your last chance to stop a $100 typo.
-
SELECT *against a 10 TB table is a $50 click. Always project columns. The validator turns from red to yellow on column projection alone. - Filter on the partition column first. Bytes scanned drops by 10–100× when the WHERE matches the partition layout. Look up the partition column in the Schema tab if you don't know it.
-
LIMITdoes not reduce bytes scanned. BigQuery scans the partition set first, then truncates. Use a partition WHERE to control cost; use LIMIT only for the rendered row count. - Cache hit is free. Toggle "Use cached results" ON for ad-hoc exploration. Identical SQL within 24 hours = $0.
-
Cmd-Enterto run;Cmd-/to comment;Cmd-Kto format. Three shortcuts; a minute saved per query. - Use the Preview tab for "what does this table look like." Free; metadata only; no SQL needed.
- Read the Query plan after any > 10 s query. Slot ms by stage tells you cost; shuffle bytes tells you join shape.
-
INFORMATION_SCHEMA.JOBS_BY_PROJECTis the one view that runs the cost loop. Top-10 owners, partition misses, and labels all derive from it. -
Tag queries with
SET @@query_label = 'team:de,job:nightly';at the top of every editor session. Finance can join the label to JOBS for chargeback. - Save common queries; promote to scheduled queries when stable. Three identical days in a row is a reasonable promotion criterion.
- Match orchestrator to dependency graph. One step → scheduled query. Two-to-three steps → Workflows. DAG with sensors → Composer. Versioned SQL with tests → Dataform.
-
SET @@maximum_bytes_billed = 100 * POW(10, 9);is the safety net for the editor. Set it once per tab. - BI Engine for dashboards that scan the same tables. Flat hourly cost, sub-second latency, no per-query bill.
- Service-account-owned schedules survive team turnover. Never schedule a query as yourself; create a runner SA and grant the minimum two roles.
Frequently asked questions
How do I open the BigQuery console?
Navigate to https://console.cloud.google.com/bigquery in any modern browser. Sign in with the Google account that has GCP IAM permissions on the project. The top-bar project selector lists every project where you have bigquery.user (or higher); pick the one you need. The left Explorer pane shows the datasets you can bigquery.dataViewer over; the centre pane is the multi-tab query editor. Two keyboard shortcuts pay for themselves immediately: Cmd-Enter to run, Cmd-/ to toggle comments. If you want the new unified workbench (notebooks, canvases, Dataform), use the Studio toggle in the top-left.
What is a dry run in BigQuery and how do I do one?
A dry run is a server-side query validation that reports how many bytes the query would scan, what it would cost (on the on-demand model), and whether the query references invalid columns or wrong regions — without actually running. In the console, the validator card on the right of the editor performs a dry run automatically as you type. From the CLI, run bq query --dry_run --use_legacy_sql=false "...". From the Python client, set dry_run=True on the QueryJobConfig. Read the dry run before every Run, especially against partitioned tables, where a missing WHERE clause can be the difference between a $0.01 query and a $50 query.
How much does a BigQuery query cost?
On the on-demand model, BigQuery charges $5 per TB of bytes billed, where "bytes billed" is the sum of column blocks read across all referenced columns and partitions. A query that reads 100 GB pays $0.50; one that reads 1 TB pays $5. Cache hits are free. On Editions (Standard / Enterprise / Enterprise Plus), you pay per slot-hour against a reservation, and individual queries cost nothing extra within the reservation's capacity. The crossover is around 100 TB billed per month — below, on-demand wins; above, Editions wins. To audit cost per query, use INFORMATION_SCHEMA.JOBS_BY_PROJECT.total_bytes_billed.
What are scheduled queries and when should I use them instead of Airflow?
Scheduled queries are BigQuery's built-in cron, powered by the Data Transfer Service. You attach a cron expression to a saved query, pick a destination table, pick a runner service account, and BigQuery handles the rest — no infrastructure to manage. Use them for one-step jobs: a daily aggregate, an hourly rollup, a single SQL statement that writes to one destination. The moment your pipeline has two dependent steps, an upstream sensor, a backfill, or branching logic, graduate up the ladder to Cloud Workflows, Cloud Composer (managed Airflow), or Dataform. The rule of thumb: scheduled queries are for SQL that is the pipeline; Composer is for SQL that's part of a pipeline.
How do I see who's running expensive queries in my BigQuery project?
Query INFORMATION_SCHEMA.JOBS_BY_PROJECT with a WHERE on creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) and job_type = 'QUERY', then GROUP BY user_email and ORDER BY SUM(total_bytes_billed) DESC — the top 10 rows are your cost owners. Pair with a partition-miss audit (NOT REGEXP_CONTAINS(query, partition_col) on queries above 100 GB) to surface the rewrites that matter most. Run the audit weekly, post the results in your team's Slack, and follow up with the top three. The visibility alone usually drops weekly cost by 20–40% in the first month.
What's the difference between BigQuery Studio and the classic console?
The classic BigQuery console is a SQL editor with an Explorer pane, a validator card, and a job history rail — optimised for "write a query, see the result." BigQuery Studio is the 2024+ unified workbench that bundles SQL queries, Python / SQL notebooks, data canvases (drag-and-drop graphical query builder), Dataform projects, saved queries, and ML models under one left-pane navigator. You can schedule a notebook the same way you schedule a query; you can mix %%sql and bigframes Python in the same notebook. The Studio toggle is in the top-left of the console; switching is non-destructive. In 2026, most teams default to Studio for new work and keep the classic view bookmarked for legacy IAM workflows.
Practice on PipeCode
- Drill the SQL practice library → for the BigQuery-flavoured ANSI SQL that the workbench rewards.
- Rehearse on aggregation problems → for the GROUP BY queries you'll write all day.
- Sharpen window function drills → for the patterns that drive most of the slot bill.
- Layer the joins practice library → for the multi-table queries that dominate mart-layer SQL.
- Stack the CTE practice library → — every multi-step BigQuery query lives or dies on clean CTEs.
- For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the SQL axis with the SQL for data engineering interviews course →.
- For long-form schema craft, work through data modelling for DE interviews →.
- To frame BigQuery in a broader stack, study ETL system design for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every console recipe above ships with hands-on practice rooms where you write the partition-aware WHERE, the INFORMATION_SCHEMA audit, and the LEFT JOIN with COALESCE against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your BigQuery query would survive a code review on a real team's reservation.





Top comments (0)