DEV Community

Cover image for GCP Data Engineering: BigQuery, Dataflow, Pub/Sub, Composer
Gowtham Potureddi
Gowtham Potureddi

Posted on

GCP Data Engineering: BigQuery, Dataflow, Pub/Sub, Composer

GCP data engineering rests on four pillars every Google Cloud data team ships against: BigQuery (the serverless analytical warehouse — Dremel query engine on Colossus storage, slot-based compute, partitioning + clustering, Editions reservations), Cloud Dataflow (the unified Apache Beam runner for batch + streaming pipelines, with autoscaling workers, windows, watermarks, and triggers), Pub/Sub (the planet-scale event bus — at-least-once delivery, ordering keys, dead-letter topics, push vs pull subscriptions), and Cloud Composer (managed Apache Airflow on GKE Autopilot, with operator-rich DAGs that glue everything together). Whether you're building a new GCP data platform or prepping for GCP data engineer interview loops at Google, Spotify, Twitter, or any cloud-native shop, the same four services dominate the architecture diagram.

This guide walks through each pillar end-to-end as a teach-by-example GCP playbook. Inside BigQuery we unpack Dremel, Colossus, the Jupiter network, slots (on-demand and Editions reservations), PARTITION BY DATE, CLUSTER BY user_id, partition pruning, cluster pruning, and BigLake Iceberg. Inside Dataflow we walk the Apache Beam unified model — ParDo, GroupByKey, Window, Combine — plus fixed / sliding / session windows, watermark + late data + triggers, and worker autoscaling. The Pub/Sub + Composer section covers Pub/Sub at-least-once + ordering keys + DLQ, push (Cloud Run) vs pull (Dataflow) subscriptions, Composer 2/3 on GKE Autopilot, and a real Airflow DAG with GCSObjectExistenceSensorDataflowTemplatedJobStartOperatorBigQueryInsertJobOperator. We close with the wider GCP ecosystem stack so you can place every service on a four-layer mental map.

PipeCode blog header for a GCP data engineering tutorial — bold white headline 'GCP Data Engineering' with subtitle 'BigQuery · Dataflow · Pub/Sub · Composer' and a stylised four-service flow on a dark gradient with blue, purple, green, and orange accents and a small pipecode.ai attribution.

When you want hands-on reps alongside the reading, browse Python practice library →, drill SQL practice library →, grind ETL Python drills →, sharpen streaming Python drills →, or rehearse aggregation problems (SQL) →.


On this page


1. GCP data engineering overview — the four-service spine

The four services every GCP data platform leans on — and how they wire together

The one-sentence invariant: a modern GCP data engineering stack ingests events via Pub/Sub, transforms with Dataflow (batch or streaming), warehouses in BigQuery, and orchestrates everything via Cloud Composer (Airflow) — with GCS as the universal landing zone underneath. Memorising this spine is the fastest way to make every other GCP design decision concrete.

The four pillars at a glance.

  • BigQuery — serverless analytical warehouse; SQL on petabyte tables; pay per TB scanned (on-demand) or per slot-hour (Editions); cheap, fast, near-zero-ops.
  • Cloud Dataflow — managed Apache Beam runner; one programming model covers batch and streaming; autoscales workers; the canonical transform layer between Pub/Sub and BigQuery.
  • Pub/Sub — planet-scale, at-least-once messaging; topics + subscriptions; the entry point for events into the pipeline.
  • Cloud Composer — managed Apache Airflow on GKE Autopilot; the orchestration layer for batch DAGs, sensors, and dependencies between Dataflow + BigQuery jobs.

The canonical end-to-end pipeline.

Producer  →  Pub/Sub topic  →  Dataflow streaming  →  BigQuery (real-time table)
                                                     ↘  GCS (Parquet raw)
                                                                     ↑
                                                              Composer DAG
                                                              schedules nightly
                                                              Dataflow batch +
                                                              BigQueryInsertJob
                                                              loads + reports.
Enter fullscreen mode Exit fullscreen mode

Why this four-service spine wins on GCP.

  • Serverless by default — every layer scales without cluster babysitting (no Hadoop YARN, no Spark cluster lifecycle).
  • Unified programming model — Apache Beam covers batch and streaming with the same code; no Spark vs Flink split.
  • Tight integrations — Pub/Sub → Dataflow → BigQuery is a first-class path; native templates and connectors exist for every hop.
  • Cost levers are explicit — partitioning + clustering for BigQuery, autoscaling caps for Dataflow, snapshot retention for Pub/Sub — each one is a knob, not a guess.
  • Composer brings Airflow — battle-tested orchestration semantics (sensors, retries, SLAs, backfills) without standing up Airflow yourself.

What interviewers and architects test on the GCP track.

  • Do you pick the right service per layer? — Dataflow vs Dataproc (Beam vs Spark), Pub/Sub vs Kafka, Composer vs Cloud Workflows.
  • Do you optimise BigQuery cost? — partitioning, clustering, materialized views, BI Engine, Editions vs on-demand.
  • Do you handle streaming correctness? — watermark, late data, triggers, exactly-once via Dataflow + BigQuery streaming inserts.
  • Do you wire reliability? — Pub/Sub DLQs, Dataflow drains, Composer task retries, BigQuery transactional MERGE.

Dataflow vs Dataproc — the most-asked GCP comparison.

  • Dataflow — serverless Apache Beam; autoscaling; pay per vCPU-hour; the default for new pipelines.
  • Dataproc — managed Spark / Hadoop clusters; pay per VM; the right call when you have existing Spark code or need fine-grained cluster control.
  • The 2026 default — start with Dataflow for new pipelines; lift-and-shift Spark to Dataproc; migrate Dataproc to BigQuery / Dataflow when you can.

BigQuery vs Snowflake / Redshift — the warehouse choice.

  • BigQuery — separates storage (Colossus) from compute (Dremel + slots); pay per TB scanned by default; zero cluster management.
  • Snowflake — virtual warehouses (per-second billing); multi-cloud; great workload isolation.
  • Redshift — node-based (or serverless); deeply integrated with AWS.
  • GCP-native answer — BigQuery is the default when your data already lives on GCP; integrate via BigLake to read Iceberg tables on GCS without duplicating data.

Worked example — a Pub/Sub → Dataflow → BigQuery pipeline at a glance

Detailed explanation. A realistic real-time pipeline on GCP: events land in a Pub/Sub topic, a Dataflow streaming job reads them, applies a windowed aggregation, and streams the results into BigQuery. The same code (Beam pipeline) runs as a batch job over a GCS bucket for the nightly backfill — that's the unified-model superpower.

Question. Clickstream events flow at 50k events/sec. We need a 5-minute rolling per-country click count, queryable in BigQuery within 30 seconds of event time.

Pipeline shape (GCP-native).

Browser   ──HTTPS──▶  Pub/Sub topic: clicks-topic
                          │
                          ▼
                   Dataflow streaming job (Beam)
                     · windowing: 5-minute fixed
                     · trigger: AfterWatermark
                     · groupBy: country
                          │
                          ▼
                BigQuery streaming insert
                table: analytics.click_counts_5m
                clustered by country, partitioned by event_date
                          │
                          ▼
                Looker / Looker Studio dashboard
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Producer publishes to Pub/Sub — the browser sends events to an HTTPS endpoint that publishes a message to projects/X/topics/clicks-topic with an event_time attribute.
  2. Dataflow pulls the subscription — the streaming job uses a pull subscription, decodes the message, and assigns each event to a 5-minute fixed window.
  3. Beam aggregatesGroupByKey shuffles events by country; Combine.perKey(Count.ones()) sums per window.
  4. Streaming insert to BigQuery — the result tuples are written with WriteToBigQuery(...), landing in analytics.click_counts_5m partitioned by event_date and clustered by country.
  5. Composer (optional) — a Cloud Composer DAG runs every hour to re-aggregate the previous day for late events, using a BigQueryInsertJobOperator MERGE.
  6. Looker queries — a single BigQuery SQL SELECT country, window_start, clicks FROM analytics.click_counts_5m WHERE event_date = CURRENT_DATE() powers the live dashboard.

Rule of thumb: On GCP, Pub/Sub for entry, Dataflow for transform, BigQuery for warehouse, Composer for time-based orchestration — anything more exotic should justify itself against this default.

Python
Topic — etl
ETL Python drills

Practice →

Python
Topic — streaming
Streaming Python drills

Practice →


2. BigQuery — Dremel, Colossus, slots, partitioning + clustering

Diagram of BigQuery architecture — a Dremel query engine card on top with slot worker tiles, a Colossus storage card underneath with columnar shards, a Jupiter network ribbon connecting them, and a partitioned + clustered table panel on the right showing date partitions split into clustered blocks, on a light PipeCode card.

BigQuery — separated compute (Dremel) and storage (Colossus), priced by what you scan

BigQuery is the #1 service GCP data engineers touch every day. The senior one-sentence answer: BigQuery splits compute (Dremel, a serverless distributed SQL engine) from storage (Colossus, a columnar file system) and connects them with the petabit-scale Jupiter network — so you scale compute and storage independently, and you pay only for the bytes Dremel actually scans (on-demand) or the slot-hours you reserve (Editions).

Dremel — the query engine.

  • Serverless — no cluster to manage; you submit SQL and BigQuery dispatches it across thousands of workers.
  • Slot-based — a slot is one unit of CPU + RAM that runs a stage of your query; on-demand queries draw from a shared 2000-slot pool per project.
  • Tree-architecture execution — a root server splits the query into stages; each stage runs across many slot workers in parallel; intermediate results shuffle between stages over Jupiter.
  • In-memory shuffle — fast, but spills to disk for huge GROUP BYs.

Colossus — the storage layer.

  • Distributed, columnar file system (Google's successor to GFS).
  • Capacitor format — BigQuery's proprietary columnar encoding; compressed by column with run-length / dictionary encoding.
  • Independent of compute — you can keep petabytes in Colossus and pay near-S3 storage rates; you only pay for scanned bytes when you query.
  • Time-travel — every table keeps 7 days of snapshots for FOR SYSTEM_TIME AS OF queries.

Jupiter — the network.

  • Petabit-bisection-bandwidth internal network connecting Dremel slot workers to Colossus storage nodes.
  • Why it matters — shuffles between query stages travel over Jupiter; this is what makes BigQuery fast even for huge JOINs and GROUP BYs.

Slots — on-demand vs Editions.

  • On-demand — pay $6.25 / TB scanned (us-multi-region); draws from a shared 2000-slot pool; unpredictable performance under contention.
  • Editions (Standard / Enterprise / Enterprise Plus) — reserve a fixed number of slots; pay per slot-hour; predictable performance + advanced features (column-level security, BigQuery Omni, materialized views with smart tuning).
  • Autoscaling reservations — Editions can autoscale between baseline + max slots; you pay only for what you use.
  • Picking — on-demand for ad-hoc / unpredictable; Editions for production pipelines with steady load.

Partitioning — the #1 BigQuery cost lever.

  • PARTITION BY DATE(event_time) — splits the table into one physical block per date; queries with WHERE event_time >= '2026-05-01' scan only the matching partitions (partition pruning).
  • Supported types — DATE, DATETIME, TIMESTAMP (with daily / hourly / monthly / yearly granularity), integer-range, ingestion-time (_PARTITIONTIME).
  • Required-filter — set require_partition_filter = TRUE to force every query to include a partition filter (saves runaway scans).
  • Cost impact — partition pruning often cuts 99% of scanned bytes on time-series tables.

Clustering — the #2 cost lever.

  • CLUSTER BY user_id, country — sorts data inside each partition by the cluster columns; queries with WHERE user_id = 'X' scan only the relevant clustered blocks (cluster pruning).
  • Up to 4 columns in the cluster key; order matters (most-filtered first).
  • Works with partitioning — partition by date, cluster by user_id; queries on (date, user_id) get both layers of pruning.
  • Free — clustering has no extra storage cost; cluster pruning is best-effort but usually significant.

Editions feature highlights.

  • Standard — basic Editions; no advanced security or BigQuery ML.
  • Enterprise — column-level security, materialized views with smart tuning, BigQuery ML.
  • Enterprise Plus — multi-region + region replicas, longer time-travel, advanced VPC controls.

BigLake + Iceberg — open lakehouse on BigQuery.

  • BigLake tables — let BigQuery read Parquet / Iceberg / Hudi files on GCS as if they were native BigQuery tables.
  • BigLake Iceberg — full Iceberg table format with schema evolution, time-travel, partition evolution; readable by BigQuery, Spark, Trino simultaneously.
  • Use case — open lakehouse where one set of Iceberg files is shared across engines.

Worked example — partitioning + clustering a clicks table

Detailed explanation. A clicks table grows by ~50GB/day. We need fast queries by (date, country, user_id) and a hard cap on per-query scanned bytes. Partitioning by date prunes by day; clustering by country, user_id prunes within the partition; together they cut a multi-TB scan to a few GB.

Question. Build a analytics.clicks table for ~10TB of click events partitioned by event date and clustered for fast per-country lookups, with a required partition filter so nobody runs a full-table scan.

Input — sample columns.

Column Type Notes
event_time TIMESTAMP event timestamp
event_date DATE partition column (derived)
country STRING top cluster column
user_id STRING second cluster column
url STRING clicked URL
referrer STRING optional

Code.

CREATE TABLE analytics.clicks (
  event_time  TIMESTAMP NOT NULL,
  event_date  DATE      NOT NULL,
  country     STRING,
  user_id     STRING,
  url         STRING,
  referrer    STRING
)
PARTITION BY event_date
CLUSTER BY country, user_id
OPTIONS (
  require_partition_filter = TRUE,
  partition_expiration_days = 730,
  description = "Clickstream events; partitioned by event_date, clustered by country, user_id."
);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. PARTITION BY event_date — BigQuery physically splits the table into one block per date; storage layout mirrors the partition column.
  2. CLUSTER BY country, user_id — inside each partition, BigQuery sorts and groups data by country, then user_id; queries with these filters benefit from cluster pruning.
  3. require_partition_filter = TRUE — any query missing a WHERE event_date … clause is rejected with an error; saves teams from accidental full scans.
  4. partition_expiration_days = 730 — partitions older than 2 years auto-delete; storage cost capped without manual cleanup.
  5. Query pathSELECT … WHERE event_date BETWEEN '2026-05-01' AND '2026-05-07' AND country = 'US' AND user_id = 'u_123' scans ~7 days × the US cluster × the u_123 sub-block — a few MB, not a few TB.

Output.

Metric Without partitioning + clustering With both
Scanned bytes (1-week, 1 user) ~10 TB ~50 MB
Query cost (on-demand) ~$62.50 ~$0.0003
Query latency minutes sub-second

Rule of thumb: On any time-series BigQuery table, PARTITION BY date column + CLUSTER BY high-cardinality filter column + require_partition_filter = TRUE is the canonical setup.

BigQuery interview question — write a partition-pruned cohort query

The most common BigQuery interview prompt: given a partitioned + clustered events table, write a query that benefits from both partition pruning and cluster pruning, then explain why it's cheap.

Question. Using analytics.clicks above, compute the number of distinct user_ids per day for the US for the week of 2026-05-01 to 2026-05-07.

Solution Using partition pruning + cluster pruning

SELECT
  event_date,
  COUNT(DISTINCT user_id) AS distinct_users
FROM analytics.clicks
WHERE event_date BETWEEN DATE '2026-05-01' AND DATE '2026-05-07'
  AND country = 'US'
GROUP BY event_date
ORDER BY event_date;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Stage What BigQuery does
1. Plan Reads table metadata; sees partition column event_date and cluster (country, user_id).
2. Partition prune Picks only the 7 partitions in the BETWEEN range; ignores ~720+ others.
3. Cluster prune Inside each partition, reads only the country='US' clustered block.
4. Scan Loads event_date, user_id columns only (columnar) — typically a few hundred MB.
5. Aggregate Distributed COUNT(DISTINCT) via HyperLogLog++ approximation under the hood.
6. Return One row per date, ordered.

Output:

event_date distinct_users
2026-05-01 412 391
2026-05-02 438 217
2026-05-03 401 504
2026-05-04 458 996
2026-05-05 471 002
2026-05-06 469 558
2026-05-07 482 113

Why this works — concept by concept:

  • partition pruningWHERE event_date BETWEEN … filters out 99% of partitions before any bytes are read; this is the single biggest cost lever in BigQuery.
  • cluster pruningcountry = 'US' filters out other-country clustered blocks inside each partition; further cuts scanned bytes by ~50× depending on country distribution.
  • columnar storage — only event_date and user_id columns are decoded; the wide url, referrer columns are skipped because Capacitor is column-oriented.
  • HyperLogLog++ — BigQuery's COUNT(DISTINCT) uses HLL++ for sub-linear memory; deterministic up to a configurable error.
  • CostO(rows scanned) = a few hundred MB instead of multi-TB; on-demand cost ~$0.001 per run, latency sub-second on warm slots.

SQL
Topic — aggregation
Aggregation problems (SQL)

Practice →

SQL
Topic — window-functions
Window function drills (SQL)

Practice →


3. Cloud Dataflow — Apache Beam unified batch + streaming

Diagram of Dataflow — Pub/Sub topic and GCS bucket as left sources, an Apache Beam pipeline in the middle showing ParDo, Window, GroupByKey, Combine transforms with PCollection arrows, and BigQuery + GCS as right sinks, with an Auto-scaling workers bracket on top, on a light PipeCode card.

Cloud Dataflow — one Beam pipeline, batch or streaming

Cloud Dataflow is GCP's serverless runner for Apache Beam pipelines. The senior one-sentence answer: Beam expresses your pipeline as a DAG of PCollection → transform → PCollection steps, and Dataflow runs the same DAG as a batch job or a streaming job — autoscaling workers, handling shuffle, and managing exactly-once semantics on supported sinks.

Apache Beam — the unified model.

  • PCollection<T> — an immutable, possibly-infinite distributed collection of typed elements.
  • PTransform — a unit of computation (ParDo, GroupByKey, Window, Combine); composes into pipelines.
  • Pipeline — the DAG of PCollections and PTransforms.
  • Runners — Dataflow, Flink, Spark, Direct; same code, different execution engine.

The four core transforms.

  • ParDo — generalised map / filter / flat-map; you write a DoFn with process_element.
  • GroupByKey — shuffles KV<K, V> elements so all values for a key land at one worker; triggers a window boundary in streaming.
  • Window — assigns each element to one or more windows (fixed / sliding / session / global).
  • Combine — associative + commutative aggregation; supports combiners on the map side (like a MapReduce combiner).

Fixed vs sliding vs session windows.

  • Fixed (tumbling) windows — non-overlapping fixed-duration buckets; e.g. "5-minute fixed windows" buckets events at 0:00, 0:05, 0:10 …
  • Sliding windows — overlapping fixed-duration buckets that move every period; e.g. "10-minute sliding, 1-minute period" produces a 10-min average every minute.
  • Session windows — gap-based; a session ends when there's a gap longer than session_gap_duration with no events; great for per-user activity sessions.
  • Global window — one window; default for batch; rarely used unmodified in streaming.

Watermark, late data, triggers — the three streaming knobs.

  • Watermark — Dataflow's estimate of "we've seen all events with event_time ≤ X"; advances as new events arrive.
  • Late data — events arriving after the watermark passes their window; handled via allowed_lateness + Trigger.AfterPane.
  • Triggers — control when window results are emitted; AfterWatermark() for one final result per window, AfterProcessingTime(...) for low-latency early speculative results, AfterCount(N) for batched emits.
  • CombinationAfterWatermark().withEarlyFirings(AfterProcessingTime(...)).withLateFirings(AfterPane.elementCountAtLeast(1)) gives you fast first guesses, a final on-watermark answer, and late corrections.

Autoscaling workers.

  • Streaming Engine — Dataflow's optimised streaming backend; separates state + shuffle from worker VMs.
  • Horizontal autoscaling — Dataflow watches backlog and CPU; adds workers when backlog grows, removes them when it shrinks.
  • --max_num_workers — hard cap; prevents runaway cost.
  • Dataflow Prime — vertical autoscaling (per-stage right-sizing) + smart shuffle.

Templates.

  • Classic templates — a serialised pipeline JSON in GCS; launched via gcloud dataflow jobs run with parameters.
  • Flex templates — Docker-packaged pipelines; richer parameterisation; the modern default.
  • Why — lets ops teams kick off pipelines without rebuilding code; Composer's DataflowTemplatedJobStartOperator invokes them.

Exactly-once on Dataflow + BigQuery.

  • Pub/Sub source — at-least-once delivery; Dataflow dedupes via message IDs at the source.
  • BigQuery streaming insert — uses insert IDs to dedupe (best-effort); the Storage Write API offers stricter exactly-once.
  • GCS sinkFileIO.write() writes shard files atomically.

Worked example — a Beam streaming pipeline (Pub/Sub → BigQuery)

Detailed explanation. A canonical streaming Dataflow job in Python that reads from Pub/Sub, parses JSON, assigns events to 5-minute fixed windows, counts events per country, and streams the counts into BigQuery. This is the bread-and-butter pipeline shape for real-time analytics on GCP.

Question. Read clickstream events from projects/my-proj/topics/clicks-topic, count clicks per country per 5-minute window, and write the result to analytics.click_counts_5m.

Code.

import json
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions
from apache_beam.transforms import window
from apache_beam.io.gcp.bigquery import WriteToBigQuery

opts = PipelineOptions(streaming=True)
opts.view_as(StandardOptions).runner = "DataflowRunner"

with beam.Pipeline(options=opts) as p:
    (
        p
        | "ReadPubSub"  >> beam.io.ReadFromPubSub(
              topic="projects/my-proj/topics/clicks-topic")
        | "ParseJSON"   >> beam.Map(lambda b: json.loads(b.decode("utf-8")))
        | "KeyByCountry">> beam.Map(lambda e: (e["country"], 1))
        | "5minWindow"  >> beam.WindowInto(window.FixedWindows(5 * 60))
        | "CountPerKey" >> beam.CombinePerKey(sum)
        | "ToRow"       >> beam.Map(lambda kv: {
              "country": kv[0],
              "clicks":  kv[1],
              "window_start": window.IntervalWindow})
        | "WriteBQ"     >> WriteToBigQuery(
              table="my-proj:analytics.click_counts_5m",
              write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
              create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER)
    )
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. ReadFromPubSub — Dataflow creates a managed pull subscription on the topic and streams messages into the pipeline.
  2. ParseJSONParDo-under-the-hood; each message is decoded and parsed.
  3. KeyByCountry — emits (country, 1) KV pairs so downstream transforms can group by key.
  4. WindowInto(FixedWindows(5*60)) — every element is assigned to a 5-minute non-overlapping window based on its processing time (or event time if you set a TimestampAttribute on ReadFromPubSub).
  5. CombinePerKey(sum) — Beam fuses a map-side combiner with a reducer-side merge; per (country, window), the total click count is computed.
  6. WriteToBigQuery — appends each row to the partitioned + clustered BigQuery table via the streaming insert API.

Output (one window's slice).

country clicks window_start
US 24 815 2026-05-24T12:00:00Z
GB 9 102 2026-05-24T12:00:00Z
DE 7 944 2026-05-24T12:00:00Z
FR 6 102 2026-05-24T12:00:00Z

Rule of thumb: For a new Dataflow pipeline, start with ReadFromPubSubWindowCombinePerKeyWriteToBigQuery and only add complexity (custom triggers, side inputs, stateful DoFn) when a real requirement forces you to.

Dataflow interview question — late events + a custom trigger

The most common Dataflow interview prompt: how do you handle late-arriving events in a windowed streaming aggregation? Show a trigger configuration that emits an early result, the final on-watermark result, and corrections for late data.

Question. Sometimes clicks land 10+ minutes late because of mobile retries. Modify the pipeline so each 5-minute window emits (a) an early speculative count after 30 seconds of processing time, (b) the final count when the watermark passes, and (c) corrected counts for events arriving up to 10 minutes late.

Solution Using AfterWatermark with early + late firings

from apache_beam.transforms import trigger

windowed = (
    events_kv
    | "5minWindow" >> beam.WindowInto(
          window.FixedWindows(5 * 60),
          trigger=trigger.AfterWatermark(
              early=trigger.AfterProcessingTime(30),
              late=trigger.AfterPane.elementCountAtLeast(1)
          ),
          allowed_lateness=10 * 60,
          accumulation_mode=trigger.AccumulationMode.ACCUMULATING)
    | "CountPerKey" >> beam.CombinePerKey(sum)
)
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Time (event) Watermark What Dataflow emits
12:00:30 12:00 Early pane — 30s after first element (speculative count).
12:05:10 12:05 Final on-watermark pane for window [12:00, 12:05).
12:12:00 12:12 Late event with event_time 12:03 arrives; pane re-fires (corrected count).
12:16:00 12:16 Late event with event_time 12:04 — still within 10-min lateness; re-fires.
12:17:00 12:17 Even-later event with event_time 12:03 — dropped (past allowed_lateness).

Output: the BigQuery table gets multiple rows per (country, window_start) — one early, one final, one per late firing — and the downstream MERGE / dedupe step keeps the latest by processing_time.

Why this works — concept by concept:

  • AfterWatermark() — guarantees one canonical pane fires when the watermark crosses the window end; this is the "right" answer for most cases.
  • early firingsAfterProcessingTime(30) emits a speculative pane every 30s before the watermark closes; dashboards see updates in near real-time.
  • late firingsAfterPane.elementCountAtLeast(1) re-fires on each late element so corrections aren't batched.
  • allowed_lateness — cap on how late an element can arrive and still be processed; anything later is dropped (memory bound on state).
  • accumulation modeACCUMULATING means each new pane includes all previously-seen elements for that window (so the count is a running total), vs DISCARDING which only includes new elements since the last pane.
  • CostO(elements per window) state in memory until allowed_lateness expires; trigger choice has zero impact on the underlying compute cost.

Python
Topic — streaming
Streaming Python drills

Practice →

Python
Topic — etl
ETL Python drills

Practice →


4. Pub/Sub + Composer — event-driven ingestion and Airflow orchestration

Diagram of Pub/Sub fanout and Composer orchestration — a producer publishing to a Pub/Sub topic with push (Cloud Run) and pull (Dataflow) subscription paths, plus a Cloud Composer Airflow DAG with sense_file, run_dataflow, load_bigquery, notify task nodes connected by arrows, on a light PipeCode card.

Pub/Sub — topics + subscriptions, at-least-once with ordering keys and DLQ

Pub/Sub is GCP's planet-scale, fully-managed messaging service. The senior one-sentence answer: Pub/Sub is a publish-subscribe bus where publishers push messages to a topic and any number of subscriptions deliver them to consumers — at-least-once by default, with optional ordering keys, dead-letter topics, and either push (HTTPS webhook to Cloud Run) or pull (Dataflow / custom client) delivery.

Topics and subscriptions.

  • Topic — a named channel; publishers push messages here.
  • Subscription — a named consumer attached to a topic; one topic can have many subscriptions; each subscription gets its own copy of every message.
  • Filter — subscriptions can apply a filter expression on message attributes to receive only matching messages.

At-least-once delivery — the core guarantee.

  • Every message is delivered at least once to each subscription; duplicates are possible.
  • Ack deadline — the consumer has N seconds (default 10, max 600) to ack; missing the deadline = redelivery.
  • Idempotent consumers — design your downstream so processing the same message twice is safe (e.g. dedupe by message ID, MERGE in BigQuery).

Ordering keys.

  • Same ordering_key = messages delivered in publish order to the subscriber.
  • Different keys = no order guarantee between keys (still parallelism between them).
  • Cost — slightly higher latency vs unordered; worth it when order matters (e.g. CDC events per primary key).

Dead-letter topic (DLQ).

  • Configurable per subscription — after N failed redelivery attempts, the message is forwarded to a dead-letter topic.
  • Why — prevents poison messages from blocking the subscription forever.
  • Pattern — DLQ → human inspection topic, or DLQ → repair pipeline that fixes + republishes.

Push vs pull subscriptions.

  • Push — Pub/Sub POSTs the message to your HTTPS endpoint (typically Cloud Run / Cloud Functions); great for low-traffic, request/response-style consumers.
  • Pull — your consumer (Dataflow, custom client) pulls messages on its own schedule; better for high throughput and back-pressure control.
  • Bidirectional streaming pull — modern gRPC pull keeps a long-lived connection; lowest latency.

Pub/Sub Lite — the cheaper alternative.

  • Zonal (vs Pub/Sub's global) and partition-based, like Kafka.
  • Up to 80% cheaper for very high throughput in a single zone.
  • Use case — single-zone analytics ingestion where global delivery isn't needed.

Snapshots and seek.

  • Snapshot — captures the unacked-message state of a subscription at a moment in time.
  • Seek — rewinds a subscription to a snapshot or timestamp; lets you replay messages.
  • Use case — bug in your pipeline? snapshot, fix, seek back, reprocess.

Cloud Composer — managed Apache Airflow on GKE Autopilot

Cloud Composer is GCP's managed Airflow. The senior one-sentence answer: Composer runs Apache Airflow on GKE Autopilot with native GCP integrations (operators for BigQuery, Dataflow, GCS, Dataproc, …), so you author DAGs in Python and Composer handles the scheduler, webserver, workers, and database upgrades.

Composer versions.

  • Composer 1 — runs on GKE Standard; user-managed node pool; legacy.
  • Composer 2 — runs on GKE Autopilot; serverless from your perspective; current default.
  • Composer 3 — newest; even tighter Autopilot integration, faster autoscaling, lighter ops.
  • Picking — Composer 2 or 3 for any new environment; Composer 1 is in maintenance mode.

DAG anatomy.

  • DAG — a Python file with a DAG(...) object + task instances.
  • Operator — a unit of work (PythonOperator, BigQueryInsertJobOperator, DataflowTemplatedJobStartOperator, GCSObjectExistenceSensor).
  • Sensor — an operator that polls until a condition is true (file exists, partition exists, time reached).
  • >> operator — declares dependencies (a >> b >> c).
  • Schedule — cron-like (@daily, 0 6 * * *, or timedelta(hours=1)).

The most-used GCP operators.

  • GCSObjectExistenceSensor — wait for a file to land in GCS.
  • DataflowTemplatedJobStartOperator — kick off a Flex / classic Dataflow template.
  • BigQueryInsertJobOperator — run any BigQuery SQL (query, DDL, MERGE, …).
  • GCSToBigQueryOperator — load a GCS file directly into BigQuery.
  • BigQueryToGCSOperator — export a BigQuery table to GCS.
  • DataprocSubmitJobOperator — submit a Spark job to Dataproc.

Composer scaling knobs.

  • Workers — auto-scale based on queued tasks; you set min + max.
  • Scheduler replicas — 1 (default) up to several for very large DAG sets.
  • Environment size — small / medium / large; controls CPU + memory per component.
  • DAG parsing — bigger DAG folders = slower parse; keep DAGs lean.

Composer vs Cloud Workflows vs Cloud Scheduler.

  • Composer — full Airflow; complex DAGs with branching, sensors, XComs.
  • Cloud Workflows — serverless YAML-defined workflows; great for short orchestrations across GCP APIs; no Python.
  • Cloud Scheduler — cron triggers; fires a Pub/Sub message or HTTPS POST; lightest weight.
  • Picking — Composer for daily ETL DAGs with dependencies; Workflows for short multi-step API orchestration; Scheduler for a single cron trigger.

Worked example — a Composer DAG (sense file → Dataflow → BigQuery → notify)

Detailed explanation. The canonical batch-DE DAG on GCP: wait for the daily raw file to appear in GCS, launch a Dataflow template to transform it, MERGE the result into a BigQuery table, and post a Slack notification. Each step uses a GCP-native operator so failure handling, retries, and SLAs are first-class.

Question. Build a Composer DAG that runs daily at 06:00 UTC: (1) waits for gs://raw/clicks/{ds}/clicks.parquet, (2) runs the clicks_etl Dataflow flex template, (3) MERGEs the staging table into analytics.clicks, (4) posts to Slack.

Code.

from datetime import datetime, timedelta
from airflow import DAG
from airflow.providers.google.cloud.sensors.gcs import GCSObjectExistenceSensor
from airflow.providers.google.cloud.operators.dataflow import \
    DataflowTemplatedJobStartOperator
from airflow.providers.google.cloud.operators.bigquery import \
    BigQueryInsertJobOperator
from airflow.providers.slack.operators.slack_webhook import SlackWebhookOperator

with DAG(
    dag_id="clicks_daily_etl",
    start_date=datetime(2026, 5, 1),
    schedule="0 6 * * *",
    catchup=False,
    default_args={"retries": 2, "retry_delay": timedelta(minutes=5)},
    tags=["gcp", "bigquery", "dataflow"],
) as dag:

    sense_file = GCSObjectExistenceSensor(
        task_id="sense_file",
        bucket="raw",
        object="clicks/{{ ds }}/clicks.parquet",
        timeout=60 * 60,
        poke_interval=60,
    )

    run_dataflow = DataflowTemplatedJobStartOperator(
        task_id="run_dataflow",
        template="gs://my-proj/templates/clicks_etl",
        parameters={"input": "gs://raw/clicks/{{ ds }}/clicks.parquet",
                    "output_table": "my-proj:staging.clicks_{{ ds_nodash }}"},
        location="us-central1",
    )

    merge_to_bq = BigQueryInsertJobOperator(
        task_id="merge_to_bq",
        configuration={
            "query": {
                "query": """
                    MERGE analytics.clicks T
                    USING staging.clicks_{{ ds_nodash }} S
                    ON  T.event_time = S.event_time
                    AND T.user_id    = S.user_id
                    WHEN NOT MATCHED THEN INSERT ROW
                """,
                "useLegacySql": False,
            }
        },
    )

    notify = SlackWebhookOperator(
        task_id="notify",
        slack_webhook_conn_id="slack_default",
        message="clicks_daily_etl :: {{ ds }} :: success",
    )

    sense_file >> run_dataflow >> merge_to_bq >> notify
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. sense_fileGCSObjectExistenceSensor pokes every 60s for up to 60 minutes, waiting for the upstream producer to land the day's file.
  2. run_dataflowDataflowTemplatedJobStartOperator calls the prebuilt clicks_etl template with templated parameters (ds, ds_nodash) so each run targets the right partition.
  3. merge_to_bqBigQueryInsertJobOperator runs a MERGE to upsert into the canonical clicks table; idempotent because of the join key.
  4. notifySlackWebhookOperator posts a success message.
  5. >> dependencies — the four tasks run strictly in order; if any fails, Airflow retries twice with 5-minute backoff, then alerts.

Output.

[2026-05-24 06:00:00] sense_file STARTED
[2026-05-24 06:02:12] sense_file SUCCESS (file landed at 06:02:09)
[2026-05-24 06:02:13] run_dataflow STARTED
[2026-05-24 06:14:55] run_dataflow SUCCESS (workers: 4-12, 12m 42s)
[2026-05-24 06:14:56] merge_to_bq STARTED
[2026-05-24 06:15:31] merge_to_bq SUCCESS (24M rows merged)
[2026-05-24 06:15:32] notify SUCCESS
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: One sensor at the front, one Dataflow / Dataproc compute task, one BigQuery MERGE / load, and a notify — that's the shape of 80% of production GCP Composer DAGs.

Pub/Sub interview question — at-least-once + idempotent BigQuery sink

The most common Pub/Sub interview prompt: how do you handle the at-least-once guarantee so duplicates don't pollute your BigQuery table?

Question. A Dataflow streaming job pulls from Pub/Sub and writes to BigQuery. Sometimes Pub/Sub redelivers a message after the consumer crashes mid-process. Design the BigQuery write so duplicates are deduped.

Solution Using insert IDs + a periodic MERGE

from apache_beam.io.gcp.bigquery import WriteToBigQuery

(
    decoded
    | "WriteBQ" >> WriteToBigQuery(
          table="my-proj:analytics.click_events_raw",
          schema="event_id:STRING,event_time:TIMESTAMP,country:STRING,clicks:INT64",
          insert_retry_strategy="RETRY_ON_TRANSIENT_ERROR",
          additional_bq_parameters={
              "timePartitioning": {"type": "DAY", "field": "event_time"},
              "clustering":       {"fields": ["country"]},
          },
          # Use the message ID as the BigQuery insert ID for streaming dedupe
          insert_method="STREAMING_INSERTS")
)
Enter fullscreen mode Exit fullscreen mode
-- Periodic dedupe (Composer DAG, runs hourly)
MERGE analytics.click_events T
USING (
  SELECT * EXCEPT(rn) FROM (
    SELECT *, ROW_NUMBER() OVER (
      PARTITION BY event_id ORDER BY event_time DESC) AS rn
    FROM analytics.click_events_raw
    WHERE event_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR)
  ) WHERE rn = 1
) S
ON T.event_id = S.event_id
WHEN NOT MATCHED THEN INSERT ROW;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Stage What happens
1. Publish Producer sends event with event_id = e_123.
2. Pub/Sub stores Topic durably stores e_123.
3. Dataflow pulls Subscription delivers e_123; worker processes + writes to click_events_raw.
4. Worker crashes Mid-process; ack never sent; ack deadline expires.
5. Pub/Sub redelivers Same e_123 delivered again; new worker writes to click_events_raw (duplicate row).
6. Hourly MERGE Composer DAG runs the MERGE above; ROW_NUMBER() keeps one row per event_id; canonical table click_events has no duplicates.

Output: analytics.click_events_raw may contain duplicate rows; analytics.click_events (the canonical table) is exactly-once per event_id. Dashboards read from click_events.

Why this works — concept by concept:

  • at-least-once — accept Pub/Sub's guarantee instead of fighting it; design the sink for idempotency.
  • event_id deduplication — every event carries a stable identifier; the MERGE uses it as the join key.
  • staging + canonical split — raw table absorbs duplicates cheaply; canonical table is deduped on a schedule (cheap MERGE over a 2-hour window, not the whole table).
  • partition pruning in MERGEWHERE event_time >= NOW() - 2h keeps the MERGE scan bounded to recent partitions.
  • ROW_NUMBER() OVER (PARTITION BY event_id) — picks one canonical row per id, even if there are 3+ duplicates.
  • CostO(rows in 2-hour window) per MERGE; runs in seconds even on multi-billion-row tables thanks to partition pruning.

Python
Topic — streaming
Streaming Python drills

Practice →

Python
Topic — etl
ETL Python drills

Practice →


5. GCP data engineering ecosystem — the wider stack

Diagram of the GCP data engineering ecosystem — four horizontal strata stacked top to bottom: Governance + Orchestration (Dataplex, IAM, Cloud KMS, Cloud Composer, VPC Service Controls), Compute + Transform (BigQuery, Dataflow, Dataproc, Dataform, Cloud Functions, Cloud Run), Ingestion + Streaming (Pub/Sub, Datastream, Storage Transfer, BigQuery Data Transfer), Storage (Cloud Storage, BigQuery storage, Bigtable, Cloud SQL, Spanner), on a light PipeCode card.

The four-layer GCP ecosystem map — storage, ingestion, compute, governance

Beyond the four-service spine, GCP has dozens of data services. The cleanest mental model is a four-layer stack — storage at the bottom, ingestion + streaming above it, compute + transform on top of that, and governance + orchestration covering everything. Every other GCP data service slots into one of these four layers.

Layer 1 — Storage.

  • Cloud Storage (GCS) — object storage; the universal landing zone for raw files; the lakehouse foundation.
  • BigQuery storage (Colossus) — managed columnar storage inside BigQuery tables.
  • Bigtable — wide-column NoSQL; sub-10ms reads at petabyte scale; the Cassandra / HBase equivalent.
  • Cloud SQL — managed MySQL / PostgreSQL / SQL Server; OLTP workloads.
  • Spanner — globally distributed, strongly consistent relational; the only "infinite-scale SQL" answer on GCP.
  • Firestore — document NoSQL; mobile / web app backends.

Layer 2 — Ingestion + streaming.

  • Pub/Sub — the streaming entry point; covered above.
  • Datastream — change-data-capture from Oracle, MySQL, PostgreSQL into BigQuery; near-real-time replicas.
  • Storage Transfer Service — scheduled batch transfers from S3 / Azure Blob / HTTP / on-prem to GCS.
  • BigQuery Data Transfer Service — managed transfers from Google Ads, YouTube, Search Ads, SaaS, etc. directly into BigQuery.

Layer 3 — Compute + transform.

  • BigQuery — SQL warehouse; also runs BigQuery ML and remote functions.
  • Dataflow — Apache Beam; covered above.
  • Dataproc — managed Spark / Hadoop; lift-and-shift target.
  • Dataform — SQL-first transformation framework (the dbt equivalent inside BigQuery).
  • Cloud Functions — serverless functions; quick glue code; HTTP / Pub/Sub triggered.
  • Cloud Run — serverless containers; great for Pub/Sub push subscribers and custom microservices.

Layer 4 — Governance + orchestration.

  • Cloud Composer — managed Airflow; covered above.
  • Cloud Workflows — serverless YAML workflows.
  • Cloud Scheduler — cron triggers.
  • Dataplex — data mesh + governance; data discovery, quality, lineage across GCS + BigQuery.
  • IAM — identity and access; predefined and custom roles for every resource.
  • Cloud KMS — encryption keys; CMEK for BigQuery / GCS / Pub/Sub.
  • VPC Service Controls — network perimeter; locks down BigQuery / GCS to specific VPCs and prevents data exfiltration.

Modern GCP lakehouse pattern.

  • GCS at the bottom — raw + curated zones; Iceberg / Parquet files.
  • BigLake Iceberg tables — registered in BigQuery, readable by BigQuery, Spark (Dataproc), and Trino simultaneously.
  • Dataform / dbt for transformations — SQL-first; version-controlled.
  • Composer schedules the daily refresh.
  • Dataplex governs the whole thing — lineage, access, data quality scans.
  • Looker / Looker Studio for BI on top.

GCP vs AWS vs Azure — the cross-cloud mental map.

Layer GCP AWS Azure
Object storage GCS S3 ADLS Gen2
Warehouse BigQuery Redshift Synapse
Streaming bus Pub/Sub Kinesis / MSK Event Hubs
Stream + batch compute Dataflow Kinesis Data Analytics / Glue / EMR Stream Analytics / Synapse
Managed Spark Dataproc EMR Synapse Spark / HDInsight
Managed Airflow Composer MWAA Azure Data Factory (partial)
NoSQL wide-column Bigtable DynamoDB Cosmos DB
CDC Datastream DMS Data Factory CDC
Governance Dataplex Lake Formation Purview

When to stay all-GCP vs go multi-cloud.

  • All-GCP — minimum ops; tightest integrations; fastest delivery.
  • Multi-cloud (BigQuery Omni, BigLake) — when data lives elsewhere; BigQuery Omni reads S3 / Azure without moving bytes.
  • Open lakehouse (Iceberg + Dataproc + BigLake) — when you want vendor-neutral storage + multiple compute engines reading the same files.

Python
Topic — etl
GCP / cloud ecosystem drills

Practice →

Python
Topic — data-analysis
Data analysis drills

Practice →


Choosing the right GCP service (cheat sheet)

A one-screen cheat sheet for the most common GCP data engineering decisions.

You want to … GCP service Notes
Warehouse petabytes of analytics data BigQuery Dremel + Colossus; pay per TB scanned or slot-hour
Cut BigQuery scan cost on a time-series table PARTITION BY date + CLUSTER BY user_id Set require_partition_filter = TRUE
Get predictable BigQuery performance BigQuery Editions reservations Pay per slot-hour; autoscale baseline → max
Read Iceberg / Parquet on GCS from BigQuery BigLake (incl. BigLake Iceberg) Single source of truth across engines
Run a streaming ETL pipeline Dataflow streaming Beam unified model; autoscaling workers
Run a batch ETL pipeline Dataflow batch or Dataproc Beam (new code) vs Spark (existing code)
Handle late events in streaming AfterWatermark + allowed_lateness Combine early + late firings
Ingest events at planet scale Pub/Sub At-least-once; ordering keys + DLQ
Cheap per-zone streaming bus Pub/Sub Lite Partition-based; ~80% cheaper
Trigger Cloud Run on each event Pub/Sub push subscription HTTPS POST per message
Drain a streaming job safely Dataflow drain Lets in-flight elements complete
Orchestrate a daily ETL DAG Cloud Composer (Airflow) Composer 2/3 on GKE Autopilot
Wait for a file before running a job GCSObjectExistenceSensor Polls GCS until object appears
Launch a Dataflow template from Airflow DataflowTemplatedJobStartOperator Flex template = modern default
Run a BigQuery MERGE from Airflow BigQueryInsertJobOperator Any DDL / DML / query
Replicate from on-prem MySQL / Postgres / Oracle to BigQuery Datastream Managed CDC
Run dbt-style SQL transforms in BigQuery Dataform Native; integrated with BigQuery git
Govern lakehouse data quality + lineage Dataplex Cross-GCS + BigQuery coverage
Lock BigQuery / GCS to specific VPCs VPC Service Controls Prevents data exfiltration
Wide-column NoSQL Bigtable Sub-10ms reads at PB scale
Globally consistent relational Spanner The only horizontally-scalable SQL on GCP

Frequently asked questions

What's the difference between on-demand and Editions pricing in BigQuery?

BigQuery on-demand pricing charges by the bytes Dremel scans (typically $6.25 / TB in us-multi-region); your project draws from a shared 2000-slot pool, so query latency can vary under contention. BigQuery Editions (Standard / Enterprise / Enterprise Plus) charges per slot-hour for a reservation you control — baseline + autoscaling max — so performance is predictable and many advanced features (column-level security, materialized views with smart tuning, BigQuery ML in Enterprise+, BigQuery Omni) unlock. The 2026 default: on-demand for ad-hoc and unpredictable workloads, Editions for any steady production pipeline where you can size a baseline reservation. Cost-wise, Editions tends to win once you scan more than ~10 TB / month with consistent load; on-demand wins for spiky, small workloads.

Dataflow vs Dataproc — when do I pick each?

Cloud Dataflow runs Apache Beam pipelines serverlessly with autoscaling workers and a unified batch + streaming model — pick it for new pipelines, especially streaming, where you don't want to think about clusters. Dataproc runs managed Spark / Hadoop / Flink clusters — pick it when you have existing Spark / PySpark code to lift-and-shift, when you need cluster-level control (custom Hadoop ecosystem libs, OS-level config), or when you're optimising the per-VM cost on a steady batch workload. The 2026 default: Dataflow first, Dataproc when you have a concrete reason. Many teams run Datastream / Pub/Sub → Dataflow for streaming, and Dataproc Serverless for batch Spark jobs that pre-date the GCP migration.

Pub/Sub vs Kafka — when does each win on GCP?

Pub/Sub is GCP's fully-managed planet-scale messaging — global topics, automatic regional failover, at-least-once delivery, optional ordering keys, DLQs, zero ops. Kafka (self-managed, Confluent Cloud, or MSK) gives you partition-level control, long retention by default, the Kafka Streams / KSQL ecosystem, and easier portability across clouds. Pub/Sub wins on GCP-native projects that value zero ops + global delivery; Kafka wins when you need exact partition ordering with custom partitioner logic, when you already have a Kafka-based platform (Kafka Streams apps, Connect ecosystem), or when multi-cloud portability matters. Pub/Sub Lite is the partition-based, single-zone option that closes the cost gap with Kafka for high-throughput single-zone use cases.

When should I use Cloud Composer vs Cloud Workflows vs Cloud Scheduler?

Cloud Composer (managed Airflow) is the right choice for complex multi-task DAGs with sensors, XComs, branching, retries, SLAs, and backfills — i.e. the kind of daily ETL most data teams run. Cloud Workflows is YAML-defined, serverless, and ideal for short multi-step API orchestrations (call function A, then call BigQuery API, then call function B) — no Python required, near-zero cold-start. Cloud Scheduler is just cron-as-a-service — a trigger that fires an HTTPS POST or publishes to a Pub/Sub topic; combine it with Workflows / Functions / Run for ultra-cheap glue. Rule of thumb: Composer for the ETL DAGs you wake up to, Workflows for short API choreographies, Scheduler for plain cron triggers.

How do I make my Pub/Sub → Dataflow → BigQuery pipeline exactly-once?

Pub/Sub delivery is at-least-once; duplicates are possible. To achieve exactly-once end-to-end, combine four layers: (1) stable event IDs generated at the source — UUIDs or natural keys; (2) Dataflow deduping via the Pub/Sub source's built-in messageId dedupe window (typically ~10 minutes); (3) BigQuery streaming inserts with insert IDs or the Storage Write API for stricter exactly-once semantics on the BigQuery side; (4) a periodic MERGE / dedupe job (Composer DAG with BigQueryInsertJobOperator) that takes a raw / staging table and materialises a deduplicated canonical table partitioned + clustered for cheap re-runs. The pattern is idempotent sinks + a canonical view: accept that duplicates can land in the raw table, and treat the deduped canonical table as the source of truth.

What's the cheapest way to keep cold data in BigQuery?

BigQuery storage has two tiers: active ($0.02 / GB-month in us-multi-region) and long-term ($0.01 / GB-month, automatically applied to any table or partition not modified for 90 days). To minimise storage cost: partition by date so old partitions roll into long-term storage on their own; set partition_expiration_days so very old partitions auto-delete; consider BigLake on GCS for archival data (~$0.004 / GB-month for Coldline storage) and let BigQuery query it via BigLake / external tables; use table snapshots instead of full table copies for backups. For really cold archives, write Parquet / Iceberg files to GCS Coldline / Archive storage classes and create a BigLake Iceberg table over them — you pay storage at archive-class rates and only ingest into BigQuery storage when you actually need fast queries.


Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including SQL and Python practice keyed to the exact patterns this guide teaches for GCP: warehouse SQL on BigQuery-shaped schemas, streaming and event-time logic that mirrors Dataflow + Pub/Sub, and ETL DAG-style thinking that translates straight to Cloud Composer. Whether you're prepping for a Google GCP data engineer interview or building your own GCP platform, the practice library mirrors the same four-pillar mental model — BigQuery, Dataflow, Pub/Sub, Composer — this guide teaches.

Kick off via Explore practice →; drill the Python practice lane →; grind the SQL practice lane →; fan out into the ETL lane →; rehearse streaming Python drills →; sharpen aggregation problems (SQL) →; reinforce window function drills (SQL) →; widen coverage on the full data-analysis library →.

Top comments (0)