GCP data engineering rests on four pillars every Google Cloud data team ships against: BigQuery (the serverless analytical warehouse — Dremel query engine on Colossus storage, slot-based compute, partitioning + clustering, Editions reservations), Cloud Dataflow (the unified Apache Beam runner for batch + streaming pipelines, with autoscaling workers, windows, watermarks, and triggers), Pub/Sub (the planet-scale event bus — at-least-once delivery, ordering keys, dead-letter topics, push vs pull subscriptions), and Cloud Composer (managed Apache Airflow on GKE Autopilot, with operator-rich DAGs that glue everything together). Whether you're building a new GCP data platform or prepping for GCP data engineer interview loops at Google, Spotify, Twitter, or any cloud-native shop, the same four services dominate the architecture diagram.
This guide walks through each pillar end-to-end as a teach-by-example GCP playbook. Inside BigQuery we unpack Dremel, Colossus, the Jupiter network, slots (on-demand and Editions reservations), PARTITION BY DATE, CLUSTER BY user_id, partition pruning, cluster pruning, and BigLake Iceberg. Inside Dataflow we walk the Apache Beam unified model — ParDo, GroupByKey, Window, Combine — plus fixed / sliding / session windows, watermark + late data + triggers, and worker autoscaling. The Pub/Sub + Composer section covers Pub/Sub at-least-once + ordering keys + DLQ, push (Cloud Run) vs pull (Dataflow) subscriptions, Composer 2/3 on GKE Autopilot, and a real Airflow DAG with GCSObjectExistenceSensor → DataflowTemplatedJobStartOperator → BigQueryInsertJobOperator. We close with the wider GCP ecosystem stack so you can place every service on a four-layer mental map.
When you want hands-on reps alongside the reading, browse Python practice library →, drill SQL practice library →, grind ETL Python drills →, sharpen streaming Python drills →, or rehearse aggregation problems (SQL) →.
On this page
- GCP data engineering overview — the four-service spine
- BigQuery — Dremel, Colossus, slots, partitioning + clustering
- Cloud Dataflow — Apache Beam unified batch + streaming
- Pub/Sub + Composer — event-driven ingestion and Airflow orchestration
- GCP data engineering ecosystem — the wider stack
- Choosing the right GCP service (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. GCP data engineering overview — the four-service spine
The four services every GCP data platform leans on — and how they wire together
The one-sentence invariant: a modern GCP data engineering stack ingests events via Pub/Sub, transforms with Dataflow (batch or streaming), warehouses in BigQuery, and orchestrates everything via Cloud Composer (Airflow) — with GCS as the universal landing zone underneath. Memorising this spine is the fastest way to make every other GCP design decision concrete.
The four pillars at a glance.
- BigQuery — serverless analytical warehouse; SQL on petabyte tables; pay per TB scanned (on-demand) or per slot-hour (Editions); cheap, fast, near-zero-ops.
- Cloud Dataflow — managed Apache Beam runner; one programming model covers batch and streaming; autoscales workers; the canonical transform layer between Pub/Sub and BigQuery.
- Pub/Sub — planet-scale, at-least-once messaging; topics + subscriptions; the entry point for events into the pipeline.
- Cloud Composer — managed Apache Airflow on GKE Autopilot; the orchestration layer for batch DAGs, sensors, and dependencies between Dataflow + BigQuery jobs.
The canonical end-to-end pipeline.
Producer → Pub/Sub topic → Dataflow streaming → BigQuery (real-time table)
↘ GCS (Parquet raw)
↑
Composer DAG
schedules nightly
Dataflow batch +
BigQueryInsertJob
loads + reports.
Why this four-service spine wins on GCP.
- Serverless by default — every layer scales without cluster babysitting (no Hadoop YARN, no Spark cluster lifecycle).
- Unified programming model — Apache Beam covers batch and streaming with the same code; no Spark vs Flink split.
- Tight integrations — Pub/Sub → Dataflow → BigQuery is a first-class path; native templates and connectors exist for every hop.
- Cost levers are explicit — partitioning + clustering for BigQuery, autoscaling caps for Dataflow, snapshot retention for Pub/Sub — each one is a knob, not a guess.
- Composer brings Airflow — battle-tested orchestration semantics (sensors, retries, SLAs, backfills) without standing up Airflow yourself.
What interviewers and architects test on the GCP track.
- Do you pick the right service per layer? — Dataflow vs Dataproc (Beam vs Spark), Pub/Sub vs Kafka, Composer vs Cloud Workflows.
- Do you optimise BigQuery cost? — partitioning, clustering, materialized views, BI Engine, Editions vs on-demand.
- Do you handle streaming correctness? — watermark, late data, triggers, exactly-once via Dataflow + BigQuery streaming inserts.
-
Do you wire reliability? — Pub/Sub DLQs, Dataflow drains, Composer task retries, BigQuery transactional
MERGE.
Dataflow vs Dataproc — the most-asked GCP comparison.
- Dataflow — serverless Apache Beam; autoscaling; pay per vCPU-hour; the default for new pipelines.
- Dataproc — managed Spark / Hadoop clusters; pay per VM; the right call when you have existing Spark code or need fine-grained cluster control.
- The 2026 default — start with Dataflow for new pipelines; lift-and-shift Spark to Dataproc; migrate Dataproc to BigQuery / Dataflow when you can.
BigQuery vs Snowflake / Redshift — the warehouse choice.
- BigQuery — separates storage (Colossus) from compute (Dremel + slots); pay per TB scanned by default; zero cluster management.
- Snowflake — virtual warehouses (per-second billing); multi-cloud; great workload isolation.
- Redshift — node-based (or serverless); deeply integrated with AWS.
-
GCP-native answer — BigQuery is the default when your data already lives on GCP; integrate via
BigLaketo read Iceberg tables on GCS without duplicating data.
Worked example — a Pub/Sub → Dataflow → BigQuery pipeline at a glance
Detailed explanation. A realistic real-time pipeline on GCP: events land in a Pub/Sub topic, a Dataflow streaming job reads them, applies a windowed aggregation, and streams the results into BigQuery. The same code (Beam pipeline) runs as a batch job over a GCS bucket for the nightly backfill — that's the unified-model superpower.
Question. Clickstream events flow at 50k events/sec. We need a 5-minute rolling per-country click count, queryable in BigQuery within 30 seconds of event time.
Pipeline shape (GCP-native).
Browser ──HTTPS──▶ Pub/Sub topic: clicks-topic
│
▼
Dataflow streaming job (Beam)
· windowing: 5-minute fixed
· trigger: AfterWatermark
· groupBy: country
│
▼
BigQuery streaming insert
table: analytics.click_counts_5m
clustered by country, partitioned by event_date
│
▼
Looker / Looker Studio dashboard
Step-by-step explanation.
-
Producer publishes to Pub/Sub — the browser sends events to an HTTPS endpoint that publishes a message to
projects/X/topics/clicks-topicwith anevent_timeattribute. - Dataflow pulls the subscription — the streaming job uses a pull subscription, decodes the message, and assigns each event to a 5-minute fixed window.
-
Beam aggregates —
GroupByKeyshuffles events bycountry;Combine.perKey(Count.ones())sums per window. -
Streaming insert to BigQuery — the result tuples are written with
WriteToBigQuery(...), landing inanalytics.click_counts_5mpartitioned byevent_dateand clustered bycountry. -
Composer (optional) — a Cloud Composer DAG runs every hour to re-aggregate the previous day for late events, using a
BigQueryInsertJobOperatorMERGE. -
Looker queries — a single BigQuery SQL
SELECT country, window_start, clicks FROM analytics.click_counts_5m WHERE event_date = CURRENT_DATE()powers the live dashboard.
Rule of thumb: On GCP, Pub/Sub for entry, Dataflow for transform, BigQuery for warehouse, Composer for time-based orchestration — anything more exotic should justify itself against this default.
Python
Topic — etl
ETL Python drills
Python
Topic — streaming
Streaming Python drills
2. BigQuery — Dremel, Colossus, slots, partitioning + clustering
BigQuery — separated compute (Dremel) and storage (Colossus), priced by what you scan
BigQuery is the #1 service GCP data engineers touch every day. The senior one-sentence answer: BigQuery splits compute (Dremel, a serverless distributed SQL engine) from storage (Colossus, a columnar file system) and connects them with the petabit-scale Jupiter network — so you scale compute and storage independently, and you pay only for the bytes Dremel actually scans (on-demand) or the slot-hours you reserve (Editions).
Dremel — the query engine.
- Serverless — no cluster to manage; you submit SQL and BigQuery dispatches it across thousands of workers.
- Slot-based — a slot is one unit of CPU + RAM that runs a stage of your query; on-demand queries draw from a shared 2000-slot pool per project.
- Tree-architecture execution — a root server splits the query into stages; each stage runs across many slot workers in parallel; intermediate results shuffle between stages over Jupiter.
- In-memory shuffle — fast, but spills to disk for huge GROUP BYs.
Colossus — the storage layer.
- Distributed, columnar file system (Google's successor to GFS).
- Capacitor format — BigQuery's proprietary columnar encoding; compressed by column with run-length / dictionary encoding.
- Independent of compute — you can keep petabytes in Colossus and pay near-S3 storage rates; you only pay for scanned bytes when you query.
-
Time-travel — every table keeps 7 days of snapshots for
FOR SYSTEM_TIME AS OFqueries.
Jupiter — the network.
- Petabit-bisection-bandwidth internal network connecting Dremel slot workers to Colossus storage nodes.
- Why it matters — shuffles between query stages travel over Jupiter; this is what makes BigQuery fast even for huge JOINs and GROUP BYs.
Slots — on-demand vs Editions.
- On-demand — pay $6.25 / TB scanned (us-multi-region); draws from a shared 2000-slot pool; unpredictable performance under contention.
- Editions (Standard / Enterprise / Enterprise Plus) — reserve a fixed number of slots; pay per slot-hour; predictable performance + advanced features (column-level security, BigQuery Omni, materialized views with smart tuning).
- Autoscaling reservations — Editions can autoscale between baseline + max slots; you pay only for what you use.
- Picking — on-demand for ad-hoc / unpredictable; Editions for production pipelines with steady load.
Partitioning — the #1 BigQuery cost lever.
-
PARTITION BY DATE(event_time)— splits the table into one physical block per date; queries withWHERE event_time >= '2026-05-01'scan only the matching partitions (partition pruning). -
Supported types — DATE, DATETIME, TIMESTAMP (with daily / hourly / monthly / yearly granularity), integer-range, ingestion-time (
_PARTITIONTIME). -
Required-filter — set
require_partition_filter = TRUEto force every query to include a partition filter (saves runaway scans). - Cost impact — partition pruning often cuts 99% of scanned bytes on time-series tables.
Clustering — the #2 cost lever.
-
CLUSTER BY user_id, country— sorts data inside each partition by the cluster columns; queries withWHERE user_id = 'X'scan only the relevant clustered blocks (cluster pruning). - Up to 4 columns in the cluster key; order matters (most-filtered first).
-
Works with partitioning — partition by date, cluster by user_id; queries on
(date, user_id)get both layers of pruning. - Free — clustering has no extra storage cost; cluster pruning is best-effort but usually significant.
Editions feature highlights.
- Standard — basic Editions; no advanced security or BigQuery ML.
- Enterprise — column-level security, materialized views with smart tuning, BigQuery ML.
- Enterprise Plus — multi-region + region replicas, longer time-travel, advanced VPC controls.
BigLake + Iceberg — open lakehouse on BigQuery.
- BigLake tables — let BigQuery read Parquet / Iceberg / Hudi files on GCS as if they were native BigQuery tables.
- BigLake Iceberg — full Iceberg table format with schema evolution, time-travel, partition evolution; readable by BigQuery, Spark, Trino simultaneously.
- Use case — open lakehouse where one set of Iceberg files is shared across engines.
Worked example — partitioning + clustering a clicks table
Detailed explanation. A clicks table grows by ~50GB/day. We need fast queries by (date, country, user_id) and a hard cap on per-query scanned bytes. Partitioning by date prunes by day; clustering by country, user_id prunes within the partition; together they cut a multi-TB scan to a few GB.
Question. Build a analytics.clicks table for ~10TB of click events partitioned by event date and clustered for fast per-country lookups, with a required partition filter so nobody runs a full-table scan.
Input — sample columns.
| Column | Type | Notes |
|---|---|---|
| event_time | TIMESTAMP | event timestamp |
| event_date | DATE | partition column (derived) |
| country | STRING | top cluster column |
| user_id | STRING | second cluster column |
| url | STRING | clicked URL |
| referrer | STRING | optional |
Code.
CREATE TABLE analytics.clicks (
event_time TIMESTAMP NOT NULL,
event_date DATE NOT NULL,
country STRING,
user_id STRING,
url STRING,
referrer STRING
)
PARTITION BY event_date
CLUSTER BY country, user_id
OPTIONS (
require_partition_filter = TRUE,
partition_expiration_days = 730,
description = "Clickstream events; partitioned by event_date, clustered by country, user_id."
);
Step-by-step explanation.
-
PARTITION BY event_date— BigQuery physically splits the table into one block per date; storage layout mirrors the partition column. -
CLUSTER BY country, user_id— inside each partition, BigQuery sorts and groups data bycountry, thenuser_id; queries with these filters benefit from cluster pruning. -
require_partition_filter = TRUE— any query missing aWHERE event_date …clause is rejected with an error; saves teams from accidental full scans. -
partition_expiration_days = 730— partitions older than 2 years auto-delete; storage cost capped without manual cleanup. -
Query path —
SELECT … WHERE event_date BETWEEN '2026-05-01' AND '2026-05-07' AND country = 'US' AND user_id = 'u_123'scans ~7 days × the US cluster × the u_123 sub-block — a few MB, not a few TB.
Output.
| Metric | Without partitioning + clustering | With both |
|---|---|---|
| Scanned bytes (1-week, 1 user) | ~10 TB | ~50 MB |
| Query cost (on-demand) | ~$62.50 | ~$0.0003 |
| Query latency | minutes | sub-second |
Rule of thumb: On any time-series BigQuery table, PARTITION BY date column + CLUSTER BY high-cardinality filter column + require_partition_filter = TRUE is the canonical setup.
BigQuery interview question — write a partition-pruned cohort query
The most common BigQuery interview prompt: given a partitioned + clustered events table, write a query that benefits from both partition pruning and cluster pruning, then explain why it's cheap.
Question. Using analytics.clicks above, compute the number of distinct user_ids per day for the US for the week of 2026-05-01 to 2026-05-07.
Solution Using partition pruning + cluster pruning
SELECT
event_date,
COUNT(DISTINCT user_id) AS distinct_users
FROM analytics.clicks
WHERE event_date BETWEEN DATE '2026-05-01' AND DATE '2026-05-07'
AND country = 'US'
GROUP BY event_date
ORDER BY event_date;
Step-by-step trace.
| Stage | What BigQuery does |
|---|---|
| 1. Plan | Reads table metadata; sees partition column event_date and cluster (country, user_id). |
| 2. Partition prune | Picks only the 7 partitions in the BETWEEN range; ignores ~720+ others. |
| 3. Cluster prune | Inside each partition, reads only the country='US' clustered block. |
| 4. Scan | Loads event_date, user_id columns only (columnar) — typically a few hundred MB. |
| 5. Aggregate | Distributed COUNT(DISTINCT) via HyperLogLog++ approximation under the hood. |
| 6. Return | One row per date, ordered. |
Output:
| event_date | distinct_users |
|---|---|
| 2026-05-01 | 412 391 |
| 2026-05-02 | 438 217 |
| 2026-05-03 | 401 504 |
| 2026-05-04 | 458 996 |
| 2026-05-05 | 471 002 |
| 2026-05-06 | 469 558 |
| 2026-05-07 | 482 113 |
Why this works — concept by concept:
-
partition pruning —
WHERE event_date BETWEEN …filters out 99% of partitions before any bytes are read; this is the single biggest cost lever in BigQuery. -
cluster pruning —
country = 'US'filters out other-country clustered blocks inside each partition; further cuts scanned bytes by ~50× depending on country distribution. -
columnar storage — only
event_dateanduser_idcolumns are decoded; the wideurl,referrercolumns are skipped because Capacitor is column-oriented. -
HyperLogLog++ — BigQuery's
COUNT(DISTINCT)uses HLL++ for sub-linear memory; deterministic up to a configurable error. -
Cost —
O(rows scanned)= a few hundred MB instead of multi-TB; on-demand cost ~$0.001 per run, latency sub-second on warm slots.
SQL
Topic — aggregation
Aggregation problems (SQL)
SQL
Topic — window-functions
Window function drills (SQL)
3. Cloud Dataflow — Apache Beam unified batch + streaming
Cloud Dataflow — one Beam pipeline, batch or streaming
Cloud Dataflow is GCP's serverless runner for Apache Beam pipelines. The senior one-sentence answer: Beam expresses your pipeline as a DAG of PCollection → transform → PCollection steps, and Dataflow runs the same DAG as a batch job or a streaming job — autoscaling workers, handling shuffle, and managing exactly-once semantics on supported sinks.
Apache Beam — the unified model.
-
PCollection<T>— an immutable, possibly-infinite distributed collection of typed elements. -
PTransform— a unit of computation (ParDo,GroupByKey,Window,Combine); composes into pipelines. -
Pipeline— the DAG ofPCollections andPTransforms. - Runners — Dataflow, Flink, Spark, Direct; same code, different execution engine.
The four core transforms.
-
ParDo— generalised map / filter / flat-map; you write aDoFnwithprocess_element. -
GroupByKey— shufflesKV<K, V>elements so all values for a key land at one worker; triggers a window boundary in streaming. -
Window— assigns each element to one or more windows (fixed / sliding / session / global). -
Combine— associative + commutative aggregation; supports combiners on the map side (like a MapReduce combiner).
Fixed vs sliding vs session windows.
- Fixed (tumbling) windows — non-overlapping fixed-duration buckets; e.g. "5-minute fixed windows" buckets events at 0:00, 0:05, 0:10 …
-
Sliding windows — overlapping fixed-duration buckets that move every
period; e.g. "10-minute sliding, 1-minute period" produces a 10-min average every minute. -
Session windows — gap-based; a session ends when there's a gap longer than
session_gap_durationwith no events; great for per-user activity sessions. - Global window — one window; default for batch; rarely used unmodified in streaming.
Watermark, late data, triggers — the three streaming knobs.
- Watermark — Dataflow's estimate of "we've seen all events with event_time ≤ X"; advances as new events arrive.
-
Late data — events arriving after the watermark passes their window; handled via
allowed_lateness+Trigger.AfterPane. -
Triggers — control when window results are emitted;
AfterWatermark()for one final result per window,AfterProcessingTime(...)for low-latency early speculative results,AfterCount(N)for batched emits. -
Combination —
AfterWatermark().withEarlyFirings(AfterProcessingTime(...)).withLateFirings(AfterPane.elementCountAtLeast(1))gives you fast first guesses, a final on-watermark answer, and late corrections.
Autoscaling workers.
- Streaming Engine — Dataflow's optimised streaming backend; separates state + shuffle from worker VMs.
- Horizontal autoscaling — Dataflow watches backlog and CPU; adds workers when backlog grows, removes them when it shrinks.
-
--max_num_workers— hard cap; prevents runaway cost. - Dataflow Prime — vertical autoscaling (per-stage right-sizing) + smart shuffle.
Templates.
-
Classic templates — a serialised pipeline JSON in GCS; launched via
gcloud dataflow jobs runwith parameters. - Flex templates — Docker-packaged pipelines; richer parameterisation; the modern default.
-
Why — lets ops teams kick off pipelines without rebuilding code; Composer's
DataflowTemplatedJobStartOperatorinvokes them.
Exactly-once on Dataflow + BigQuery.
- Pub/Sub source — at-least-once delivery; Dataflow dedupes via message IDs at the source.
- BigQuery streaming insert — uses insert IDs to dedupe (best-effort); the Storage Write API offers stricter exactly-once.
-
GCS sink —
FileIO.write()writes shard files atomically.
Worked example — a Beam streaming pipeline (Pub/Sub → BigQuery)
Detailed explanation. A canonical streaming Dataflow job in Python that reads from Pub/Sub, parses JSON, assigns events to 5-minute fixed windows, counts events per country, and streams the counts into BigQuery. This is the bread-and-butter pipeline shape for real-time analytics on GCP.
Question. Read clickstream events from projects/my-proj/topics/clicks-topic, count clicks per country per 5-minute window, and write the result to analytics.click_counts_5m.
Code.
import json
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions
from apache_beam.transforms import window
from apache_beam.io.gcp.bigquery import WriteToBigQuery
opts = PipelineOptions(streaming=True)
opts.view_as(StandardOptions).runner = "DataflowRunner"
with beam.Pipeline(options=opts) as p:
(
p
| "ReadPubSub" >> beam.io.ReadFromPubSub(
topic="projects/my-proj/topics/clicks-topic")
| "ParseJSON" >> beam.Map(lambda b: json.loads(b.decode("utf-8")))
| "KeyByCountry">> beam.Map(lambda e: (e["country"], 1))
| "5minWindow" >> beam.WindowInto(window.FixedWindows(5 * 60))
| "CountPerKey" >> beam.CombinePerKey(sum)
| "ToRow" >> beam.Map(lambda kv: {
"country": kv[0],
"clicks": kv[1],
"window_start": window.IntervalWindow})
| "WriteBQ" >> WriteToBigQuery(
table="my-proj:analytics.click_counts_5m",
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER)
)
Step-by-step explanation.
-
ReadFromPubSub— Dataflow creates a managed pull subscription on the topic and streams messages into the pipeline. -
ParseJSON—ParDo-under-the-hood; each message is decoded and parsed. -
KeyByCountry— emits(country, 1)KV pairs so downstream transforms can group by key. -
WindowInto(FixedWindows(5*60))— every element is assigned to a 5-minute non-overlapping window based on its processing time (or event time if you set aTimestampAttributeonReadFromPubSub). -
CombinePerKey(sum)— Beam fuses a map-side combiner with a reducer-side merge; per(country, window), the total click count is computed. -
WriteToBigQuery— appends each row to the partitioned + clustered BigQuery table via the streaming insert API.
Output (one window's slice).
| country | clicks | window_start |
|---|---|---|
| US | 24 815 | 2026-05-24T12:00:00Z |
| GB | 9 102 | 2026-05-24T12:00:00Z |
| DE | 7 944 | 2026-05-24T12:00:00Z |
| FR | 6 102 | 2026-05-24T12:00:00Z |
Rule of thumb: For a new Dataflow pipeline, start with ReadFromPubSub → Window → CombinePerKey → WriteToBigQuery and only add complexity (custom triggers, side inputs, stateful DoFn) when a real requirement forces you to.
Dataflow interview question — late events + a custom trigger
The most common Dataflow interview prompt: how do you handle late-arriving events in a windowed streaming aggregation? Show a trigger configuration that emits an early result, the final on-watermark result, and corrections for late data.
Question. Sometimes clicks land 10+ minutes late because of mobile retries. Modify the pipeline so each 5-minute window emits (a) an early speculative count after 30 seconds of processing time, (b) the final count when the watermark passes, and (c) corrected counts for events arriving up to 10 minutes late.
Solution Using AfterWatermark with early + late firings
from apache_beam.transforms import trigger
windowed = (
events_kv
| "5minWindow" >> beam.WindowInto(
window.FixedWindows(5 * 60),
trigger=trigger.AfterWatermark(
early=trigger.AfterProcessingTime(30),
late=trigger.AfterPane.elementCountAtLeast(1)
),
allowed_lateness=10 * 60,
accumulation_mode=trigger.AccumulationMode.ACCUMULATING)
| "CountPerKey" >> beam.CombinePerKey(sum)
)
Step-by-step trace.
| Time (event) | Watermark | What Dataflow emits |
|---|---|---|
| 12:00:30 | 12:00 | Early pane — 30s after first element (speculative count). |
| 12:05:10 | 12:05 | Final on-watermark pane for window [12:00, 12:05). |
| 12:12:00 | 12:12 | Late event with event_time 12:03 arrives; pane re-fires (corrected count). |
| 12:16:00 | 12:16 | Late event with event_time 12:04 — still within 10-min lateness; re-fires. |
| 12:17:00 | 12:17 | Even-later event with event_time 12:03 — dropped (past allowed_lateness). |
Output: the BigQuery table gets multiple rows per (country, window_start) — one early, one final, one per late firing — and the downstream MERGE / dedupe step keeps the latest by processing_time.
Why this works — concept by concept:
- AfterWatermark() — guarantees one canonical pane fires when the watermark crosses the window end; this is the "right" answer for most cases.
-
early firings —
AfterProcessingTime(30)emits a speculative pane every 30s before the watermark closes; dashboards see updates in near real-time. -
late firings —
AfterPane.elementCountAtLeast(1)re-fires on each late element so corrections aren't batched. - allowed_lateness — cap on how late an element can arrive and still be processed; anything later is dropped (memory bound on state).
-
accumulation mode —
ACCUMULATINGmeans each new pane includes all previously-seen elements for that window (so the count is a running total), vsDISCARDINGwhich only includes new elements since the last pane. -
Cost —
O(elements per window)state in memory untilallowed_latenessexpires; trigger choice has zero impact on the underlying compute cost.
Python
Topic — streaming
Streaming Python drills
Python
Topic — etl
ETL Python drills
4. Pub/Sub + Composer — event-driven ingestion and Airflow orchestration
Pub/Sub — topics + subscriptions, at-least-once with ordering keys and DLQ
Pub/Sub is GCP's planet-scale, fully-managed messaging service. The senior one-sentence answer: Pub/Sub is a publish-subscribe bus where publishers push messages to a topic and any number of subscriptions deliver them to consumers — at-least-once by default, with optional ordering keys, dead-letter topics, and either push (HTTPS webhook to Cloud Run) or pull (Dataflow / custom client) delivery.
Topics and subscriptions.
- Topic — a named channel; publishers push messages here.
- Subscription — a named consumer attached to a topic; one topic can have many subscriptions; each subscription gets its own copy of every message.
- Filter — subscriptions can apply a filter expression on message attributes to receive only matching messages.
At-least-once delivery — the core guarantee.
- Every message is delivered at least once to each subscription; duplicates are possible.
- Ack deadline — the consumer has N seconds (default 10, max 600) to ack; missing the deadline = redelivery.
- Idempotent consumers — design your downstream so processing the same message twice is safe (e.g. dedupe by message ID, MERGE in BigQuery).
Ordering keys.
-
Same
ordering_key= messages delivered in publish order to the subscriber. - Different keys = no order guarantee between keys (still parallelism between them).
- Cost — slightly higher latency vs unordered; worth it when order matters (e.g. CDC events per primary key).
Dead-letter topic (DLQ).
- Configurable per subscription — after N failed redelivery attempts, the message is forwarded to a dead-letter topic.
- Why — prevents poison messages from blocking the subscription forever.
- Pattern — DLQ → human inspection topic, or DLQ → repair pipeline that fixes + republishes.
Push vs pull subscriptions.
- Push — Pub/Sub POSTs the message to your HTTPS endpoint (typically Cloud Run / Cloud Functions); great for low-traffic, request/response-style consumers.
- Pull — your consumer (Dataflow, custom client) pulls messages on its own schedule; better for high throughput and back-pressure control.
- Bidirectional streaming pull — modern gRPC pull keeps a long-lived connection; lowest latency.
Pub/Sub Lite — the cheaper alternative.
- Zonal (vs Pub/Sub's global) and partition-based, like Kafka.
- Up to 80% cheaper for very high throughput in a single zone.
- Use case — single-zone analytics ingestion where global delivery isn't needed.
Snapshots and seek.
- Snapshot — captures the unacked-message state of a subscription at a moment in time.
- Seek — rewinds a subscription to a snapshot or timestamp; lets you replay messages.
- Use case — bug in your pipeline? snapshot, fix, seek back, reprocess.
Cloud Composer — managed Apache Airflow on GKE Autopilot
Cloud Composer is GCP's managed Airflow. The senior one-sentence answer: Composer runs Apache Airflow on GKE Autopilot with native GCP integrations (operators for BigQuery, Dataflow, GCS, Dataproc, …), so you author DAGs in Python and Composer handles the scheduler, webserver, workers, and database upgrades.
Composer versions.
- Composer 1 — runs on GKE Standard; user-managed node pool; legacy.
- Composer 2 — runs on GKE Autopilot; serverless from your perspective; current default.
- Composer 3 — newest; even tighter Autopilot integration, faster autoscaling, lighter ops.
- Picking — Composer 2 or 3 for any new environment; Composer 1 is in maintenance mode.
DAG anatomy.
-
DAG— a Python file with aDAG(...)object + task instances. -
Operator— a unit of work (PythonOperator,BigQueryInsertJobOperator,DataflowTemplatedJobStartOperator,GCSObjectExistenceSensor). -
Sensor— an operator that polls until a condition is true (file exists, partition exists, time reached). -
>>operator — declares dependencies (a >> b >> c). -
Schedule — cron-like (
@daily,0 6 * * *, ortimedelta(hours=1)).
The most-used GCP operators.
-
GCSObjectExistenceSensor— wait for a file to land in GCS. -
DataflowTemplatedJobStartOperator— kick off a Flex / classic Dataflow template. -
BigQueryInsertJobOperator— run any BigQuery SQL (query, DDL, MERGE, …). -
GCSToBigQueryOperator— load a GCS file directly into BigQuery. -
BigQueryToGCSOperator— export a BigQuery table to GCS. -
DataprocSubmitJobOperator— submit a Spark job to Dataproc.
Composer scaling knobs.
- Workers — auto-scale based on queued tasks; you set min + max.
- Scheduler replicas — 1 (default) up to several for very large DAG sets.
- Environment size — small / medium / large; controls CPU + memory per component.
- DAG parsing — bigger DAG folders = slower parse; keep DAGs lean.
Composer vs Cloud Workflows vs Cloud Scheduler.
- Composer — full Airflow; complex DAGs with branching, sensors, XComs.
- Cloud Workflows — serverless YAML-defined workflows; great for short orchestrations across GCP APIs; no Python.
- Cloud Scheduler — cron triggers; fires a Pub/Sub message or HTTPS POST; lightest weight.
- Picking — Composer for daily ETL DAGs with dependencies; Workflows for short multi-step API orchestration; Scheduler for a single cron trigger.
Worked example — a Composer DAG (sense file → Dataflow → BigQuery → notify)
Detailed explanation. The canonical batch-DE DAG on GCP: wait for the daily raw file to appear in GCS, launch a Dataflow template to transform it, MERGE the result into a BigQuery table, and post a Slack notification. Each step uses a GCP-native operator so failure handling, retries, and SLAs are first-class.
Question. Build a Composer DAG that runs daily at 06:00 UTC: (1) waits for gs://raw/clicks/{ds}/clicks.parquet, (2) runs the clicks_etl Dataflow flex template, (3) MERGEs the staging table into analytics.clicks, (4) posts to Slack.
Code.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.providers.google.cloud.sensors.gcs import GCSObjectExistenceSensor
from airflow.providers.google.cloud.operators.dataflow import \
DataflowTemplatedJobStartOperator
from airflow.providers.google.cloud.operators.bigquery import \
BigQueryInsertJobOperator
from airflow.providers.slack.operators.slack_webhook import SlackWebhookOperator
with DAG(
dag_id="clicks_daily_etl",
start_date=datetime(2026, 5, 1),
schedule="0 6 * * *",
catchup=False,
default_args={"retries": 2, "retry_delay": timedelta(minutes=5)},
tags=["gcp", "bigquery", "dataflow"],
) as dag:
sense_file = GCSObjectExistenceSensor(
task_id="sense_file",
bucket="raw",
object="clicks/{{ ds }}/clicks.parquet",
timeout=60 * 60,
poke_interval=60,
)
run_dataflow = DataflowTemplatedJobStartOperator(
task_id="run_dataflow",
template="gs://my-proj/templates/clicks_etl",
parameters={"input": "gs://raw/clicks/{{ ds }}/clicks.parquet",
"output_table": "my-proj:staging.clicks_{{ ds_nodash }}"},
location="us-central1",
)
merge_to_bq = BigQueryInsertJobOperator(
task_id="merge_to_bq",
configuration={
"query": {
"query": """
MERGE analytics.clicks T
USING staging.clicks_{{ ds_nodash }} S
ON T.event_time = S.event_time
AND T.user_id = S.user_id
WHEN NOT MATCHED THEN INSERT ROW
""",
"useLegacySql": False,
}
},
)
notify = SlackWebhookOperator(
task_id="notify",
slack_webhook_conn_id="slack_default",
message="clicks_daily_etl :: {{ ds }} :: success",
)
sense_file >> run_dataflow >> merge_to_bq >> notify
Step-by-step explanation.
-
sense_file—GCSObjectExistenceSensorpokes every 60s for up to 60 minutes, waiting for the upstream producer to land the day's file. -
run_dataflow—DataflowTemplatedJobStartOperatorcalls the prebuiltclicks_etltemplate with templated parameters (ds,ds_nodash) so each run targets the right partition. -
merge_to_bq—BigQueryInsertJobOperatorruns aMERGEto upsert into the canonical clicks table; idempotent because of the join key. -
notify—SlackWebhookOperatorposts a success message. -
>>dependencies — the four tasks run strictly in order; if any fails, Airflow retries twice with 5-minute backoff, then alerts.
Output.
[2026-05-24 06:00:00] sense_file STARTED
[2026-05-24 06:02:12] sense_file SUCCESS (file landed at 06:02:09)
[2026-05-24 06:02:13] run_dataflow STARTED
[2026-05-24 06:14:55] run_dataflow SUCCESS (workers: 4-12, 12m 42s)
[2026-05-24 06:14:56] merge_to_bq STARTED
[2026-05-24 06:15:31] merge_to_bq SUCCESS (24M rows merged)
[2026-05-24 06:15:32] notify SUCCESS
Rule of thumb: One sensor at the front, one Dataflow / Dataproc compute task, one BigQuery MERGE / load, and a notify — that's the shape of 80% of production GCP Composer DAGs.
Pub/Sub interview question — at-least-once + idempotent BigQuery sink
The most common Pub/Sub interview prompt: how do you handle the at-least-once guarantee so duplicates don't pollute your BigQuery table?
Question. A Dataflow streaming job pulls from Pub/Sub and writes to BigQuery. Sometimes Pub/Sub redelivers a message after the consumer crashes mid-process. Design the BigQuery write so duplicates are deduped.
Solution Using insert IDs + a periodic MERGE
from apache_beam.io.gcp.bigquery import WriteToBigQuery
(
decoded
| "WriteBQ" >> WriteToBigQuery(
table="my-proj:analytics.click_events_raw",
schema="event_id:STRING,event_time:TIMESTAMP,country:STRING,clicks:INT64",
insert_retry_strategy="RETRY_ON_TRANSIENT_ERROR",
additional_bq_parameters={
"timePartitioning": {"type": "DAY", "field": "event_time"},
"clustering": {"fields": ["country"]},
},
# Use the message ID as the BigQuery insert ID for streaming dedupe
insert_method="STREAMING_INSERTS")
)
-- Periodic dedupe (Composer DAG, runs hourly)
MERGE analytics.click_events T
USING (
SELECT * EXCEPT(rn) FROM (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY event_id ORDER BY event_time DESC) AS rn
FROM analytics.click_events_raw
WHERE event_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR)
) WHERE rn = 1
) S
ON T.event_id = S.event_id
WHEN NOT MATCHED THEN INSERT ROW;
Step-by-step trace.
| Stage | What happens |
|---|---|
| 1. Publish | Producer sends event with event_id = e_123. |
| 2. Pub/Sub stores | Topic durably stores e_123. |
| 3. Dataflow pulls | Subscription delivers e_123; worker processes + writes to click_events_raw. |
| 4. Worker crashes | Mid-process; ack never sent; ack deadline expires. |
| 5. Pub/Sub redelivers | Same e_123 delivered again; new worker writes to click_events_raw (duplicate row). |
| 6. Hourly MERGE | Composer DAG runs the MERGE above; ROW_NUMBER() keeps one row per event_id; canonical table click_events has no duplicates. |
Output: analytics.click_events_raw may contain duplicate rows; analytics.click_events (the canonical table) is exactly-once per event_id. Dashboards read from click_events.
Why this works — concept by concept:
- at-least-once — accept Pub/Sub's guarantee instead of fighting it; design the sink for idempotency.
- event_id deduplication — every event carries a stable identifier; the MERGE uses it as the join key.
- staging + canonical split — raw table absorbs duplicates cheaply; canonical table is deduped on a schedule (cheap MERGE over a 2-hour window, not the whole table).
-
partition pruning in MERGE —
WHERE event_time >= NOW() - 2hkeeps the MERGE scan bounded to recent partitions. - ROW_NUMBER() OVER (PARTITION BY event_id) — picks one canonical row per id, even if there are 3+ duplicates.
-
Cost —
O(rows in 2-hour window)per MERGE; runs in seconds even on multi-billion-row tables thanks to partition pruning.
Python
Topic — streaming
Streaming Python drills
Python
Topic — etl
ETL Python drills
5. GCP data engineering ecosystem — the wider stack
The four-layer GCP ecosystem map — storage, ingestion, compute, governance
Beyond the four-service spine, GCP has dozens of data services. The cleanest mental model is a four-layer stack — storage at the bottom, ingestion + streaming above it, compute + transform on top of that, and governance + orchestration covering everything. Every other GCP data service slots into one of these four layers.
Layer 1 — Storage.
- Cloud Storage (GCS) — object storage; the universal landing zone for raw files; the lakehouse foundation.
- BigQuery storage (Colossus) — managed columnar storage inside BigQuery tables.
- Bigtable — wide-column NoSQL; sub-10ms reads at petabyte scale; the Cassandra / HBase equivalent.
- Cloud SQL — managed MySQL / PostgreSQL / SQL Server; OLTP workloads.
- Spanner — globally distributed, strongly consistent relational; the only "infinite-scale SQL" answer on GCP.
- Firestore — document NoSQL; mobile / web app backends.
Layer 2 — Ingestion + streaming.
- Pub/Sub — the streaming entry point; covered above.
- Datastream — change-data-capture from Oracle, MySQL, PostgreSQL into BigQuery; near-real-time replicas.
- Storage Transfer Service — scheduled batch transfers from S3 / Azure Blob / HTTP / on-prem to GCS.
- BigQuery Data Transfer Service — managed transfers from Google Ads, YouTube, Search Ads, SaaS, etc. directly into BigQuery.
Layer 3 — Compute + transform.
- BigQuery — SQL warehouse; also runs BigQuery ML and remote functions.
- Dataflow — Apache Beam; covered above.
- Dataproc — managed Spark / Hadoop; lift-and-shift target.
- Dataform — SQL-first transformation framework (the dbt equivalent inside BigQuery).
- Cloud Functions — serverless functions; quick glue code; HTTP / Pub/Sub triggered.
- Cloud Run — serverless containers; great for Pub/Sub push subscribers and custom microservices.
Layer 4 — Governance + orchestration.
- Cloud Composer — managed Airflow; covered above.
- Cloud Workflows — serverless YAML workflows.
- Cloud Scheduler — cron triggers.
- Dataplex — data mesh + governance; data discovery, quality, lineage across GCS + BigQuery.
- IAM — identity and access; predefined and custom roles for every resource.
- Cloud KMS — encryption keys; CMEK for BigQuery / GCS / Pub/Sub.
- VPC Service Controls — network perimeter; locks down BigQuery / GCS to specific VPCs and prevents data exfiltration.
Modern GCP lakehouse pattern.
- GCS at the bottom — raw + curated zones; Iceberg / Parquet files.
- BigLake Iceberg tables — registered in BigQuery, readable by BigQuery, Spark (Dataproc), and Trino simultaneously.
- Dataform / dbt for transformations — SQL-first; version-controlled.
- Composer schedules the daily refresh.
- Dataplex governs the whole thing — lineage, access, data quality scans.
- Looker / Looker Studio for BI on top.
GCP vs AWS vs Azure — the cross-cloud mental map.
| Layer | GCP | AWS | Azure |
|---|---|---|---|
| Object storage | GCS | S3 | ADLS Gen2 |
| Warehouse | BigQuery | Redshift | Synapse |
| Streaming bus | Pub/Sub | Kinesis / MSK | Event Hubs |
| Stream + batch compute | Dataflow | Kinesis Data Analytics / Glue / EMR | Stream Analytics / Synapse |
| Managed Spark | Dataproc | EMR | Synapse Spark / HDInsight |
| Managed Airflow | Composer | MWAA | Azure Data Factory (partial) |
| NoSQL wide-column | Bigtable | DynamoDB | Cosmos DB |
| CDC | Datastream | DMS | Data Factory CDC |
| Governance | Dataplex | Lake Formation | Purview |
When to stay all-GCP vs go multi-cloud.
- All-GCP — minimum ops; tightest integrations; fastest delivery.
- Multi-cloud (BigQuery Omni, BigLake) — when data lives elsewhere; BigQuery Omni reads S3 / Azure without moving bytes.
- Open lakehouse (Iceberg + Dataproc + BigLake) — when you want vendor-neutral storage + multiple compute engines reading the same files.
Python
Topic — etl
GCP / cloud ecosystem drills
Python
Topic — data-analysis
Data analysis drills
Choosing the right GCP service (cheat sheet)
A one-screen cheat sheet for the most common GCP data engineering decisions.
| You want to … | GCP service | Notes |
|---|---|---|
| Warehouse petabytes of analytics data | BigQuery | Dremel + Colossus; pay per TB scanned or slot-hour |
| Cut BigQuery scan cost on a time-series table |
PARTITION BY date + CLUSTER BY user_id
|
Set require_partition_filter = TRUE
|
| Get predictable BigQuery performance | BigQuery Editions reservations | Pay per slot-hour; autoscale baseline → max |
| Read Iceberg / Parquet on GCS from BigQuery | BigLake (incl. BigLake Iceberg) | Single source of truth across engines |
| Run a streaming ETL pipeline | Dataflow streaming | Beam unified model; autoscaling workers |
| Run a batch ETL pipeline | Dataflow batch or Dataproc | Beam (new code) vs Spark (existing code) |
| Handle late events in streaming |
AfterWatermark + allowed_lateness
|
Combine early + late firings |
| Ingest events at planet scale | Pub/Sub | At-least-once; ordering keys + DLQ |
| Cheap per-zone streaming bus | Pub/Sub Lite | Partition-based; ~80% cheaper |
| Trigger Cloud Run on each event | Pub/Sub push subscription | HTTPS POST per message |
| Drain a streaming job safely | Dataflow drain | Lets in-flight elements complete |
| Orchestrate a daily ETL DAG | Cloud Composer (Airflow) | Composer 2/3 on GKE Autopilot |
| Wait for a file before running a job | GCSObjectExistenceSensor |
Polls GCS until object appears |
| Launch a Dataflow template from Airflow | DataflowTemplatedJobStartOperator |
Flex template = modern default |
Run a BigQuery MERGE from Airflow |
BigQueryInsertJobOperator |
Any DDL / DML / query |
| Replicate from on-prem MySQL / Postgres / Oracle to BigQuery | Datastream | Managed CDC |
| Run dbt-style SQL transforms in BigQuery | Dataform | Native; integrated with BigQuery git |
| Govern lakehouse data quality + lineage | Dataplex | Cross-GCS + BigQuery coverage |
| Lock BigQuery / GCS to specific VPCs | VPC Service Controls | Prevents data exfiltration |
| Wide-column NoSQL | Bigtable | Sub-10ms reads at PB scale |
| Globally consistent relational | Spanner | The only horizontally-scalable SQL on GCP |
Frequently asked questions
What's the difference between on-demand and Editions pricing in BigQuery?
BigQuery on-demand pricing charges by the bytes Dremel scans (typically $6.25 / TB in us-multi-region); your project draws from a shared 2000-slot pool, so query latency can vary under contention. BigQuery Editions (Standard / Enterprise / Enterprise Plus) charges per slot-hour for a reservation you control — baseline + autoscaling max — so performance is predictable and many advanced features (column-level security, materialized views with smart tuning, BigQuery ML in Enterprise+, BigQuery Omni) unlock. The 2026 default: on-demand for ad-hoc and unpredictable workloads, Editions for any steady production pipeline where you can size a baseline reservation. Cost-wise, Editions tends to win once you scan more than ~10 TB / month with consistent load; on-demand wins for spiky, small workloads.
Dataflow vs Dataproc — when do I pick each?
Cloud Dataflow runs Apache Beam pipelines serverlessly with autoscaling workers and a unified batch + streaming model — pick it for new pipelines, especially streaming, where you don't want to think about clusters. Dataproc runs managed Spark / Hadoop / Flink clusters — pick it when you have existing Spark / PySpark code to lift-and-shift, when you need cluster-level control (custom Hadoop ecosystem libs, OS-level config), or when you're optimising the per-VM cost on a steady batch workload. The 2026 default: Dataflow first, Dataproc when you have a concrete reason. Many teams run Datastream / Pub/Sub → Dataflow for streaming, and Dataproc Serverless for batch Spark jobs that pre-date the GCP migration.
Pub/Sub vs Kafka — when does each win on GCP?
Pub/Sub is GCP's fully-managed planet-scale messaging — global topics, automatic regional failover, at-least-once delivery, optional ordering keys, DLQs, zero ops. Kafka (self-managed, Confluent Cloud, or MSK) gives you partition-level control, long retention by default, the Kafka Streams / KSQL ecosystem, and easier portability across clouds. Pub/Sub wins on GCP-native projects that value zero ops + global delivery; Kafka wins when you need exact partition ordering with custom partitioner logic, when you already have a Kafka-based platform (Kafka Streams apps, Connect ecosystem), or when multi-cloud portability matters. Pub/Sub Lite is the partition-based, single-zone option that closes the cost gap with Kafka for high-throughput single-zone use cases.
When should I use Cloud Composer vs Cloud Workflows vs Cloud Scheduler?
Cloud Composer (managed Airflow) is the right choice for complex multi-task DAGs with sensors, XComs, branching, retries, SLAs, and backfills — i.e. the kind of daily ETL most data teams run. Cloud Workflows is YAML-defined, serverless, and ideal for short multi-step API orchestrations (call function A, then call BigQuery API, then call function B) — no Python required, near-zero cold-start. Cloud Scheduler is just cron-as-a-service — a trigger that fires an HTTPS POST or publishes to a Pub/Sub topic; combine it with Workflows / Functions / Run for ultra-cheap glue. Rule of thumb: Composer for the ETL DAGs you wake up to, Workflows for short API choreographies, Scheduler for plain cron triggers.
How do I make my Pub/Sub → Dataflow → BigQuery pipeline exactly-once?
Pub/Sub delivery is at-least-once; duplicates are possible. To achieve exactly-once end-to-end, combine four layers: (1) stable event IDs generated at the source — UUIDs or natural keys; (2) Dataflow deduping via the Pub/Sub source's built-in messageId dedupe window (typically ~10 minutes); (3) BigQuery streaming inserts with insert IDs or the Storage Write API for stricter exactly-once semantics on the BigQuery side; (4) a periodic MERGE / dedupe job (Composer DAG with BigQueryInsertJobOperator) that takes a raw / staging table and materialises a deduplicated canonical table partitioned + clustered for cheap re-runs. The pattern is idempotent sinks + a canonical view: accept that duplicates can land in the raw table, and treat the deduped canonical table as the source of truth.
What's the cheapest way to keep cold data in BigQuery?
BigQuery storage has two tiers: active ($0.02 / GB-month in us-multi-region) and long-term ($0.01 / GB-month, automatically applied to any table or partition not modified for 90 days). To minimise storage cost: partition by date so old partitions roll into long-term storage on their own; set partition_expiration_days so very old partitions auto-delete; consider BigLake on GCS for archival data (~$0.004 / GB-month for Coldline storage) and let BigQuery query it via BigLake / external tables; use table snapshots instead of full table copies for backups. For really cold archives, write Parquet / Iceberg files to GCS Coldline / Archive storage classes and create a BigLake Iceberg table over them — you pay storage at archive-class rates and only ingest into BigQuery storage when you actually need fast queries.
Practice on PipeCode
PipeCode ships 450+ data-engineering interview problems — including SQL and Python practice keyed to the exact patterns this guide teaches for GCP: warehouse SQL on BigQuery-shaped schemas, streaming and event-time logic that mirrors Dataflow + Pub/Sub, and ETL DAG-style thinking that translates straight to Cloud Composer. Whether you're prepping for a Google GCP data engineer interview or building your own GCP platform, the practice library mirrors the same four-pillar mental model — BigQuery, Dataflow, Pub/Sub, Composer — this guide teaches.
Kick off via Explore practice →; drill the Python practice lane →; grind the SQL practice lane →; fan out into the ETL lane →; rehearse streaming Python drills →; sharpen aggregation problems (SQL) →; reinforce window function drills (SQL) →; widen coverage on the full data-analysis library →.





Top comments (0)