An ETL pipeline is the core data-engineering workflow that turns scattered raw payloads — database rows, API responses, log files, SaaS exports — into clean, trusted data inside a warehouse where analysts and BI tools can use it. ETL stands for Extract, Transform, Load: pull raw data from many source systems, reshape and clean it into a consistent schema, then write it into a destination like Amazon Redshift, Snowflake, or a data lake. Every fresher data-engineering interview probes the same three letters — and the candidate who can name the failure modes per stage wins the round.
Think of this as a beginner-friendly ETL pipeline tutorial for data engineers — a first-principles walk through the Extract → Transform → Load loop, the orchestration tools that automate it (Airflow, dbt, Spark, AWS Glue), the ETL-vs-ELT trade-off that defines modern cloud warehouses, and a runnable Python pandas example you can adapt to your own pipeline. Every section ships worked examples and an ETL interview questions-style problem with a full traced solution, in the same shape PipeCode practice problems use.
If you want hands-on reps after you read, explore practice →, drill SQL problems →, browse ETL practice →, or open ETL System Design for Data Engineering Interviews → for a structured path.
On this page
- Why ETL pipelines matter
- Extract — pulling raw data from sources
- Transform — cleaning, dedup, standardization, aggregation
- Load — destinations from warehouses to BI tools
- ETL vs ELT — transform before or after loading
- ETL orchestration tools — Airflow, dbt, Spark, AWS Glue
- Building a Python pandas ETL pipeline
- Choosing your ETL stack (checklist)
- Frequently asked questions
- Practice on PipeCode
1. Why ETL pipelines matter
Clean, trusted data is the foundation of every analytics decision
So, why ETL? Because raw source data is messy — duplicates, nulls, mixed formats, inconsistent customer IDs across systems — and dashboards can't tolerate that mess. An ETL pipeline is the automated cleaning step between the noisy source-of-truth (operational databases, third-party APIs, file dumps) and the curated layer (data warehouse, lakehouse, BI tool). Without it, analytics teams answer the same question three different ways and trust in the data evaporates.
Pro tip: When an interviewer asks "what's an ETL pipeline?", lead with the contract it provides, not the steps. The contract is: "given any source payload, downstream consumers see clean, deduplicated, type-coerced, time-aligned rows on a known schema with a known freshness SLA." The three letters (E, T, L) are just how you keep that contract.
Raw data is noisy — duplicates, nulls, mixed formats
The noise invariant: source systems were built for their own workload, not for analytics; they ship duplicates from CDC retries, nulls where the user skipped a field, three different date formats from three different teams, and inconsistent capitalisation that breaks joins. Every one of those defects either becomes a bug in the dashboard or gets cleaned out by an ETL stage.
- Duplicates — same customer recorded multiple times (CDC retries, late events).
- Nulls — missing amounts, missing emails, optional fields left blank.
-
Mixed formats —
2026-05-11,11/05/26,May 11all mean the same date. -
Inconsistent identifiers —
Ram,RAM,Ram(trailing space) all refer to one customer.
Worked example. A raw orders extract straight from the CRM contains three flavours of "Ram" and a null amount:
| order_id | customer_name | amount |
|---|---|---|
| 1 | Ram | 500 |
| 2 | RAM | 1000 |
| 3 | Ram | NULL |
Step-by-step.
-
SELECT COUNT(DISTINCT customer_name) FROM ordersreturns 2 (RamandRAM) when the business wants 1. -
SELECT AVG(amount) FROM ordersreturns750instead of the correct500becauseNULLis excluded fromAVG, but the consumer expected it to be treated as0. - A dashboard built directly on this raw table publishes wrong numbers to the CFO.
- The fix isn't a smarter query — it's an ETL pipeline that normalises the casing, deduplicates customers on a canonical key, and resolves null amounts using a business rule.
Worked-example solution. A minimal Transform step in SQL:
SELECT
LOWER(TRIM(customer_name)) AS customer_key,
COALESCE(amount, 0) AS amount
FROM raw_orders;
Rule of thumb: the deeper you get into a pipeline, the more expensive the fix; clean the raw payload as close to ingest as possible.
Multiple sources, one consistent schema
The unification invariant: every analytical query joins data from at least two systems — the e-commerce orders table, the payment provider's ledger, the CRM customer record, the marketing platform's campaign IDs — and they don't agree on schema, primary keys, or freshness; the ETL pipeline is what produces a **single conformed schema with a shared customer key, a shared product key, and a shared time grain**.
- Common sources — SQL databases (Postgres, MySQL), APIs (Stripe, Salesforce), CSV / Excel dumps, log files, cloud storage (S3, GCS), SaaS tools (HubSpot, Segment).
-
Conformed dimensions —
dim_customer,dim_product,dim_dateshared across every fact table. -
Surrogate keys — never join on a source-system natural key; map it to an internal
BIGINTkey in the warehouse. - Shared time grain — every fact aligns to a common date grain (day, hour, minute) so cross-source joins work.
Worked example. Three sources, three different customer-identifier columns:
| source | identifier column | example value |
|---|---|---|
| website (Postgres) |
users.id (BIGINT) |
42 |
| payments (Stripe API) |
customer.id (string) |
cus_abc123 |
| CRM (HubSpot export) |
contact.email (string) |
ram@example.com |
Step-by-step.
- The website ships orders keyed by Postgres
users.id. - The payment provider returns transaction rows keyed by Stripe
cus_abc123. - The CRM exports customer contacts keyed by email address.
- A direct three-way join is impossible — there is no shared key.
- The ETL pipeline builds a
dim_customertable that carries all three identifiers as columns plus a single internalcustomer_key BIGINTthat every downstream fact uses. After that, the three-way join is one line of SQL.
Worked-example solution. A bridge dimension that unifies all three identifiers:
CREATE TABLE dim_customer (
customer_key BIGSERIAL PRIMARY KEY,
website_id BIGINT UNIQUE,
stripe_id TEXT UNIQUE,
crm_email TEXT UNIQUE,
name TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Rule of thumb: if you find yourself joining on a string from a third party, the join key belongs in a dimension table, not in your fact-table predicates.
Automation, repeatability, and observability
The automation invariant: an ETL pipeline runs on a schedule (or in response to an event), produces the same output on every rerun (idempotent), and emits enough metadata (row counts, hashes, error logs) for an on-call engineer to debug a failure at 3 a.m.. A "pipeline" without these properties is a one-off script, and one-off scripts always rot.
-
Schedule — cron (
0 2 * * *for 2 a.m. daily), event-driven (S3 ObjectCreated), or continuous (Kafka stream). -
Idempotent — rerunning the same job produces the same output; achieved with
MERGE,INSERT OVERWRITE PARTITION, orDELETE + INSERT. - Observable — row counts logged per stage, schema drift alerts, success / failure notifications to Slack or PagerDuty.
- Reproducible — the pipeline definition lives in git; deploys are versioned; rollbacks are one PR away.
Worked example. An idempotent daily load that overwrites a single date partition.
| run | rows in target | resulting count |
|---|---|---|
| original | 0 | 12,835 (first load for 2026-05-11) |
| retry after partial failure | 6,420 (partial write) | 12,835 (overwrite produces clean state) |
| accidental rerun next morning | 12,835 | 12,835 (same data, no duplicates) |
Step-by-step.
- Day-1 run: load lands 12,835 rows for partition
ingest_date='2026-05-11'. - The pipeline crashes mid-write, leaving 6,420 partial rows.
- Retry runs the same job; the
INSERT OVERWRITE PARTITIONsemantics drop the partial rows first, then write the full 12,835 — net state is correct. - An accidental rerun a day later does the same thing — overwrite the partition, end at 12,835. No duplicates.
- The key property is that the final state depends on the input, not on how many times the job ran. That's idempotency.
Worked-example solution. A partition-overwrite load (works in PostgreSQL, Snowflake, Spark SQL):
INSERT OVERWRITE INTO silver.orders PARTITION (ingest_date='2026-05-11')
SELECT
order_id,
customer_id,
amount::NUMERIC(14, 2) AS amount,
source_ts
FROM bronze.orders
WHERE ingest_date = '2026-05-11';
Rule of thumb: if your job's output depends on NOW() or on previous state in the target, it is not idempotent — restructure.
Common beginner mistakes
- Conflating an ETL pipeline with a one-off SQL script — pipelines are scheduled, versioned, and observable.
- Skipping the deduplication step in Transform — assuming source data is "clean enough" and shipping doubled metrics.
- Hand-mapping identifiers in every dashboard instead of building a conformed
dim_customeronce. - Writing non-idempotent loads (
INSERT INTO ... SELECT ...) and discovering duplicates only after the second run. - Treating ETL as code-only without observability — silent failures rot trust faster than loud ones.
ETL Interview Question on Designing a First-Pass Pipeline
A retail company has order data spread across three systems — the Postgres-backed e-commerce site, a Stripe payments account, and a HubSpot CRM. The CFO wants a daily revenue dashboard sliced by customer segment and product category. Design the simplest end-to-end ETL pipeline that gives the CFO a trustworthy answer by tomorrow morning.
Solution Using a Daily ETL Pipeline with Bronze / Silver / Gold Layers
Code solution.
EXTRACT
├── Postgres orders → s3://lake/bronze/orders/ingest_date=…/ (Debezium CDC or daily snapshot)
├── Stripe charges → s3://lake/bronze/charges/ingest_date=…/ (REST API → JSON to S3)
└── HubSpot contacts → s3://lake/bronze/contacts/ingest_date=…/ (nightly CSV export)
TRANSFORM (dbt or Spark SQL)
├── silver.orders ← dedup + type coercion + customer_key surrogate
├── silver.charges ← join orders ↔ charges by stripe transaction_id
├── silver.contacts ← deduped contact rows keyed by email
└── silver.dim_customer ← unify all three identifiers in one dimension
LOAD
└── gold.fact_revenue ← grain: one row per (date, customer_key, product_key)
partitioned by date_key
joined to dim_customer + dim_product
ORCHESTRATION
└── Airflow DAG, daily at 02:00 UTC, with reconciliation gate before gold promotion
Step-by-step trace of the daily pipeline:
| stage | what runs | output |
|---|---|---|
| 02:00 | Airflow triggers DAG | start |
| 02:05 | Extract: Postgres snapshot → S3 bronze | 12,835 raw orders |
| 02:08 | Extract: Stripe API pulls yesterday's charges → S3 | 12,712 charges |
| 02:10 | Extract: HubSpot nightly CSV → S3 | 8,210 contacts |
| 02:15 | Transform: dbt builds silver layer (dedup, type coercion, surrogate keys) | 12,835 silver orders |
| 02:25 | Transform: dim_customer unified across all three sources | 8,210 dim rows |
| 02:30 | Reconciliation gate: silver.orders count vs Postgres source | drift < 0.1% — PASS |
| 02:32 | Load: gold.fact_revenue partition for 2026-05-11 | 12,835 fact rows |
| 02:35 | DAG complete; CFO dashboard refreshes at 02:40 | clean numbers |
Output: a single gold.fact_revenue table the BI tool reads. Every row has a customer_key, a product_key, a date_key, an exact revenue decimal, and a pipeline_version lineage column. The CFO opens Tableau, picks a date range, and sees correct numbers segmented by customer cohort.
Why this works — concept by concept:
- Bronze (append-only) → Silver (conformed) → Gold (star schema) — the medallion layering keeps source drift contained to the bronze layer; consumers never see raw payloads.
- Daily partition overwrite — every load is idempotent; reruns and backfills don't produce duplicates.
-
dim_customerbridging three identifiers — joins are one line in the gold layer; the warehouse query plan uses a single surrogate key everywhere. - Reconciliation gate before gold promotion — drift > tolerance pages on-call; the BI tool never sees a bad load.
- Airflow orchestration — DAG definition is versioned in git; retries, alerts, SLAs are first-class.
-
Cost— one daily run scalesO(|daily delta|)notO(|all-time data|); backfill isO(|range|); clear observability everywhere.
Inline CTA: drill the ETL practice page for end-to-end pipeline shapes and the SQL practice page for transformation-style queries.
ETL
Topic — ETL pipelines
ETL practice problems
SQL
Language — SQL
SQL practice problems
COURSE
Course — ETL System Design
ETL System Design for Data Engineering Interviews
2. Extract — pulling raw data from sources
How to read from databases, APIs, files, and SaaS without breaking the source
The Extract stage is the gateway between source systems and the rest of the pipeline — and the choices you make here cascade into everything downstream. Different sources have different protocols (SQL, REST, file drops, CDC streams), different freshness expectations (sub-second to weekly), and different failure modes (rate limits, schema drift, network flakiness). The right Extract strategy is the one that pulls complete, ordered, replayable raw data without disrupting the source system's own workload.
Pro tip: the most common Extract failure mode in production isn't "data is wrong" — it's "data is silently missing because the source paginated and we didn't follow the cursor." Always log the source-system pagination cursor / last-modified marker for every batch so you can replay from exactly where you stopped.
SQL databases — snapshots, CDC, and the read-replica rule
The relational-source invariant: never run extract queries against the OLTP primary database — that's the production transactional workload; extract from a read replica, a CDC stream, or a periodic snapshot to S3 / GCS. Within those three patterns, CDC is the modern default for high-freshness pipelines because it has near-zero impact on the source.
-
Snapshot extract —
SELECT * FROM orders WHERE updated_at > $cursoragainst a read replica. - CDC (Change Data Capture) — Debezium reads the Postgres WAL or MySQL binlog and emits change events to Kafka.
- Read replica — point extracts at a follower, not the primary, so the OLTP workload is unaffected.
- Cursor / watermark — persist the last successfully-extracted timestamp or LSN so the next run resumes correctly.
Worked example. A daily extract that pulls only the previous day's orders.
| approach | source impact | freshness | complexity |
|---|---|---|---|
| Full table dump | high (lock + IO) | 24 h | low |
| Cursor-based incremental | medium (one indexed scan) | 24 h | medium |
| CDC (Debezium + Kafka) | near-zero | ~1 min | high |
Step-by-step.
- The naïve approach
SELECT * FROM ordersagainst the primary scans the entire table — millions of rows of IO that competes with the live website. - The cursor-based approach
WHERE updated_at >= '2026-05-10' AND updated_at < '2026-05-11'against a read replica reads only ~1 day of data — much cheaper. - CDC reads the database's own write-ahead log, so the extract has zero query cost on the source — only the disk-tail read of the WAL.
- The right choice depends on freshness needs: daily dashboards → cursor; sub-minute analytics → CDC; one-off backfill → full snapshot to S3.
- In every case, log the cursor (or LSN) per batch — that's how you replay after a failure.
Worked-example solution. A cursor-based daily extract against a Postgres read replica:
-- Run nightly; bind $cursor to the previous run's max_updated_at
COPY (
SELECT order_id, customer_id, amount, status, updated_at
FROM orders
WHERE updated_at >= '2026-05-10'::timestamptz
AND updated_at < '2026-05-11'::timestamptz
ORDER BY updated_at
)
TO 's3://lake/bronze/orders/ingest_date=2026-05-10/orders.csv'
WITH (FORMAT csv, HEADER true);
Rule of thumb: every extract you write should be resumable from a cursor. Without one, a network blip becomes a full re-extract.
APIs — pagination, rate limits, and idempotent paging
The API-source invariant: most REST APIs return a paginated response; the extract job must follow the cursor / next-link until all pages are consumed; rate limits force exponential backoff; the same call run twice should produce the same data unless the upstream changed. Idempotency on the extract side protects the rest of the pipeline.
-
Pagination — cursor-based (
next_cursortoken), offset / limit, orsincetimestamps. -
Rate limits — read the
X-RateLimit-Remainingheader; back off when it hits 0. - Idempotency — repeating the same API call returns the same rows (modulo true new data).
- Auth — OAuth refresh tokens, API keys via secrets manager (never in code).
Worked example. A Stripe charges extract that pages through ~10,000 records per day.
| step | request | result |
|---|---|---|
| 1 | GET /v1/charges?limit=100&created[gte]=… |
100 charges + has_more=true + next_id
|
| 2 | GET /v1/charges?limit=100&starting_after=cha_99 |
100 more + has_more=true
|
| 3-99 | repeat with new starting_after cursor |
100 each |
| 100 | last page | 12 charges + has_more=false
|
| total | 99 × 100 + 12 = 9,912 | written to S3 as one JSON Lines file |
Step-by-step.
- The first call grabs the first 100 charges with
limit=100. - Each response contains
has_more=trueplus the ID of the last record (cha_99). - The next call uses
starting_after=cha_99to fetch the next page — that's cursor-based pagination. - Loop until
has_more=false; concatenate all pages into a single JSON Lines file. - Persist the final cursor ID so a retry knows exactly where to resume; without that, you re-extract from the start every time.
Worked-example solution. A simple Python pagination loop with rate-limit handling:
import os, time, json, requests
cursor = None
records = []
while True:
params = {"limit": 100, "created[gte]": "1715040000"}
if cursor:
params["starting_after"] = cursor
r = requests.get(
"https://api.stripe.com/v1/charges",
params=params,
auth=(os.environ["STRIPE_SECRET_KEY"], "")
)
if r.status_code == 429: # rate-limited
time.sleep(int(r.headers.get("Retry-After", 10)))
continue
page = r.json()
records.extend(page["data"])
if not page["has_more"]:
break
cursor = page["data"][-1]["id"]
with open("/tmp/charges.jsonl", "w") as fh:
for rec in records:
fh.write(json.dumps(rec) + "\n")
Rule of thumb: assume every API call can fail; build retries, backoff, and cursor-resume in from day one.
Files and SaaS — schema drift and the contract problem
The file-source invariant: CSV / Excel / JSON dumps from third parties are the highest-drift source in any pipeline — a column rename, a date-format change, or a quote-character switch silently breaks the load; defend with strict schema validation at ingest and explicit alerts on drift.
-
CSV —
csvmodule orpandas.read_csvwith explicitdtypemap andparse_dateslist. -
Excel —
openpyxlfor.xlsx, but really pressure the source to send CSV / Parquet. - JSON / JSONL — line-delimited JSON for streaming-friendly reads; flatten nested objects in Transform, not Extract.
- Schema validation — assert column names, types, and required-not-null status at ingest; fail loudly on drift.
Worked example. A nightly CSV export from a CRM that quietly renamed Email → email_address.
| date | columns extracted | result |
|---|---|---|
| 2026-05-09 | Name, Email, Phone |
parses fine |
| 2026-05-10 | Name, email_address, Phone |
column Email not found → loud error |
| 2026-05-10 (no validation) | Name, email_address, Phone |
silently writes NULL into email — dashboard breaks |
Step-by-step.
- The CRM team renamed
Emailtoemail_addressin a release note nobody read. - The extract script asks for
df["Email"]; pandas raisesKeyError. - With strict validation, the script alerts on-call and halts before writing bad data.
- Without validation, the script writes
NULLfor every email; the downstream dashboard's "users with no email" panel jumps from 0% to 100% overnight. - The fix is to assert column presence and types at ingest — and to publish that schema to the source team as a contract.
Worked-example solution. A defensive CSV ingest with schema assertion:
import pandas as pd
EXPECTED = {"Name": "object", "Email": "object", "Phone": "object"}
df = pd.read_csv("crm_contacts.csv", dtype=EXPECTED)
missing = set(EXPECTED) - set(df.columns)
if missing:
raise RuntimeError(f"CRM CSV missing columns: {missing}")
# only after validation, write to bronze
df.to_parquet("/tmp/contacts.parquet", index=False)
Rule of thumb: every file source has a contract — declare it in code and fail loudly when the source breaks it.
Common beginner mistakes
- Pointing extract queries at the OLTP primary instead of a read replica or CDC stream.
- Skipping pagination — pulling page 1 of 100 and assuming the API gave you everything.
- Storing API secrets in code instead of a secrets manager / environment variable.
- Trusting the source schema without validation — silent drift becomes silent data loss.
- Forgetting to persist the cursor / watermark — failures force a full re-extract.
ETL Interview Question on Extracting from a Rate-Limited API
A pipeline needs to pull users data from a third-party API daily. The API returns 200 users per page, is paginated by next_cursor, and rate-limits at 60 requests / minute. The user base is ~50,000. Walk through the extract design that finishes in under 30 minutes without hitting rate-limit errors.
Solution Using Cursor-Based Pagination + Rate-Limit-Aware Backoff
Code solution.
import os, time, json, requests
API = "https://api.example.com/v1/users"
TOKEN = os.environ["API_TOKEN"]
PAGE = 200
out = []
session = requests.Session()
session.headers.update({"Authorization": f"Bearer {TOKEN}"})
cursor = None
while True:
params = {"page_size": PAGE}
if cursor:
params["next_cursor"] = cursor
r = session.get(API, params=params)
if r.status_code == 429:
time.sleep(int(r.headers.get("Retry-After", 5)))
continue
r.raise_for_status()
body = r.json()
out.extend(body["users"])
cursor = body.get("next_cursor")
if not cursor:
break
# Stay well under 60 req/min — sleep 1.1s between requests
time.sleep(1.1)
with open("/tmp/users.jsonl", "w") as fh:
for u in out:
fh.write(json.dumps(u) + "\n")
Step-by-step trace for the 50k-user extract:
| metric | value |
|---|---|
| pages required | ⌈50,000 / 200⌉ = 250 |
| time per request (with 1.1s sleep) | ~1.4 s |
| total wall-clock time | 250 × 1.4 s ≈ 6 min |
| rate-limit headroom | ~52 req / min (under the 60 cap) |
| failure mode handled |
429 → backoff per Retry-After
|
| final output | one JSONL file with 50,000 user rows |
Output: a single JSONL file (users.jsonl) with one row per user. The cursor design means a retry resumes from the failure point, not the beginning. Total wall-clock: ~6 min, well under the 30-min budget. Zero rate-limit errors.
Why this works — concept by concept:
-
Cursor pagination +
next_cursorfollow — extracts every user without skipping or duplicating pages. - 1.1-second sleep between calls — keeps the request rate at ~52 req / min, comfortably under the 60-req / min limit.
-
429→Retry-Afterbackoff — handles bursty rate-limit events without crashing the pipeline. - JSONL output — streaming-friendly; downstream Transform can read line-by-line without loading the whole file into memory.
-
Persisted cursor between runs — extend the script to write
cursorto S3 / database; the next day resumes from there. -
Cost—O(|users| / page_size)requests; bounded by the rate limit; failure recovery isO(1)cursor reload.
Inline CTA: more Python practice problems for API-extraction loops and the ETL practice page for full pipeline shapes.
PYTHON
Language — Python
Python practice problems
ETL
Topic — ETL pipelines
ETL practice problems
SQL
Language — SQL
SQL practice problems
3. Transform — cleaning, dedup, standardization, aggregation
Where raw data becomes useful — the meat of every ETL pipeline
The Transform stage is where 80% of an ETL pipeline's value is created. Raw data lands as-is, and Transform applies the cleaning, deduplication, type coercion, joining, and business-rule logic that turn it into something consumers can trust. In modern pipelines, Transform is usually SQL (dbt, Spark SQL) on top of a staged copy of the raw data — easy to test, easy to version, easy to backfill.
Pro tip: Transform logic should be idempotent and unit-testable. Every transformation is "pure" if its output depends only on its input — no
NOW(), no random IDs, no calls to external services. That property is what lets you backfill, rerun, and refactor without fear.
Cleaning and deduplication
The cleaning invariant: bronze data contains duplicates from CDC retries, retries on flaky network, and late-arriving records; silver must collapse them to one canonical row per business key, choosing the latest version when conflicts exist. The standard pattern is ROW_NUMBER() OVER (PARTITION BY business_key ORDER BY source_ts DESC) = 1.
- Duplicates — same business key with multiple physical rows.
-
ROW_NUMBERdedup — partition by business key, order by latestsource_ts, keeprn = 1. -
Bad rows — invalid types, broken refs, business-rule violations; quarantine to a separate
rejecttable. -
COALESCE— replace nulls with a known default at the boundary, not inside business logic.
Worked example. Bronze orders arrives with two rows for order_id = 448 due to a CDC retry.
| order_id | source_ts | amount |
|---|---|---|
| 448 | 2026-05-11 09:30:00 | 500 |
| 448 | 2026-05-11 09:30:15 | 520 (corrected) |
| 449 | 2026-05-11 10:00:00 | 800 |
Step-by-step.
- Bronze has two rows for order 448 — the first one (500) is the original CDC event, the second (520) is the post-correction CDC event.
-
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC)numbers them: the 09:30:15 row getsrn = 1, the 09:30:00 row getsrn = 2. - For order 449, only one row →
rn = 1. -
WHERE rn = 1keeps the latest version of order 448 (the corrected $520) and the only version of order 449. - Silver now has one deterministic row per
order_id— joins and aggregates produce the right numbers.
Worked-example solution. Dedup in SQL:
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY source_ts DESC
) AS rn
FROM bronze.orders
WHERE ingest_date = '2026-05-11'
)
INSERT INTO silver.orders
SELECT order_id, customer_id, amount, source_ts
FROM ranked
WHERE rn = 1;
Rule of thumb: dedup belongs in the silver layer and the silver layer alone; if downstream needs to dedup again, your silver contract is leaking.
Standardization — types, casing, dates, casing
The standardization invariant: source systems disagree on date format, capitalisation, currency, and units; Transform converts everything to one canonical representation so downstream queries can compare them without LOWER() calls in every WHERE.
-
Dates —
'11/05/26','2026-05-11','May 11'→ all becomeDATE '2026-05-11'. -
Casing —
'Ram','RAM','ram'→LOWER(TRIM(name))→'ram'. - Currency — multiple feeds in USD, EUR, INR → convert to one reporting currency in silver.
- Units — distance in miles vs km, weight in lb vs kg — canonicalise once at ingest.
Worked example. Three rows from three sources with three date formats:
| source | raw date | canonical date |
|---|---|---|
| Postgres app |
2026-05-11 (ISO) |
2026-05-11 |
| CRM CSV |
11/05/26 (DD/MM/YY) |
2026-05-11 |
| API JSON | "May 11, 2026" |
2026-05-11 |
Step-by-step.
- Postgres uses ISO 8601 (
YYYY-MM-DD) natively — no transformation needed. - The CRM exports DD/MM/YY — needs
TO_DATE(raw_date, 'DD/MM/YY'). - The API returns prose-style
"May 11, 2026"— needsTO_DATE(raw_date, 'Mon DD, YYYY'). - After Transform, every row has a single
DATEvalue in the canonical column. - Downstream SQL is then trivial:
WHERE order_date = '2026-05-11'works across all three sources.
Worked-example solution. Standardize dates and casing in one pass:
SELECT
CASE
WHEN raw_date ~ '^\d{4}-\d{2}-\d{2}$' THEN raw_date::DATE
WHEN raw_date ~ '^\d{2}/\d{2}/\d{2}$' THEN TO_DATE(raw_date, 'DD/MM/YY')
WHEN raw_date ~ '^[A-Z][a-z]{2} \d{1,2}, \d{4}$' THEN TO_DATE(raw_date, 'Mon DD, YYYY')
ELSE NULL -- send to reject table
END AS order_date,
LOWER(TRIM(customer_name)) AS customer_key,
COALESCE(amount, 0)::NUMERIC(14, 2) AS amount
FROM bronze.orders_raw;
Rule of thumb: standardize at the bronze → silver boundary; never let two different formats coexist past that line.
Aggregation and business rules
The aggregation invariant: silver carries one row per source event; gold often carries pre-aggregated metrics for fast dashboard reads — daily revenue, monthly active users, average order value per cohort; the aggregation logic is a SQL GROUP BY plus business-rule CASE WHEN expressions.
-
Daily totals —
SELECT date, SUM(amount) FROM silver.orders GROUP BY date. -
CASE WHEN— classify rows into business buckets at aggregation time. - Window aggregates — running totals, rolling averages, MoM deltas.
-
Cohort metrics —
GROUP BY signup_month, days_since_signupfor retention curves.
Worked example. A daily revenue rollup with a High Value business rule.
| date | order_count | total_revenue | high_value_count |
|---|---|---|---|
| 2026-05-09 | 4,210 | 1,051,200 | 38 |
| 2026-05-10 | 4,832 | 1,224,500 | 47 |
| 2026-05-11 | 5,118 | 1,387,420 | 52 |
Step-by-step.
- Silver
ordershas one row per order withorder_date,amount, andcustomer_id. - The aggregate
GROUP BY order_datecollapses to one row per date. -
SUM(amount)produces the daily total revenue. -
COUNT(*) FILTER (WHERE amount > 10000)produces the count ofHigh Valueorders for that day. - The result lands in
gold.daily_revenuefor the dashboard to read in milliseconds — no full-table scan per page load.
Worked-example solution. Daily rollup with business rule:
INSERT INTO gold.daily_revenue
SELECT
order_date,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue,
COUNT(*) FILTER (WHERE amount > 10000) AS high_value_count
FROM silver.orders
WHERE order_date = '2026-05-11'
GROUP BY order_date
ON CONFLICT (order_date) DO UPDATE SET
order_count = EXCLUDED.order_count,
total_revenue = EXCLUDED.total_revenue,
high_value_count = EXCLUDED.high_value_count;
Rule of thumb: never recompute aggregates inside a BI tool when the warehouse can pre-compute them at load time.
Common beginner mistakes
- Skipping
ROW_NUMBERdedup and assuming "the source won't send duplicates" — every CDC pipeline eventually retries. - Mixing canonical and source-system date formats — every downstream query needs
WHEREcasts. - Doing aggregation inside the BI tool instead of in the warehouse — slow dashboards, no reuse.
- Putting business rules in Transform code instead of declarative SQL — harder to test, harder to version.
- Forgetting to write rejected rows to a
rejecttable — quietly losing data on cleanup.
ETL Interview Question on Cleaning Drifted Source Data
A CRM dumps a daily CSV with customer_name values like Ram, RAM, Ram (trailing space), and Ram@ (corrupted). The downstream dashboard counts distinct customers and currently reports 4 unique names when the truth is 1. Write the Transform step that produces a single canonical customer_key per real-world customer and quarantine the corrupted row.
Solution Using LOWER + TRIM + REGEXP + Reject-Table Quarantine
Code solution.
-- 1) Quarantine corrupted rows
INSERT INTO silver.customer_reject
SELECT raw_id, customer_name, 'invalid_char' AS reason, NOW() AS rejected_at
FROM bronze.customers
WHERE customer_name !~ '^[A-Za-z][A-Za-z .-]+$';
-- 2) Standardise the good rows into silver
INSERT INTO silver.customers (raw_id, customer_key, full_name)
SELECT
raw_id,
LOWER(TRIM(customer_name)) AS customer_key,
INITCAP(TRIM(customer_name)) AS full_name
FROM bronze.customers
WHERE customer_name ~ '^[A-Za-z][A-Za-z .-]+$'
ON CONFLICT (raw_id) DO UPDATE SET
customer_key = EXCLUDED.customer_key,
full_name = EXCLUDED.full_name;
-- 3) Verify
SELECT COUNT(DISTINCT customer_key) AS unique_customers
FROM silver.customers;
Step-by-step trace of the cleanup:
| raw row | regex pass? | customer_key | landed in |
|---|---|---|---|
Ram |
✓ | ram |
silver.customers |
RAM |
✓ | ram |
silver.customers (same key as above) |
Ram |
✓ (TRIM strips the space) |
ram |
silver.customers (same key) |
Ram@ |
✗ (@ not allowed) |
— | silver.customer_reject |
Output: silver.customers has 3 rows mapped to a single customer_key = 'ram'; the dashboard now correctly reports 1 unique customer. The corrupted row sits in silver.customer_reject for the data steward to investigate.
Why this works — concept by concept:
- Quarantine before standardise — corrupted rows go to a separate table; clean rows enter silver; you never silently drop data.
-
LOWER(TRIM(...))as the canonical key — collapses case + whitespace variants into one bucket. -
INITCAP(TRIM(...))for display name — produces a clean human-readable version while keeping the join key normalised. -
Regex gate
^[A-Za-z][A-Za-z .-]+$— explicit allow-list of valid name characters; everything else routes to reject. -
Idempotent
INSERT ... ON CONFLICT— rerunning produces the same final state; backfills are safe. -
Cost— single linear scan over bronze; reject volume is observable as a metric (alert when >1% of rows reject).
Inline CTA: drill the SQL practice page for cleanup queries and the ETL practice page for staged-transform patterns.
SQL
Language — SQL
SQL practice problems
ETL
Topic — ETL pipelines
ETL practice problems
SQL
Topic — aggregations
SQL aggregation problems
4. Load — destinations from warehouses to BI tools
Where clean data lands — warehouse, lake, lakehouse, or directly into a dashboard
The Load stage writes the curated data to one or more destinations: a cloud data warehouse (Snowflake, Redshift, BigQuery), a data lake (S3, GCS), a lakehouse (Iceberg / Delta on object storage), or directly into a BI tool's cache. The right destination depends on the access pattern — interactive analyst SQL favours warehouses; ML feature stores favour lakes; cross-team reuse favours a shared lakehouse.
Pro tip: Loads should be partitioned and idempotent. Partition by
ingest_dateso backfills touch only the affected day, and useINSERT OVERWRITE PARTITIONorMERGEso reruns don't duplicate rows.
Data warehouses — Redshift, Snowflake, BigQuery
The warehouse-load invariant: cloud warehouses (Redshift, Snowflake, BigQuery) prefer bulk loads from object storage; the canonical commands are COPY INTO (Snowflake / Redshift) and LOAD DATA (BigQuery); never use single-row INSERT INTO ... VALUES for production loads — it's 10-100× slower and defeats columnar storage.
-
COPY INTO(Snowflake / Redshift) — bulk-load Parquet / CSV / JSON from S3 / GCS in parallel. -
bq load(BigQuery) — same shape; loads from GCS with auto-detect schema. -
File splitting — split source into
N × num_slicesfiles for parallel ingest. -
COMPUPDATE ON— auto-pick column compression on first load (Redshift); Snowflake does this automatically.
Worked example. Daily load of 50 GB of CSV from S3 into Snowflake.
| step | command | wall-clock |
|---|---|---|
| 1. Stage |
PUT file://... @stage (already in S3) |
n/a |
| 2. Copy | COPY INTO orders FROM @stage/2026-05-11/ |
~3 min |
| 3. Verify | SELECT COUNT(*) FROM orders WHERE … |
<1 s |
Step-by-step.
- The Transform stage has already written 40 Parquet files of ~1.25 GB each to
s3://lake/silver/orders/ingest_date=2026-05-11/. - The warehouse stage points at that S3 prefix via an
EXTERNAL STAGE. -
COPY INTO orders FROM @stage/2026-05-11/ingests all 40 files in parallel across the warehouse's compute slices. - The whole load finishes in 2-3 minutes — vs hours for the row-by-row
INSERTapproach. -
STATUPDATE ON(Redshift) or auto-stats (Snowflake) refreshes the planner so subsequent queries pick the right plan.
Worked-example solution. A Snowflake COPY INTO for daily orders:
COPY INTO silver.orders
FROM @lake_stage/silver/orders/ingest_date=2026-05-11/
FILE_FORMAT = (TYPE = PARQUET)
ON_ERROR = 'ABORT_STATEMENT'
PURGE = FALSE;
Rule of thumb: every production load uses bulk COPY; reserve single-row INSERT for tests and one-off corrections.
Data lakes — S3, GCS, ADLS with Iceberg / Delta
The lake-load invariant: a data lake load is just a write to object storage in a columnar file format (Parquet / ORC); a lakehouse load wraps that write in a table format (Iceberg / Delta) that adds ACID, time travel, and partition evolution; both patterns scale storage and compute independently.
- Plain lake — write Parquet to a prefix; register with a catalog (Glue, Hive Metastore).
-
Lakehouse —
INSERT INTO iceberg.ordersorMERGE INTO delta.orders— ACID on object storage. -
Partitioning — by
ingest_dateorevent_datefor prune-friendly reads. - Compaction — periodic batch job rewrites many small files into fewer large ones.
Worked example. A Spark Structured Streaming job that writes micro-batches to an Iceberg table.
| time | event count in batch | files written | total bytes |
|---|---|---|---|
| 09:00 | 1,200 | 1 | 8 MB |
| 09:01 | 980 | 1 | 7 MB |
| 09:02 | 1,350 | 1 | 9 MB |
Step-by-step.
- The Spark streaming job reads events from Kafka in 1-minute trigger windows.
- Each batch writes one Parquet file (~8 MB) to the Iceberg table's S3 location.
- The Iceberg metadata layer records a new snapshot per batch — ACID is preserved across concurrent writers.
- After an hour, the table has 60 small files; a nightly compaction job rewrites them into a single ~500 MB file for better read performance.
- Trino, Spark, and Snowflake (via Iceberg external tables) can all read the same data without copying.
Worked-example solution. A Spark write to Iceberg:
events.writeStream \
.format("iceberg") \
.outputMode("append") \
.option("path", "s3://lake/lakehouse/orders") \
.option("checkpointLocation", "s3://lake/checkpoints/orders") \
.trigger(processingTime="1 minute") \
.start()
Rule of thumb: if multiple engines need to read the same data (Spark + Trino + Snowflake), use a lakehouse table format; if only the warehouse reads it, a managed warehouse table is simpler.
BI tools and serving layers
The serving-load invariant: BI tools (Tableau, Power BI, Looker, Metabase) read from the warehouse / lakehouse; the load stage's job is to materialise the exact shape the dashboard expects so the BI tool runs sub-second queries.
- Pre-aggregated marts — gold-layer tables shaped for one dashboard each.
- Materialised views — warehouse-native auto-refresh of frequently-queried aggregates.
- Caching layer — BI tools cache for 5-60 min after the load completes.
- Reverse-ETL — push curated data back to operational systems (Salesforce, HubSpot).
Worked example. A gold.daily_revenue mart powering the CFO dashboard.
| dashboard query | source table | rows scanned | response time |
|---|---|---|---|
| Yesterday's revenue | gold.daily_revenue |
1 | <100 ms |
| Last 30 days | gold.daily_revenue |
30 | <200 ms |
| Without the mart (raw silver) | silver.orders |
5 M | ~5 s |
Step-by-step.
- The CFO dashboard wants "revenue per day for the last 30 days" — a tiny output but a huge underlying scan.
- Without a gold mart, the BI tool would aggregate 5 M silver rows on every refresh — ~5 s per refresh.
- With a
gold.daily_revenuemart (one row per day), the BI tool reads 30 rows in <200 ms. - The Load step writes one row per day to
gold.daily_revenueafter the silver layer finishes. - End-to-end: ETL produces the mart; the BI tool reads the mart; the CFO sees a sub-second dashboard.
Worked-example solution. The Load step that produces the daily mart:
INSERT INTO gold.daily_revenue (date_key, revenue_total, order_count)
SELECT
order_date AS date_key,
SUM(amount) AS revenue_total,
COUNT(*) AS order_count
FROM silver.orders
WHERE order_date = '2026-05-11'
GROUP BY order_date
ON CONFLICT (date_key) DO UPDATE SET
revenue_total = EXCLUDED.revenue_total,
order_count = EXCLUDED.order_count;
Rule of thumb: every dashboard should read from a gold-layer mart, never from silver — fast dashboards keep stakeholder trust.
Common beginner mistakes
- Loading row-by-row with
INSERT INTO ... VALUESinstead of bulkCOPY INTO. - Forgetting to partition the destination table — queries scan the whole table for one day's data.
- Skipping
ON CONFLICT/MERGEand discovering duplicates on the second run. - Letting BI tools query silver directly — slow dashboards, no reuse.
- Not refreshing planner statistics after load — wrong join plans, 10-100× slower queries.
ETL Interview Question on Choosing a Load Destination
A retail company has 50 TB of clickstream events generated daily, plus a 5 GB curated gold.fact_orders table that powers BI dashboards. For each, recommend the load destination and the load command — and justify the choice.
Solution Using Lakehouse for Clickstream + Managed Warehouse for Gold
Code solution.
CLICKSTREAM (50 TB / day)
Destination: S3 + Iceberg (lakehouse)
Command: Spark Structured Streaming with .writeStream.format("iceberg")
Reason: too big for managed warehouse storage cost; date-partition-pruned access; ML reuse
GOLD.FACT_ORDERS (5 GB)
Destination: Snowflake managed table
Command: COPY INTO gold.fact_orders FROM @stage/orders/...
Reason: hot, joined, sub-second dashboard reads; ACID; analyst-friendly SQL ergonomics
CROSS-LAYER JOIN
SELECT u.region, COUNT(*) clicks, SUM(o.amount) revenue
FROM lake.clickstream c
JOIN gold.fact_orders o ON o.user_id = c.user_id
WHERE c.event_date = '2026-05-11'
GROUP BY u.region;
Step-by-step trace of the architectural decision:
| step | question | answer |
|---|---|---|
| 1 | What's the data volume? | clickstream 50 TB, gold 5 GB |
| 2 | Hot or cold? | clickstream cold (date-pruned access); gold hot (every dashboard refresh) |
| 3 | Single-engine or multi-engine? | clickstream needs ML (Spark) + analyst SQL (Trino); gold only needs warehouse SQL |
| 4 | Pick destination for clickstream | S3 + Iceberg (lakehouse) — open format, ACID, multi-engine |
| 5 | Pick destination for gold | Snowflake managed — fast, ergonomic, ACID across multi-table updates |
| 6 | Cross-layer joins? | yes — Snowflake reads Iceberg via external tables |
Output: the clickstream lakehouse holds 50 TB in S3 at ~$1,150/month; the gold warehouse holds 5 GB in Snowflake at ~$200/month; the BI tool reads gold in <200 ms; the ML pipeline reads clickstream features directly from S3 without copying.
Why this works — concept by concept:
- Lakehouse for volume + ML reuse — 50 TB at warehouse storage cost would be ~$5K/month; on S3 it's ~$1,150 and ML can read the files directly without an export step.
- Managed warehouse for hot SQL — gold is small, hot, frequently joined; warehouse storage cost is negligible at 5 GB; SQL ergonomics + sub-second response is what BI users need.
- Iceberg as the open boundary — Snowflake reads Iceberg natively; no nightly copy job, no schema drift between systems.
- Spark Structured Streaming for clickstream — micro-batch writes; ACID via Iceberg; replay-friendly.
-
COPY INTOfor gold — bulk Parquet load; sub-3-min load time; auto-compression. -
Cost— each layer carries its own cost characteristic; the right destination per workload is what keeps the AWS bill sane.
Inline CTA: more ETL practice problems for load patterns and the dimensional modeling practice page for star-schema design.
ETL
Topic — ETL pipelines
ETL practice problems
SQL
Topic — dimensional modeling
Dimensional modeling problems
SQL
Language — SQL
SQL practice problems
5. ETL vs ELT — transform before or after loading
When to push transforms to the warehouse vs run them upstream
The ETL vs ELT distinction is the single most-asked architecture question in fresher data-engineering interviews. The mental model: ETL transforms data before loading it into the warehouse — clean Python / Spark jobs land curated rows in the destination; ELT loads raw data into the warehouse first, then transforms it using the warehouse's own SQL engine (dbt is the canonical example). Modern cloud warehouses tilt the answer toward ELT because compute is elastic and SQL is the most-debugged transform language on earth.
Pro tip: the textbook "ETL vs ELT" answer is "depends on the warehouse" — but in interviews, name the specific feature that flips the choice. ELT wins when the warehouse has elastic compute (Snowflake's separate warehouses, BigQuery's slots) and a mature SQL transform layer (dbt). ETL wins when the warehouse can't afford the raw-data storage cost or when transforms are non-SQL (image processing, ML feature engineering).
ETL — transform before loading
The ETL invariant: raw data is transformed by an upstream compute layer (Python, Spark, custom services) into curated rows that land directly in the warehouse; the warehouse holds only the cleaned data; transforms happen on dedicated compute (often cheaper than warehouse credits).
- Compute layer — Spark cluster, Python on Kubernetes, AWS Glue, custom services.
- Warehouse holds — only the cleaned silver / gold tables.
- Storage cost — lower (no raw data in the warehouse).
- Best fit — legacy warehouses without elastic compute, non-SQL transforms (ML features, image pipelines).
Worked example. A Python pipeline that cleans CSV → loads cleaned Parquet → Snowflake.
| stage | runs on | data shape |
|---|---|---|
| Extract | Python on EC2 | raw CSV (10 GB) |
| Transform | Python + pandas (4 CPUs) | cleaned Parquet (3 GB compressed) |
| Load | COPY INTO snowflake.gold.orders |
warehouse holds 3 GB |
Step-by-step.
- A Python script reads the raw CSV from S3 into a pandas DataFrame.
- The script applies cleaning (
dropna, type coercion, business rules) — all in Python memory. - The script writes the cleaned data as Parquet back to S3.
- A
COPY INTOships the Parquet into Snowflake — only the cleaned 3 GB lands in the warehouse. - The warehouse never sees the raw CSV; storage cost is bounded by the curated output.
Worked-example solution. A minimal ETL outline:
import pandas as pd
import snowflake.connector
# Extract
df = pd.read_csv("s3://lake/raw/orders.csv")
# Transform
df = df.drop_duplicates(subset=["order_id"]).dropna(subset=["amount"])
df["amount"] = df["amount"].astype(float).round(2)
df.to_parquet("s3://lake/staging/orders.parquet", index=False)
# Load — Snowflake COPY INTO from the staged Parquet
ctx = snowflake.connector.connect(...)
ctx.cursor().execute("""
COPY INTO gold.orders
FROM @lake_stage/staging/orders.parquet
FILE_FORMAT = (TYPE = PARQUET);
""")
Rule of thumb: ETL is the right shape when transforms need a non-SQL runtime (Python ML features, image processing, custom validation services).
ELT — load first, transform inside the warehouse
The ELT invariant: raw data is loaded directly into the warehouse with minimal transformation; transforms run as SQL inside the warehouse (typically orchestrated by dbt); the warehouse's elastic compute and parallel SQL engine handle the transform workload at scale.
- Tools — dbt (canonical SQL transformation framework), Dataform, custom SQL.
-
Pattern —
raw.orders→silver.orders(dbt model) →gold.fact_orders(dbt model). - Storage cost — higher (warehouse holds both raw and curated data).
- Best fit — modern cloud warehouses with elastic compute (Snowflake, BigQuery, Databricks SQL).
Worked example. A dbt project that builds silver and gold from raw.
| layer | dbt model | runs in |
|---|---|---|
| raw | direct COPY INTO raw.orders
|
Snowflake |
| silver | models/silver/silver_orders.sql |
Snowflake SQL via dbt |
| gold | models/gold/fact_orders.sql |
Snowflake SQL via dbt |
Step-by-step.
- The raw CSV is loaded straight into
raw.ordersviaCOPY INTO— no Python in the middle. - A dbt model
silver_orders.sqlreads fromraw.ordersand applies dedup + type coercion as SQL. - A downstream dbt model
fact_orders.sqlreads fromsilver_ordersand applies aggregation. - dbt runs the whole DAG on the warehouse's compute; transforms are SQL, version-controlled, testable.
- The whole "transform" stage is a
dbt runcommand — minutes for hundreds of models.
Worked-example solution. A dbt silver model:
-- models/silver/silver_orders.sql
{{ config(materialized='incremental', unique_key='order_id') }}
SELECT
order_id,
customer_id,
amount::NUMERIC(14, 2) AS amount,
source_ts AS source_ts,
CURRENT_TIMESTAMP AS silver_loaded_at
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
WHERE source_ts > (SELECT MAX(source_ts) FROM {{ this }})
{% endif %}
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC) = 1
Rule of thumb: ELT is the modern default; reach for it whenever your warehouse has elastic compute (Snowflake, BigQuery) and your transforms can be expressed as SQL.
When ELT beats ETL — modern cloud warehouses
The cloud-warehouse invariant: elastic-compute warehouses (Snowflake's virtual warehouses, BigQuery's slot model, Redshift Concurrency Scaling) let you spin up and tear down transform compute on demand; the marginal cost of a transform job is small; SQL transforms become first-class with dbt.
-
Snowflake — separate virtual warehouses per workload;
XSfor cheap transforms,XLfor hourly loads. - BigQuery — slot-based pricing; transforms run on the same pool as queries.
- Redshift — Concurrency Scaling for elastic warehouse compute.
- Databricks SQL — serverless SQL warehouse + Spark for non-SQL transforms.
Worked example. Same daily transform run two ways.
| approach | compute | wall-clock | monthly cost |
|---|---|---|---|
| ETL (Python on EC2) | dedicated EC2 instance | 30 min | $300 (always-on) |
| ELT (dbt on Snowflake XS warehouse) | warehouse, on-demand | 8 min | $40 (pay-per-second) |
Step-by-step.
- The ETL approach runs a dedicated Python service on EC2 — the server is always on, even when the pipeline isn't running.
- The ELT approach runs
dbt runagainst a Snowflake XS warehouse — the warehouse spins up when the job starts and suspends when it finishes. - Daily cost: $300 (always-on EC2) vs $40 (pay-per-second warehouse).
- The SQL transforms in dbt are version-controlled, tested, and reviewable in PRs — easier collaboration than Python ETL.
- For most analytical workloads on modern warehouses, ELT is cheaper, faster, and easier to maintain.
Worked-example solution. A dbt project layout that captures the ELT pattern:
my_dbt_project/
├── dbt_project.yml
├── models/
│ ├── staging/
│ │ ├── stg_orders.sql -- raw → typed
│ │ └── stg_customers.sql
│ ├── silver/
│ │ ├── silver_orders.sql -- typed → deduped
│ │ └── silver_customers.sql
│ └── gold/
│ ├── fact_orders.sql -- deduped → aggregated
│ └── dim_customer.sql
├── tests/
│ └── orders_amount_positive.sql -- assertion: amount > 0
└── snapshots/
└── customer_history.sql -- SCD2 snapshots
Rule of thumb: on Snowflake / BigQuery, ELT with dbt is the modern default for ~80% of analytical pipelines.
Common beginner mistakes
- Saying "ELT is always better" — false; non-SQL transforms (ML features, image processing) still need ETL.
- Loading raw data straight into gold tables — skips the silver layer's dedup + type coercion contract.
- Running ELT on a legacy on-prem warehouse without elastic compute — transforms compete with analyst queries.
- Treating dbt as a "SQL runner" — it's also a testing framework, a docs generator, and a lineage tool.
- Not separating raw / silver / gold layers in dbt — every model becomes a tangle.
ETL Interview Question on ETL vs ELT Pattern Selection
A team is building a new analytics platform. Sources: 200 GB / day of clickstream events (Kafka), a 5 GB Postgres OLTP database, and a 1 GB nightly SaaS export. Warehouse choice: Snowflake. For each source, recommend ETL or ELT and justify.
Solution Using ELT for Most + ETL for Heavy-Compute Streaming
Code solution.
CLICKSTREAM (200 GB / day, Kafka)
Pattern: ETL upstream + ELT downstream
Why: volume is too high for raw-to-warehouse; Spark streaming pre-aggregates and lands hourly batches
Tools: Spark Structured Streaming → Iceberg → Snowflake external table
POSTGRES OLTP (5 GB)
Pattern: ELT (raw load + dbt transform)
Why: small, schema-stable, SQL transforms natural
Tools: Fivetran CDC → Snowflake raw schema → dbt models for silver/gold
SAAS NIGHTLY EXPORT (1 GB)
Pattern: ELT
Why: tiny, low-frequency; SQL-friendly transforms
Tools: Airbyte → Snowflake raw schema → dbt models
Step-by-step trace of the pattern selection:
| source | volume | freshness | compute | pattern |
|---|---|---|---|---|
| Clickstream | 200 GB / day | sub-minute | Spark streaming | ETL upstream + ELT downstream |
| Postgres OLTP | 5 GB | daily | warehouse SQL | ELT (Fivetran + dbt) |
| SaaS export | 1 GB | daily | warehouse SQL | ELT (Airbyte + dbt) |
Output: a hybrid architecture — Spark handles the high-volume streaming pre-aggregation (ETL pattern), then Snowflake + dbt handles the curated SQL transforms (ELT pattern) for everything that lands in the warehouse.
Why this works — concept by concept:
- ETL for clickstream — 200 GB / day raw into the warehouse is expensive; Spark pre-aggregates to a more compact form first.
- ELT for OLTP + SaaS — small data, SQL-friendly transforms, version-controlled in dbt.
- Fivetran / Airbyte for raw load — managed connectors handle schema evolution and incremental sync.
- dbt for transforms — SQL is reviewable, testable, and runs on Snowflake's elastic compute.
- Iceberg + Snowflake external tables — clickstream stays queryable from Snowflake without copying.
-
Cost— Spark cluster for streaming (always-on); Snowflake compute pay-per-second for transforms; total bill stays bounded.
Inline CTA: for the structured warehouse-and-transform path see ETL System Design for Data Engineering Interviews and the related data lake architecture for data engineering interviews blog.
ETL
Topic — ETL pipelines
ETL practice problems
COURSE
Course — ETL System Design
ETL System Design for Data Engineering Interviews
SQL
Language — SQL
SQL practice problems
6. ETL orchestration tools — Airflow, dbt, Spark, AWS Glue
The tools that turn an ETL script into a production pipeline
A single Python script that runs once is not a pipeline — it's a one-off job. Real production ETL needs scheduling, dependency management, retries, alerts, lineage, and observability, and the modern stack is built around a handful of tools that each solve one slice of that problem: Apache Airflow for orchestration / DAGs, dbt for SQL transforms / tests / docs, Apache Spark for distributed batch + streaming, AWS Glue / Azure Data Factory / Dataflow for managed serverless ETL, and Fivetran / Airbyte for managed source connectors.
Pro tip: the modern ETL stack is layered, not monolithic — Airflow orchestrates dbt, dbt invokes Spark, Spark reads from S3, and Fivetran handles the source connectors. Knowing which tool owns which slice (and why) is the senior signal in any ETL design round.
Apache Airflow — DAG orchestration and scheduling
The Airflow invariant: Airflow defines pipelines as Python DAGs (Directed Acyclic Graphs); each DAG is a graph of tasks with explicit dependencies; the scheduler triggers DAGs on cron / sensor / event; failures retry per task with exponential backoff. It's the canonical orchestrator for batch ETL.
- DAG — Python file that declares tasks + dependencies + schedule.
-
Operators —
BashOperator,PythonOperator,SnowflakeOperator,DbtRunOperator, ... - Sensors — wait for an external event (S3 file arrival, table refresh, API webhook).
- XCom — pass small values between tasks; for large data, use object storage as the boundary.
Worked example. A 4-task daily DAG: extract → transform → reconcile → notify.
| task | runs | depends on |
|---|---|---|
extract_postgres |
02:00 daily | — |
transform_dbt |
after extract | extract_postgres |
reconcile_counts |
after transform | transform_dbt |
notify_slack |
after reconcile | reconcile_counts |
Step-by-step.
- The Airflow scheduler triggers the DAG at 02:00.
-
extract_postgresruns aBashOperatorinvoking a Python script — extracts yesterday's rows to S3. - Once it succeeds,
transform_dbtrunsdbt runagainst the warehouse. -
reconcile_countsqueries silver vs source for drift > 0.1%; if it fails, the DAG halts and pages on-call. - On success,
notify_slackposts a "✓ pipeline complete" message; the BI dashboard refreshes.
Worked-example solution. A minimal Airflow DAG:
from datetime import datetime
from airflow import DAG
from airflow.operators.bash import BashOperator
with DAG(
"daily_orders_etl",
schedule_interval="0 2 * * *", # 02:00 UTC daily
start_date=datetime(2026, 5, 1),
catchup=False,
tags=["orders"],
) as dag:
extract = BashOperator(task_id="extract_postgres",
bash_command="python /opt/scripts/extract.py")
transform = BashOperator(task_id="transform_dbt",
bash_command="cd /opt/dbt && dbt run --target prod")
reconcile = BashOperator(task_id="reconcile_counts",
bash_command="python /opt/scripts/reconcile.py")
notify = BashOperator(task_id="notify_slack",
bash_command="python /opt/scripts/notify.py 'Pipeline OK'")
extract >> transform >> reconcile >> notify
Rule of thumb: Airflow is the right choice for batch-scheduled pipelines with explicit dependencies. For event-driven streaming pipelines, look at Prefect, Dagster, or native cloud orchestrators.
dbt and Apache Spark — SQL transforms and distributed processing
The dbt invariant: dbt is a SQL transformation framework that runs models (SELECT statements) against your warehouse, with built-in tests, lineage graphs, and documentation. The Spark invariant: Spark is a distributed compute engine that runs Python / Scala / SQL transforms across a cluster of machines, scaling from gigabytes to petabytes.
-
dbt —
models/,tests/,snapshots/,seeds/; runs on Snowflake / BigQuery / Redshift / Databricks SQL. -
Spark —
pyspark.sql, structured streaming, MLlib; reads from S3, Iceberg, Delta, Hive. - When dbt wins — SQL transforms on a managed warehouse; small-to-medium-scale analytics.
- When Spark wins — non-SQL transforms (Python ML, image / log processing), petabyte-scale data.
Worked example. A dbt model + a Spark job for the same logical transform.
| approach | code | runs on | best for |
|---|---|---|---|
| dbt | models/silver_orders.sql |
Snowflake | SQL-first analytics |
| Spark |
pyspark script |
EMR / Databricks cluster | petabyte-scale or custom Python |
Step-by-step.
- For a 1 GB dedup transform, dbt + Snowflake is simpler — write SQL, push to git, run
dbt run. - For a 1 TB transform with custom Python UDFs (ML features), Spark on EMR is the better tool.
- dbt is declarative — write the desired output as SELECT, let dbt handle the materialisation strategy.
- Spark is imperative — write the transform as code, hand-tune partitioning and caching.
- The two tools coexist: dbt for SQL warehouse layers, Spark for upstream lake processing.
Worked-example solution. A Spark transform that dedups orders:
from pyspark.sql import functions as F, Window
orders = spark.read.parquet("s3://lake/bronze/orders/ingest_date=2026-05-11/")
w = Window.partitionBy("order_id").orderBy(F.col("source_ts").desc())
deduped = (
orders.withColumn("rn", F.row_number().over(w))
.filter("rn = 1")
.drop("rn")
)
deduped.write.mode("overwrite").parquet(
"s3://lake/silver/orders/ingest_date=2026-05-11/"
)
Rule of thumb: use dbt for SQL transforms on the warehouse; reach for Spark when transforms are non-SQL or when data outgrows warehouse compute budgets.
Managed services — AWS Glue, Fivetran, Airbyte
The managed-service invariant: managed ETL services (AWS Glue, Azure Data Factory, GCP Dataflow, Fivetran, Airbyte) trade flexibility for operational simplicity — they handle the infrastructure, retries, scaling, and connector maintenance so the team writes business logic, not boilerplate.
- AWS Glue — managed serverless Spark; auto-scales; AWS-native pricing.
- Fivetran — managed connectors from 300+ SaaS sources to warehouses; usage-based pricing.
- Airbyte — open-source alternative to Fivetran; self-hosted or managed.
- Dataflow / ADF — GCP / Azure equivalents.
Worked example. Pulling Stripe charges via Fivetran vs writing a custom script.
| approach | time to deploy | maintenance | cost |
|---|---|---|---|
| Custom Python | ~2 weeks | high (rate-limit + schema-drift bugs) | engineer time |
| Fivetran connector | ~2 hours | low (managed) | ~$200/month for 100K MAR |
Step-by-step.
- The custom approach: write a Python script, handle pagination, build retries, deploy to Airflow, write tests, monitor schema drift, fix bugs for years.
- Fivetran approach: click "Connect Stripe", paste API key, choose destination warehouse, click "Sync". Done in 2 hours.
- Custom code wins when the source is custom or volume is huge; Fivetran wins for the 80% case (standard SaaS sources).
- The choice depends on team size and operational maturity — a 3-person data team is better off paying Fivetran than burning engineer hours.
- Hybrid is common: Fivetran for standard sources, custom Python / Spark for unique ones.
Worked-example solution. AWS Glue ETL skeleton (Spark-based, managed):
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glueContext = GlueContext(SparkContext.getOrCreate())
orders = glueContext.create_dynamic_frame.from_catalog(
database="bronze_db", table_name="orders"
)
cleaned = DropNullFields.apply(frame=orders)
glueContext.write_dynamic_frame.from_options(
frame=cleaned,
connection_type="s3",
connection_options={"path": "s3://lake/silver/orders/"},
format="parquet",
)
Rule of thumb: managed services are the right starting point — only go custom when a specific source or transform genuinely needs it.
Common beginner mistakes
- Building a custom orchestrator instead of using Airflow (or Prefect / Dagster) — wasted years.
- Running dbt locally only — needs a deployment story (Airflow, dbt Cloud, GitHub Actions).
- Spinning up a Spark cluster for a 1 GB transform that dbt + Snowflake would do faster.
- Writing custom Stripe / Salesforce / HubSpot connectors when Fivetran / Airbyte handle them.
- Skipping monitoring / lineage — the first incident is when you learn what observability you needed.
ETL Interview Question on Choosing an Orchestration Stack
A 4-person data team is bootstrapping a new analytics platform on Snowflake. Sources: 10 SaaS tools (Stripe, Salesforce, HubSpot, …), a Postgres OLTP database, and 50 GB / day of clickstream from Kafka. Pick the orchestration / ingestion / transform tools and explain the choices.
Solution Using Fivetran + Airflow + dbt + Spark Streaming
Code solution.
INGESTION
├── SaaS sources (Stripe, Salesforce, HubSpot, ...) → Fivetran → Snowflake raw.* schemas
├── Postgres OLTP → Fivetran CDC → Snowflake raw.postgres_*
└── Kafka clickstream → Spark Structured Streaming → Iceberg → Snowflake external table
ORCHESTRATION
└── Airflow (managed via MWAA or Astronomer)
- Daily DAG: trigger dbt run after Fivetran loads
- Sensor: wait for S3 clickstream Parquet arrival, then trigger dbt clickstream model
TRANSFORM
└── dbt (Cloud or self-hosted)
- models/staging/ ← raw → typed
- models/silver/ ← typed → deduped + conformed
- models/gold/ ← curated marts for BI
OBSERVABILITY
├── Airflow alerts → Slack on DAG failure
├── dbt tests → fail loudly on schema / quality assertions
└── Snowflake account_usage → cost & query monitoring dashboard
Step-by-step trace of the architectural decisions:
| concern | answer | reason |
|---|---|---|
| SaaS ingestion | Fivetran | 10 sources × 2 weeks of custom code per source = 20 engineer-weeks; Fivetran handles it for ~$1K/month |
| OLTP ingestion | Fivetran CDC | zero impact on source; sub-minute freshness |
| Clickstream ingestion | Spark Structured Streaming + Iceberg | 50 GB / day too big for Fivetran's pricing tier |
| Orchestration | Airflow (MWAA) | mature DAG semantics; team can hire engineers who know it |
| Transform | dbt | SQL-first; version-controlled; testable; runs on Snowflake's elastic compute |
| Storage | Snowflake + Iceberg lakehouse | warehouse for curated; lakehouse for high-volume clickstream |
Output: an end-to-end stack that the 4-person team can build in ~2 months. Fivetran handles 80% of ingestion; Spark covers the streaming edge case; Airflow + dbt provide a clean orchestration + transform layer; Snowflake is the warehouse. Total operational burden is low; engineering time goes into business logic, not boilerplate.
Why this works — concept by concept:
- Fivetran for standard SaaS / OLTP — buys 10 connectors for the price of one engineer; bug-fixes are someone else's problem.
- Spark for streaming clickstream — 50 GB / day is outside Fivetran's sweet spot; Spark + Iceberg handles it cleanly.
- Airflow as the orchestrator — mature, hireable, integrates with everything.
- dbt for SQL transforms — version control, tests, lineage all built in; reviewable in PRs.
- Snowflake + Iceberg — warehouse for hot curated; lakehouse for high-volume cold; cross-layer joins via external tables.
-
Cost— Fivetran ~$1K/month, Snowflake ~$3-5K/month, Airflow (MWAA) ~$300/month, Spark cluster ~$1K/month → total ~$5-7K/month for a 4-person team's full stack.
Inline CTA: for the structured pipeline-design path see ETL System Design for Data Engineering Interviews and PySpark Fundamentals.
ETL
Topic — ETL pipelines
ETL practice problems
COURSE
Course — PySpark Fundamentals
PySpark Fundamentals
PYTHON
Language — Python
Python practice problems
7. Building a Python pandas ETL pipeline
A runnable end-to-end example you can adapt
To make all of the above concrete, here's a runnable Python pandas ETL pipeline that pulls a CSV of orders, cleans + deduplicates the data, and writes a curated Parquet ready for warehouse load. It's short enough to read in five minutes and structured enough to extend into a production pipeline.
Pro tip:
pandasis great for the first 1-10 GB of data; beyond that, you wantpyspark,polars, orduckdb. The shape of the pipeline is the same; only the engine changes.
Setup — installing pandas and pyarrow
The setup invariant: pandas reads / writes CSV, JSON, and Parquet natively; pyarrow is the Parquet engine pandas calls under the hood; install both with pip install pandas pyarrow. Everything else (S3 access, database connectors) is optional.
-
pandas— DataFrame manipulation; the backbone of single-node Python ETL. -
pyarrow— Parquet read / write engine; columnar format for warehouse-friendly Parquet. -
boto3— AWS SDK for S3 reads / writes. -
sqlalchemy/psycopg2— database connectors for Postgres extraction.
Worked example. Install the minimal set:
| package | role |
|---|---|
pandas |
DataFrame ops, CSV / Parquet IO |
pyarrow |
Parquet engine |
boto3 |
S3 access (optional for local files) |
requests |
HTTP / API extraction (optional) |
Step-by-step.
- Create a virtual environment:
python3 -m venv .venv && source .venv/bin/activate. - Install dependencies:
pip install pandas pyarrow boto3 requests. - Verify:
python -c "import pandas; print(pandas.__version__)"prints a 2.x version. - Place a sample
orders.csvin the current directory (or use the snippet's path). - Run the script:
python etl.pyproducescleaned_orders.parquet.
Worked-example solution. A one-line install:
pip install pandas pyarrow boto3 requests
Rule of thumb: pin versions in requirements.txt so reruns produce identical environments.
Extract + Transform + Load in 30 lines of Python
The pipeline invariant: extract reads from a source into a DataFrame; transform applies cleaning + dedup + typing; load writes to the destination as Parquet. The whole script fits in one file and is testable end-to-end.
-
Extract —
pd.read_csv/pd.read_sql/pd.read_json. -
Transform —
.drop_duplicates(),.dropna(),.astype(), business rules withnp.where. -
Load —
.to_parquet()for warehouse-friendly columnar output. - Idempotency — overwrite the destination; never append blindly.
Worked example. Cleaning an orders.csv with 12,847 rows down to 12,835 unique rows.
| metric | before | after |
|---|---|---|
| rows | 12,847 | 12,835 (12 duplicates dropped) |
| null amounts | 38 | 0 (replaced with 0.0) |
amount dtype |
object | float64 |
order_date dtype |
object | datetime64 |
Step-by-step.
-
pd.read_csv("orders.csv")reads the raw file into a DataFrame. -
.drop_duplicates(subset=["order_id"], keep="last")collapses 12,847 rows to 12,835 unique orders. -
.fillna({"amount": 0})replaces 38 null amounts with0.0per the business rule. -
.astype({"amount": "float64", "order_date": "datetime64[ns]"})coerces types. -
.to_parquet("cleaned_orders.parquet", index=False)writes the curated output; downstreamCOPY INTOreads it.
Worked-example solution. A complete ETL script:
import pandas as pd
from pathlib import Path
RAW = Path("orders.csv")
OUT = Path("cleaned_orders.parquet")
# ── Extract ─────────────────────────────────────────────────────────────
df = pd.read_csv(RAW)
# ── Transform ───────────────────────────────────────────────────────────
df = (
df.drop_duplicates(subset=["order_id"], keep="last")
.fillna({"amount": 0.0, "status": "unknown"})
.astype({"amount": "float64"})
)
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
df = df.dropna(subset=["order_date"])
# Business rule
df["is_high_value"] = df["amount"] > 10_000
# ── Load ────────────────────────────────────────────────────────────────
df.to_parquet(OUT, index=False, engine="pyarrow")
print(f"Loaded {len(df)} rows → {OUT}")
Rule of thumb: keep extract, transform, and load as separate functions so you can unit-test each independently.
Scaling out — when to move beyond pandas
The scale-out invariant: pandas is single-threaded and in-memory; beyond ~10 GB it slows; the upgrade path is polars (single-machine, multi-core, Arrow-backed) or pyspark (distributed across many machines).
-
polars— drop-in pandas alternative; 5-10× faster on a single machine; lazy evaluation. -
pyspark— distributed DataFrame API; same shape as pandas but scales to terabytes. -
duckdb— embedded analytical SQL engine; great for ~100 GB datasets on a laptop. - dbt + warehouse — when transforms can be expressed as SQL, push them into the warehouse.
Worked example. Same dedup transform, three engines.
| engine | time on 10 GB | RAM needed |
|---|---|---|
pandas |
~3 min | ~40 GB |
polars |
~30 s | ~12 GB |
pyspark (10-node) |
~10 s | distributed |
Step-by-step.
- For 1 GB,
pandasis fine — 10-30 seconds on a laptop. - At 10 GB,
pandasstarts hitting memory limits on a 16 GB laptop;polarsreduces RAM by 4× and runs 5× faster. - At 100 GB, you want
duckdb(single-machine columnar) orpyspark(distributed). - At 1 TB+,
pysparkon a real cluster is the only practical option. - The transform shape (dedup + cast + write) doesn't change — only the engine does.
Worked-example solution. The same dedup in polars:
import polars as pl
df = (
pl.read_csv("orders.csv")
.unique(subset=["order_id"], keep="last")
.with_columns(
pl.col("amount").fill_null(0.0).cast(pl.Float64),
pl.col("order_date").str.strptime(pl.Date, "%Y-%m-%d", strict=False),
)
)
df.write_parquet("cleaned_orders.parquet")
Rule of thumb: start with pandas; switch to polars / pyspark only when data outgrows the laptop.
Common beginner mistakes
- Reading a 50 GB CSV with
pd.read_csvand watching the laptop OOM. - Appending to the output file instead of overwriting — silently growing duplicates.
- Skipping dtype coercion —
amountstaysobjectand aggregations break. - Forgetting
errors="coerce"onpd.to_datetime— one bad date kills the whole load. - Not separating extract / transform / load into functions — untestable script.
ETL Interview Question on Building a First Pandas Pipeline
You're asked in a live coding round to write a Python ETL pipeline that reads orders.csv (~1 GB), removes duplicate order_id rows keeping the latest by source_ts, replaces null amount with 0, and writes a Parquet file. Write the runnable script.
Solution Using pandas + pyarrow
Code solution.
import pandas as pd
from pathlib import Path
RAW = Path("orders.csv")
OUT = Path("cleaned_orders.parquet")
# 1) Extract
df = pd.read_csv(RAW, parse_dates=["source_ts"])
# 2) Transform
# Sort by source_ts ascending so .drop_duplicates(keep="last")
# keeps the most-recent row per order_id.
df = (
df.sort_values(["order_id", "source_ts"])
.drop_duplicates(subset=["order_id"], keep="last")
.assign(amount=lambda d: d["amount"].fillna(0).astype("float64"))
)
# 3) Load
df.to_parquet(OUT, index=False, engine="pyarrow")
print(f"Wrote {len(df):,} rows → {OUT}")
Step-by-step trace for a 12,847-row input:
| stage | rows | notes |
|---|---|---|
| Read CSV | 12,847 | raw input |
Sort by (order_id, source_ts)
|
12,847 | sorted in-place |
drop_duplicates(keep="last") |
12,835 | 12 duplicates dropped, latest kept |
fillna(amount=0) |
12,835 | 38 nulls replaced |
astype(float64) |
12,835 | type coerced |
| Write Parquet | 12,835 | columnar output, ~80 MB |
Output: cleaned_orders.parquet — 12,835 unique orders with non-null amount, ready for COPY INTO into the warehouse. Total wall-clock: ~10 seconds on a laptop for 1 GB of input.
Why this works — concept by concept:
-
parse_dates=["source_ts"]— pandas reads timestamps directly intodatetime64, skipping a laterto_datetimecall. -
Sort before
drop_duplicates(keep="last")— guarantees the latestsource_tsperorder_idsurvives; the alternative (sorting after) keeps an arbitrary row. -
assign(amount=lambda d: d["amount"].fillna(0).astype("float64"))— chain-friendly transform that returns a new DataFrame, easier to test than mutating in place. - Parquet output — columnar, compressed, warehouse-friendly; ~10× smaller than CSV for the same data.
-
pyarrowengine — the modern Parquet backend; faster and more compatible thanfastparquet. -
Cost—O(N log N)for the sort,O(N)for the rest; trivially scales to 10 GB on a 16 GB laptop.
Inline CTA: more Python practice problems for pandas-style ETL and the ETL practice page for end-to-end pipeline shapes.
PYTHON
Language — Python
Python practice problems
ETL
Topic — ETL pipelines
ETL practice problems
COURSE
Course — PySpark Fundamentals
PySpark Fundamentals
Choosing your ETL stack (checklist)
Pick the right tool for the workload, not the other way around:
- For SaaS sources (Stripe, Salesforce, HubSpot) → managed connectors (Fivetran, Airbyte).
- For Postgres / MySQL OLTP → CDC (Debezium + Kafka, or Fivetran CDC).
- For high-volume clickstream / logs → Spark Structured Streaming + Iceberg / Delta.
- For SQL transforms on the warehouse → dbt + Snowflake / BigQuery / Redshift.
- For non-SQL transforms (ML features, image / NLP processing) → Spark or custom Python.
- For batch orchestration → Apache Airflow (or Prefect / Dagster for newer projects).
- For BI / serving → gold-layer marts in the warehouse; never let dashboards query silver.
Frequently asked questions
What is an ETL pipeline?
An ETL pipeline is an automated workflow that Extracts raw data from multiple sources, Transforms it into a clean and structured format, and Loads it into a destination system (data warehouse, data lake, or BI tool) for reporting and analysis. The three letters describe the stages; in practice the pipeline also handles scheduling, retries, observability, and idempotency.
What is the difference between ETL and ELT?
ETL transforms data before loading it into the warehouse (the curated, traditional pattern). ELT loads raw data into the warehouse first, then transforms it using the warehouse's own SQL engine (the modern cloud-warehouse pattern, typically with dbt). Modern Snowflake / BigQuery / Redshift workloads tilt heavily toward ELT because the warehouse compute is elastic and SQL is the most-debugged transform language.
Which tool is best for ETL orchestration?
Apache Airflow is the most-deployed batch orchestrator and the safest hireable choice. For event-driven streaming pipelines, Prefect and Dagster are modern alternatives with better Python ergonomics. For SQL-only pipelines on cloud warehouses, dbt with its built-in scheduler (dbt Cloud) handles most cases without needing a separate orchestrator.
Can I build an ETL pipeline in Python?
Yes — pandas + pyarrow for single-machine ETL, pyspark for distributed, polars for fast single-machine. The pipeline shape (extract → transform → load) is the same regardless of the engine. Real production pipelines wrap the Python script in Airflow / Prefect for scheduling, retries, and observability.
What's the most common ETL pipeline failure mode?
Schema drift — the source system silently renames a column, changes a date format, or splits a field, and the pipeline either crashes or writes wrong data. Defend with strict schema assertions at ingest, explicit alerts on drift, and a published contract with the source team. The runner-up is non-idempotent loads — reruns produce duplicates and silent data corruption.
Practice on PipeCode
Reading is one thing; reps are another. To turn the ETL primitives in this guide into reliable interview answers, pair the reading with practice on real problems and structured courses.
- Drill SQL transformations — the workhorse of ELT — at the SQL practice page.
- Practice ETL pipeline design with company-flavoured problems at the ETL practice page.
- Build pandas / Python ETL fluency at the Python practice page.
- Take the structured path with ETL System Design for Data Engineering Interviews and PySpark Fundamentals.
- Read related guides — the data lake architecture for data engineering interviews blog covers the medallion zones every ETL pipeline writes into, and the SQL interview questions for data engineering blog drills the SQL primitives every Transform step uses.
The ETL pipelines you'll be asked to design in interviews use exactly the primitives in this guide — extract from messy sources, transform with idempotent SQL or Python, load into a warehouse the BI team trusts, and orchestrate the whole thing with Airflow + dbt. Practice them in the practice surface and the design rounds become reps, not surprises.





Top comments (0)