Gowtham Potureddi

Posted on May 12

ETL Pipeline for Data Engineering: A Beginner's Guide to Extract, Transform, and Load

#python #sql #interview #dataengineering

An ETL pipeline is the core data-engineering workflow that turns scattered raw payloads — database rows, API responses, log files, SaaS exports — into clean, trusted data inside a warehouse where analysts and BI tools can use it. ETL stands for Extract, Transform, Load: pull raw data from many source systems, reshape and clean it into a consistent schema, then write it into a destination like Amazon Redshift, Snowflake, or a data lake. Every fresher data-engineering interview probes the same three letters — and the candidate who can name the failure modes per stage wins the round.

Think of this as a beginner-friendly ETL pipeline tutorial for data engineers — a first-principles walk through the Extract → Transform → Load loop, the orchestration tools that automate it (Airflow, dbt, Spark, AWS Glue), the ETL-vs-ELT trade-off that defines modern cloud warehouses, and a runnable Python pandas example you can adapt to your own pipeline. Every section ships worked examples and an ETL interview questions-style problem with a full traced solution, in the same shape PipeCode practice problems use.

If you want hands-on reps after you read, explore practice →, drill SQL problems →, browse ETL practice →, or open ETL System Design for Data Engineering Interviews → for a structured path.

On this page

Why ETL pipelines matter
Extract — pulling raw data from sources
Transform — cleaning, dedup, standardization, aggregation
Load — destinations from warehouses to BI tools
ETL vs ELT — transform before or after loading
ETL orchestration tools — Airflow, dbt, Spark, AWS Glue
Building a Python pandas ETL pipeline
Choosing your ETL stack (checklist)
Frequently asked questions
Practice on PipeCode

1. Why ETL pipelines matter

Clean, trusted data is the foundation of every analytics decision

So, why ETL? Because raw source data is messy — duplicates, nulls, mixed formats, inconsistent customer IDs across systems — and dashboards can't tolerate that mess. An ETL pipeline is the automated cleaning step between the noisy source-of-truth (operational databases, third-party APIs, file dumps) and the curated layer (data warehouse, lakehouse, BI tool). Without it, analytics teams answer the same question three different ways and trust in the data evaporates.

Pro tip: When an interviewer asks "what's an ETL pipeline?", lead with the contract it provides, not the steps. The contract is: "given any source payload, downstream consumers see clean, deduplicated, type-coerced, time-aligned rows on a known schema with a known freshness SLA." The three letters (E, T, L) are just how you keep that contract.

Raw data is noisy — duplicates, nulls, mixed formats

The noise invariant: source systems were built for their own workload, not for analytics; they ship duplicates from CDC retries, nulls where the user skipped a field, three different date formats from three different teams, and inconsistent capitalisation that breaks joins. Every one of those defects either becomes a bug in the dashboard or gets cleaned out by an ETL stage.

Duplicates — same customer recorded multiple times (CDC retries, late events).
Nulls — missing amounts, missing emails, optional fields left blank.
Mixed formats — 2026-05-11, 11/05/26, May 11 all mean the same date.
Inconsistent identifiers — Ram, RAM, Ram (trailing space) all refer to one customer.

Worked example. A raw orders extract straight from the CRM contains three flavours of "Ram" and a null amount:

order_id	customer_name	amount
1	Ram	500
2	RAM	1000
3	Ram	NULL

Step-by-step.

SELECT COUNT(DISTINCT customer_name) FROM orders returns 2 (Ram and RAM) when the business wants 1.
SELECT AVG(amount) FROM orders returns 750 instead of the correct 500 because NULL is excluded from AVG, but the consumer expected it to be treated as 0.
A dashboard built directly on this raw table publishes wrong numbers to the CFO.
The fix isn't a smarter query — it's an ETL pipeline that normalises the casing, deduplicates customers on a canonical key, and resolves null amounts using a business rule.

Worked-example solution. A minimal Transform step in SQL:

SELECT
    LOWER(TRIM(customer_name))             AS customer_key,
    COALESCE(amount, 0)                    AS amount
FROM raw_orders;

Rule of thumb: the deeper you get into a pipeline, the more expensive the fix; clean the raw payload as close to ingest as possible.

Multiple sources, one consistent schema

The unification invariant: every analytical query joins data from at least two systems — the e-commerce orders table, the payment provider's ledger, the CRM customer record, the marketing platform's campaign IDs — and they don't agree on schema, primary keys, or freshness; the ETL pipeline is what produces a **single conformed schema with a shared customer key, a shared product key, and a shared time grain**.

Common sources — SQL databases (Postgres, MySQL), APIs (Stripe, Salesforce), CSV / Excel dumps, log files, cloud storage (S3, GCS), SaaS tools (HubSpot, Segment).
Conformed dimensions — dim_customer, dim_product, dim_date shared across every fact table.
Surrogate keys — never join on a source-system natural key; map it to an internal BIGINT key in the warehouse.
Shared time grain — every fact aligns to a common date grain (day, hour, minute) so cross-source joins work.

Worked example. Three sources, three different customer-identifier columns:

source	identifier column	example value
website (Postgres)	`users.id` (BIGINT)	`42`
payments (Stripe API)	`customer.id` (string)	`cus_abc123`
CRM (HubSpot export)	`contact.email` (string)	`ram@example.com`

Step-by-step.

The website ships orders keyed by Postgres users.id.
The payment provider returns transaction rows keyed by Stripe cus_abc123.
The CRM exports customer contacts keyed by email address.
A direct three-way join is impossible — there is no shared key.
The ETL pipeline builds a dim_customer table that carries all three identifiers as columns plus a single internal customer_key BIGINT that every downstream fact uses. After that, the three-way join is one line of SQL.

Worked-example solution. A bridge dimension that unifies all three identifiers:

CREATE TABLE dim_customer (
    customer_key   BIGSERIAL    PRIMARY KEY,
    website_id     BIGINT       UNIQUE,
    stripe_id      TEXT         UNIQUE,
    crm_email      TEXT         UNIQUE,
    name           TEXT         NOT NULL,
    created_at     TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

Rule of thumb: if you find yourself joining on a string from a third party, the join key belongs in a dimension table, not in your fact-table predicates.

Automation, repeatability, and observability

The automation invariant: an ETL pipeline runs on a schedule (or in response to an event), produces the same output on every rerun (idempotent), and emits enough metadata (row counts, hashes, error logs) for an on-call engineer to debug a failure at 3 a.m.. A "pipeline" without these properties is a one-off script, and one-off scripts always rot.

Schedule — cron (0 2 * * * for 2 a.m. daily), event-driven (S3 ObjectCreated), or continuous (Kafka stream).
Idempotent — rerunning the same job produces the same output; achieved with MERGE, INSERT OVERWRITE PARTITION, or DELETE + INSERT.
Observable — row counts logged per stage, schema drift alerts, success / failure notifications to Slack or PagerDuty.
Reproducible — the pipeline definition lives in git; deploys are versioned; rollbacks are one PR away.

Worked example. An idempotent daily load that overwrites a single date partition.

run	rows in target	resulting count
original	0	12,835 (first load for 2026-05-11)
retry after partial failure	6,420 (partial write)	12,835 (overwrite produces clean state)
accidental rerun next morning	12,835	12,835 (same data, no duplicates)

Step-by-step.

Day-1 run: load lands 12,835 rows for partition ingest_date='2026-05-11'.
The pipeline crashes mid-write, leaving 6,420 partial rows.
Retry runs the same job; the INSERT OVERWRITE PARTITION semantics drop the partial rows first, then write the full 12,835 — net state is correct.
An accidental rerun a day later does the same thing — overwrite the partition, end at 12,835. No duplicates.
The key property is that the final state depends on the input, not on how many times the job ran. That's idempotency.

Worked-example solution. A partition-overwrite load (works in PostgreSQL, Snowflake, Spark SQL):

INSERT OVERWRITE INTO silver.orders PARTITION (ingest_date='2026-05-11')
SELECT
    order_id,
    customer_id,
    amount::NUMERIC(14, 2) AS amount,
    source_ts
FROM bronze.orders
WHERE ingest_date = '2026-05-11';

Rule of thumb: if your job's output depends on NOW() or on previous state in the target, it is not idempotent — restructure.

Common beginner mistakes

Conflating an ETL pipeline with a one-off SQL script — pipelines are scheduled, versioned, and observable.
Skipping the deduplication step in Transform — assuming source data is "clean enough" and shipping doubled metrics.
Hand-mapping identifiers in every dashboard instead of building a conformed dim_customer once.
Writing non-idempotent loads (INSERT INTO ... SELECT ...) and discovering duplicates only after the second run.
Treating ETL as code-only without observability — silent failures rot trust faster than loud ones.

ETL Interview Question on Designing a First-Pass Pipeline

A retail company has order data spread across three systems — the Postgres-backed e-commerce site, a Stripe payments account, and a HubSpot CRM. The CFO wants a daily revenue dashboard sliced by customer segment and product category. Design the simplest end-to-end ETL pipeline that gives the CFO a trustworthy answer by tomorrow morning.

Solution Using a Daily ETL Pipeline with Bronze / Silver / Gold Layers

Code solution.

EXTRACT
   ├── Postgres orders   → s3://lake/bronze/orders/ingest_date=…/         (Debezium CDC or daily snapshot)
   ├── Stripe charges    → s3://lake/bronze/charges/ingest_date=…/        (REST API → JSON to S3)
   └── HubSpot contacts  → s3://lake/bronze/contacts/ingest_date=…/       (nightly CSV export)

TRANSFORM   (dbt or Spark SQL)
   ├── silver.orders     ← dedup + type coercion + customer_key surrogate
   ├── silver.charges    ← join orders ↔ charges by stripe transaction_id
   ├── silver.contacts   ← deduped contact rows keyed by email
   └── silver.dim_customer ← unify all three identifiers in one dimension

LOAD
   └── gold.fact_revenue ← grain: one row per (date, customer_key, product_key)
                            partitioned by date_key
                            joined to dim_customer + dim_product

ORCHESTRATION
   └── Airflow DAG, daily at 02:00 UTC, with reconciliation gate before gold promotion

Step-by-step trace of the daily pipeline:

stage	what runs	output
02:00	Airflow triggers DAG	start
02:05	Extract: Postgres snapshot → S3 bronze	12,835 raw orders
02:08	Extract: Stripe API pulls yesterday's charges → S3	12,712 charges
02:10	Extract: HubSpot nightly CSV → S3	8,210 contacts
02:15	Transform: dbt builds silver layer (dedup, type coercion, surrogate keys)	12,835 silver orders
02:25	Transform: dim_customer unified across all three sources	8,210 dim rows
02:30	Reconciliation gate: silver.orders count vs Postgres source	drift < 0.1% — PASS
02:32	Load: gold.fact_revenue partition for 2026-05-11	12,835 fact rows
02:35	DAG complete; CFO dashboard refreshes at 02:40	clean numbers

Output: a single gold.fact_revenue table the BI tool reads. Every row has a customer_key, a product_key, a date_key, an exact revenue decimal, and a pipeline_version lineage column. The CFO opens Tableau, picks a date range, and sees correct numbers segmented by customer cohort.

Why this works — concept by concept:

Bronze (append-only) → Silver (conformed) → Gold (star schema) — the medallion layering keeps source drift contained to the bronze layer; consumers never see raw payloads.
Daily partition overwrite — every load is idempotent; reruns and backfills don't produce duplicates.
dim_customer bridging three identifiers — joins are one line in the gold layer; the warehouse query plan uses a single surrogate key everywhere.
Reconciliation gate before gold promotion — drift > tolerance pages on-call; the BI tool never sees a bad load.
Airflow orchestration — DAG definition is versioned in git; retries, alerts, SLAs are first-class.
Cost — one daily run scales O(|daily delta|) not O(|all-time data|); backfill is O(|range|); clear observability everywhere.

Inline CTA: drill the ETL practice page for end-to-end pipeline shapes and the SQL practice page for transformation-style queries.

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

SQL
Language — SQL
SQL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for Data Engineering Interviews

View course →

2. Extract — pulling raw data from sources

How to read from databases, APIs, files, and SaaS without breaking the source

The Extract stage is the gateway between source systems and the rest of the pipeline — and the choices you make here cascade into everything downstream. Different sources have different protocols (SQL, REST, file drops, CDC streams), different freshness expectations (sub-second to weekly), and different failure modes (rate limits, schema drift, network flakiness). The right Extract strategy is the one that pulls complete, ordered, replayable raw data without disrupting the source system's own workload.

Pro tip: the most common Extract failure mode in production isn't "data is wrong" — it's "data is silently missing because the source paginated and we didn't follow the cursor." Always log the source-system pagination cursor / last-modified marker for every batch so you can replay from exactly where you stopped.

SQL databases — snapshots, CDC, and the read-replica rule

The relational-source invariant: never run extract queries against the OLTP primary database — that's the production transactional workload; extract from a read replica, a CDC stream, or a periodic snapshot to S3 / GCS. Within those three patterns, CDC is the modern default for high-freshness pipelines because it has near-zero impact on the source.

Snapshot extract — SELECT * FROM orders WHERE updated_at > $cursor against a read replica.
CDC (Change Data Capture) — Debezium reads the Postgres WAL or MySQL binlog and emits change events to Kafka.
Read replica — point extracts at a follower, not the primary, so the OLTP workload is unaffected.
Cursor / watermark — persist the last successfully-extracted timestamp or LSN so the next run resumes correctly.

Worked example. A daily extract that pulls only the previous day's orders.

approach	source impact	freshness	complexity
Full table dump	high (lock + IO)	24 h	low
Cursor-based incremental	medium (one indexed scan)	24 h	medium
CDC (Debezium + Kafka)	near-zero	~1 min	high

Step-by-step.

The naïve approach SELECT * FROM orders against the primary scans the entire table — millions of rows of IO that competes with the live website.
The cursor-based approach WHERE updated_at >= '2026-05-10' AND updated_at < '2026-05-11' against a read replica reads only ~1 day of data — much cheaper.
CDC reads the database's own write-ahead log, so the extract has zero query cost on the source — only the disk-tail read of the WAL.
The right choice depends on freshness needs: daily dashboards → cursor; sub-minute analytics → CDC; one-off backfill → full snapshot to S3.
In every case, log the cursor (or LSN) per batch — that's how you replay after a failure.

Worked-example solution. A cursor-based daily extract against a Postgres read replica:

-- Run nightly; bind $cursor to the previous run's max_updated_at
COPY (
    SELECT order_id, customer_id, amount, status, updated_at
    FROM orders
    WHERE updated_at >= '2026-05-10'::timestamptz
      AND updated_at <  '2026-05-11'::timestamptz
    ORDER BY updated_at
)
TO 's3://lake/bronze/orders/ingest_date=2026-05-10/orders.csv'
WITH (FORMAT csv, HEADER true);

Rule of thumb: every extract you write should be resumable from a cursor. Without one, a network blip becomes a full re-extract.

APIs — pagination, rate limits, and idempotent paging

The API-source invariant: most REST APIs return a paginated response; the extract job must follow the cursor / next-link until all pages are consumed; rate limits force exponential backoff; the same call run twice should produce the same data unless the upstream changed. Idempotency on the extract side protects the rest of the pipeline.

Pagination — cursor-based (next_cursor token), offset / limit, or since timestamps.
Rate limits — read the X-RateLimit-Remaining header; back off when it hits 0.
Idempotency — repeating the same API call returns the same rows (modulo true new data).
Auth — OAuth refresh tokens, API keys via secrets manager (never in code).

Worked example. A Stripe charges extract that pages through ~10,000 records per day.

step	request	result
1	`GET /v1/charges?limit=100&created[gte]=…`	100 charges + `has_more=true` + `next_id`
2	`GET /v1/charges?limit=100&starting_after=cha_99`	100 more + `has_more=true`
3-99	repeat with new `starting_after` cursor	100 each
100	last page	12 charges + `has_more=false`
total	99 × 100 + 12 = 9,912	written to S3 as one JSON Lines file

Step-by-step.

The first call grabs the first 100 charges with limit=100.
Each response contains has_more=true plus the ID of the last record (cha_99).
The next call uses starting_after=cha_99 to fetch the next page — that's cursor-based pagination.
Loop until has_more=false; concatenate all pages into a single JSON Lines file.
Persist the final cursor ID so a retry knows exactly where to resume; without that, you re-extract from the start every time.

Worked-example solution. A simple Python pagination loop with rate-limit handling:

import os, time, json, requests

cursor = None
records = []
while True:
    params = {"limit": 100, "created[gte]": "1715040000"}
    if cursor:
        params["starting_after"] = cursor
    r = requests.get(
        "https://api.stripe.com/v1/charges",
        params=params,
        auth=(os.environ["STRIPE_SECRET_KEY"], "")
    )
    if r.status_code == 429:                # rate-limited
        time.sleep(int(r.headers.get("Retry-After", 10)))
        continue
    page = r.json()
    records.extend(page["data"])
    if not page["has_more"]:
        break
    cursor = page["data"][-1]["id"]

with open("/tmp/charges.jsonl", "w") as fh:
    for rec in records:
        fh.write(json.dumps(rec) + "\n")

Rule of thumb: assume every API call can fail; build retries, backoff, and cursor-resume in from day one.

Files and SaaS — schema drift and the contract problem

The file-source invariant: CSV / Excel / JSON dumps from third parties are the highest-drift source in any pipeline — a column rename, a date-format change, or a quote-character switch silently breaks the load; defend with strict schema validation at ingest and explicit alerts on drift.

CSV — csv module or pandas.read_csv with explicit dtype map and parse_dates list.
Excel — openpyxl for .xlsx, but really pressure the source to send CSV / Parquet.
JSON / JSONL — line-delimited JSON for streaming-friendly reads; flatten nested objects in Transform, not Extract.
Schema validation — assert column names, types, and required-not-null status at ingest; fail loudly on drift.

Worked example. A nightly CSV export from a CRM that quietly renamed Email → email_address.

date	columns extracted	result
2026-05-09	`Name, Email, Phone`	parses fine
2026-05-10	`Name, email_address, Phone`	column `Email` not found → loud error
2026-05-10 (no validation)	`Name, email_address, Phone`	silently writes NULL into `email` — dashboard breaks

Step-by-step.

The CRM team renamed Email to email_address in a release note nobody read.
The extract script asks for df["Email"]; pandas raises KeyError.
With strict validation, the script alerts on-call and halts before writing bad data.
Without validation, the script writes NULL for every email; the downstream dashboard's "users with no email" panel jumps from 0% to 100% overnight.
The fix is to assert column presence and types at ingest — and to publish that schema to the source team as a contract.

Worked-example solution. A defensive CSV ingest with schema assertion:

import pandas as pd

EXPECTED = {"Name": "object", "Email": "object", "Phone": "object"}

df = pd.read_csv("crm_contacts.csv", dtype=EXPECTED)
missing = set(EXPECTED) - set(df.columns)
if missing:
    raise RuntimeError(f"CRM CSV missing columns: {missing}")

# only after validation, write to bronze
df.to_parquet("/tmp/contacts.parquet", index=False)

Rule of thumb: every file source has a contract — declare it in code and fail loudly when the source breaks it.

Common beginner mistakes

Pointing extract queries at the OLTP primary instead of a read replica or CDC stream.
Skipping pagination — pulling page 1 of 100 and assuming the API gave you everything.
Storing API secrets in code instead of a secrets manager / environment variable.
Trusting the source schema without validation — silent drift becomes silent data loss.
Forgetting to persist the cursor / watermark — failures force a full re-extract.

ETL Interview Question on Extracting from a Rate-Limited API

A pipeline needs to pull users data from a third-party API daily. The API returns 200 users per page, is paginated by next_cursor, and rate-limits at 60 requests / minute. The user base is ~50,000. Walk through the extract design that finishes in under 30 minutes without hitting rate-limit errors.

Solution Using Cursor-Based Pagination + Rate-Limit-Aware Backoff

Code solution.

import os, time, json, requests

API   = "https://api.example.com/v1/users"
TOKEN = os.environ["API_TOKEN"]
PAGE  = 200
out   = []

session = requests.Session()
session.headers.update({"Authorization": f"Bearer {TOKEN}"})

cursor = None
while True:
    params = {"page_size": PAGE}
    if cursor:
        params["next_cursor"] = cursor

    r = session.get(API, params=params)
    if r.status_code == 429:
        time.sleep(int(r.headers.get("Retry-After", 5)))
        continue
    r.raise_for_status()

    body = r.json()
    out.extend(body["users"])
    cursor = body.get("next_cursor")
    if not cursor:
        break

    # Stay well under 60 req/min — sleep 1.1s between requests
    time.sleep(1.1)

with open("/tmp/users.jsonl", "w") as fh:
    for u in out:
        fh.write(json.dumps(u) + "\n")

Step-by-step trace for the 50k-user extract:

metric	value
pages required	⌈50,000 / 200⌉ = 250
time per request (with 1.1s sleep)	~1.4 s
total wall-clock time	250 × 1.4 s ≈ 6 min
rate-limit headroom	~52 req / min (under the 60 cap)
failure mode handled	`429` → backoff per `Retry-After`
final output	one JSONL file with 50,000 user rows

Output: a single JSONL file (users.jsonl) with one row per user. The cursor design means a retry resumes from the failure point, not the beginning. Total wall-clock: ~6 min, well under the 30-min budget. Zero rate-limit errors.

Why this works — concept by concept:

Cursor pagination + next_cursor follow — extracts every user without skipping or duplicating pages.
1.1-second sleep between calls — keeps the request rate at ~52 req / min, comfortably under the 60-req / min limit.
429 → Retry-After backoff — handles bursty rate-limit events without crashing the pipeline.
JSONL output — streaming-friendly; downstream Transform can read line-by-line without loading the whole file into memory.
Persisted cursor between runs — extend the script to write cursor to S3 / database; the next day resumes from there.
Cost — O(|users| / page_size) requests; bounded by the rate limit; failure recovery is O(1) cursor reload.

Inline CTA: more Python practice problems for API-extraction loops and the ETL practice page for full pipeline shapes.

PYTHON
Language — Python
Python practice problems

Practice →

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

SQL
Language — SQL
SQL practice problems

Practice →

3. Transform — cleaning, dedup, standardization, aggregation

Where raw data becomes useful — the meat of every ETL pipeline

The Transform stage is where 80% of an ETL pipeline's value is created. Raw data lands as-is, and Transform applies the cleaning, deduplication, type coercion, joining, and business-rule logic that turn it into something consumers can trust. In modern pipelines, Transform is usually SQL (dbt, Spark SQL) on top of a staged copy of the raw data — easy to test, easy to version, easy to backfill.

Pro tip: Transform logic should be idempotent and unit-testable. Every transformation is "pure" if its output depends only on its input — no NOW(), no random IDs, no calls to external services. That property is what lets you backfill, rerun, and refactor without fear.

Cleaning and deduplication

The cleaning invariant: bronze data contains duplicates from CDC retries, retries on flaky network, and late-arriving records; silver must collapse them to one canonical row per business key, choosing the latest version when conflicts exist. The standard pattern is ROW_NUMBER() OVER (PARTITION BY business_key ORDER BY source_ts DESC) = 1.

Duplicates — same business key with multiple physical rows.
ROW_NUMBER dedup — partition by business key, order by latest source_ts, keep rn = 1.
Bad rows — invalid types, broken refs, business-rule violations; quarantine to a separate reject table.
COALESCE — replace nulls with a known default at the boundary, not inside business logic.

Worked example. Bronze orders arrives with two rows for order_id = 448 due to a CDC retry.

order_id	source_ts	amount
448	2026-05-11 09:30:00	500
448	2026-05-11 09:30:15	520 (corrected)
449	2026-05-11 10:00:00	800

Step-by-step.

Bronze has two rows for order 448 — the first one (500) is the original CDC event, the second (520) is the post-correction CDC event.
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC) numbers them: the 09:30:15 row gets rn = 1, the 09:30:00 row gets rn = 2.
For order 449, only one row → rn = 1.
WHERE rn = 1 keeps the latest version of order 448 (the corrected $520) and the only version of order 449.
Silver now has one deterministic row per order_id — joins and aggregates produce the right numbers.

Worked-example solution. Dedup in SQL:

WITH ranked AS (
    SELECT *,
           ROW_NUMBER() OVER (
               PARTITION BY order_id
               ORDER BY source_ts DESC
           ) AS rn
    FROM bronze.orders
    WHERE ingest_date = '2026-05-11'
)
INSERT INTO silver.orders
SELECT order_id, customer_id, amount, source_ts
FROM ranked
WHERE rn = 1;

Rule of thumb: dedup belongs in the silver layer and the silver layer alone; if downstream needs to dedup again, your silver contract is leaking.

Standardization — types, casing, dates, casing

The standardization invariant: source systems disagree on date format, capitalisation, currency, and units; Transform converts everything to one canonical representation so downstream queries can compare them without LOWER() calls in every WHERE.

Dates — '11/05/26', '2026-05-11', 'May 11' → all become DATE '2026-05-11'.
Casing — 'Ram', 'RAM', 'ram' → LOWER(TRIM(name)) → 'ram'.
Currency — multiple feeds in USD, EUR, INR → convert to one reporting currency in silver.
Units — distance in miles vs km, weight in lb vs kg — canonicalise once at ingest.

Worked example. Three rows from three sources with three date formats:

source	raw date	canonical date
Postgres app	`2026-05-11` (ISO)	`2026-05-11`
CRM CSV	`11/05/26` (DD/MM/YY)	`2026-05-11`
API JSON	`"May 11, 2026"`	`2026-05-11`

Step-by-step.

Postgres uses ISO 8601 (YYYY-MM-DD) natively — no transformation needed.
The CRM exports DD/MM/YY — needs TO_DATE(raw_date, 'DD/MM/YY').
The API returns prose-style "May 11, 2026" — needs TO_DATE(raw_date, 'Mon DD, YYYY').
After Transform, every row has a single DATE value in the canonical column.
Downstream SQL is then trivial: WHERE order_date = '2026-05-11' works across all three sources.

Worked-example solution. Standardize dates and casing in one pass:

SELECT
    CASE
        WHEN raw_date ~ '^\d{4}-\d{2}-\d{2}$'   THEN raw_date::DATE
        WHEN raw_date ~ '^\d{2}/\d{2}/\d{2}$'   THEN TO_DATE(raw_date, 'DD/MM/YY')
        WHEN raw_date ~ '^[A-Z][a-z]{2} \d{1,2}, \d{4}$' THEN TO_DATE(raw_date, 'Mon DD, YYYY')
        ELSE NULL                               -- send to reject table
    END                                     AS order_date,
    LOWER(TRIM(customer_name))              AS customer_key,
    COALESCE(amount, 0)::NUMERIC(14, 2)     AS amount
FROM bronze.orders_raw;

Rule of thumb: standardize at the bronze → silver boundary; never let two different formats coexist past that line.

Aggregation and business rules

The aggregation invariant: silver carries one row per source event; gold often carries pre-aggregated metrics for fast dashboard reads — daily revenue, monthly active users, average order value per cohort; the aggregation logic is a SQL GROUP BY plus business-rule CASE WHEN expressions.

Daily totals — SELECT date, SUM(amount) FROM silver.orders GROUP BY date.
CASE WHEN — classify rows into business buckets at aggregation time.
Window aggregates — running totals, rolling averages, MoM deltas.
Cohort metrics — GROUP BY signup_month, days_since_signup for retention curves.

Worked example. A daily revenue rollup with a High Value business rule.

date	order_count	total_revenue	high_value_count
2026-05-09	4,210	1,051,200	38
2026-05-10	4,832	1,224,500	47
2026-05-11	5,118	1,387,420	52

Step-by-step.

Silver orders has one row per order with order_date, amount, and customer_id.
The aggregate GROUP BY order_date collapses to one row per date.
SUM(amount) produces the daily total revenue.
COUNT(*) FILTER (WHERE amount > 10000) produces the count of High Value orders for that day.
The result lands in gold.daily_revenue for the dashboard to read in milliseconds — no full-table scan per page load.

Worked-example solution. Daily rollup with business rule:

INSERT INTO gold.daily_revenue
SELECT
    order_date,
    COUNT(*)                                            AS order_count,
    SUM(amount)                                         AS total_revenue,
    COUNT(*) FILTER (WHERE amount > 10000)              AS high_value_count
FROM silver.orders
WHERE order_date = '2026-05-11'
GROUP BY order_date
ON CONFLICT (order_date) DO UPDATE SET
    order_count       = EXCLUDED.order_count,
    total_revenue     = EXCLUDED.total_revenue,
    high_value_count  = EXCLUDED.high_value_count;

Rule of thumb: never recompute aggregates inside a BI tool when the warehouse can pre-compute them at load time.

Common beginner mistakes

Skipping ROW_NUMBER dedup and assuming "the source won't send duplicates" — every CDC pipeline eventually retries.
Mixing canonical and source-system date formats — every downstream query needs WHERE casts.
Doing aggregation inside the BI tool instead of in the warehouse — slow dashboards, no reuse.
Putting business rules in Transform code instead of declarative SQL — harder to test, harder to version.
Forgetting to write rejected rows to a reject table — quietly losing data on cleanup.

ETL Interview Question on Cleaning Drifted Source Data

A CRM dumps a daily CSV with customer_name values like Ram, RAM, Ram (trailing space), and Ram@ (corrupted). The downstream dashboard counts distinct customers and currently reports 4 unique names when the truth is 1. Write the Transform step that produces a single canonical customer_key per real-world customer and quarantine the corrupted row.

Solution Using `LOWER + TRIM + REGEXP` + Reject-Table Quarantine

Code solution.

-- 1) Quarantine corrupted rows
INSERT INTO silver.customer_reject
SELECT raw_id, customer_name, 'invalid_char' AS reason, NOW() AS rejected_at
FROM bronze.customers
WHERE customer_name !~ '^[A-Za-z][A-Za-z .-]+$';

-- 2) Standardise the good rows into silver
INSERT INTO silver.customers (raw_id, customer_key, full_name)
SELECT
    raw_id,
    LOWER(TRIM(customer_name))   AS customer_key,
    INITCAP(TRIM(customer_name)) AS full_name
FROM bronze.customers
WHERE customer_name ~ '^[A-Za-z][A-Za-z .-]+$'
ON CONFLICT (raw_id) DO UPDATE SET
    customer_key = EXCLUDED.customer_key,
    full_name    = EXCLUDED.full_name;

-- 3) Verify
SELECT COUNT(DISTINCT customer_key) AS unique_customers
FROM silver.customers;

Step-by-step trace of the cleanup:

raw row	regex pass?	customer_key	landed in
`Ram`	✓	`ram`	silver.customers
`RAM`	✓	`ram`	silver.customers (same key as above)
`Ram`	✓ (`TRIM` strips the space)	`ram`	silver.customers (same key)
`Ram@`	✗ (`@` not allowed)	—	silver.customer_reject

Output: silver.customers has 3 rows mapped to a single customer_key = 'ram'; the dashboard now correctly reports 1 unique customer. The corrupted row sits in silver.customer_reject for the data steward to investigate.

Why this works — concept by concept:

Quarantine before standardise — corrupted rows go to a separate table; clean rows enter silver; you never silently drop data.
LOWER(TRIM(...)) as the canonical key — collapses case + whitespace variants into one bucket.
INITCAP(TRIM(...)) for display name — produces a clean human-readable version while keeping the join key normalised.
Regex gate ^[A-Za-z][A-Za-z .-]+$ — explicit allow-list of valid name characters; everything else routes to reject.
Idempotent INSERT ... ON CONFLICT — rerunning produces the same final state; backfills are safe.
Cost — single linear scan over bronze; reject volume is observable as a metric (alert when >1% of rows reject).

Inline CTA: drill the SQL practice page for cleanup queries and the ETL practice page for staged-transform patterns.

SQL
Language — SQL
SQL practice problems

Practice →

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

SQL
Topic — aggregations
SQL aggregation problems

Practice →

4. Load — destinations from warehouses to BI tools

Where clean data lands — warehouse, lake, lakehouse, or directly into a dashboard

The Load stage writes the curated data to one or more destinations: a cloud data warehouse (Snowflake, Redshift, BigQuery), a data lake (S3, GCS), a lakehouse (Iceberg / Delta on object storage), or directly into a BI tool's cache. The right destination depends on the access pattern — interactive analyst SQL favours warehouses; ML feature stores favour lakes; cross-team reuse favours a shared lakehouse.

Pro tip: Loads should be partitioned and idempotent. Partition by ingest_date so backfills touch only the affected day, and use INSERT OVERWRITE PARTITION or MERGE so reruns don't duplicate rows.

Data warehouses — Redshift, Snowflake, BigQuery

The warehouse-load invariant: cloud warehouses (Redshift, Snowflake, BigQuery) prefer bulk loads from object storage; the canonical commands are COPY INTO (Snowflake / Redshift) and LOAD DATA (BigQuery); never use single-row INSERT INTO ... VALUES for production loads — it's 10-100× slower and defeats columnar storage.

COPY INTO (Snowflake / Redshift) — bulk-load Parquet / CSV / JSON from S3 / GCS in parallel.
bq load (BigQuery) — same shape; loads from GCS with auto-detect schema.
File splitting — split source into N × num_slices files for parallel ingest.
COMPUPDATE ON — auto-pick column compression on first load (Redshift); Snowflake does this automatically.

Worked example. Daily load of 50 GB of CSV from S3 into Snowflake.

step	command	wall-clock
1. Stage	`PUT file://... @stage` (already in S3)	n/a
2. Copy	`COPY INTO orders FROM @stage/2026-05-11/`	~3 min
3. Verify	`SELECT COUNT(*) FROM orders WHERE …`	<1 s

Step-by-step.

The Transform stage has already written 40 Parquet files of ~1.25 GB each to s3://lake/silver/orders/ingest_date=2026-05-11/.
The warehouse stage points at that S3 prefix via an EXTERNAL STAGE.
COPY INTO orders FROM @stage/2026-05-11/ ingests all 40 files in parallel across the warehouse's compute slices.
The whole load finishes in 2-3 minutes — vs hours for the row-by-row INSERT approach.
STATUPDATE ON (Redshift) or auto-stats (Snowflake) refreshes the planner so subsequent queries pick the right plan.

Worked-example solution. A Snowflake COPY INTO for daily orders:

COPY INTO silver.orders
FROM @lake_stage/silver/orders/ingest_date=2026-05-11/
FILE_FORMAT = (TYPE = PARQUET)
ON_ERROR = 'ABORT_STATEMENT'
PURGE = FALSE;

Rule of thumb: every production load uses bulk COPY; reserve single-row INSERT for tests and one-off corrections.

Data lakes — S3, GCS, ADLS with Iceberg / Delta

The lake-load invariant: a data lake load is just a write to object storage in a columnar file format (Parquet / ORC); a lakehouse load wraps that write in a table format (Iceberg / Delta) that adds ACID, time travel, and partition evolution; both patterns scale storage and compute independently.

Plain lake — write Parquet to a prefix; register with a catalog (Glue, Hive Metastore).
Lakehouse — INSERT INTO iceberg.orders or MERGE INTO delta.orders — ACID on object storage.
Partitioning — by ingest_date or event_date for prune-friendly reads.
Compaction — periodic batch job rewrites many small files into fewer large ones.

Worked example. A Spark Structured Streaming job that writes micro-batches to an Iceberg table.

time	event count in batch	files written	total bytes
09:00	1,200	1	8 MB
09:01	980	1	7 MB
09:02	1,350	1	9 MB

Step-by-step.

The Spark streaming job reads events from Kafka in 1-minute trigger windows.
Each batch writes one Parquet file (~8 MB) to the Iceberg table's S3 location.
The Iceberg metadata layer records a new snapshot per batch — ACID is preserved across concurrent writers.
After an hour, the table has 60 small files; a nightly compaction job rewrites them into a single ~500 MB file for better read performance.
Trino, Spark, and Snowflake (via Iceberg external tables) can all read the same data without copying.

Worked-example solution. A Spark write to Iceberg:

events.writeStream \
    .format("iceberg") \
    .outputMode("append") \
    .option("path", "s3://lake/lakehouse/orders") \
    .option("checkpointLocation", "s3://lake/checkpoints/orders") \
    .trigger(processingTime="1 minute") \
    .start()

Rule of thumb: if multiple engines need to read the same data (Spark + Trino + Snowflake), use a lakehouse table format; if only the warehouse reads it, a managed warehouse table is simpler.

BI tools and serving layers

The serving-load invariant: BI tools (Tableau, Power BI, Looker, Metabase) read from the warehouse / lakehouse; the load stage's job is to materialise the exact shape the dashboard expects so the BI tool runs sub-second queries.

Pre-aggregated marts — gold-layer tables shaped for one dashboard each.
Materialised views — warehouse-native auto-refresh of frequently-queried aggregates.
Caching layer — BI tools cache for 5-60 min after the load completes.
Reverse-ETL — push curated data back to operational systems (Salesforce, HubSpot).

Worked example. A gold.daily_revenue mart powering the CFO dashboard.

dashboard query	source table	rows scanned	response time
Yesterday's revenue	`gold.daily_revenue`	1	<100 ms
Last 30 days	`gold.daily_revenue`	30	<200 ms
Without the mart (raw silver)	`silver.orders`	5 M	~5 s

Step-by-step.

The CFO dashboard wants "revenue per day for the last 30 days" — a tiny output but a huge underlying scan.
Without a gold mart, the BI tool would aggregate 5 M silver rows on every refresh — ~5 s per refresh.
With a gold.daily_revenue mart (one row per day), the BI tool reads 30 rows in <200 ms.
The Load step writes one row per day to gold.daily_revenue after the silver layer finishes.
End-to-end: ETL produces the mart; the BI tool reads the mart; the CFO sees a sub-second dashboard.

Worked-example solution. The Load step that produces the daily mart:

INSERT INTO gold.daily_revenue (date_key, revenue_total, order_count)
SELECT
    order_date    AS date_key,
    SUM(amount)   AS revenue_total,
    COUNT(*)      AS order_count
FROM silver.orders
WHERE order_date = '2026-05-11'
GROUP BY order_date
ON CONFLICT (date_key) DO UPDATE SET
    revenue_total = EXCLUDED.revenue_total,
    order_count   = EXCLUDED.order_count;

Rule of thumb: every dashboard should read from a gold-layer mart, never from silver — fast dashboards keep stakeholder trust.

Common beginner mistakes

Loading row-by-row with INSERT INTO ... VALUES instead of bulk COPY INTO.
Forgetting to partition the destination table — queries scan the whole table for one day's data.
Skipping ON CONFLICT / MERGE and discovering duplicates on the second run.
Letting BI tools query silver directly — slow dashboards, no reuse.
Not refreshing planner statistics after load — wrong join plans, 10-100× slower queries.

ETL Interview Question on Choosing a Load Destination

A retail company has 50 TB of clickstream events generated daily, plus a 5 GB curated gold.fact_orders table that powers BI dashboards. For each, recommend the load destination and the load command — and justify the choice.

Solution Using Lakehouse for Clickstream + Managed Warehouse for Gold

Code solution.

CLICKSTREAM (50 TB / day)
  Destination: S3 + Iceberg (lakehouse)
  Command:    Spark Structured Streaming with .writeStream.format("iceberg")
  Reason:     too big for managed warehouse storage cost; date-partition-pruned access; ML reuse

GOLD.FACT_ORDERS (5 GB)
  Destination: Snowflake managed table
  Command:    COPY INTO gold.fact_orders FROM @stage/orders/...
  Reason:     hot, joined, sub-second dashboard reads; ACID; analyst-friendly SQL ergonomics

CROSS-LAYER JOIN
  SELECT u.region, COUNT(*) clicks, SUM(o.amount) revenue
  FROM lake.clickstream c
  JOIN gold.fact_orders o ON o.user_id = c.user_id
  WHERE c.event_date = '2026-05-11'
  GROUP BY u.region;

Step-by-step trace of the architectural decision:

step	question	answer
1	What's the data volume?	clickstream 50 TB, gold 5 GB
2	Hot or cold?	clickstream cold (date-pruned access); gold hot (every dashboard refresh)
3	Single-engine or multi-engine?	clickstream needs ML (Spark) + analyst SQL (Trino); gold only needs warehouse SQL
4	Pick destination for clickstream	S3 + Iceberg (lakehouse) — open format, ACID, multi-engine
5	Pick destination for gold	Snowflake managed — fast, ergonomic, ACID across multi-table updates
6	Cross-layer joins?	yes — Snowflake reads Iceberg via external tables

Output: the clickstream lakehouse holds 50 TB in S3 at ~$1,150/month; the gold warehouse holds 5 GB in Snowflake at ~$200/month; the BI tool reads gold in <200 ms; the ML pipeline reads clickstream features directly from S3 without copying.

Why this works — concept by concept:

Lakehouse for volume + ML reuse — 50 TB at warehouse storage cost would be ~$5K/month; on S3 it's ~$1,150 and ML can read the files directly without an export step.
Managed warehouse for hot SQL — gold is small, hot, frequently joined; warehouse storage cost is negligible at 5 GB; SQL ergonomics + sub-second response is what BI users need.
Iceberg as the open boundary — Snowflake reads Iceberg natively; no nightly copy job, no schema drift between systems.
Spark Structured Streaming for clickstream — micro-batch writes; ACID via Iceberg; replay-friendly.
COPY INTO for gold — bulk Parquet load; sub-3-min load time; auto-compression.
Cost — each layer carries its own cost characteristic; the right destination per workload is what keeps the AWS bill sane.

Inline CTA: more ETL practice problems for load patterns and the dimensional modeling practice page for star-schema design.

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

SQL
Topic — dimensional modeling
Dimensional modeling problems

Practice →

SQL
Language — SQL
SQL practice problems

Practice →

5. ETL vs ELT — transform before or after loading

When to push transforms to the warehouse vs run them upstream

The ETL vs ELT distinction is the single most-asked architecture question in fresher data-engineering interviews. The mental model: ETL transforms data before loading it into the warehouse — clean Python / Spark jobs land curated rows in the destination; ELT loads raw data into the warehouse first, then transforms it using the warehouse's own SQL engine (dbt is the canonical example). Modern cloud warehouses tilt the answer toward ELT because compute is elastic and SQL is the most-debugged transform language on earth.

Pro tip: the textbook "ETL vs ELT" answer is "depends on the warehouse" — but in interviews, name the specific feature that flips the choice. ELT wins when the warehouse has elastic compute (Snowflake's separate warehouses, BigQuery's slots) and a mature SQL transform layer (dbt). ETL wins when the warehouse can't afford the raw-data storage cost or when transforms are non-SQL (image processing, ML feature engineering).

ETL — transform before loading

The ETL invariant: raw data is transformed by an upstream compute layer (Python, Spark, custom services) into curated rows that land directly in the warehouse; the warehouse holds only the cleaned data; transforms happen on dedicated compute (often cheaper than warehouse credits).

Compute layer — Spark cluster, Python on Kubernetes, AWS Glue, custom services.
Warehouse holds — only the cleaned silver / gold tables.
Storage cost — lower (no raw data in the warehouse).
Best fit — legacy warehouses without elastic compute, non-SQL transforms (ML features, image pipelines).

Worked example. A Python pipeline that cleans CSV → loads cleaned Parquet → Snowflake.

stage	runs on	data shape
Extract	Python on EC2	raw CSV (10 GB)
Transform	Python + pandas (4 CPUs)	cleaned Parquet (3 GB compressed)
Load	`COPY INTO snowflake.gold.orders`	warehouse holds 3 GB

Step-by-step.

A Python script reads the raw CSV from S3 into a pandas DataFrame.
The script applies cleaning (dropna, type coercion, business rules) — all in Python memory.
The script writes the cleaned data as Parquet back to S3.
A COPY INTO ships the Parquet into Snowflake — only the cleaned 3 GB lands in the warehouse.
The warehouse never sees the raw CSV; storage cost is bounded by the curated output.

Worked-example solution. A minimal ETL outline:

import pandas as pd
import snowflake.connector

# Extract
df = pd.read_csv("s3://lake/raw/orders.csv")

# Transform
df = df.drop_duplicates(subset=["order_id"]).dropna(subset=["amount"])
df["amount"] = df["amount"].astype(float).round(2)
df.to_parquet("s3://lake/staging/orders.parquet", index=False)

# Load — Snowflake COPY INTO from the staged Parquet
ctx = snowflake.connector.connect(...)
ctx.cursor().execute("""
    COPY INTO gold.orders
    FROM @lake_stage/staging/orders.parquet
    FILE_FORMAT = (TYPE = PARQUET);
""")

Rule of thumb: ETL is the right shape when transforms need a non-SQL runtime (Python ML features, image processing, custom validation services).

ELT — load first, transform inside the warehouse

The ELT invariant: raw data is loaded directly into the warehouse with minimal transformation; transforms run as SQL inside the warehouse (typically orchestrated by dbt); the warehouse's elastic compute and parallel SQL engine handle the transform workload at scale.

Tools — dbt (canonical SQL transformation framework), Dataform, custom SQL.
Pattern — raw.orders → silver.orders (dbt model) → gold.fact_orders (dbt model).
Storage cost — higher (warehouse holds both raw and curated data).
Best fit — modern cloud warehouses with elastic compute (Snowflake, BigQuery, Databricks SQL).

Worked example. A dbt project that builds silver and gold from raw.

layer	dbt model	runs in
raw	direct `COPY INTO raw.orders`	Snowflake
silver	`models/silver/silver_orders.sql`	Snowflake SQL via dbt
gold	`models/gold/fact_orders.sql`	Snowflake SQL via dbt

Step-by-step.

The raw CSV is loaded straight into raw.orders via COPY INTO — no Python in the middle.
A dbt model silver_orders.sql reads from raw.orders and applies dedup + type coercion as SQL.
A downstream dbt model fact_orders.sql reads from silver_orders and applies aggregation.
dbt runs the whole DAG on the warehouse's compute; transforms are SQL, version-controlled, testable.
The whole "transform" stage is a dbt run command — minutes for hundreds of models.

Worked-example solution. A dbt silver model:

-- models/silver/silver_orders.sql
{{ config(materialized='incremental', unique_key='order_id') }}

SELECT
    order_id,
    customer_id,
    amount::NUMERIC(14, 2)   AS amount,
    source_ts                 AS source_ts,
    CURRENT_TIMESTAMP         AS silver_loaded_at
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
  WHERE source_ts > (SELECT MAX(source_ts) FROM {{ this }})
{% endif %}
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC) = 1

Rule of thumb: ELT is the modern default; reach for it whenever your warehouse has elastic compute (Snowflake, BigQuery) and your transforms can be expressed as SQL.

When ELT beats ETL — modern cloud warehouses

The cloud-warehouse invariant: elastic-compute warehouses (Snowflake's virtual warehouses, BigQuery's slot model, Redshift Concurrency Scaling) let you spin up and tear down transform compute on demand; the marginal cost of a transform job is small; SQL transforms become first-class with dbt.

Snowflake — separate virtual warehouses per workload; XS for cheap transforms, XL for hourly loads.
BigQuery — slot-based pricing; transforms run on the same pool as queries.
Redshift — Concurrency Scaling for elastic warehouse compute.
Databricks SQL — serverless SQL warehouse + Spark for non-SQL transforms.

Worked example. Same daily transform run two ways.

approach	compute	wall-clock	monthly cost
ETL (Python on EC2)	dedicated EC2 instance	30 min	$300 (always-on)
ELT (dbt on Snowflake XS warehouse)	warehouse, on-demand	8 min	$40 (pay-per-second)

Step-by-step.

The ETL approach runs a dedicated Python service on EC2 — the server is always on, even when the pipeline isn't running.
The ELT approach runs dbt run against a Snowflake XS warehouse — the warehouse spins up when the job starts and suspends when it finishes.
Daily cost: $300 (always-on EC2) vs $40 (pay-per-second warehouse).
The SQL transforms in dbt are version-controlled, tested, and reviewable in PRs — easier collaboration than Python ETL.
For most analytical workloads on modern warehouses, ELT is cheaper, faster, and easier to maintain.

Worked-example solution. A dbt project layout that captures the ELT pattern:

my_dbt_project/
├── dbt_project.yml
├── models/
│   ├── staging/
│   │   ├── stg_orders.sql        -- raw → typed
│   │   └── stg_customers.sql
│   ├── silver/
│   │   ├── silver_orders.sql     -- typed → deduped
│   │   └── silver_customers.sql
│   └── gold/
│       ├── fact_orders.sql        -- deduped → aggregated
│       └── dim_customer.sql
├── tests/
│   └── orders_amount_positive.sql -- assertion: amount > 0
└── snapshots/
    └── customer_history.sql       -- SCD2 snapshots

Rule of thumb: on Snowflake / BigQuery, ELT with dbt is the modern default for ~80% of analytical pipelines.

Common beginner mistakes

Saying "ELT is always better" — false; non-SQL transforms (ML features, image processing) still need ETL.
Loading raw data straight into gold tables — skips the silver layer's dedup + type coercion contract.
Running ELT on a legacy on-prem warehouse without elastic compute — transforms compete with analyst queries.
Treating dbt as a "SQL runner" — it's also a testing framework, a docs generator, and a lineage tool.
Not separating raw / silver / gold layers in dbt — every model becomes a tangle.

ETL Interview Question on ETL vs ELT Pattern Selection

A team is building a new analytics platform. Sources: 200 GB / day of clickstream events (Kafka), a 5 GB Postgres OLTP database, and a 1 GB nightly SaaS export. Warehouse choice: Snowflake. For each source, recommend ETL or ELT and justify.

Solution Using ELT for Most + ETL for Heavy-Compute Streaming

Code solution.

CLICKSTREAM (200 GB / day, Kafka)
  Pattern: ETL upstream + ELT downstream
  Why:     volume is too high for raw-to-warehouse; Spark streaming pre-aggregates and lands hourly batches
  Tools:   Spark Structured Streaming → Iceberg → Snowflake external table

POSTGRES OLTP (5 GB)
  Pattern: ELT (raw load + dbt transform)
  Why:     small, schema-stable, SQL transforms natural
  Tools:   Fivetran CDC → Snowflake raw schema → dbt models for silver/gold

SAAS NIGHTLY EXPORT (1 GB)
  Pattern: ELT
  Why:     tiny, low-frequency; SQL-friendly transforms
  Tools:   Airbyte → Snowflake raw schema → dbt models

Step-by-step trace of the pattern selection:

source	volume	freshness	compute	pattern
Clickstream	200 GB / day	sub-minute	Spark streaming	ETL upstream + ELT downstream
Postgres OLTP	5 GB	daily	warehouse SQL	ELT (Fivetran + dbt)
SaaS export	1 GB	daily	warehouse SQL	ELT (Airbyte + dbt)

Output: a hybrid architecture — Spark handles the high-volume streaming pre-aggregation (ETL pattern), then Snowflake + dbt handles the curated SQL transforms (ELT pattern) for everything that lands in the warehouse.

Why this works — concept by concept:

ETL for clickstream — 200 GB / day raw into the warehouse is expensive; Spark pre-aggregates to a more compact form first.
ELT for OLTP + SaaS — small data, SQL-friendly transforms, version-controlled in dbt.
Fivetran / Airbyte for raw load — managed connectors handle schema evolution and incremental sync.
dbt for transforms — SQL is reviewable, testable, and runs on Snowflake's elastic compute.
Iceberg + Snowflake external tables — clickstream stays queryable from Snowflake without copying.
Cost — Spark cluster for streaming (always-on); Snowflake compute pay-per-second for transforms; total bill stays bounded.

Inline CTA: for the structured warehouse-and-transform path see ETL System Design for Data Engineering Interviews and the related data lake architecture for data engineering interviews blog.

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for Data Engineering Interviews

View course →

SQL
Language — SQL
SQL practice problems

Practice →

6. ETL orchestration tools — Airflow, dbt, Spark, AWS Glue

The tools that turn an ETL script into a production pipeline

A single Python script that runs once is not a pipeline — it's a one-off job. Real production ETL needs scheduling, dependency management, retries, alerts, lineage, and observability, and the modern stack is built around a handful of tools that each solve one slice of that problem: Apache Airflow for orchestration / DAGs, dbt for SQL transforms / tests / docs, Apache Spark for distributed batch + streaming, AWS Glue / Azure Data Factory / Dataflow for managed serverless ETL, and Fivetran / Airbyte for managed source connectors.

Pro tip: the modern ETL stack is layered, not monolithic — Airflow orchestrates dbt, dbt invokes Spark, Spark reads from S3, and Fivetran handles the source connectors. Knowing which tool owns which slice (and why) is the senior signal in any ETL design round.

Apache Airflow — DAG orchestration and scheduling

The Airflow invariant: Airflow defines pipelines as Python DAGs (Directed Acyclic Graphs); each DAG is a graph of tasks with explicit dependencies; the scheduler triggers DAGs on cron / sensor / event; failures retry per task with exponential backoff. It's the canonical orchestrator for batch ETL.

DAG — Python file that declares tasks + dependencies + schedule.
Operators — BashOperator, PythonOperator, SnowflakeOperator, DbtRunOperator, ...
Sensors — wait for an external event (S3 file arrival, table refresh, API webhook).
XCom — pass small values between tasks; for large data, use object storage as the boundary.

Worked example. A 4-task daily DAG: extract → transform → reconcile → notify.

task	runs	depends on
`extract_postgres`	02:00 daily	—
`transform_dbt`	after extract	`extract_postgres`
`reconcile_counts`	after transform	`transform_dbt`
`notify_slack`	after reconcile	`reconcile_counts`

Step-by-step.

The Airflow scheduler triggers the DAG at 02:00.
extract_postgres runs a BashOperator invoking a Python script — extracts yesterday's rows to S3.
Once it succeeds, transform_dbt runs dbt run against the warehouse.
reconcile_counts queries silver vs source for drift > 0.1%; if it fails, the DAG halts and pages on-call.
On success, notify_slack posts a "✓ pipeline complete" message; the BI dashboard refreshes.

Worked-example solution. A minimal Airflow DAG:

from datetime import datetime
from airflow import DAG
from airflow.operators.bash import BashOperator

with DAG(
    "daily_orders_etl",
    schedule_interval="0 2 * * *",     # 02:00 UTC daily
    start_date=datetime(2026, 5, 1),
    catchup=False,
    tags=["orders"],
) as dag:

    extract  = BashOperator(task_id="extract_postgres",
                            bash_command="python /opt/scripts/extract.py")
    transform = BashOperator(task_id="transform_dbt",
                            bash_command="cd /opt/dbt && dbt run --target prod")
    reconcile = BashOperator(task_id="reconcile_counts",
                            bash_command="python /opt/scripts/reconcile.py")
    notify    = BashOperator(task_id="notify_slack",
                            bash_command="python /opt/scripts/notify.py 'Pipeline OK'")

    extract >> transform >> reconcile >> notify

Rule of thumb: Airflow is the right choice for batch-scheduled pipelines with explicit dependencies. For event-driven streaming pipelines, look at Prefect, Dagster, or native cloud orchestrators.

dbt and Apache Spark — SQL transforms and distributed processing

The dbt invariant: dbt is a SQL transformation framework that runs models (SELECT statements) against your warehouse, with built-in tests, lineage graphs, and documentation. The Spark invariant: Spark is a distributed compute engine that runs Python / Scala / SQL transforms across a cluster of machines, scaling from gigabytes to petabytes.

dbt — models/, tests/, snapshots/, seeds/; runs on Snowflake / BigQuery / Redshift / Databricks SQL.
Spark — pyspark.sql, structured streaming, MLlib; reads from S3, Iceberg, Delta, Hive.
When dbt wins — SQL transforms on a managed warehouse; small-to-medium-scale analytics.
When Spark wins — non-SQL transforms (Python ML, image / log processing), petabyte-scale data.

Worked example. A dbt model + a Spark job for the same logical transform.

approach	code	runs on	best for
dbt	`models/silver_orders.sql`	Snowflake	SQL-first analytics
Spark	`pyspark` script	EMR / Databricks cluster	petabyte-scale or custom Python

Step-by-step.

For a 1 GB dedup transform, dbt + Snowflake is simpler — write SQL, push to git, run dbt run.
For a 1 TB transform with custom Python UDFs (ML features), Spark on EMR is the better tool.
dbt is declarative — write the desired output as SELECT, let dbt handle the materialisation strategy.
Spark is imperative — write the transform as code, hand-tune partitioning and caching.
The two tools coexist: dbt for SQL warehouse layers, Spark for upstream lake processing.

Worked-example solution. A Spark transform that dedups orders:

from pyspark.sql import functions as F, Window

orders = spark.read.parquet("s3://lake/bronze/orders/ingest_date=2026-05-11/")

w = Window.partitionBy("order_id").orderBy(F.col("source_ts").desc())

deduped = (
    orders.withColumn("rn", F.row_number().over(w))
          .filter("rn = 1")
          .drop("rn")
)

deduped.write.mode("overwrite").parquet(
    "s3://lake/silver/orders/ingest_date=2026-05-11/"
)

Rule of thumb: use dbt for SQL transforms on the warehouse; reach for Spark when transforms are non-SQL or when data outgrows warehouse compute budgets.

Managed services — AWS Glue, Fivetran, Airbyte

The managed-service invariant: managed ETL services (AWS Glue, Azure Data Factory, GCP Dataflow, Fivetran, Airbyte) trade flexibility for operational simplicity — they handle the infrastructure, retries, scaling, and connector maintenance so the team writes business logic, not boilerplate.

AWS Glue — managed serverless Spark; auto-scales; AWS-native pricing.
Fivetran — managed connectors from 300+ SaaS sources to warehouses; usage-based pricing.
Airbyte — open-source alternative to Fivetran; self-hosted or managed.
Dataflow / ADF — GCP / Azure equivalents.

Worked example. Pulling Stripe charges via Fivetran vs writing a custom script.

approach	time to deploy	maintenance	cost
Custom Python	~2 weeks	high (rate-limit + schema-drift bugs)	engineer time
Fivetran connector	~2 hours	low (managed)	~$200/month for 100K MAR

Step-by-step.

The custom approach: write a Python script, handle pagination, build retries, deploy to Airflow, write tests, monitor schema drift, fix bugs for years.
Fivetran approach: click "Connect Stripe", paste API key, choose destination warehouse, click "Sync". Done in 2 hours.
Custom code wins when the source is custom or volume is huge; Fivetran wins for the 80% case (standard SaaS sources).
The choice depends on team size and operational maturity — a 3-person data team is better off paying Fivetran than burning engineer hours.
Hybrid is common: Fivetran for standard sources, custom Python / Spark for unique ones.

Worked-example solution. AWS Glue ETL skeleton (Spark-based, managed):

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glueContext = GlueContext(SparkContext.getOrCreate())

orders = glueContext.create_dynamic_frame.from_catalog(
    database="bronze_db", table_name="orders"
)
cleaned = DropNullFields.apply(frame=orders)

glueContext.write_dynamic_frame.from_options(
    frame=cleaned,
    connection_type="s3",
    connection_options={"path": "s3://lake/silver/orders/"},
    format="parquet",
)

Rule of thumb: managed services are the right starting point — only go custom when a specific source or transform genuinely needs it.

Common beginner mistakes

Building a custom orchestrator instead of using Airflow (or Prefect / Dagster) — wasted years.
Running dbt locally only — needs a deployment story (Airflow, dbt Cloud, GitHub Actions).
Spinning up a Spark cluster for a 1 GB transform that dbt + Snowflake would do faster.
Writing custom Stripe / Salesforce / HubSpot connectors when Fivetran / Airbyte handle them.
Skipping monitoring / lineage — the first incident is when you learn what observability you needed.

ETL Interview Question on Choosing an Orchestration Stack

A 4-person data team is bootstrapping a new analytics platform on Snowflake. Sources: 10 SaaS tools (Stripe, Salesforce, HubSpot, …), a Postgres OLTP database, and 50 GB / day of clickstream from Kafka. Pick the orchestration / ingestion / transform tools and explain the choices.

Solution Using Fivetran + Airflow + dbt + Spark Streaming

Code solution.

INGESTION
  ├── SaaS sources (Stripe, Salesforce, HubSpot, ...) → Fivetran → Snowflake raw.* schemas
  ├── Postgres OLTP → Fivetran CDC → Snowflake raw.postgres_*
  └── Kafka clickstream → Spark Structured Streaming → Iceberg → Snowflake external table

ORCHESTRATION
  └── Airflow (managed via MWAA or Astronomer)
      - Daily DAG: trigger dbt run after Fivetran loads
      - Sensor: wait for S3 clickstream Parquet arrival, then trigger dbt clickstream model

TRANSFORM
  └── dbt (Cloud or self-hosted)
      - models/staging/   ← raw → typed
      - models/silver/    ← typed → deduped + conformed
      - models/gold/      ← curated marts for BI

OBSERVABILITY
  ├── Airflow alerts → Slack on DAG failure
  ├── dbt tests → fail loudly on schema / quality assertions
  └── Snowflake account_usage → cost & query monitoring dashboard

Step-by-step trace of the architectural decisions:

concern	answer	reason
SaaS ingestion	Fivetran	10 sources × 2 weeks of custom code per source = 20 engineer-weeks; Fivetran handles it for ~$1K/month
OLTP ingestion	Fivetran CDC	zero impact on source; sub-minute freshness
Clickstream ingestion	Spark Structured Streaming + Iceberg	50 GB / day too big for Fivetran's pricing tier
Orchestration	Airflow (MWAA)	mature DAG semantics; team can hire engineers who know it
Transform	dbt	SQL-first; version-controlled; testable; runs on Snowflake's elastic compute
Storage	Snowflake + Iceberg lakehouse	warehouse for curated; lakehouse for high-volume clickstream

Output: an end-to-end stack that the 4-person team can build in ~2 months. Fivetran handles 80% of ingestion; Spark covers the streaming edge case; Airflow + dbt provide a clean orchestration + transform layer; Snowflake is the warehouse. Total operational burden is low; engineering time goes into business logic, not boilerplate.

Why this works — concept by concept:

Fivetran for standard SaaS / OLTP — buys 10 connectors for the price of one engineer; bug-fixes are someone else's problem.
Spark for streaming clickstream — 50 GB / day is outside Fivetran's sweet spot; Spark + Iceberg handles it cleanly.
Airflow as the orchestrator — mature, hireable, integrates with everything.
dbt for SQL transforms — version control, tests, lineage all built in; reviewable in PRs.
Snowflake + Iceberg — warehouse for hot curated; lakehouse for high-volume cold; cross-layer joins via external tables.
Cost — Fivetran ~$1K/month, Snowflake ~$3-5K/month, Airflow (MWAA) ~$300/month, Spark cluster ~$1K/month → total ~$5-7K/month for a 4-person team's full stack.

Inline CTA: for the structured pipeline-design path see ETL System Design for Data Engineering Interviews and PySpark Fundamentals.

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

COURSE
Course — PySpark Fundamentals
PySpark Fundamentals

View course →

PYTHON
Language — Python
Python practice problems

Practice →

7. Building a Python pandas ETL pipeline

A runnable end-to-end example you can adapt

To make all of the above concrete, here's a runnable Python pandas ETL pipeline that pulls a CSV of orders, cleans + deduplicates the data, and writes a curated Parquet ready for warehouse load. It's short enough to read in five minutes and structured enough to extend into a production pipeline.

Pro tip: pandas is great for the first 1-10 GB of data; beyond that, you want pyspark, polars, or duckdb. The shape of the pipeline is the same; only the engine changes.

Setup — installing pandas and pyarrow

The setup invariant: pandas reads / writes CSV, JSON, and Parquet natively; pyarrow is the Parquet engine pandas calls under the hood; install both with pip install pandas pyarrow. Everything else (S3 access, database connectors) is optional.

pandas — DataFrame manipulation; the backbone of single-node Python ETL.
pyarrow — Parquet read / write engine; columnar format for warehouse-friendly Parquet.
boto3 — AWS SDK for S3 reads / writes.
sqlalchemy / psycopg2 — database connectors for Postgres extraction.

Worked example. Install the minimal set:

package	role
`pandas`	DataFrame ops, CSV / Parquet IO
`pyarrow`	Parquet engine
`boto3`	S3 access (optional for local files)
`requests`	HTTP / API extraction (optional)

Step-by-step.

Create a virtual environment: python3 -m venv .venv && source .venv/bin/activate.
Install dependencies: pip install pandas pyarrow boto3 requests.
Verify: python -c "import pandas; print(pandas.__version__)" prints a 2.x version.
Place a sample orders.csv in the current directory (or use the snippet's path).
Run the script: python etl.py produces cleaned_orders.parquet.

Worked-example solution. A one-line install:

pip install pandas pyarrow boto3 requests

Rule of thumb: pin versions in requirements.txt so reruns produce identical environments.

Extract + Transform + Load in 30 lines of Python

The pipeline invariant: extract reads from a source into a DataFrame; transform applies cleaning + dedup + typing; load writes to the destination as Parquet. The whole script fits in one file and is testable end-to-end.

Extract — pd.read_csv / pd.read_sql / pd.read_json.
Transform — .drop_duplicates(), .dropna(), .astype(), business rules with np.where.
Load — .to_parquet() for warehouse-friendly columnar output.
Idempotency — overwrite the destination; never append blindly.

Worked example. Cleaning an orders.csv with 12,847 rows down to 12,835 unique rows.

metric	before	after
rows	12,847	12,835 (12 duplicates dropped)
null amounts	38	0 (replaced with 0.0)
`amount` dtype	object	float64
`order_date` dtype	object	datetime64

Step-by-step.

pd.read_csv("orders.csv") reads the raw file into a DataFrame.
.drop_duplicates(subset=["order_id"], keep="last") collapses 12,847 rows to 12,835 unique orders.
.fillna({"amount": 0}) replaces 38 null amounts with 0.0 per the business rule.
.astype({"amount": "float64", "order_date": "datetime64[ns]"}) coerces types.
.to_parquet("cleaned_orders.parquet", index=False) writes the curated output; downstream COPY INTO reads it.

Worked-example solution. A complete ETL script:

import pandas as pd
from pathlib import Path

RAW = Path("orders.csv")
OUT = Path("cleaned_orders.parquet")

# ── Extract ─────────────────────────────────────────────────────────────
df = pd.read_csv(RAW)

# ── Transform ───────────────────────────────────────────────────────────
df = (
    df.drop_duplicates(subset=["order_id"], keep="last")
      .fillna({"amount": 0.0, "status": "unknown"})
      .astype({"amount": "float64"})
)
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
df = df.dropna(subset=["order_date"])

# Business rule
df["is_high_value"] = df["amount"] > 10_000

# ── Load ────────────────────────────────────────────────────────────────
df.to_parquet(OUT, index=False, engine="pyarrow")

print(f"Loaded {len(df)} rows → {OUT}")

Rule of thumb: keep extract, transform, and load as separate functions so you can unit-test each independently.

Scaling out — when to move beyond pandas

The scale-out invariant: pandas is single-threaded and in-memory; beyond ~10 GB it slows; the upgrade path is polars (single-machine, multi-core, Arrow-backed) or pyspark (distributed across many machines).

polars — drop-in pandas alternative; 5-10× faster on a single machine; lazy evaluation.
pyspark — distributed DataFrame API; same shape as pandas but scales to terabytes.
duckdb — embedded analytical SQL engine; great for ~100 GB datasets on a laptop.
dbt + warehouse — when transforms can be expressed as SQL, push them into the warehouse.

Worked example. Same dedup transform, three engines.

engine	time on 10 GB	RAM needed
`pandas`	~3 min	~40 GB
`polars`	~30 s	~12 GB
`pyspark` (10-node)	~10 s	distributed

Step-by-step.

For 1 GB, pandas is fine — 10-30 seconds on a laptop.
At 10 GB, pandas starts hitting memory limits on a 16 GB laptop; polars reduces RAM by 4× and runs 5× faster.
At 100 GB, you want duckdb (single-machine columnar) or pyspark (distributed).
At 1 TB+, pyspark on a real cluster is the only practical option.
The transform shape (dedup + cast + write) doesn't change — only the engine does.

Worked-example solution. The same dedup in polars:

import polars as pl

df = (
    pl.read_csv("orders.csv")
      .unique(subset=["order_id"], keep="last")
      .with_columns(
          pl.col("amount").fill_null(0.0).cast(pl.Float64),
          pl.col("order_date").str.strptime(pl.Date, "%Y-%m-%d", strict=False),
      )
)

df.write_parquet("cleaned_orders.parquet")

Rule of thumb: start with pandas; switch to polars / pyspark only when data outgrows the laptop.

Common beginner mistakes

Reading a 50 GB CSV with pd.read_csv and watching the laptop OOM.
Appending to the output file instead of overwriting — silently growing duplicates.
Skipping dtype coercion — amount stays object and aggregations break.
Forgetting errors="coerce" on pd.to_datetime — one bad date kills the whole load.
Not separating extract / transform / load into functions — untestable script.

ETL Interview Question on Building a First Pandas Pipeline

You're asked in a live coding round to write a Python ETL pipeline that reads orders.csv (~1 GB), removes duplicate order_id rows keeping the latest by source_ts, replaces null amount with 0, and writes a Parquet file. Write the runnable script.

Solution Using pandas + pyarrow

Code solution.

import pandas as pd
from pathlib import Path

RAW = Path("orders.csv")
OUT = Path("cleaned_orders.parquet")

# 1) Extract
df = pd.read_csv(RAW, parse_dates=["source_ts"])

# 2) Transform
# Sort by source_ts ascending so .drop_duplicates(keep="last")
# keeps the most-recent row per order_id.
df = (
    df.sort_values(["order_id", "source_ts"])
      .drop_duplicates(subset=["order_id"], keep="last")
      .assign(amount=lambda d: d["amount"].fillna(0).astype("float64"))
)

# 3) Load
df.to_parquet(OUT, index=False, engine="pyarrow")

print(f"Wrote {len(df):,} rows → {OUT}")

Step-by-step trace for a 12,847-row input:

stage	rows	notes
Read CSV	12,847	raw input
Sort by `(order_id, source_ts)`	12,847	sorted in-place
`drop_duplicates(keep="last")`	12,835	12 duplicates dropped, latest kept
`fillna(amount=0)`	12,835	38 nulls replaced
`astype(float64)`	12,835	type coerced
Write Parquet	12,835	columnar output, ~80 MB

Output: cleaned_orders.parquet — 12,835 unique orders with non-null amount, ready for COPY INTO into the warehouse. Total wall-clock: ~10 seconds on a laptop for 1 GB of input.

Why this works — concept by concept:

parse_dates=["source_ts"] — pandas reads timestamps directly into datetime64, skipping a later to_datetime call.
Sort before drop_duplicates(keep="last") — guarantees the latest source_ts per order_id survives; the alternative (sorting after) keeps an arbitrary row.
assign(amount=lambda d: d["amount"].fillna(0).astype("float64")) — chain-friendly transform that returns a new DataFrame, easier to test than mutating in place.
Parquet output — columnar, compressed, warehouse-friendly; ~10× smaller than CSV for the same data.
pyarrow engine — the modern Parquet backend; faster and more compatible than fastparquet.
Cost — O(N log N) for the sort, O(N) for the rest; trivially scales to 10 GB on a 16 GB laptop.

Inline CTA: more Python practice problems for pandas-style ETL and the ETL practice page for end-to-end pipeline shapes.

PYTHON
Language — Python
Python practice problems

Practice →

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

COURSE
Course — PySpark Fundamentals
PySpark Fundamentals

View course →

Choosing your ETL stack (checklist)

Pick the right tool for the workload, not the other way around:

For SaaS sources (Stripe, Salesforce, HubSpot) → managed connectors (Fivetran, Airbyte).
For Postgres / MySQL OLTP → CDC (Debezium + Kafka, or Fivetran CDC).
For high-volume clickstream / logs → Spark Structured Streaming + Iceberg / Delta.
For SQL transforms on the warehouse → dbt + Snowflake / BigQuery / Redshift.
For non-SQL transforms (ML features, image / NLP processing) → Spark or custom Python.
For batch orchestration → Apache Airflow (or Prefect / Dagster for newer projects).
For BI / serving → gold-layer marts in the warehouse; never let dashboards query silver.

Frequently asked questions

What is an ETL pipeline?

An ETL pipeline is an automated workflow that Extracts raw data from multiple sources, Transforms it into a clean and structured format, and Loads it into a destination system (data warehouse, data lake, or BI tool) for reporting and analysis. The three letters describe the stages; in practice the pipeline also handles scheduling, retries, observability, and idempotency.

What is the difference between ETL and ELT?

ETL transforms data before loading it into the warehouse (the curated, traditional pattern). ELT loads raw data into the warehouse first, then transforms it using the warehouse's own SQL engine (the modern cloud-warehouse pattern, typically with dbt). Modern Snowflake / BigQuery / Redshift workloads tilt heavily toward ELT because the warehouse compute is elastic and SQL is the most-debugged transform language.

Which tool is best for ETL orchestration?

Apache Airflow is the most-deployed batch orchestrator and the safest hireable choice. For event-driven streaming pipelines, Prefect and Dagster are modern alternatives with better Python ergonomics. For SQL-only pipelines on cloud warehouses, dbt with its built-in scheduler (dbt Cloud) handles most cases without needing a separate orchestrator.

Can I build an ETL pipeline in Python?

Yes — pandas + pyarrow for single-machine ETL, pyspark for distributed, polars for fast single-machine. The pipeline shape (extract → transform → load) is the same regardless of the engine. Real production pipelines wrap the Python script in Airflow / Prefect for scheduling, retries, and observability.

What's the most common ETL pipeline failure mode?

Schema drift — the source system silently renames a column, changes a date format, or splits a field, and the pipeline either crashes or writes wrong data. Defend with strict schema assertions at ingest, explicit alerts on drift, and a published contract with the source team. The runner-up is non-idempotent loads — reruns produce duplicates and silent data corruption.

Practice on PipeCode

Reading is one thing; reps are another. To turn the ETL primitives in this guide into reliable interview answers, pair the reading with practice on real problems and structured courses.

Drill SQL transformations — the workhorse of ELT — at the SQL practice page.
Practice ETL pipeline design with company-flavoured problems at the ETL practice page.
Build pandas / Python ETL fluency at the Python practice page.
Take the structured path with ETL System Design for Data Engineering Interviews and PySpark Fundamentals.
Read related guides — the data lake architecture for data engineering interviews blog covers the medallion zones every ETL pipeline writes into, and the SQL interview questions for data engineering blog drills the SQL primitives every Transform step uses.

The ETL pipelines you'll be asked to design in interviews use exactly the primitives in this guide — extract from messy sources, transform with idempotent SQL or Python, load into a warehouse the BI team trusts, and orchestrate the whole thing with Airflow + dbt. Practice them in the practice surface and the design rounds become reps, not surprises.

1. Why ETL pipelines matter

Clean, trusted data is the foundation of every analytics decision

Raw data is noisy — duplicates, nulls, mixed formats

Multiple sources, one consistent schema

Automation, repeatability, and observability

Common beginner mistakes

ETL Interview Question on Designing a First-Pass Pipeline

Solution Using a Daily ETL Pipeline with Bronze / Silver / Gold Layers

2. Extract — pulling raw data from sources

How to read from databases, APIs, files, and SaaS without breaking the source

SQL databases — snapshots, CDC, and the read-replica rule

APIs — pagination, rate limits, and idempotent paging

Files and SaaS — schema drift and the contract problem

Common beginner mistakes

ETL Interview Question on Extracting from a Rate-Limited API

Solution Using Cursor-Based Pagination + Rate-Limit-Aware Backoff

3. Transform — cleaning, dedup, standardization, aggregation

Where raw data becomes useful — the meat of every ETL pipeline

Cleaning and deduplication

Standardization — types, casing, dates, casing

Aggregation and business rules

Common beginner mistakes

ETL Interview Question on Cleaning Drifted Source Data

Solution Using LOWER + TRIM + REGEXP + Reject-Table Quarantine

4. Load — destinations from warehouses to BI tools

Where clean data lands — warehouse, lake, lakehouse, or directly into a dashboard

Data warehouses — Redshift, Snowflake, BigQuery

Data lakes — S3, GCS, ADLS with Iceberg / Delta

BI tools and serving layers

Common beginner mistakes

ETL Interview Question on Choosing a Load Destination

Solution Using Lakehouse for Clickstream + Managed Warehouse for Gold

5. ETL vs ELT — transform before or after loading

When to push transforms to the warehouse vs run them upstream

ETL — transform before loading

ELT — load first, transform inside the warehouse

When ELT beats ETL — modern cloud warehouses

Common beginner mistakes

ETL Interview Question on ETL vs ELT Pattern Selection

Solution Using ELT for Most + ETL for Heavy-Compute Streaming

6. ETL orchestration tools — Airflow, dbt, Spark, AWS Glue

The tools that turn an ETL script into a production pipeline

Apache Airflow — DAG orchestration and scheduling

dbt and Apache Spark — SQL transforms and distributed processing

Managed services — AWS Glue, Fivetran, Airbyte

Common beginner mistakes

ETL Interview Question on Choosing an Orchestration Stack

Solution Using Fivetran + Airflow + dbt + Spark Streaming

7. Building a Python pandas ETL pipeline

A runnable end-to-end example you can adapt

Setup — installing pandas and pyarrow

Extract + Transform + Load in 30 lines of Python

Scaling out — when to move beyond pandas

Common beginner mistakes

ETL Interview Question on Building a First Pandas Pipeline

Solution Using pandas + pyarrow

Choosing your ETL stack (checklist)

Frequently asked questions

What is an ETL pipeline?

What is the difference between ETL and ELT?

Which tool is best for ETL orchestration?

Can I build an ETL pipeline in Python?

What's the most common ETL pipeline failure mode?

Practice on PipeCode

Solution Using `LOWER + TRIM + REGEXP` + Reject-Table Quarantine