etl vs elt is the single most-asked architectural question on modern data engineering interviews and the most-debated pattern choice on greenfield pipelines, because the same five letters reordered describe two fundamentally different cost profiles, two different team skill mixes, and two different compliance postures. The etl elt difference is not just where the "T" runs — it is which compute you pay for, how fast new sources land, and whether sensitive columns ever touch the warehouse in raw form.
This guide walks through the etl vs elt architecture end-to-end: the classic transform-first ETL shape that dominated pre-2015 data warehousing, the modern transform-after-load ELT shape that took over once Snowflake, BigQuery, and Databricks made warehouse compute cheaper than dedicated ETL servers, the six-dimension decision matrix that picks the right shape for a given workload, and the etl elt pipeline hybrid (EtLT) that most production teams actually ship in 2026. Each section pairs a teaching block with a worked example — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works — so you leave with both the extract transform load vs extract load transform vocabulary and the design instincts to defend either shape in an interview round.
When you want hands-on reps immediately after reading, drill the ETL practice library →, browse data pipeline + transformation problems →, and rehearse on the ETL system design course →.
On this page
- The 3-letter difference and why it matters
- Classic ETL — transform-first, before the warehouse load
- Modern ELT — load-first, transform inside the warehouse
- When ETL still wins — the 6-dimension decision matrix
- The hybrid pattern — EtLT (light pre-load + warehouse transforms)
- Cheat sheet — pick ETL vs ELT vs EtLT
- Frequently asked questions
- Practice on PipeCode
1. The 3-letter difference and why it matters
etl vs elt collapses to a single decision: where does the T run — outside the warehouse or inside it?
The one-sentence invariant: ETL transforms data BEFORE it lands in the warehouse — on a dedicated ETL server with its own compute and licensing; ELT loads raw data INTO the warehouse first and then transforms it there using the warehouse's own compute (typically via dbt or native SQL). Once you internalise that mental model — "T before L" vs "T after L" — every downstream etl elt difference discussion (cost, latency, compliance, team skill) becomes a derivative question rather than a fresh debate.
ETL vs ELT in five bullets.
- ETL — transform-then-load. Source rows hit a dedicated ETL server (Informatica, Talend, SSIS, DataStage) where they are cleaned, joined, deduplicated, and modelled. Only modelled rows land in the warehouse. The warehouse stays small and pristine.
-
ELT — load-then-transform. Source rows are streamed through a lightweight loader (Fivetran, Airbyte, custom Python) directly into a
rawschema in the warehouse. Transforms run inside the warehouse via dbt models, Spark SQL, or native warehouse SQL. The warehouse holds both raw and modelled data. - The cost flip. Pre-2015, warehouse compute was expensive ($/TB/hour on Teradata, Netezza, on-prem Oracle) — so you minimised warehouse work and paid for a dedicated ETL server. Post-2020, Snowflake / BigQuery / Databricks made warehouse compute cheaper than dedicated ETL boxes — so you maximise warehouse work and pay only for the queries you actually run.
- The latency flip. ETL was batch-shaped because the ETL server was a finite resource you scheduled around. ELT is closer to micro-batch because the warehouse autoscales — a 200M-row dbt model rebuild runs whenever you trigger it, not when the ETL queue clears.
- The compliance flip. ETL can mask / encrypt / drop PII before the warehouse ever sees it — a hard win for regulated industries. ELT lands PII raw and relies on warehouse-level row security or post-load masking — a softer compliance posture that demands stricter access controls.
The 2026 reality (post-Snowflake, post-dbt, post-Fivetran).
-
ELT is the default. Almost every cloud-native data team starts with
Fivetran → Snowflake → dbtorAirbyte → BigQuery → dbtbecause it goes from "no warehouse" to "first dashboard" in days, not months. - ETL is alive in two niches. Strict-PII regulated industries (banking, healthcare, EU GDPR-heavy) and budget-capped warehouses (Redshift Serverless with per-query spending limits) still ship ETL as the primary pattern — the warehouse is too sensitive or too expensive to be the transform engine.
- EtLT is the production reality. Most multi-source pipelines ship a hybrid: light "Et" (PII mask, type cast, decrypt) outside the warehouse, then heavy "T" inside the warehouse via dbt. You get compliance + cloud economics in one shape.
What interviewers listen for.
- Do you frame the choice as a trade-off, not a religion? — senior signal. ELT-only or ETL-only answers are a red flag.
- Do you reach for
acksof compliance the moment PII shows up in the question? — required answer (mask before warehouse for regulated data). - Do you mention dbt or warehouse-native transforms when explaining ELT? — current-default signal.
- Do you sketch the 6-dimension decision matrix (PII, latency, cost, source velocity, team skill, on-prem vs cloud) rather than picking a side? — staff-level signal.
Worked example — frame the ETL vs ELT decision for a new pipeline
Detailed explanation. A realistic interview probe sounds like: "We have eight SaaS sources (Salesforce, Stripe, HubSpot, Zendesk, Mixpanel, Segment, Jira, GitHub) feeding a BI dashboard. The warehouse is fresh — pick ETL or ELT, defend it." The honest answer is a decision tree, not a one-word pick.
Question. Eight SaaS sources, Snowflake as the target warehouse, a 3-person analytics-engineering team, no PII in scope. Choose ETL or ELT and justify the choice on cost, time-to-first-dashboard, and team skill mix.
Input (the workload).
| Factor | Value |
|---|---|
| Sources | 8 SaaS (Salesforce, Stripe, HubSpot, Zendesk, Mixpanel, Segment, Jira, GitHub) |
| Target | Snowflake (cloud warehouse, autoscaling) |
| PII | None in scope |
| Latency SLO | 1 hour freshness |
| Team | 3 analytics engineers (SQL fluent, light Python) |
| Budget | Pay-per-query, no fixed cap |
Code (the ELT shape you would ship).
# fivetran-connectors.yaml — managed loader handles every source
connectors:
- source: salesforce
schedule: 15min
schema: raw_salesforce
- source: stripe
schedule: 15min
schema: raw_stripe
- source: hubspot
schedule: 60min
schema: raw_hubspot
- source: zendesk
schedule: 60min
schema: raw_zendesk
- source: mixpanel
schedule: 60min
schema: raw_mixpanel
- source: segment
schedule: 15min
schema: raw_segment
- source: jira
schedule: 60min
schema: raw_jira
- source: github
schedule: 60min
schema: raw_github
-- dbt/models/staging/stg_stripe_charges.sql — the "T" inside Snowflake
{{ config(materialized='view') }}
SELECT
id AS charge_id,
customer AS customer_id,
amount / 100.0 AS amount_usd,
currency,
status,
TO_TIMESTAMP(created) AS charged_at,
_fivetran_synced AS loaded_at
FROM {{ source('raw_stripe', 'charges') }}
WHERE _fivetran_deleted IS NULL
Step-by-step explanation.
-
Fivetran handles "EL". Each connector pulls source rows on its schedule (15 or 60 min), normalises them lightly, and writes them to a per-source
raw_*schema in Snowflake. No code on the analytics team's side beyond the YAML. -
Snowflake does "T". dbt models in the
staging,intermediate, andmartslayers read fromraw_*and produce business-ready tables. Compute scales with the model — a 100M-row mart rebuild fires up a Medium warehouse for 90 seconds, then suspends. - Cost shape. You pay Fivetran per row (predictable) and Snowflake per query-second (variable). Adding a 9th source is a 15-minute YAML change; adding a 9th ETL pipeline in Informatica would be a multi-week dev cycle.
-
Time-to-first-dashboard. Day 1 — connectors are flowing into
raw_*. Day 2 — first staging models. Day 3 — first mart. Day 4 — first dashboard. - Team fit. A 3-person SQL-fluent team can own the full pipeline. There is no separate "ETL developer" role, no Informatica licensing call, no specialist hire.
Output.
| Decision | Pick | Why |
|---|---|---|
| Pattern | ELT | cloud warehouse + SaaS sources + small SQL team |
| Loader | Fivetran | 8 SaaS connectors, schema drift handled |
| Transform engine | Snowflake + dbt | warehouse pays only for queries that run |
| Time to first dashboard | ~4 days | vs ~4 weeks for an ETL-server build |
| Headcount | 3 analytics engineers | no specialist ETL dev needed |
Rule of thumb. When the workload is "cloud warehouse + SaaS sources + SQL-fluent team + no strict pre-warehouse PII gate," ELT wins on every dimension that matters. The question gets harder only when PII, on-prem sources, or hard budget caps enter the room.
ETL vs ELT interview question — picking the right shape from raw constraints
A senior interviewer often shapes this round as: "Given the constraints I'm about to read out, would you ship ETL or ELT — and why?" It blends architectural literacy with trade-off reasoning — the two muscles every etl elt pipeline interview opens with.
Solution Using a constraint-driven decision tree
┌────────────────────────────────────────┐
│ Pick ETL vs ELT for a new pipeline │
└────────────────────────────────────────┘
│
┌──────────────────────────┼──────────────────────────┐
▼ ▼ ▼
Strict pre-warehouse Cloud warehouse + On-prem warehouse
PII / GDPR rule? SQL-fluent team? or legacy ETL stack?
│ │ │
yes │ yes│ yes│
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ ETL or │ │ ELT │ │ ETL │
│ EtLT │ │ default│ │ legacy │
│ (mask │ │ shape │ │ install│
│ first) │ │ │ │ │
└────────┘ └────────┘ └────────┘
Step-by-step trace.
| Workload sketch | PII gate? | Cloud target? | Team skill | Verdict |
|---|---|---|---|---|
| Healthcare claims → on-prem DW | yes (HIPAA PHI) | no | specialist ETL devs | ETL — mask PHI on the ETL server before it reaches the DW |
| 8 SaaS → Snowflake | no | yes | SQL-fluent analytics engineers | ELT — Fivetran + dbt is the default |
| Banking transactions → BigQuery | yes (account numbers) | yes | mixed | EtLT — mask PII pre-load, dbt models post-load |
| Legacy SSIS on-prem, no cloud budget | varies | no | SSIS specialists | ETL — stay on the legacy install, don't force a migration |
| 500-source ad-tech firehose → Databricks | partial | yes | analytics engineers | ELT — manual ETL would never keep up with source velocity |
Output:
| Workload | Pattern | Primary driver |
|---|---|---|
| Healthcare claims | ETL | compliance — PHI never reaches DW raw |
| 8 SaaS → Snowflake | ELT | cloud economics + SQL team |
| Banking → BigQuery | EtLT | compliance + cloud both required |
| Legacy SSIS | ETL | sunk cost + on-prem |
| 500-source firehose | ELT | source velocity dominates |
Why this works — concept by concept:
- Compliance gate as the first branch — if the law says raw PII cannot land in the warehouse, the question is already answered: you need some transformation outside the warehouse. ETL or EtLT, never pure ELT.
- Cloud + SQL team as the ELT signal — every dimension (cost, time-to-ship, hire-ability, source velocity) favours ELT when the warehouse is cloud-native and the team speaks SQL. Picking ETL here is fighting both economics and labour markets.
- Source velocity as the firehose signal — when you onboard a new source every week, the per-source ETL dev cycle (days) cannot keep up. Connectors (30 min each) are the only way through.
- Legacy install as the inertia signal — re-platforming a healthy on-prem ETL install for the sake of "modern" is a multi-year project with negative ROI. Stay on the legacy until a separate forcing function appears.
- Cost — picking the right pattern up-front saves quarters of rework; picking the wrong one creates a "platform migration" line item that dominates the next two roadmaps.
A short history — why ETL was the only option pre-2015
Detailed explanation. Understanding why the industry inverted the letters helps you defend either shape on the day. The story is not "engineers got smarter" — it is "warehouse compute got cheaper."
Question. Walk through the four eras of data warehousing and explain when (and why) ETL gave way to ELT as the default.
Input (the four eras at a glance).
| Era | Years | Warehouse class | Compute economics | Default pattern |
|---|---|---|---|---|
| Era 1 — Appliance | 2000-2010 | Teradata, Netezza, Exadata | $/TB/hour expensive, fixed capacity | ETL |
| Era 2 — Hadoop | 2010-2015 | HDFS + Hive | cheap storage, slow query | ETL + ad-hoc |
| Era 3 — Cloud DW | 2015-2020 | Redshift, BigQuery, Snowflake | per-second metered compute | ELT (new default) |
| Era 4 — Lakehouse | 2020-2026 | Snowflake, BigQuery, Databricks, Delta / Iceberg | autoscale + cheap storage | ELT + EtLT |
Step-by-step explanation.
- Era 1 (Appliance). Warehouse compute was scarce and expensive — every CPU-cycle on Teradata mattered. The cheapest way to keep the warehouse fast was to do all the heavy work elsewhere. Dedicated ETL servers (Informatica) emerged as the natural transform layer. The cost of a separate ETL stack was worth it.
- Era 2 (Hadoop). Storage got cheap (HDFS); compute got distributed (MapReduce) but slow per-query. Teams still ran ETL into the warehouse for fast queries; Hadoop was a data lake, not a warehouse replacement. ETL stayed dominant.
- Era 3 (Cloud DW). Snowflake landed in 2014; BigQuery became serious in 2015; per-second compute metering changed everything. A 60-second warehouse query cost cents, not dollars. Now the calculus flipped — paying for warehouse compute was cheaper than maintaining an Informatica install. dbt (2016) gave the warehouse a clean transform-orchestration layer. ELT became the default for cloud-native teams.
- Era 4 (Lakehouse). Snowflake / BigQuery / Databricks all add cheap storage to elastic compute; open table formats (Delta, Iceberg, Hudi) decouple storage from any single warehouse vendor. ELT remains the default; EtLT emerges to handle the compliance + cost-cap niches that pure ELT cannot.
- Why the cost flip is permanent. Cloud warehouses get cheaper per query every year (autoscale + serverless). Dedicated ETL servers do not get cheaper — they get more expensive as the SaaS source landscape grows and connector lag widens.
Output.
| Era | Pattern | Why |
|---|---|---|
| 2000-2010 | ETL | warehouse compute is the bottleneck — move work out |
| 2010-2015 | ETL + Hadoop | data lake offloads storage; warehouse still expensive |
| 2015-2020 | ELT | per-second warehouse compute cheaper than ETL servers |
| 2020-2026 | ELT + EtLT | warehouse autoscale + dbt + compliance gates |
Rule of thumb. The pattern in vogue follows the cheapest compute. When warehouse compute is expensive, transform outside (ETL). When warehouse compute is cheap and elastic, transform inside (ELT). EtLT is what you ship when compliance is the cost-of-doing-business factor pure ELT cannot satisfy.
ETL
Topic — etl
ETL pipeline design problems (extract, transform, load)
2. Classic ETL — transform-first, before the warehouse load
etl vs elt architecture starts with the classic shape: dedicated compute outside the warehouse, only modelled rows inside
The mental model in one line: in classic ETL, source rows fan through a dedicated ETL server (Informatica, Talend, SSIS, DataStage) that runs the full cleaning + modelling pipeline, then writes only the modelled output to the warehouse — so the warehouse stays "clean" and small, but you pay for a separate compute footprint and a separate licensing stack. This was the only viable shape pre-2015 because warehouse compute was scarce and expensive enough that you minimised it at all costs.
The three stages of classic ETL.
- Extract. Pull rows from source systems (OLTP databases, flat files, mainframe extracts, SaaS APIs). Often nightly batch; some shops ship hourly micro-batch.
- Transform. The big middle. Cleaning, type casting, joining lookup tables, applying business rules, generating surrogate keys, building Slowly Changing Dimension (SCD) histories — all on the ETL server's CPU and RAM.
-
Load. Write the modelled rows (cleaned facts + dimensions) to the warehouse via bulk insert /
MERGE/ load utility. The warehouse never sees raw source data.
The tooling that defined this era.
- Informatica PowerCenter — the dominant enterprise ETL platform from the 2000s. Visual mapping designer, deep connector library, expensive licensing.
- Talend Data Integration — open-source-rooted competitor; broad ecosystem; Java codegen under the hood.
- Microsoft SSIS (SQL Server Integration Services) — bundled with SQL Server; common in mid-market Windows shops.
- IBM DataStage — the legacy choice in banking, insurance, telco. Still humming along inside many global Fortune 500 stacks.
Why ETL was right for its era.
- Warehouse compute was scarce. Teradata, Netezza, on-prem Oracle Exadata — every CPU-second cost real money. Moving transforms off the warehouse onto a cheaper ETL box was a net cost win.
- PII compliance gate. Banks, hospitals, insurers had auditable workflows that required sensitive columns to be masked before they entered the warehouse. The ETL server was a natural gate.
- Batch-shaped business. Reports refreshed daily; nightly batch windows were normal; the warehouse went quiet at 4 AM and woke up at 7 AM with fresh modelled data. ELT's "transform on demand" wasn't required.
Where ETL hurts in 2026.
- Two compute stacks to operate. The ETL server cluster and the warehouse cluster. Two scaling stories, two incident channels, two on-call rotations.
- Per-source dev cost. Adding a new source is a development project — a new mapping, new tests, new deploy. Days to weeks per source.
- Connector lag. Commercial ETL vendors take quarters to support new SaaS APIs (Stripe, Mixpanel, Segment) compared to Fivetran or Airbyte shipping the same connector in a sprint.
- Hire-ability. "Informatica developer" is a shrinking labour market; "analytics engineer who writes dbt" is a growing one.
Worked example — Informatica → SQL Server warehouse with row-level transforms
Detailed explanation. A classic ETL job ingests a customers CSV from an OLTP extract, cleans the email column, derives a country code from the address, joins a country lookup table to get the ISO-3 code, generates a surrogate key, and writes the modelled dim_customer row to the SQL Server warehouse. Each row is processed on the Informatica server before any of it touches the warehouse.
Question. Walk through an Informatica-shaped ETL pipeline that ingests a daily customers.csv extract (1M rows), cleans the email, derives the country code from the address, joins to a country lookup, generates a surrogate key, and lands dim_customer rows in the SQL Server warehouse — with row-level transforms running entirely on the ETL server.
Input (the source CSV).
| customer_id | address | created_at | |
|---|---|---|---|
| 1001 | JANE@example.com |
"221B Baker St, London, UK" | 2026-06-04 10:11 |
| 1002 | bob@ACME.org |
"10 Rue Bonaparte, Paris, France" | 2026-06-04 10:12 |
| 1003 | Alice@beta.io |
"1 Apple Park Way, Cupertino, USA" | 2026-06-04 10:13 |
Code (an Informatica-style mapping, expressed in pseudocode + the equivalent SQL on the ETL server).
# informatica_mapping.py — runs on the ETL server, not the warehouse
import pandas as pd
raw = pd.read_csv("/landing/customers_2026-06-04.csv")
lookup = pd.read_csv("/lookups/country_iso3.csv") # country -> iso3
# 1) Cleanse email — lowercase + strip whitespace
raw["email"] = raw["email"].str.strip().str.lower()
# 2) Derive country from the last comma-separated token in the address
raw["country"] = (
raw["address"].str.rsplit(",", n=1).str[-1].str.strip()
)
# 3) Join to country ISO-3 lookup
modelled = raw.merge(lookup, how="left", on="country")
# 4) Generate the surrogate key
modelled = modelled.reset_index(drop=True)
modelled["customer_sk"] = modelled.index + 1_000_001
# 5) Project the warehouse-facing columns only
out = modelled[["customer_sk", "customer_id", "email",
"country", "iso3", "created_at"]]
# 6) Bulk-load to SQL Server (only the modelled rows leave the ETL box)
out.to_sql(
"dim_customer",
con="mssql+pyodbc://etl_user@DW01/PROD_DW",
if_exists="append",
index=False,
method="multi",
chunksize=10_000,
)
Step-by-step explanation.
- Read the CSV on the ETL server. The 1M rows live in RAM (or stream from disk if the file is huge). The warehouse hasn't seen a single byte yet.
-
Cleanse the email column.
str.strip().str.lower()normalises whitespace and case. This is exactly the kind of work the ETL server is sized for — row-level CPU operations on a few tens of GB of RAM. -
Derive the country. The address column is parsed; the last comma-separated token becomes the country. In a real Informatica mapping this would be an
Expression Transformationwith aSUBSTR+INSTRpair. -
Join the country lookup.
merge(... on="country")enriches each row with the ISO-3 code. In a real mapping this is aLookup Transformation. -
Generate the surrogate key. A monotonically-increasing integer drawn from a sequence (in Informatica, a
Sequence Generator). Surrogate keys decouple the warehouse from any volatility in source IDs. - Project the final columns. Only the warehouse-facing columns survive. Any raw fields not needed in the warehouse stay on the ETL server — that's the "clean warehouse" guarantee in action.
-
Bulk-load. A single
INSERTintodim_customer. The warehouse spends its compute on the load, not on the transform.
Output (the warehouse table after the load).
| customer_sk | customer_id | country | iso3 | created_at | |
|---|---|---|---|---|---|
| 1000001 | 1001 | jane@example.com | UK | GBR | 2026-06-04 10:11 |
| 1000002 | 1002 | bob@acme.org | France | FRA | 2026-06-04 10:12 |
| 1000003 | 1003 | alice@beta.io | USA | USA | 2026-06-04 10:13 |
Rule of thumb. Classic ETL is at its best when row-level cleansing dominates the work, the warehouse is expensive per CPU-second, and PII compliance demands a pre-warehouse gate. Once any of those constraints lifts, ELT or EtLT is almost always a cheaper, faster shape.
ETL interview question — defending classic ETL on a regulated workload
A senior interviewer often probes ETL like this: "Your bank's compliance team forbids any raw account number landing in the warehouse. The legacy stack is Informatica + Oracle Exadata. Design the ETL job and defend why ETL still wins here."
Solution Using pre-warehouse PII tokenisation in an Informatica mapping
# informatica_pii_etl.py — runs on the ETL server, NEVER in the warehouse
import pandas as pd
import hashlib
import hmac
ROTATION_KEY = open("/secrets/pii_hmac_key.bin", "rb").read() # rotated quarterly
def tokenise_account(account_number: str) -> str:
"""HMAC-SHA256 tokenisation — deterministic, irreversible without the key."""
h = hmac.new(ROTATION_KEY, account_number.encode(), hashlib.sha256)
return "AC_" + h.hexdigest()[:24]
raw = pd.read_csv("/landing/transactions_2026-06-04.csv")
# 1) Tokenise the account number BEFORE anything else
raw["account_token"] = raw["account_number"].apply(tokenise_account)
# 2) Drop the raw column — guarantees it cannot leak downstream
raw = raw.drop(columns=["account_number"])
# 3) Cleanse + cast + business rules (still on the ETL server)
raw["amount_usd"] = raw["amount_cents"] / 100.0
raw["txn_date"] = pd.to_datetime(raw["txn_timestamp"]).dt.date
# 4) Project the warehouse-facing columns only
out = raw[["txn_id", "account_token", "amount_usd",
"txn_date", "merchant_id", "country"]]
# 5) Bulk-load into Exadata
out.to_sql(
"fact_transactions",
con="oracle+cx_oracle://etl_user@EXADATA/PROD_DW",
if_exists="append",
index=False,
method="multi",
chunksize=10_000,
)
Step-by-step trace.
| Step | Action | Effect on PII boundary |
|---|---|---|
| 1 | Read transactions_2026-06-04.csv into ETL-server RAM |
raw account numbers exist only inside the ETL server process |
| 2 | HMAC-SHA256 with rotating key → account_token
|
irreversible token replaces the raw account number |
| 3 | drop(columns=["account_number"]) |
the raw column cannot be selected by any downstream step |
| 4 | Cast amount_cents → amount_usd, parse timestamps |
normal business-rule work, no PII implication |
| 5 | Bulk-load to Exadata | warehouse receives only tokens — auditable, attestable, audit-log clean |
Output:
| txn_id | account_token | amount_usd | txn_date | merchant_id | country |
|---|---|---|---|---|---|
| t_001 | AC_3f9c2a1e8b4d6f0a8c2e7b1d | 42.50 | 2026-06-04 | m_42 | GBR |
| t_002 | AC_a1b2c3d4e5f60718293a4b5c | 178.90 | 2026-06-04 | m_91 | FRA |
| t_003 | AC_8e7d6c5b4a3f2e1d0c9b8a7e | 9.99 | 2026-06-04 | m_42 | USA |
Why this works — concept by concept:
- Pre-warehouse PII gate — the raw account number lives only inside the ETL server's process memory. The warehouse never sees it on disk, in logs, in query plans, or in audit trails. This is the single biggest compliance posture difference between ETL and ELT.
- HMAC tokenisation — deterministic (same account → same token, so joins still work), irreversible (cannot recover the account number without the rotation key), rotatable (re-tokenise quarterly to limit blast radius of a key compromise).
-
Column drop after tokenisation —
drop(columns=["account_number"])is the structural guarantee that no downstream code path can select the raw column. The PII is not present in the rest of the pipeline. - Bulk load only — the warehouse's only role is storing and querying tokens; it never participates in the PII boundary. Auditors can attest the warehouse contains no raw PHI / PCI data.
- Cost — extra ETL-server compute for the tokenisation pass; warehouse storage is unchanged; the per-source dev cost is the price you pay for a regulator-clean pipeline.
ETL
Topic — etl
Classic ETL transforms (cleanse, derive, surrogate keys)
3. Modern ELT — load-first, transform inside the warehouse
etl elt difference flips the order: load raw rows first, then transform them with the warehouse's own compute (dbt-native)
The mental model in one line: in modern ELT, a lightweight loader (Fivetran, Airbyte, Stitch, custom Python) pushes raw source rows directly into a raw schema in the warehouse, and the transforms run inside the warehouse — typically as dbt models, sometimes as Spark SQL or native warehouse SQL — using the warehouse's own elastic compute. The warehouse holds both the raw and the modelled data; you pay only for the query-seconds the transforms actually consume.
The three stages of modern ELT.
- Extract. Loader pulls from sources (SaaS APIs, OLTP CDC streams, event buses). Schemas are auto-detected; drift is handled by the loader.
-
Load. Raw rows land in the warehouse in a per-source
raw_*schema. No transformation beyond the loader's normalisation (column type casts, JSON unnesting). -
Transform. dbt models read from
raw_*and build a layered cake:staging(rename + clean),intermediate(joins + business logic),marts(BI-facing star schemas and exposures).
The tooling that defines this era.
- Loaders (the "EL"). Fivetran (managed), Airbyte (open-source / cloud), Stitch (managed), Meltano (open-source orchestration), custom Python on Airflow.
- Warehouses (the "T" engine). Snowflake, BigQuery, Databricks SQL, Redshift, ClickHouse.
- Transform orchestration. dbt (the dominant choice), Dataform, SQLMesh.
-
The 3-layer dbt project.
staging→intermediate→marts. Each layer is a directory of.sqlmodel files, each compiled to a SQLCREATE TABLE AS SELECTagainst the warehouse.
Why ELT is the 2026 default.
- Warehouse compute is cheaper than ETL boxes. A Snowflake X-Small warehouse at $2/hour runs 90 seconds of dbt model rebuilds and then suspends. The equivalent Informatica licence is a 6-figure annual line item.
- New sources land in 30 minutes. Fivetran's connector for a new SaaS → click, OAuth, sync. No mapping designer, no deploy, no test cycle.
- SQL is the lingua franca. Analytics engineers write dbt models; product analysts read them; data scientists query the marts. One language, one warehouse, one set of access controls.
-
CI/CD is straightforward. dbt's
dbt testanddbt buildrun in GitHub Actions. Pull requests rebuild the model graph against a Snowflake clone in seconds. - Replay is cheap. Re-run any model — or the whole graph — by triggering a fresh dbt build. The raw zone is the source of truth; you can always recompute from it.
Where ELT has rough edges.
- Storage cost goes up. You hold raw + staging + intermediate + marts — typically 2–4× the storage of an ETL-only landing strategy.
-
PII lands raw. Unless you mask pre-load (EtLT — see §5), sensitive columns sit in
raw_*in the warehouse. You need row-level security or dynamic data masking to compensate. -
Bad data propagates fast. A source schema change can poison
martswithin minutes if dbt tests don't catch it. Strictdbt testdiscipline is non-negotiable.
Worked example — Fivetran → Snowflake → dbt 3-layer model
Detailed explanation. A managed loader streams Stripe charges into raw_stripe.charges every 15 minutes. A dbt project then builds a staging model (clean + cast), an intermediate model (joined with customer dimension), and a mart (daily revenue facts). All three transform stages run inside Snowflake.
Question. Walk through a Fivetran → Snowflake → dbt 3-layer pipeline that loads Stripe charges into raw_stripe.charges, builds a stg_stripe__charges view, an int_charges_enriched ephemeral model, and a fct_daily_revenue table — entirely inside Snowflake.
Input (the raw zone after a Fivetran sync).
raw_stripe.charges (Fivetran-managed schema, one row per charge):
| id | customer | amount | currency | status | created | _fivetran_synced | _fivetran_deleted |
|---|---|---|---|---|---|---|---|
| ch_001 | cus_42 | 4250 | usd | succeeded | 1717488000 | 2026-06-04 10:15:00 | NULL |
| ch_002 | cus_91 | 17890 | usd | succeeded | 1717488060 | 2026-06-04 10:15:00 | NULL |
| ch_003 | cus_42 | 999 | usd | failed | 1717488120 | 2026-06-04 10:15:00 | NULL |
Code (the three dbt models).
-- 1) models/staging/stg_stripe__charges.sql (VIEW — cheap to re-run)
{{ config(materialized='view') }}
SELECT
id AS charge_id,
customer AS customer_id,
amount / 100.0 AS amount_usd,
currency,
status,
TO_TIMESTAMP(created) AS charged_at,
_fivetran_synced AS loaded_at
FROM {{ source('raw_stripe', 'charges') }}
WHERE _fivetran_deleted IS NULL
-- 2) models/intermediate/int_charges_enriched.sql (EPHEMERAL — compiled inline)
{{ config(materialized='ephemeral') }}
SELECT
c.charge_id,
c.customer_id,
c.amount_usd,
c.status,
c.charged_at,
cust.country,
cust.plan_tier
FROM {{ ref('stg_stripe__charges') }} c
LEFT JOIN {{ ref('dim_customers') }} cust
ON cust.customer_id = c.customer_id
WHERE c.status = 'succeeded'
-- 3) models/marts/fct_daily_revenue.sql (TABLE — materialised)
{{ config(materialized='table') }}
SELECT
DATE(charged_at) AS charge_date,
country,
plan_tier,
COUNT(*) AS charge_count,
SUM(amount_usd) AS revenue_usd
FROM {{ ref('int_charges_enriched') }}
GROUP BY 1, 2, 3
Step-by-step explanation.
-
Fivetran does "EL" — no code. A 15-minute connector schedule pulls Stripe events through Stripe's API, normalises columns, and writes them to
raw_stripe.charges. Schema drift is handled transparently — new columns appear inraw_stripe.charges; missing rows are tagged with_fivetran_deleted. -
stg_stripe__charges— cast + clean (VIEW). A view is cheap because every query recomputes fromraw_stripe.charges; you only pay storage for the underlying raw table. The view casts cents to dollars, parses the Unix timestamp, drops soft-deleted rows. -
int_charges_enriched— join + filter (EPHEMERAL). An ephemeral model is not materialised; dbt inlines it into the downstream mart's SQL. This avoids an unnecessary intermediate table while keeping the model graph readable. -
fct_daily_revenue— aggregate (TABLE). Materialised as a table so BI tools (Looker, Mode, Hex) can query it without re-running the join + aggregate every time. dbt's incremental materialisation is also an option for very large fact tables. -
Snowflake handles the transforms. When
dbt buildruns, Snowflake spins up a Medium warehouse, executes the three model SQLs (in graph order), and suspends. You pay for ~60 seconds of compute — far less than the cost of an idle ETL server. -
Tests run alongside.
dbt testenforcesunique,not_null,relationshipson the model surfaces; a schema change in Stripe that breaks the contract surfaces as a red CI run, not a silent data corruption.
Output (the mart after dbt build).
| charge_date | country | plan_tier | charge_count | revenue_usd |
|---|---|---|---|---|
| 2026-06-04 | USA | pro | 124 | 12450.00 |
| 2026-06-04 | USA | enterprise | 18 | 28900.00 |
| 2026-06-04 | GBR | pro | 47 | 4700.00 |
| 2026-06-04 | FRA | starter | 91 | 1820.00 |
Rule of thumb. ELT shines when the warehouse is cloud-native, the source mix is SaaS-heavy, and the team is SQL-fluent. Once the workload settles into "Fivetran → Snowflake → dbt," adding a 9th source is a 15-minute change and a 10th source is a 15-minute change — the architecture scales linearly with effort.
ELT interview question — defending the dbt 3-layer model graph
A common probe: "Why three layers — staging, intermediate, marts? Why not just one big CREATE TABLE per mart that does all the work?"
Solution Using the dbt layered cake to keep tests and re-use clean
-- staging — one model per source table, 1:1 rename + cast + clean
-- intermediate — joins + business logic, NOT BI-facing
-- marts — star schemas + exposures for dashboards
-- models/staging/stg_stripe__charges.sql (VIEW)
-- models/staging/stg_stripe__customers.sql (VIEW)
-- models/intermediate/int_charges_enriched.sql (EPHEMERAL)
-- models/marts/fct_daily_revenue.sql (TABLE)
-- models/marts/fct_charge_failures.sql (TABLE — re-uses int_charges_enriched)
-- dbt tests live alongside the models:
-- tests/staging/stg_stripe__charges.yml — unique(charge_id), not_null(amount_usd)
-- tests/marts/fct_daily_revenue.yml — not_null(revenue_usd), >=0
Step-by-step trace.
| Layer | Purpose | Materialisation | Re-use |
|---|---|---|---|
| 1. staging | rename source columns, cast types, drop deletes | view | every downstream layer reads from staging |
| 2. intermediate | joins + business filters (e.g. status='succeeded') |
ephemeral or table | re-used by multiple marts |
| 3. marts | star schemas + exposures for BI | table or incremental | endpoints — BI tools query these |
Output:
| Metric | One-shot mart | 3-layer dbt cake |
|---|---|---|
| Model files | 1 huge file | 5 small files |
| Test surface | the mart only | each layer (catches bad data earlier) |
| Reuse | nothing reusable |
int_charges_enriched feeds fct_daily_revenue AND fct_charge_failures
|
| Cost of a Stripe schema change | rewrite the whole mart | edit stg_stripe__charges only |
| Time to add a new mart | hours (re-derive everything) | minutes (read existing staging / intermediate) |
Why this works — concept by concept:
-
Staging as a 1:1 contract — every downstream model reads from staging, never from
raw_*. A source schema change is absorbed in one place. The blast radius of any upstream change collapses to the staging file. - Intermediate as the join layer — once a join (e.g. charges enriched with customer country) is needed by two marts, it lives in intermediate. DRY by construction.
- Marts as the exposure surface — BI tools, exposures, and downstream products read only from marts. The "public API" of the warehouse is a small, stable set of star-schema tables.
-
Tests at every layer —
dbt testrunsunique,not_null,relationshipson staging and marts. Bad data surfaces in CI seconds after the schema change, not as a silent dashboard discrepancy days later. - Cost — extra storage = one view per staging model (negligible) + one ephemeral per intermediate (zero — inlined) + one table per mart (the same as a one-shot). Net storage premium ≈ 0; engineering velocity premium = quarterly.
Worked example — incremental dbt model for a 1B-row event fact
Detailed explanation. ELT scales because dbt's incremental materialisation only rebuilds the new rows since the last run. A fact table that holds 1B events doesn't get fully rewritten every 15 minutes — only the latest sliver lands on each schedule. The pattern uses a MERGE (or INSERT ... ON CONFLICT) keyed on a business id, which also gives you idempotency on retries.
Question. Write an incremental dbt model for a fct_user_events table that holds 1B rows, ingests ~5M new rows every 15 minutes, and must be safely retry-able without producing duplicates.
Input (the source staging model stg_events).
| event_id | user_id | event_name | event_ts | properties |
|---|---|---|---|---|
| e_a | u_42 | page_view | 2026-06-04 10:00 | {"path":"/"} |
| e_b | u_42 | click | 2026-06-04 10:01 | {"target":"cta"} |
| e_c | u_91 | page_view | 2026-06-04 10:02 | {"path":"/pricing"} |
| ... | ... | ... | ... | ... |
Code (the incremental dbt model).
-- models/marts/fct_user_events.sql
{{
config(
materialized = 'incremental',
unique_key = 'event_id',
incremental_strategy = 'merge',
on_schema_change = 'append_new_columns',
cluster_by = ['event_date', 'user_id']
)
}}
SELECT
event_id,
user_id,
event_name,
event_ts,
DATE(event_ts) AS event_date,
properties,
CURRENT_TIMESTAMP() AS loaded_at
FROM {{ ref('stg_events') }}
{% if is_incremental() %}
-- Only consider rows newer than the latest row already in the target.
WHERE event_ts > (SELECT COALESCE(MAX(event_ts), '1970-01-01') FROM {{ this }})
{% endif %}
Step-by-step explanation.
-
First run (full refresh).
is_incremental()is false; the WHERE block is skipped; dbt rebuildsfct_user_eventsfrom scratch with all 1B rows. This is a one-time cost. -
Subsequent runs (every 15 min).
is_incremental()is true; only rows withevent_ts > MAX(existing event_ts)are SELECTed fromstg_events. The result set is ~5M rows, not 1B. -
incremental_strategy = 'merge'— Snowflake'sMERGE INTO fct_user_events USING (SELECT ...) ON event_id = src.event_idupserts on the unique key. If the sameevent_idre-arrives (a duplicate from the source CDC stream or a retried connector sync), the existing row is updated, not double-inserted. -
Retry safety. If the dbt run fails halfway, the next run picks up where the last successful run left off (because
MAX(event_ts)reflects only committed rows). No duplicates, no gaps. -
cluster_byon Snowflake. Co-locates rows with similarevent_date+user_idin the same micro-partitions; downstream user-level analytics queries prune to a tiny slice of the table.
Output (the cost shape over time).
| Schedule slot | Source rows scanned | Target rows written | Compute time |
|---|---|---|---|
| First run | 1,000,000,000 | 1,000,000,000 | ~30 min on a Large warehouse |
| 10:15 incremental | 5,000,000 | ~5,000,000 (MERGE) | ~90 seconds on X-Small |
| 10:30 incremental | 5,000,000 | ~5,000,000 (MERGE) | ~90 seconds on X-Small |
| 10:45 incremental (retry of failed 10:30) | 5,000,000 | 0 duplicates (MERGE upserts) | ~90 seconds on X-Small |
Rule of thumb. Reach for incremental materialisation when the target table is >50M rows and the per-run delta is <10% of the total. Use merge as the strategy for any table where the same business id might arrive twice; use append only when the source is strictly append-only and CDC duplicates are impossible.
SQL
Topic — data transformation
Warehouse-native transforms (dbt-style SQL modelling)
4. When ETL still wins — the 6-dimension decision matrix
etl vs elt architecture is not a religion — six concrete dimensions decide the right shape per workload
Even in 2026, ETL is not "dead." It wins on at least two dimensions for at least two workload classes. The trick is to score the workload across six concrete dimensions and let the matrix decide — not the pattern's reputation.
The six dimensions every interview probes.
- PII / compliance gate. Must raw sensitive columns ever land in the warehouse? If no, ETL or EtLT only.
- Latency to insight. Hours-stale (overnight reports) or minutes-stale (operational dashboards)?
- Cost model. Predictable fixed (ETL licence + server) or variable per-query (warehouse compute)?
- Source velocity. How many new sources per quarter? One? Twenty?
- Team skill mix. Specialist ETL developers, or SQL-fluent analytics engineers?
- On-prem vs cloud. Hard requirement to stay on-prem? Or full cloud-native target?
Dimension 1 — PII / compliance gate.
- ETL wins. PHI / PCI / GDPR-regulated columns can be tokenised, masked, or dropped on the ETL server before the warehouse sees a single byte. Auditors love this.
-
ELT struggles. Raw PII lands in
raw_*; you compensate with warehouse-level row security, dynamic data masking, or post-load redaction — softer posture, more moving parts. - EtLT bridge. Light "Et" outside the warehouse to mask only the sensitive columns; everything else flows raw into the warehouse.
Dimension 2 — Latency to insight.
- ETL. Batch-shaped — hours. The ETL server is a finite queue you schedule around.
- ELT. Minutes — the warehouse spins up on demand. dbt rebuilds the affected models whenever the loader's next sync lands.
- EtLT. Inherits ELT's latency once past the pre-load step.
Dimension 3 — Cost model.
- ETL. Fixed and predictable — licence + server + ops. Easy to budget; expensive at low volumes.
- ELT. Pay-per-query-second. Cheap at low volumes; can explode with a runaway query if not governed.
- Gotcha. Snowflake auto-suspend + per-query timeouts + resource monitors are the equivalent of "budget control" in ELT-land.
Dimension 4 — Source velocity.
- ETL. Days per source — mapping, test, deploy. Falls behind for fast-moving SaaS landscapes.
- ELT. Minutes per source — Fivetran / Airbyte connectors. Wins outright on this axis.
Dimension 5 — Team skill mix.
- ETL. Specialist Informatica / SSIS / DataStage developers. Shrinking labour market; long ramp.
- ELT. Analytics engineers (SQL + dbt + Python). Larger, growing labour market; faster onboarding.
Dimension 6 — On-prem vs cloud.
- ETL. First-class on-prem story. Decades of mature deployments on Linux / Windows / mainframes.
- ELT. Cloud-first. On-prem variants exist (dbt-core can target on-prem Postgres / Trino) but ergonomics suffer.
Worked example — score a real workload across the matrix
Detailed explanation. A FinTech is choosing the architecture for a new transaction analytics pipeline. The data is PCI-regulated (account numbers + card BINs), the target is BigQuery (cloud), the team is 4 analytics engineers + 1 platform engineer, and they need 15-minute freshness for fraud-ops dashboards. Score it.
Question. A FinTech needs a new analytics pipeline. Constraints: PCI data (account numbers + card BINs), BigQuery target, 5-person team (mostly SQL-fluent), 15-minute freshness SLO, two new sources per quarter. Score the workload across the six dimensions and pick a pattern.
Input (the constraint table).
| Dimension | Value |
|---|---|
| PII / compliance | PCI — account numbers + card BINs |
| Latency to insight | 15 minutes |
| Cost model | flexible — BigQuery on-demand |
| Source velocity | ~2 new sources / quarter |
| Team skill | 4 analytics engineers + 1 platform engineer |
| On-prem vs cloud | cloud (BigQuery) |
Code (the scoring sketch — runs as a notebook before the architecture review).
# decision_matrix.py — score the workload across six dimensions
dimensions = {
"PII / compliance": {"etl": "WIN", "elt": "loss", "weight": 3, "note": "PCI — mask BEFORE warehouse"},
"Latency to insight": {"etl": "loss", "elt": "WIN", "weight": 2, "note": "15-min SLO favours warehouse autoscale"},
"Cost model": {"etl": "loss", "elt": "WIN", "weight": 1, "note": "on-demand BQ scales with usage"},
"Source velocity": {"etl": "loss", "elt": "WIN", "weight": 1, "note": "~2/quarter — both can keep up, ELT easier"},
"Team skill": {"etl": "loss", "elt": "WIN", "weight": 2, "note": "SQL-fluent team, no Informatica devs"},
"On-prem vs cloud": {"etl": "tie", "elt": "WIN", "weight": 1, "note": "BigQuery is cloud-only"},
}
etl_score = sum(d["weight"] for d in dimensions.values() if d["etl"] == "WIN")
elt_score = sum(d["weight"] for d in dimensions.values() if d["elt"] == "WIN")
print("ETL score:", etl_score, " ELT score:", elt_score)
# ETL score: 3 ELT score: 7
Step-by-step explanation.
- PII dimension flips the decision toward "mask first." PCI demands that raw account numbers and card BINs never reach BigQuery in cleartext. Pure ELT is off the table.
- Every other dimension favours ELT. Latency, cost, team skill, on-prem-vs-cloud — all wins for the warehouse-native shape.
- The verdict is EtLT, not pure ETL. Light "Et" outside the warehouse to mask the PCI columns (HMAC tokenisation), then full ELT for everything else. You don't need a full Informatica install to do one tokenisation step — a small Python service or a Fivetran column transform suffices.
- The matrix is honest about it. The PII column is the only "ETL wins" row; five out of six favour ELT. The hybrid (EtLT) captures both wins.
Output (the architecture decision).
| Dimension | Verdict | Pattern handles it via |
|---|---|---|
| PII / compliance | ETL | pre-load HMAC tokenisation of account numbers + BINs |
| Latency to insight | ELT | dbt models rebuild on 15-min schedule |
| Cost model | ELT | BigQuery on-demand pricing |
| Source velocity | ELT | Fivetran connectors for new sources |
| Team skill | ELT | SQL-fluent team, no Informatica devs |
| On-prem vs cloud | ELT | BigQuery is cloud-only |
| Final pattern | EtLT | light pre-load mask + warehouse-native transforms |
Rule of thumb. Score every workload across the six dimensions. If one dimension is a hard compliance gate, EtLT or ETL is required — but the other five dimensions still drive the bulk of the shape. Most production pipelines in 2026 score 4-or-more for ELT and either zero or one for ETL; EtLT exists precisely to capture both wins.
ETL vs ELT interview question — defending the matrix in a design round
A senior probe: "If ELT wins five of six dimensions and ETL wins one, why ship EtLT instead of pure ELT with post-load masking?" — testing whether the candidate understands the compliance posture difference.
Solution Using pre-load tokenisation to keep raw PII out of the warehouse entirely
# eltl_preload_mask.py — runs as a small ECS / Cloud Run service
# Sits BETWEEN the source CDC stream and the BigQuery loader.
import hashlib
import hmac
import json
ROTATION_KEY = open("/secrets/pci_hmac_key.bin", "rb").read()
PII_COLUMNS = {"account_number", "card_bin", "cardholder_name"}
def tokenise(value: str) -> str:
h = hmac.new(ROTATION_KEY, value.encode(), hashlib.sha256)
return "TKN_" + h.hexdigest()[:24]
def mask_record(row: dict) -> dict:
out = {}
for k, v in row.items():
if k in PII_COLUMNS and v is not None:
out[k + "_token"] = tokenise(str(v))
elif k in PII_COLUMNS:
out[k + "_token"] = None
else:
out[k] = v
return out
# Streamed from Kafka or Pub/Sub; written to BigQuery via the loader
for msg in cdc_stream:
masked = mask_record(json.loads(msg.value()))
bigquery_loader.write(masked) # only tokens land in BigQuery
Step-by-step trace.
| Step | Action | Where PII lives |
|---|---|---|
| 1 | CDC source emits a transactions row |
source system (in-scope for PCI audit) |
| 2 | Streaming masker reads the row | ephemeral memory in the masker service (PCI-scope, short-lived) |
| 3 | Masker tokenises account_number, card_bin, cardholder_name
|
raw values exist for microseconds, never persisted |
| 4 | Masker writes the masked row to BigQuery via the loader | warehouse sees only *_token columns |
| 5 | dbt models transform the masked rows | every downstream layer works on tokens |
| 6 | Auditor inspects BigQuery | no raw PCI columns present anywhere |
Output:
| Layer | Raw PCI present? | Why |
|---|---|---|
| Source CDC | yes | inherent — source system is the system of record |
| Masker service | microseconds only | values exist only in process memory during the tokenisation call |
BigQuery raw_*
|
no | only *_token columns land |
| dbt staging / marts | no | downstream layers never see raw |
| Looker / BI exposures | no | only tokens leave the warehouse |
Why this works — concept by concept:
- Pre-load mask as the compliance gate — PCI scope shrinks dramatically when raw card data does not persist anywhere in the warehouse stack. The masker service is in scope; everything downstream is out.
- HMAC tokenisation, not encryption — tokens are deterministic (joins still work) and irreversible without the rotating key (limits blast radius). Encryption is reversible by anyone with the key, which broadens audit scope.
- Light Et + heavy T — the pre-load step does only what compliance requires (mask N specific columns). All other transformation (cast, dedupe, join, aggregate) happens inside BigQuery via dbt. You don't pay for an Informatica install just to do tokenisation.
-
Key rotation discipline —
ROTATION_KEYrotates quarterly; old tokens are rotated by a one-time re-tokenisation pass. Compromise of any single key is bounded to one quarter of data. - Cost — masker service ≈ a few dollars / day at typical FinTech volumes; warehouse cost identical to pure ELT; PCI audit scope shrinks from "the whole warehouse" to "the masker service" — the real win.
Worked example — model the 3-year TCO of ETL vs ELT for a mid-market workload
Detailed explanation. Senior interviewers often probe the cost dimension quantitatively: "Sketch the 3-year TCO of an Informatica ETL stack versus a Fivetran + Snowflake + dbt ELT stack for a mid-market workload." The answer separates one-time costs, recurring licence costs, recurring compute costs, and per-source dev costs.
Question. A mid-market company runs 12 sources, 2 TB of monthly transform compute, ships ~4 new sources per year. Build a 3-year TCO model comparing Informatica + on-prem warehouse versus Fivetran + Snowflake + dbt.
Input (the workload assumptions).
| Driver | Value |
|---|---|
| Sources | 12 active, +4 per year |
| Transform volume | 2 TB / month |
| Latency SLO | 1 hour |
| Team | 3 data engineers |
| Existing licence | none — greenfield decision |
Code (the TCO model — a small Python sketch you would run before the architecture review).
# tco_etl_vs_elt.py — 3-year cost comparison
YEARS = 3
# --- ETL stack (Informatica + on-prem warehouse) ---
etl_informatica_licence = 250_000 # annual, mid-market
etl_server_hardware = 80_000 # one-time
etl_server_ops_per_year = 60_000 # SRE + storage
etl_warehouse_licence_per_year= 180_000 # on-prem warehouse
etl_per_source_dev_cost = 20_000 # specialist labour, per new source
etl_new_sources_total = 4 * YEARS
etl_total = (
etl_server_hardware
+ etl_informatica_licence * YEARS
+ etl_server_ops_per_year * YEARS
+ etl_warehouse_licence_per_year * YEARS
+ etl_per_source_dev_cost * etl_new_sources_total
)
# --- ELT stack (Fivetran + Snowflake + dbt) ---
fivetran_per_million_rows = 1.50 # blended pricing
monthly_rows_millions = 800 # 800M rows / month across 12 sources
fivetran_per_year = fivetran_per_million_rows * monthly_rows_millions * 12
snowflake_per_credit = 2.00 # X-Small at $2/hr ~= 1 credit/hr
credits_per_year = 8_000 # 2 TB compute / month at moderate WH size
snowflake_per_year = credits_per_year * snowflake_per_credit
dbt_cloud_team_seat = 100 # per seat/month
dbt_per_year = 3 * dbt_cloud_team_seat * 12
elt_per_source_dev_cost = 1_000 # mostly YAML + a staging model
elt_new_sources_total = 4 * YEARS
elt_total = (
(fivetran_per_year + snowflake_per_year + dbt_per_year) * YEARS
+ elt_per_source_dev_cost * elt_new_sources_total
)
print(f"ETL 3-year TCO: ${etl_total:>12,.0f}")
print(f"ELT 3-year TCO: ${elt_total:>12,.0f}")
print(f"Delta: ${etl_total - elt_total:>12,.0f}")
Step-by-step explanation.
- One-time hardware vs ongoing meter. ETL has an $80K hardware spend up front; ELT has zero — the warehouse is the meter. Over 3 years, the one-time cost is small but the structural difference matters for cash-flow planning.
- Licence dominates ETL. Informatica + on-prem warehouse licences run ~$430K/year in this scenario. Over 3 years, that is ~$1.3M of fixed spend that is paid whether the pipeline runs one row or one billion.
- Meter dominates ELT. Fivetran (per-row) + Snowflake (per-credit) + dbt Cloud (per seat) run ~$33K/year here. Over 3 years, that is ~$100K of variable spend that scales with usage.
- Per-source dev cost differs by an order of magnitude. ETL = ~$20K per new source (specialist labour, mapping designer, deploy cycle). ELT = ~$1K per new source (YAML + a staging model). At 12 new sources over 3 years, that's $240K vs $12K.
- Total 3-year delta. ETL ≈ $1.6M; ELT ≈ $110K. Mid-market workload, no compliance gate — ELT wins by an order of magnitude.
Output (the 3-year TCO comparison).
| Cost category | ETL stack | ELT stack |
|---|---|---|
| Hardware (one-time) | $80,000 | $0 |
| Informatica licence (3 yr) | $750,000 | $0 |
| ETL server ops (3 yr) | $180,000 | $0 |
| On-prem warehouse licence (3 yr) | $540,000 | $0 |
| Fivetran (3 yr) | $0 | $43,200 |
| Snowflake compute (3 yr) | $0 | $48,000 |
| dbt Cloud (3 yr) | $0 | $10,800 |
| New-source dev (12 sources × 3 yr) | $240,000 | $12,000 |
| 3-year TCO | $1,790,000 | $114,000 |
| Delta | — | ~16× cheaper |
Rule of thumb. Build the TCO model with both fixed and variable components, and amortise the per-source dev cost — it's the line item that quietly dominates over the contract length. If you find the ETL stack winning on TCO for a mid-market workload, double-check the assumptions: usually it's because someone pre-loaded the model with an existing licence the team is not paying for at the margin.
ETL
Topic — etl
ETL design under compliance + cost constraints
5. The hybrid pattern — EtLT (light pre-load + warehouse transforms)
etl elt pipeline in 2026 is almost always EtLT — light Et outside, heavy T inside
The pragmatic reality: most production pipelines in 2026 ship a hybrid — a thin "Et" stage outside the warehouse (mask PII, decrypt, type cast, dedupe) and the heavy "T" inside the warehouse (joins, aggregates, business logic, exposures). EtLT inherits ELT's cloud economics + team-skill match while preserving ETL's compliance gate. It is not a compromise; it is a clean split of work into the layer where each task belongs.
The four stages of EtLT.
- Extract. Same as ETL or ELT — pull rows from sources.
- Light Et (pre-load). Only what cannot wait until inside the warehouse: PII masking, decryption, hard type casting, deduplication. Runs as a small streaming service or as a loader-level transform.
-
Load. Push the masked / cast rows into the warehouse's
rawzone. - Heavy T (in-warehouse). dbt staging → intermediate → marts. All business logic, all joins, all aggregates.
What belongs in the pre-load step (and what doesn't).
- Belongs in pre-load. PII masking, decryption of source payloads, hard type casts that the warehouse cannot do (e.g. binary blob → JSON), dedupe of true source duplicates, schema-shape normalisation.
- Belongs in the warehouse. Joins, aggregates, business rules, SCD-2 history, dimensional modelling, exposure layers, BI-facing marts.
- The rule. If it requires a business decision (what is the right join key? which currency conversion to apply?), it belongs in the warehouse. If it is a compliance or technical constraint (PII mask, decrypt, cast), it belongs in pre-load.
Why EtLT has become the default.
- Best of both worlds. Compliance gate (ETL win) + cloud economics + team skill match + source velocity (ELT wins).
- Pre-load step is small. The "Et" stage handles a handful of columns across all sources — typically a single shared service, not a full ETL platform.
- Loader handles the boilerplate. Fivetran / Airbyte connectors do the extract + load; the pre-load step plugs in as a column-level transform on the loader, or as a sidecar service.
EtLT vs pure ELT vs pure ETL — when each fits.
- Pure ETL. On-prem warehouses, legacy stacks, strict pre-warehouse PII with no cloud target.
- Pure ELT. Cloud warehouse, no PII to mask pre-load, SQL-fluent team.
- EtLT. Cloud warehouse + at least one pre-load constraint (PII, decrypt, hard cast). This is most pipelines.
Worked example — hybrid Fivetran (column mask) → Snowflake raw → dbt staging → marts
Detailed explanation. A B2B SaaS ingests Salesforce Contact records into Snowflake. The email and phone columns are PII under GDPR; everything else (account, opportunity, deal_stage) is free to land raw. Fivetran's column-level transform masks the two PII columns; raw rows land in raw_salesforce.contact; dbt models build the rest.
Question. Walk through an EtLT pipeline that ingests Salesforce Contact records into Snowflake, masks email and phone at the Fivetran loader (pre-load), and builds the rest via dbt — showing exactly where each transform runs.
Input (the source Contact record from Salesforce).
{
"Id": "003abc123",
"FirstName": "Jane",
"LastName": "Doe",
"Email": "jane.doe@example.com",
"Phone": "+44-20-7946-0958",
"AccountId": "001xyz789",
"Title": "Director of Data",
"CreatedDate": "2026-06-04T10:11:00Z"
}
Code (the three layers of EtLT).
# 1) Light Et — Fivetran column-level transform (pre-load)
# fivetran-salesforce-contact.yaml
connector: salesforce
schema: raw_salesforce
table: contact
column_transforms:
- column: Email
type: hash
algo: sha256
salt: "${secrets.email_salt}"
output: email_hash
drop_original: true
- column: Phone
type: hash
algo: sha256
salt: "${secrets.phone_salt}"
output: phone_hash
drop_original: true
-- 2) Load — Fivetran writes the masked row into raw_salesforce.contact
-- (Snowflake table — written by the loader, NOT by the team)
CREATE OR REPLACE TABLE raw_salesforce.contact (
id STRING,
first_name STRING,
last_name STRING,
email_hash STRING, -- pre-loaded by Fivetran's column transform
phone_hash STRING, -- pre-loaded by Fivetran's column transform
account_id STRING,
title STRING,
created_date TIMESTAMP,
_fivetran_synced TIMESTAMP,
_fivetran_deleted BOOLEAN
);
-- 3) Heavy T — dbt staging model inside Snowflake
-- models/staging/stg_salesforce__contact.sql
{{ config(materialized='view') }}
SELECT
id AS contact_id,
first_name,
last_name,
email_hash, -- already masked
phone_hash, -- already masked
account_id,
title,
created_date,
_fivetran_synced AS loaded_at
FROM {{ source('raw_salesforce', 'contact') }}
WHERE _fivetran_deleted IS NULL
-- 4) Heavy T — dbt mart that joins contacts with accounts and opportunities
-- models/marts/fct_account_engagement.sql
{{ config(materialized='table') }}
SELECT
a.account_id,
a.account_name,
COUNT(DISTINCT c.contact_id) AS contact_count,
COUNT(DISTINCT o.opportunity_id) AS open_opportunities,
SUM(o.amount_usd) AS pipeline_usd
FROM {{ ref('stg_salesforce__account') }} a
LEFT JOIN {{ ref('stg_salesforce__contact') }} c USING (account_id)
LEFT JOIN {{ ref('stg_salesforce__opportunity') }} o USING (account_id)
WHERE o.is_open = TRUE
GROUP BY a.account_id, a.account_name
Step-by-step explanation.
-
Salesforce emits the Contact record. Includes raw
EmailandPhone— PII under GDPR. -
Fivetran's column-level transform masks PII pre-load.
EmailandPhoneare SHA-256-hashed with a per-column salt; the original columns are dropped before the row is written to Snowflake. -
Snowflake's
raw_salesforce.contactonly ever contains the hashes. Auditors can confirm the warehouse contains no raw email or phone for any contact. -
stg_salesforce__contactis a 1:1 view. Renameid → contact_id, cast types, filter soft-deletes. No PII work — it's already done. -
fct_account_engagementjoins three staging models. Pure SQL inside Snowflake; no Python; no separate ETL server; no per-mart deploy cycle. - The whole pipeline runs on two infrastructures. Fivetran (managed, handles E + light t + L). Snowflake (handles heavy T via dbt). That's it. No third box.
Output (a row of the final mart).
| account_id | account_name | contact_count | open_opportunities | pipeline_usd |
|---|---|---|---|---|
| 001xyz789 | Acme GmbH | 14 | 3 | 187500.00 |
| 001def456 | Beta SA | 7 | 1 | 42000.00 |
| 001ghi012 | Gamma Inc | 23 | 5 | 312000.00 |
Rule of thumb. Put PII / decrypt / hard-cast work in the loader's column transform (or a sidecar service). Put every other transform in dbt. The split is invariant under most workload changes — adding a new source means adding a new connector + a new staging model, never a full new pipeline.
EtLT interview question — defending the hybrid against pure ELT
A common probe: "If Snowflake has dynamic data masking, why bother with pre-load tokenisation in the loader? Just land the data raw and mask at read-time." — testing whether the candidate understands the audit-scope difference between masking at read and not landing raw at all.
Solution Using pre-load tokenisation to keep raw PII out of the warehouse entirely
# Hybrid EtLT — masks happen BEFORE the warehouse, not on read
# This is the structural guarantee that no raw PII exists at rest in Snowflake.
connector: salesforce
column_transforms:
- column: Email
type: hash # SHA-256 of (value || per-column salt)
output: email_hash
drop_original: true # critical — the raw column is GONE before LOAD
- column: Phone
type: hash
output: phone_hash
drop_original: true
- column: SSN
type: tokenise # HMAC token, deterministic, reversible only by key holder
output: ssn_token
drop_original: true
-- Compare against the pure-ELT alternative (NOT what we're shipping)
-- The pure-ELT shape lands raw PII and masks on read:
CREATE MASKING POLICY mask_email AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('PII_ADMIN') THEN val
ELSE SHA2(val, 256)
END;
ALTER TABLE raw_salesforce.contact MODIFY COLUMN email
SET MASKING POLICY mask_email;
-- Raw email STILL EXISTS in the table; only the SELECT result is masked.
-- Backups, time-travel, replication, query history all retain the raw values.
Step-by-step trace.
| Concern | EtLT pre-load mask | Pure ELT + dynamic masking |
|---|---|---|
| Raw PII at rest in warehouse? | no — never written | yes — written then masked on read |
| Raw PII in backups / time travel? | no | yes — up to retention window |
| Raw PII in replication streams? | no | yes — replicated cleartext |
| Raw PII in query history? | no | yes — full query text may include raw values |
| Audit scope | masker service only | entire warehouse + backups + replicas + query history |
| Compliance evidence | "drop_original: true" + masker logs | masking policy + per-role audit trail |
| Cost of a key compromise | re-tokenise forward | full re-load + re-mask + backup purge |
Output:
| Aspect | EtLT (pre-load mask) | Pure ELT (mask on read) |
|---|---|---|
| Raw PII landed in warehouse | NO | YES |
| Audit scope | masker only | warehouse + backups + replicas |
| GDPR "right to be forgotten" cost | low (forward-only) | high (purge backups + replicas) |
| Read-time masking still possible | yes (additional layer) | this is the only layer |
Why this works — concept by concept:
- Masking on read is not the same as not landing raw — a masking policy hides the value from the SELECT result but the underlying bytes still exist on disk, in backups, in time-travel snapshots, in replication streams, and in query history. Auditors care about bytes at rest, not bytes returned.
-
drop_original: trueis the structural guarantee — once the loader drops the raw column before LOAD, the warehouse cannot retain it anywhere. No backup-leak risk, no time-travel-leak risk, no replication-leak risk. - Audit-scope contraction — under EtLT, the PCI / GDPR audit scope is the masker service (one container, one log stream, one secrets vault). Under pure-ELT-with-masking-policy, the audit scope is the entire warehouse plus every replica plus every backup retention window.
- GDPR cost-to-forget — under EtLT, a "right to be forgotten" request is a forward-only operation (delete the user's rows, the historical hashes are already irreversible). Under pure ELT, you must purge the user's raw PII from every backup, every time-travel snapshot, every replica — quarters of cleanup.
- Cost — masker service ≈ a few dollars / day at typical SaaS volumes; the audit-scope contraction is worth several engineer-quarters per compliance cycle. EtLT pays for itself before the first audit.
Worked example — orchestrate the EtLT pipeline with Airflow + dbt
Detailed explanation. A common follow-up: "Show me how you'd orchestrate the whole EtLT shape end-to-end — masker, loader, dbt — so it survives partial failure and re-runs cleanly." The canonical 2026 answer is Airflow (or Dagster / Prefect) as the orchestrator, with dbt as the in-warehouse transform engine. Each step is idempotent so a retry never produces duplicates.
Question. Sketch the Airflow DAG that runs the EtLT pipeline: (1) wait for new Salesforce events, (2) trigger the masker service on the new partition, (3) trigger the Fivetran sync, (4) run the dbt project, (5) publish freshness metrics. Show how partial failure is handled.
Code (the Airflow DAG).
# dags/etlt_salesforce.py
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.fivetran.operators.fivetran import FivetranOperator
from airflow.providers.fivetran.sensors.fivetran import FivetranSensor
from airflow.providers.dbt.cloud.operators.dbt import DbtCloudRunJobOperator
default_args = {
"owner": "data-platform",
"retries": 3,
"retry_delay": timedelta(minutes=5),
"depends_on_past": False,
}
with DAG(
"etlt_salesforce",
start_date=datetime(2026, 6, 1),
schedule="*/15 * * * *", # every 15 minutes
default_args=default_args,
catchup=False,
max_active_runs=1, # one run at a time, no overlap
) as dag:
def run_masker(**ctx):
"""Idempotent — keyed on (source, partition_ts)."""
from masker_client import run
run(source="salesforce",
partition_ts=ctx["data_interval_start"])
mask = PythonOperator(
task_id="mask_salesforce_pii",
python_callable=run_masker,
)
sync = FivetranOperator(
task_id="trigger_fivetran_salesforce",
connector_id="salesforce_prod",
)
wait_sync = FivetranSensor(
task_id="wait_fivetran_completion",
connector_id="salesforce_prod",
poke_interval=30,
timeout=600,
)
dbt_build = DbtCloudRunJobOperator(
task_id="dbt_build_salesforce_models",
job_id=42,
check_interval=20,
wait_for_termination=True,
)
def publish_freshness(**ctx):
from monitoring import emit_metric
emit_metric("etlt.salesforce.freshness_minutes", 15)
publish = PythonOperator(
task_id="publish_freshness",
python_callable=publish_freshness,
)
mask >> sync >> wait_sync >> dbt_build >> publish
Step-by-step explanation.
-
Schedule. Cron
*/15 * * * *triggers the DAG every 15 minutes.max_active_runs=1prevents two runs from racing — if the previous run is still going, the new one queues. -
Step 1 — Mask (the "Et").
run_maskercalls the masker service for the source + partition window. The masker is keyed on(source, partition_ts)so a retry of the same partition produces the same tokens — idempotent. -
Step 2 — Trigger Fivetran sync.
FivetranOperatorPOSTs to Fivetran's REST API to kick the connector. Fivetran handles the actual extract + load. -
Step 3 — Wait for Fivetran completion.
FivetranSensorpolls every 30 seconds until the connector reports success. Times out at 10 minutes; Airflow retries the whole task up to 3 times before paging. -
Step 4 — Run dbt.
DbtCloudRunJobOperatortriggers the dbt Cloud job (or in self-hosted, aDbtRunOperator). dbt's ownMERGE-based incremental models are idempotent by construction. -
Step 5 — Publish freshness. A small metric emits
etlt.salesforce.freshness_minutes. A dashboard alert fires if freshness drifts above SLO. -
Partial-failure behaviour. If
maskfails, sync and dbt do not run — the partition is reprocessed on next schedule. Ifdbt_buildfails, the warehouse is left in a known-good state (last successful commit); the next run picks up. No duplicates, no data loss.
Output (a typical DAG run timeline).
| Step | Duration | Idempotent on retry? |
|---|---|---|
| mask_salesforce_pii | ~30s | yes (deterministic tokens) |
| trigger_fivetran_salesforce | ~5s | yes (Fivetran connector dedupes) |
| wait_fivetran_completion | ~2-5 min | yes (sensor) |
| dbt_build_salesforce_models | ~90s | yes (MERGE on unique_key) |
| publish_freshness | ~2s | yes (metric emit) |
| End-to-end | ~5-7 min | fully idempotent DAG |
Rule of thumb. Orchestrate EtLT with one DAG per source family (one for Salesforce, one for Stripe, etc.) rather than a single mega-DAG — failures localise, retries are cheaper, and the dependency graph stays readable. Use max_active_runs=1 to avoid two runs racing on the same partition.
ETL
Topic — etl
EtLT hybrid pipeline design (pre-load mask + warehouse transforms)
Cheat sheet — pick ETL vs ELT vs EtLT
- Need to keep raw PII out of the warehouse entirely? Pick ETL or EtLT — pure ELT cannot satisfy a strict pre-warehouse compliance gate, even with read-time masking.
- Have a cloud warehouse + SQL-fluent team + no strict pre-load PII rule? Pick ELT — Fivetran/Airbyte → Snowflake/BigQuery/Databricks → dbt is the default 2026 shape.
- Have a cloud warehouse + at least one pre-load constraint (PII / decrypt / hard cast)? Pick EtLT — light "Et" outside, heavy "T" inside. This is most production pipelines.
- Have a legacy on-prem Informatica / SSIS / DataStage install with no migration mandate? Pick ETL — re-platforming for the sake of "modern" rarely pays off; stay on the legacy until a separate forcing function appears.
- Onboarding twenty new sources per quarter? Pick ELT or EtLT — connector-based loaders are the only way to keep up; ETL per-source dev cycles cannot scale.
- Need 15-minute freshness for operational dashboards? Pick ELT or EtLT — warehouse autoscale rebuilds dbt models on demand; ETL servers are batch-shaped.
- Hard cap on warehouse compute spend? Consider ETL — fixed licence cost is more predictable than per-query metering; or use ELT with strict resource monitors and per-query timeouts.
-
Need exactly-once into the warehouse? Use ELT or EtLT with
MERGE-based incremental dbt models keyed on a business id — the warehouse owns the idempotency boundary cleanly.
Frequently asked questions
ETL vs ELT — which is faster?
For modern cloud warehouses, ELT is almost always faster end-to-end because the warehouse compute (Snowflake, BigQuery, Databricks) is more elastic than a dedicated ETL server — a 100M-row transform spins up a large warehouse, runs in 90 seconds, and suspends. ETL's batch-shaped server pipeline is bounded by the slowest queued job and the fixed compute size. ETL can be faster in narrow cases — small row-level cleansing jobs on a beefy on-prem ETL box, or cases where the warehouse is intentionally throttled by a strict budget cap.
Is ETL dead?
No — ETL is the right shape for two persistent niches: strict pre-warehouse PII / compliance gates (banking, healthcare, EU GDPR-heavy) where raw sensitive columns cannot land in the warehouse, and legacy on-prem stacks (Informatica, SSIS, DataStage) where the warehouse is on-prem and the institutional investment in ETL tooling is mature. What is dead is "pure ETL by default" — for new cloud-native pipelines, ELT or EtLT wins on four-to-five of the six decision dimensions.
Does dbt count as ELT?
Yes — dbt is the canonical transform engine for ELT, doing the "T" inside the warehouse after a separate loader (Fivetran, Airbyte, Stitch, custom Python) handles the "EL." dbt has no extract or load capability of its own; it compiles SQL SELECTs into CREATE TABLE AS or MERGE statements that the warehouse runs against tables that already exist. A pipeline of Fivetran → Snowflake → dbt is the textbook ELT shape; Fivetran (with column-level masks) → Snowflake → dbt is the textbook EtLT shape.
ETL vs ELT for compliance / GDPR?
For strict-PII workloads under GDPR, HIPAA, or PCI, EtLT (or pure ETL) wins on audit-scope grounds because the raw PII never lands in the warehouse — only tokens or hashes do. Pure ELT with read-time masking is not equivalent: the underlying bytes still exist in warehouse storage, backups, time-travel snapshots, replication streams, and query history, all of which fall inside the compliance audit boundary. The pre-load mask in EtLT contracts the audit scope from the entire warehouse stack down to the masker service alone.
Cost comparison — ETL vs ELT?
ETL is predictable fixed cost (server + licence + ops) — easy to budget but expensive at low volumes and hard to scale up quickly. ELT is variable per-query cost — cheap at low volumes, can explode with a runaway query if not governed, but scales with usage. For most cloud-native workloads under 10TB of monthly transforms, ELT lands well below ETL's fixed cost; at very high volumes (100+ TB transforms), the per-query meter starts approaching the licence cost — at which point resource monitors, per-query timeouts, and incremental dbt models become the cost-control tools.
Can I do ETL on Snowflake?
Yes — Snowflake supports the transform-before-final-load pattern via Snowpark, Java/Python UDFs, and ephemeral compute. You can run row-level transforms inside Snowflake's compute before writing the result to the "real" target table, which is effectively in-warehouse ETL. Practically, this is rare — if you're already inside the warehouse, dbt's view/ephemeral/table materialisations cover the same intent more idiomatically. Pure ETL with Snowflake usually means a dedicated transform service (Snowpark / external Python) runs before the canonical raw_* schema receives the data — i.e. EtLT.
Practice on PipeCode
- Drill the ETL practice library → for hands-on extract-transform-load problems with input tables and expected outputs.
- Stretch with data transformation drills → for warehouse-native SQL modelling reps.
- Rehearse SQL aggregation problems → and SQL join problems → for the dbt staging + intermediate layer muscles.
- Layer in data modelling problems → and slowly-changing-data problems → for the marts layer.
- For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Build the design-round muscles with the ETL system design course →.
- Pair the modelling intuition with the data modelling course →.
Pipecode.ai is Leetcode for Data Engineering — every ETL vs ELT concept above ships with hands-on practice rooms where you write real transforms, model real warehouses, and defend real architectures. Start with the ETL library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.





Top comments (0)