Gowtham Potureddi

Posted on Jun 5

ETL vs ELT: Architecture, Trade-offs & When Each Wins

#python #sql #dataengineering #interview

etl vs elt is the single most-asked architectural question on modern data engineering interviews and the most-debated pattern choice on greenfield pipelines, because the same five letters reordered describe two fundamentally different cost profiles, two different team skill mixes, and two different compliance postures. The etl elt difference is not just where the "T" runs — it is which compute you pay for, how fast new sources land, and whether sensitive columns ever touch the warehouse in raw form.

This guide walks through the etl vs elt architecture end-to-end: the classic transform-first ETL shape that dominated pre-2015 data warehousing, the modern transform-after-load ELT shape that took over once Snowflake, BigQuery, and Databricks made warehouse compute cheaper than dedicated ETL servers, the six-dimension decision matrix that picks the right shape for a given workload, and the etl elt pipeline hybrid (EtLT) that most production teams actually ship in 2026. Each section pairs a teaching block with a worked example — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works — so you leave with both the extract transform load vs extract load transform vocabulary and the design instincts to defend either shape in an interview round.

When you want hands-on reps immediately after reading, drill the ETL practice library →, browse data pipeline + transformation problems →, and rehearse on the ETL system design course →.

On this page

The 3-letter difference and why it matters
Classic ETL — transform-first, before the warehouse load
Modern ELT — load-first, transform inside the warehouse
When ETL still wins — the 6-dimension decision matrix
The hybrid pattern — EtLT (light pre-load + warehouse transforms)
Cheat sheet — pick ETL vs ELT vs EtLT
Frequently asked questions
Practice on PipeCode

1. The 3-letter difference and why it matters

`etl vs elt` collapses to a single decision: where does the `T` run — outside the warehouse or inside it?

The one-sentence invariant: ETL transforms data BEFORE it lands in the warehouse — on a dedicated ETL server with its own compute and licensing; ELT loads raw data INTO the warehouse first and then transforms it there using the warehouse's own compute (typically via dbt or native SQL). Once you internalise that mental model — "T before L" vs "T after L" — every downstream etl elt difference discussion (cost, latency, compliance, team skill) becomes a derivative question rather than a fresh debate.

ETL vs ELT in five bullets.

ETL — transform-then-load. Source rows hit a dedicated ETL server (Informatica, Talend, SSIS, DataStage) where they are cleaned, joined, deduplicated, and modelled. Only modelled rows land in the warehouse. The warehouse stays small and pristine.
ELT — load-then-transform. Source rows are streamed through a lightweight loader (Fivetran, Airbyte, custom Python) directly into a raw schema in the warehouse. Transforms run inside the warehouse via dbt models, Spark SQL, or native warehouse SQL. The warehouse holds both raw and modelled data.
The cost flip. Pre-2015, warehouse compute was expensive ($/TB/hour on Teradata, Netezza, on-prem Oracle) — so you minimised warehouse work and paid for a dedicated ETL server. Post-2020, Snowflake / BigQuery / Databricks made warehouse compute cheaper than dedicated ETL boxes — so you maximise warehouse work and pay only for the queries you actually run.
The latency flip. ETL was batch-shaped because the ETL server was a finite resource you scheduled around. ELT is closer to micro-batch because the warehouse autoscales — a 200M-row dbt model rebuild runs whenever you trigger it, not when the ETL queue clears.
The compliance flip. ETL can mask / encrypt / drop PII before the warehouse ever sees it — a hard win for regulated industries. ELT lands PII raw and relies on warehouse-level row security or post-load masking — a softer compliance posture that demands stricter access controls.

The 2026 reality (post-Snowflake, post-dbt, post-Fivetran).

ELT is the default. Almost every cloud-native data team starts with Fivetran → Snowflake → dbt or Airbyte → BigQuery → dbt because it goes from "no warehouse" to "first dashboard" in days, not months.
ETL is alive in two niches. Strict-PII regulated industries (banking, healthcare, EU GDPR-heavy) and budget-capped warehouses (Redshift Serverless with per-query spending limits) still ship ETL as the primary pattern — the warehouse is too sensitive or too expensive to be the transform engine.
EtLT is the production reality. Most multi-source pipelines ship a hybrid: light "Et" (PII mask, type cast, decrypt) outside the warehouse, then heavy "T" inside the warehouse via dbt. You get compliance + cloud economics in one shape.

What interviewers listen for.

Do you frame the choice as a trade-off, not a religion? — senior signal. ELT-only or ETL-only answers are a red flag.
Do you reach for acks of compliance the moment PII shows up in the question? — required answer (mask before warehouse for regulated data).
Do you mention dbt or warehouse-native transforms when explaining ELT? — current-default signal.
Do you sketch the 6-dimension decision matrix (PII, latency, cost, source velocity, team skill, on-prem vs cloud) rather than picking a side? — staff-level signal.

Worked example — frame the ETL vs ELT decision for a new pipeline

Detailed explanation. A realistic interview probe sounds like: "We have eight SaaS sources (Salesforce, Stripe, HubSpot, Zendesk, Mixpanel, Segment, Jira, GitHub) feeding a BI dashboard. The warehouse is fresh — pick ETL or ELT, defend it." The honest answer is a decision tree, not a one-word pick.

Question. Eight SaaS sources, Snowflake as the target warehouse, a 3-person analytics-engineering team, no PII in scope. Choose ETL or ELT and justify the choice on cost, time-to-first-dashboard, and team skill mix.

Input (the workload).

Factor	Value
Sources	8 SaaS (Salesforce, Stripe, HubSpot, Zendesk, Mixpanel, Segment, Jira, GitHub)
Target	Snowflake (cloud warehouse, autoscaling)
PII	None in scope
Latency SLO	1 hour freshness
Team	3 analytics engineers (SQL fluent, light Python)
Budget	Pay-per-query, no fixed cap

Code (the ELT shape you would ship).

# fivetran-connectors.yaml  — managed loader handles every source
connectors:
  - source: salesforce
    schedule: 15min
    schema: raw_salesforce
  - source: stripe
    schedule: 15min
    schema: raw_stripe
  - source: hubspot
    schedule: 60min
    schema: raw_hubspot
  - source: zendesk
    schedule: 60min
    schema: raw_zendesk
  - source: mixpanel
    schedule: 60min
    schema: raw_mixpanel
  - source: segment
    schedule: 15min
    schema: raw_segment
  - source: jira
    schedule: 60min
    schema: raw_jira
  - source: github
    schedule: 60min
    schema: raw_github

-- dbt/models/staging/stg_stripe_charges.sql  — the "T" inside Snowflake
{{ config(materialized='view') }}

SELECT
  id                       AS charge_id,
  customer                 AS customer_id,
  amount       / 100.0     AS amount_usd,
  currency,
  status,
  TO_TIMESTAMP(created)    AS charged_at,
  _fivetran_synced         AS loaded_at
FROM {{ source('raw_stripe', 'charges') }}
WHERE _fivetran_deleted IS NULL

Step-by-step explanation.

Fivetran handles "EL". Each connector pulls source rows on its schedule (15 or 60 min), normalises them lightly, and writes them to a per-source raw_* schema in Snowflake. No code on the analytics team's side beyond the YAML.
Snowflake does "T". dbt models in the staging, intermediate, and marts layers read from raw_* and produce business-ready tables. Compute scales with the model — a 100M-row mart rebuild fires up a Medium warehouse for 90 seconds, then suspends.
Cost shape. You pay Fivetran per row (predictable) and Snowflake per query-second (variable). Adding a 9th source is a 15-minute YAML change; adding a 9th ETL pipeline in Informatica would be a multi-week dev cycle.
Time-to-first-dashboard. Day 1 — connectors are flowing into raw_*. Day 2 — first staging models. Day 3 — first mart. Day 4 — first dashboard.
Team fit. A 3-person SQL-fluent team can own the full pipeline. There is no separate "ETL developer" role, no Informatica licensing call, no specialist hire.

Output.

Decision	Pick	Why
Pattern	ELT	cloud warehouse + SaaS sources + small SQL team
Loader	Fivetran	8 SaaS connectors, schema drift handled
Transform engine	Snowflake + dbt	warehouse pays only for queries that run
Time to first dashboard	~4 days	vs ~4 weeks for an ETL-server build
Headcount	3 analytics engineers	no specialist ETL dev needed

Rule of thumb. When the workload is "cloud warehouse + SaaS sources + SQL-fluent team + no strict pre-warehouse PII gate," ELT wins on every dimension that matters. The question gets harder only when PII, on-prem sources, or hard budget caps enter the room.

ETL vs ELT interview question — picking the right shape from raw constraints

A senior interviewer often shapes this round as: "Given the constraints I'm about to read out, would you ship ETL or ELT — and why?" It blends architectural literacy with trade-off reasoning — the two muscles every etl elt pipeline interview opens with.

Solution Using a constraint-driven decision tree

                     ┌────────────────────────────────────────┐
                     │  Pick ETL vs ELT for a new pipeline    │
                     └────────────────────────────────────────┘
                                       │
            ┌──────────────────────────┼──────────────────────────┐
            ▼                          ▼                          ▼
  Strict pre-warehouse        Cloud warehouse +            On-prem warehouse
  PII / GDPR rule?            SQL-fluent team?             or legacy ETL stack?
            │                          │                          │
        yes │                       yes│                       yes│
            ▼                          ▼                          ▼
       ┌────────┐                 ┌────────┐                 ┌────────┐
       │ ETL or │                 │  ELT   │                 │  ETL   │
       │ EtLT   │                 │ default│                 │ legacy │
       │ (mask  │                 │ shape  │                 │ install│
       │ first) │                 │        │                 │        │
       └────────┘                 └────────┘                 └────────┘

Step-by-step trace.

Workload sketch	PII gate?	Cloud target?	Team skill	Verdict
Healthcare claims → on-prem DW	yes (HIPAA PHI)	no	specialist ETL devs	ETL — mask PHI on the ETL server before it reaches the DW
8 SaaS → Snowflake	no	yes	SQL-fluent analytics engineers	ELT — Fivetran + dbt is the default
Banking transactions → BigQuery	yes (account numbers)	yes	mixed	EtLT — mask PII pre-load, dbt models post-load
Legacy SSIS on-prem, no cloud budget	varies	no	SSIS specialists	ETL — stay on the legacy install, don't force a migration
500-source ad-tech firehose → Databricks	partial	yes	analytics engineers	ELT — manual ETL would never keep up with source velocity

Output:

Workload	Pattern	Primary driver
Healthcare claims	ETL	compliance — PHI never reaches DW raw
8 SaaS → Snowflake	ELT	cloud economics + SQL team
Banking → BigQuery	EtLT	compliance + cloud both required
Legacy SSIS	ETL	sunk cost + on-prem
500-source firehose	ELT	source velocity dominates

Why this works — concept by concept:

Compliance gate as the first branch — if the law says raw PII cannot land in the warehouse, the question is already answered: you need some transformation outside the warehouse. ETL or EtLT, never pure ELT.
Cloud + SQL team as the ELT signal — every dimension (cost, time-to-ship, hire-ability, source velocity) favours ELT when the warehouse is cloud-native and the team speaks SQL. Picking ETL here is fighting both economics and labour markets.
Source velocity as the firehose signal — when you onboard a new source every week, the per-source ETL dev cycle (days) cannot keep up. Connectors (30 min each) are the only way through.
Legacy install as the inertia signal — re-platforming a healthy on-prem ETL install for the sake of "modern" is a multi-year project with negative ROI. Stay on the legacy until a separate forcing function appears.
Cost — picking the right pattern up-front saves quarters of rework; picking the wrong one creates a "platform migration" line item that dominates the next two roadmaps.

A short history — why ETL was the only option pre-2015

Detailed explanation. Understanding why the industry inverted the letters helps you defend either shape on the day. The story is not "engineers got smarter" — it is "warehouse compute got cheaper."

Question. Walk through the four eras of data warehousing and explain when (and why) ETL gave way to ELT as the default.

Input (the four eras at a glance).

Era	Years	Warehouse class	Compute economics	Default pattern
Era 1 — Appliance	2000-2010	Teradata, Netezza, Exadata	$/TB/hour expensive, fixed capacity	ETL
Era 2 — Hadoop	2010-2015	HDFS + Hive	cheap storage, slow query	ETL + ad-hoc
Era 3 — Cloud DW	2015-2020	Redshift, BigQuery, Snowflake	per-second metered compute	ELT (new default)
Era 4 — Lakehouse	2020-2026	Snowflake, BigQuery, Databricks, Delta / Iceberg	autoscale + cheap storage	ELT + EtLT

Step-by-step explanation.

Era 1 (Appliance). Warehouse compute was scarce and expensive — every CPU-cycle on Teradata mattered. The cheapest way to keep the warehouse fast was to do all the heavy work elsewhere. Dedicated ETL servers (Informatica) emerged as the natural transform layer. The cost of a separate ETL stack was worth it.
Era 2 (Hadoop). Storage got cheap (HDFS); compute got distributed (MapReduce) but slow per-query. Teams still ran ETL into the warehouse for fast queries; Hadoop was a data lake, not a warehouse replacement. ETL stayed dominant.
Era 3 (Cloud DW). Snowflake landed in 2014; BigQuery became serious in 2015; per-second compute metering changed everything. A 60-second warehouse query cost cents, not dollars. Now the calculus flipped — paying for warehouse compute was cheaper than maintaining an Informatica install. dbt (2016) gave the warehouse a clean transform-orchestration layer. ELT became the default for cloud-native teams.
Era 4 (Lakehouse). Snowflake / BigQuery / Databricks all add cheap storage to elastic compute; open table formats (Delta, Iceberg, Hudi) decouple storage from any single warehouse vendor. ELT remains the default; EtLT emerges to handle the compliance + cost-cap niches that pure ELT cannot.
Why the cost flip is permanent. Cloud warehouses get cheaper per query every year (autoscale + serverless). Dedicated ETL servers do not get cheaper — they get more expensive as the SaaS source landscape grows and connector lag widens.

Output.

Era	Pattern	Why
2000-2010	ETL	warehouse compute is the bottleneck — move work out
2010-2015	ETL + Hadoop	data lake offloads storage; warehouse still expensive
2015-2020	ELT	per-second warehouse compute cheaper than ETL servers
2020-2026	ELT + EtLT	warehouse autoscale + dbt + compliance gates

Rule of thumb. The pattern in vogue follows the cheapest compute. When warehouse compute is expensive, transform outside (ETL). When warehouse compute is cheap and elastic, transform inside (ELT). EtLT is what you ship when compliance is the cost-of-doing-business factor pure ELT cannot satisfy.

ETL
Topic — etl
ETL pipeline design problems (extract, transform, load)

Practice →

2. Classic ETL — transform-first, before the warehouse load

`etl vs elt architecture` starts with the classic shape: dedicated compute outside the warehouse, only modelled rows inside

The mental model in one line: in classic ETL, source rows fan through a dedicated ETL server (Informatica, Talend, SSIS, DataStage) that runs the full cleaning + modelling pipeline, then writes only the modelled output to the warehouse — so the warehouse stays "clean" and small, but you pay for a separate compute footprint and a separate licensing stack. This was the only viable shape pre-2015 because warehouse compute was scarce and expensive enough that you minimised it at all costs.

The three stages of classic ETL.

Extract. Pull rows from source systems (OLTP databases, flat files, mainframe extracts, SaaS APIs). Often nightly batch; some shops ship hourly micro-batch.
Transform. The big middle. Cleaning, type casting, joining lookup tables, applying business rules, generating surrogate keys, building Slowly Changing Dimension (SCD) histories — all on the ETL server's CPU and RAM.
Load. Write the modelled rows (cleaned facts + dimensions) to the warehouse via bulk insert / MERGE / load utility. The warehouse never sees raw source data.

The tooling that defined this era.

Informatica PowerCenter — the dominant enterprise ETL platform from the 2000s. Visual mapping designer, deep connector library, expensive licensing.
Talend Data Integration — open-source-rooted competitor; broad ecosystem; Java codegen under the hood.
Microsoft SSIS (SQL Server Integration Services) — bundled with SQL Server; common in mid-market Windows shops.
IBM DataStage — the legacy choice in banking, insurance, telco. Still humming along inside many global Fortune 500 stacks.

Why ETL was right for its era.

Warehouse compute was scarce. Teradata, Netezza, on-prem Oracle Exadata — every CPU-second cost real money. Moving transforms off the warehouse onto a cheaper ETL box was a net cost win.
PII compliance gate. Banks, hospitals, insurers had auditable workflows that required sensitive columns to be masked before they entered the warehouse. The ETL server was a natural gate.
Batch-shaped business. Reports refreshed daily; nightly batch windows were normal; the warehouse went quiet at 4 AM and woke up at 7 AM with fresh modelled data. ELT's "transform on demand" wasn't required.

Where ETL hurts in 2026.

Two compute stacks to operate. The ETL server cluster and the warehouse cluster. Two scaling stories, two incident channels, two on-call rotations.
Per-source dev cost. Adding a new source is a development project — a new mapping, new tests, new deploy. Days to weeks per source.
Connector lag. Commercial ETL vendors take quarters to support new SaaS APIs (Stripe, Mixpanel, Segment) compared to Fivetran or Airbyte shipping the same connector in a sprint.
Hire-ability. "Informatica developer" is a shrinking labour market; "analytics engineer who writes dbt" is a growing one.

Worked example — Informatica → SQL Server warehouse with row-level transforms

Detailed explanation. A classic ETL job ingests a customers CSV from an OLTP extract, cleans the email column, derives a country code from the address, joins a country lookup table to get the ISO-3 code, generates a surrogate key, and writes the modelled dim_customer row to the SQL Server warehouse. Each row is processed on the Informatica server before any of it touches the warehouse.

Question. Walk through an Informatica-shaped ETL pipeline that ingests a daily customers.csv extract (1M rows), cleans the email, derives the country code from the address, joins to a country lookup, generates a surrogate key, and lands dim_customer rows in the SQL Server warehouse — with row-level transforms running entirely on the ETL server.

Input (the source CSV).

customer_id	email	address	created_at
1001	`JANE@example.com`	"221B Baker St, London, UK"	2026-06-04 10:11
1002	`bob@ACME.org`	"10 Rue Bonaparte, Paris, France"	2026-06-04 10:12
1003	`Alice@beta.io`	"1 Apple Park Way, Cupertino, USA"	2026-06-04 10:13

Code (an Informatica-style mapping, expressed in pseudocode + the equivalent SQL on the ETL server).

# informatica_mapping.py  — runs on the ETL server, not the warehouse
import pandas as pd

raw       = pd.read_csv("/landing/customers_2026-06-04.csv")
lookup    = pd.read_csv("/lookups/country_iso3.csv")    # country -> iso3

# 1) Cleanse email — lowercase + strip whitespace
raw["email"] = raw["email"].str.strip().str.lower()

# 2) Derive country from the last comma-separated token in the address
raw["country"] = (
    raw["address"].str.rsplit(",", n=1).str[-1].str.strip()
)

# 3) Join to country ISO-3 lookup
modelled = raw.merge(lookup, how="left", on="country")

# 4) Generate the surrogate key
modelled = modelled.reset_index(drop=True)
modelled["customer_sk"] = modelled.index + 1_000_001

# 5) Project the warehouse-facing columns only
out = modelled[["customer_sk", "customer_id", "email",
                "country", "iso3", "created_at"]]

# 6) Bulk-load to SQL Server (only the modelled rows leave the ETL box)
out.to_sql(
    "dim_customer",
    con="mssql+pyodbc://etl_user@DW01/PROD_DW",
    if_exists="append",
    index=False,
    method="multi",
    chunksize=10_000,
)

Step-by-step explanation.

Read the CSV on the ETL server. The 1M rows live in RAM (or stream from disk if the file is huge). The warehouse hasn't seen a single byte yet.
Cleanse the email column. str.strip().str.lower() normalises whitespace and case. This is exactly the kind of work the ETL server is sized for — row-level CPU operations on a few tens of GB of RAM.
Derive the country. The address column is parsed; the last comma-separated token becomes the country. In a real Informatica mapping this would be an Expression Transformation with a SUBSTR + INSTR pair.
Join the country lookup. merge(... on="country") enriches each row with the ISO-3 code. In a real mapping this is a Lookup Transformation.
Generate the surrogate key. A monotonically-increasing integer drawn from a sequence (in Informatica, a Sequence Generator). Surrogate keys decouple the warehouse from any volatility in source IDs.
Project the final columns. Only the warehouse-facing columns survive. Any raw fields not needed in the warehouse stay on the ETL server — that's the "clean warehouse" guarantee in action.
Bulk-load. A single INSERT into dim_customer. The warehouse spends its compute on the load, not on the transform.

Output (the warehouse table after the load).

customer_sk	customer_id	email	country	iso3	created_at
1000001	1001	jane@example.com	UK	GBR	2026-06-04 10:11
1000002	1002	bob@acme.org	France	FRA	2026-06-04 10:12
1000003	1003	alice@beta.io	USA	USA	2026-06-04 10:13

Rule of thumb. Classic ETL is at its best when row-level cleansing dominates the work, the warehouse is expensive per CPU-second, and PII compliance demands a pre-warehouse gate. Once any of those constraints lifts, ELT or EtLT is almost always a cheaper, faster shape.

ETL interview question — defending classic ETL on a regulated workload

A senior interviewer often probes ETL like this: "Your bank's compliance team forbids any raw account number landing in the warehouse. The legacy stack is Informatica + Oracle Exadata. Design the ETL job and defend why ETL still wins here."

Solution Using pre-warehouse PII tokenisation in an Informatica mapping

# informatica_pii_etl.py  — runs on the ETL server, NEVER in the warehouse
import pandas as pd
import hashlib
import hmac

ROTATION_KEY = open("/secrets/pii_hmac_key.bin", "rb").read()  # rotated quarterly

def tokenise_account(account_number: str) -> str:
    """HMAC-SHA256 tokenisation — deterministic, irreversible without the key."""
    h = hmac.new(ROTATION_KEY, account_number.encode(), hashlib.sha256)
    return "AC_" + h.hexdigest()[:24]

raw = pd.read_csv("/landing/transactions_2026-06-04.csv")

# 1) Tokenise the account number BEFORE anything else
raw["account_token"] = raw["account_number"].apply(tokenise_account)

# 2) Drop the raw column — guarantees it cannot leak downstream
raw = raw.drop(columns=["account_number"])

# 3) Cleanse + cast + business rules (still on the ETL server)
raw["amount_usd"] = raw["amount_cents"] / 100.0
raw["txn_date"]   = pd.to_datetime(raw["txn_timestamp"]).dt.date

# 4) Project the warehouse-facing columns only
out = raw[["txn_id", "account_token", "amount_usd",
           "txn_date", "merchant_id", "country"]]

# 5) Bulk-load into Exadata
out.to_sql(
    "fact_transactions",
    con="oracle+cx_oracle://etl_user@EXADATA/PROD_DW",
    if_exists="append",
    index=False,
    method="multi",
    chunksize=10_000,
)

Step-by-step trace.

Step	Action	Effect on PII boundary
1	Read `transactions_2026-06-04.csv` into ETL-server RAM	raw account numbers exist only inside the ETL server process
2	HMAC-SHA256 with rotating key → `account_token`	irreversible token replaces the raw account number
3	`drop(columns=["account_number"])`	the raw column cannot be selected by any downstream step
4	Cast `amount_cents → amount_usd`, parse timestamps	normal business-rule work, no PII implication
5	Bulk-load to Exadata	warehouse receives only tokens — auditable, attestable, audit-log clean

Output:

txn_id	account_token	amount_usd	txn_date	merchant_id	country
t_001	AC_3f9c2a1e8b4d6f0a8c2e7b1d	42.50	2026-06-04	m_42	GBR
t_002	AC_a1b2c3d4e5f60718293a4b5c	178.90	2026-06-04	m_91	FRA
t_003	AC_8e7d6c5b4a3f2e1d0c9b8a7e	9.99	2026-06-04	m_42	USA

Why this works — concept by concept:

Pre-warehouse PII gate — the raw account number lives only inside the ETL server's process memory. The warehouse never sees it on disk, in logs, in query plans, or in audit trails. This is the single biggest compliance posture difference between ETL and ELT.
HMAC tokenisation — deterministic (same account → same token, so joins still work), irreversible (cannot recover the account number without the rotation key), rotatable (re-tokenise quarterly to limit blast radius of a key compromise).
Column drop after tokenisation — drop(columns=["account_number"]) is the structural guarantee that no downstream code path can select the raw column. The PII is not present in the rest of the pipeline.
Bulk load only — the warehouse's only role is storing and querying tokens; it never participates in the PII boundary. Auditors can attest the warehouse contains no raw PHI / PCI data.
Cost — extra ETL-server compute for the tokenisation pass; warehouse storage is unchanged; the per-source dev cost is the price you pay for a regulator-clean pipeline.

ETL
Topic — etl
Classic ETL transforms (cleanse, derive, surrogate keys)

Practice →

3. Modern ELT — load-first, transform inside the warehouse

`etl elt difference` flips the order: load raw rows first, then transform them with the warehouse's own compute (dbt-native)

The mental model in one line: in modern ELT, a lightweight loader (Fivetran, Airbyte, Stitch, custom Python) pushes raw source rows directly into a raw schema in the warehouse, and the transforms run inside the warehouse — typically as dbt models, sometimes as Spark SQL or native warehouse SQL — using the warehouse's own elastic compute. The warehouse holds both the raw and the modelled data; you pay only for the query-seconds the transforms actually consume.

The three stages of modern ELT.

Extract. Loader pulls from sources (SaaS APIs, OLTP CDC streams, event buses). Schemas are auto-detected; drift is handled by the loader.
Load. Raw rows land in the warehouse in a per-source raw_* schema. No transformation beyond the loader's normalisation (column type casts, JSON unnesting).
Transform. dbt models read from raw_* and build a layered cake: staging (rename + clean), intermediate (joins + business logic), marts (BI-facing star schemas and exposures).

The tooling that defines this era.

Loaders (the "EL"). Fivetran (managed), Airbyte (open-source / cloud), Stitch (managed), Meltano (open-source orchestration), custom Python on Airflow.
Warehouses (the "T" engine). Snowflake, BigQuery, Databricks SQL, Redshift, ClickHouse.
Transform orchestration. dbt (the dominant choice), Dataform, SQLMesh.
The 3-layer dbt project. staging → intermediate → marts. Each layer is a directory of .sql model files, each compiled to a SQL CREATE TABLE AS SELECT against the warehouse.

Why ELT is the 2026 default.

Warehouse compute is cheaper than ETL boxes. A Snowflake X-Small warehouse at $2/hour runs 90 seconds of dbt model rebuilds and then suspends. The equivalent Informatica licence is a 6-figure annual line item.
New sources land in 30 minutes. Fivetran's connector for a new SaaS → click, OAuth, sync. No mapping designer, no deploy, no test cycle.
SQL is the lingua franca. Analytics engineers write dbt models; product analysts read them; data scientists query the marts. One language, one warehouse, one set of access controls.
CI/CD is straightforward. dbt's dbt test and dbt build run in GitHub Actions. Pull requests rebuild the model graph against a Snowflake clone in seconds.
Replay is cheap. Re-run any model — or the whole graph — by triggering a fresh dbt build. The raw zone is the source of truth; you can always recompute from it.

Where ELT has rough edges.

Storage cost goes up. You hold raw + staging + intermediate + marts — typically 2–4× the storage of an ETL-only landing strategy.
PII lands raw. Unless you mask pre-load (EtLT — see §5), sensitive columns sit in raw_* in the warehouse. You need row-level security or dynamic data masking to compensate.
Bad data propagates fast. A source schema change can poison marts within minutes if dbt tests don't catch it. Strict dbt test discipline is non-negotiable.

Worked example — Fivetran → Snowflake → dbt 3-layer model

Detailed explanation. A managed loader streams Stripe charges into raw_stripe.charges every 15 minutes. A dbt project then builds a staging model (clean + cast), an intermediate model (joined with customer dimension), and a mart (daily revenue facts). All three transform stages run inside Snowflake.

Question. Walk through a Fivetran → Snowflake → dbt 3-layer pipeline that loads Stripe charges into raw_stripe.charges, builds a stg_stripe__charges view, an int_charges_enriched ephemeral model, and a fct_daily_revenue table — entirely inside Snowflake.

Input (the raw zone after a Fivetran sync).

raw_stripe.charges (Fivetran-managed schema, one row per charge):

id	customer	amount	currency	status	created	_fivetran_synced	_fivetran_deleted
ch_001	cus_42	4250	usd	succeeded	1717488000	2026-06-04 10:15:00	NULL
ch_002	cus_91	17890	usd	succeeded	1717488060	2026-06-04 10:15:00	NULL
ch_003	cus_42	999	usd	failed	1717488120	2026-06-04 10:15:00	NULL

Code (the three dbt models).

-- 1) models/staging/stg_stripe__charges.sql  (VIEW — cheap to re-run)
{{ config(materialized='view') }}

SELECT
  id                          AS charge_id,
  customer                    AS customer_id,
  amount       / 100.0        AS amount_usd,
  currency,
  status,
  TO_TIMESTAMP(created)       AS charged_at,
  _fivetran_synced            AS loaded_at
FROM {{ source('raw_stripe', 'charges') }}
WHERE _fivetran_deleted IS NULL

-- 2) models/intermediate/int_charges_enriched.sql  (EPHEMERAL — compiled inline)
{{ config(materialized='ephemeral') }}

SELECT
  c.charge_id,
  c.customer_id,
  c.amount_usd,
  c.status,
  c.charged_at,
  cust.country,
  cust.plan_tier
FROM {{ ref('stg_stripe__charges') }} c
LEFT JOIN {{ ref('dim_customers') }} cust
  ON cust.customer_id = c.customer_id
WHERE c.status = 'succeeded'

-- 3) models/marts/fct_daily_revenue.sql  (TABLE — materialised)
{{ config(materialized='table') }}

SELECT
  DATE(charged_at)               AS charge_date,
  country,
  plan_tier,
  COUNT(*)                       AS charge_count,
  SUM(amount_usd)                AS revenue_usd
FROM {{ ref('int_charges_enriched') }}
GROUP BY 1, 2, 3

Step-by-step explanation.

Fivetran does "EL" — no code. A 15-minute connector schedule pulls Stripe events through Stripe's API, normalises columns, and writes them to raw_stripe.charges. Schema drift is handled transparently — new columns appear in raw_stripe.charges; missing rows are tagged with _fivetran_deleted.
stg_stripe__charges — cast + clean (VIEW). A view is cheap because every query recomputes from raw_stripe.charges; you only pay storage for the underlying raw table. The view casts cents to dollars, parses the Unix timestamp, drops soft-deleted rows.
int_charges_enriched — join + filter (EPHEMERAL). An ephemeral model is not materialised; dbt inlines it into the downstream mart's SQL. This avoids an unnecessary intermediate table while keeping the model graph readable.
fct_daily_revenue — aggregate (TABLE). Materialised as a table so BI tools (Looker, Mode, Hex) can query it without re-running the join + aggregate every time. dbt's incremental materialisation is also an option for very large fact tables.
Snowflake handles the transforms. When dbt build runs, Snowflake spins up a Medium warehouse, executes the three model SQLs (in graph order), and suspends. You pay for ~60 seconds of compute — far less than the cost of an idle ETL server.
Tests run alongside. dbt test enforces unique, not_null, relationships on the model surfaces; a schema change in Stripe that breaks the contract surfaces as a red CI run, not a silent data corruption.

Output (the mart after dbt build).

charge_date	country	plan_tier	charge_count	revenue_usd
2026-06-04	USA	pro	124	12450.00
2026-06-04	USA	enterprise	18	28900.00
2026-06-04	GBR	pro	47	4700.00
2026-06-04	FRA	starter	91	1820.00

Rule of thumb. ELT shines when the warehouse is cloud-native, the source mix is SaaS-heavy, and the team is SQL-fluent. Once the workload settles into "Fivetran → Snowflake → dbt," adding a 9th source is a 15-minute change and a 10th source is a 15-minute change — the architecture scales linearly with effort.

ELT interview question — defending the dbt 3-layer model graph

A common probe: "Why three layers — staging, intermediate, marts? Why not just one big CREATE TABLE per mart that does all the work?"

Solution Using the dbt layered cake to keep tests and re-use clean

-- staging  — one model per source table, 1:1 rename + cast + clean
-- intermediate  — joins + business logic, NOT BI-facing
-- marts  — star schemas + exposures for dashboards

-- models/staging/stg_stripe__charges.sql        (VIEW)
-- models/staging/stg_stripe__customers.sql      (VIEW)
-- models/intermediate/int_charges_enriched.sql  (EPHEMERAL)
-- models/marts/fct_daily_revenue.sql            (TABLE)
-- models/marts/fct_charge_failures.sql          (TABLE — re-uses int_charges_enriched)

-- dbt tests live alongside the models:
-- tests/staging/stg_stripe__charges.yml — unique(charge_id), not_null(amount_usd)
-- tests/marts/fct_daily_revenue.yml     — not_null(revenue_usd), >=0

Step-by-step trace.

Layer	Purpose	Materialisation	Re-use
1. staging	rename source columns, cast types, drop deletes	view	every downstream layer reads from staging
2. intermediate	joins + business filters (e.g. `status='succeeded'`)	ephemeral or table	re-used by multiple marts
3. marts	star schemas + exposures for BI	table or incremental	endpoints — BI tools query these

Output:

Metric	One-shot mart	3-layer dbt cake
Model files	1 huge file	5 small files
Test surface	the mart only	each layer (catches bad data earlier)
Reuse	nothing reusable	`int_charges_enriched` feeds `fct_daily_revenue` AND `fct_charge_failures`
Cost of a Stripe schema change	rewrite the whole mart	edit `stg_stripe__charges` only
Time to add a new mart	hours (re-derive everything)	minutes (read existing staging / intermediate)

Why this works — concept by concept:

Staging as a 1:1 contract — every downstream model reads from staging, never from raw_*. A source schema change is absorbed in one place. The blast radius of any upstream change collapses to the staging file.
Intermediate as the join layer — once a join (e.g. charges enriched with customer country) is needed by two marts, it lives in intermediate. DRY by construction.
Marts as the exposure surface — BI tools, exposures, and downstream products read only from marts. The "public API" of the warehouse is a small, stable set of star-schema tables.
Tests at every layer — dbt test runs unique, not_null, relationships on staging and marts. Bad data surfaces in CI seconds after the schema change, not as a silent dashboard discrepancy days later.
Cost — extra storage = one view per staging model (negligible) + one ephemeral per intermediate (zero — inlined) + one table per mart (the same as a one-shot). Net storage premium ≈ 0; engineering velocity premium = quarterly.

Worked example — incremental dbt model for a 1B-row event fact

Detailed explanation. ELT scales because dbt's incremental materialisation only rebuilds the new rows since the last run. A fact table that holds 1B events doesn't get fully rewritten every 15 minutes — only the latest sliver lands on each schedule. The pattern uses a MERGE (or INSERT ... ON CONFLICT) keyed on a business id, which also gives you idempotency on retries.

Question. Write an incremental dbt model for a fct_user_events table that holds 1B rows, ingests ~5M new rows every 15 minutes, and must be safely retry-able without producing duplicates.

Input (the source staging model stg_events).

event_id	user_id	event_name	event_ts	properties
e_a	u_42	page_view	2026-06-04 10:00	{"path":"/"}
e_b	u_42	click	2026-06-04 10:01	{"target":"cta"}
e_c	u_91	page_view	2026-06-04 10:02	{"path":"/pricing"}
...	...	...	...	...

Code (the incremental dbt model).

-- models/marts/fct_user_events.sql
{{
  config(
    materialized   = 'incremental',
    unique_key     = 'event_id',
    incremental_strategy = 'merge',
    on_schema_change = 'append_new_columns',
    cluster_by     = ['event_date', 'user_id']
  )
}}

SELECT
  event_id,
  user_id,
  event_name,
  event_ts,
  DATE(event_ts)                AS event_date,
  properties,
  CURRENT_TIMESTAMP()           AS loaded_at
FROM {{ ref('stg_events') }}

{% if is_incremental() %}
  -- Only consider rows newer than the latest row already in the target.
  WHERE event_ts > (SELECT COALESCE(MAX(event_ts), '1970-01-01') FROM {{ this }})
{% endif %}

Step-by-step explanation.

First run (full refresh). is_incremental() is false; the WHERE block is skipped; dbt rebuilds fct_user_events from scratch with all 1B rows. This is a one-time cost.
Subsequent runs (every 15 min). is_incremental() is true; only rows with event_ts > MAX(existing event_ts) are SELECTed from stg_events. The result set is ~5M rows, not 1B.
incremental_strategy = 'merge' — Snowflake's MERGE INTO fct_user_events USING (SELECT ...) ON event_id = src.event_id upserts on the unique key. If the same event_id re-arrives (a duplicate from the source CDC stream or a retried connector sync), the existing row is updated, not double-inserted.
Retry safety. If the dbt run fails halfway, the next run picks up where the last successful run left off (because MAX(event_ts) reflects only committed rows). No duplicates, no gaps.
cluster_by on Snowflake. Co-locates rows with similar event_date + user_id in the same micro-partitions; downstream user-level analytics queries prune to a tiny slice of the table.

Output (the cost shape over time).

Schedule slot	Source rows scanned	Target rows written	Compute time
First run	1,000,000,000	1,000,000,000	~30 min on a Large warehouse
10:15 incremental	5,000,000	~5,000,000 (MERGE)	~90 seconds on X-Small
10:30 incremental	5,000,000	~5,000,000 (MERGE)	~90 seconds on X-Small
10:45 incremental (retry of failed 10:30)	5,000,000	0 duplicates (MERGE upserts)	~90 seconds on X-Small

Rule of thumb. Reach for incremental materialisation when the target table is >50M rows and the per-run delta is <10% of the total. Use merge as the strategy for any table where the same business id might arrive twice; use append only when the source is strictly append-only and CDC duplicates are impossible.

SQL
Topic — data transformation
Warehouse-native transforms (dbt-style SQL modelling)

Practice →

4. When ETL still wins — the 6-dimension decision matrix

`etl vs elt architecture` is not a religion — six concrete dimensions decide the right shape per workload

Even in 2026, ETL is not "dead." It wins on at least two dimensions for at least two workload classes. The trick is to score the workload across six concrete dimensions and let the matrix decide — not the pattern's reputation.

The six dimensions every interview probes.

PII / compliance gate. Must raw sensitive columns ever land in the warehouse? If no, ETL or EtLT only.
Latency to insight. Hours-stale (overnight reports) or minutes-stale (operational dashboards)?
Cost model. Predictable fixed (ETL licence + server) or variable per-query (warehouse compute)?
Source velocity. How many new sources per quarter? One? Twenty?
Team skill mix. Specialist ETL developers, or SQL-fluent analytics engineers?
On-prem vs cloud. Hard requirement to stay on-prem? Or full cloud-native target?

Dimension 1 — PII / compliance gate.

ETL wins. PHI / PCI / GDPR-regulated columns can be tokenised, masked, or dropped on the ETL server before the warehouse sees a single byte. Auditors love this.
ELT struggles. Raw PII lands in raw_*; you compensate with warehouse-level row security, dynamic data masking, or post-load redaction — softer posture, more moving parts.
EtLT bridge. Light "Et" outside the warehouse to mask only the sensitive columns; everything else flows raw into the warehouse.

Dimension 2 — Latency to insight.

ETL. Batch-shaped — hours. The ETL server is a finite queue you schedule around.
ELT. Minutes — the warehouse spins up on demand. dbt rebuilds the affected models whenever the loader's next sync lands.
EtLT. Inherits ELT's latency once past the pre-load step.

Dimension 3 — Cost model.

ETL. Fixed and predictable — licence + server + ops. Easy to budget; expensive at low volumes.
ELT. Pay-per-query-second. Cheap at low volumes; can explode with a runaway query if not governed.
Gotcha. Snowflake auto-suspend + per-query timeouts + resource monitors are the equivalent of "budget control" in ELT-land.

Dimension 4 — Source velocity.

ETL. Days per source — mapping, test, deploy. Falls behind for fast-moving SaaS landscapes.
ELT. Minutes per source — Fivetran / Airbyte connectors. Wins outright on this axis.

Dimension 5 — Team skill mix.

ETL. Specialist Informatica / SSIS / DataStage developers. Shrinking labour market; long ramp.
ELT. Analytics engineers (SQL + dbt + Python). Larger, growing labour market; faster onboarding.

Dimension 6 — On-prem vs cloud.

ETL. First-class on-prem story. Decades of mature deployments on Linux / Windows / mainframes.
ELT. Cloud-first. On-prem variants exist (dbt-core can target on-prem Postgres / Trino) but ergonomics suffer.

Worked example — score a real workload across the matrix

Detailed explanation. A FinTech is choosing the architecture for a new transaction analytics pipeline. The data is PCI-regulated (account numbers + card BINs), the target is BigQuery (cloud), the team is 4 analytics engineers + 1 platform engineer, and they need 15-minute freshness for fraud-ops dashboards. Score it.

Question. A FinTech needs a new analytics pipeline. Constraints: PCI data (account numbers + card BINs), BigQuery target, 5-person team (mostly SQL-fluent), 15-minute freshness SLO, two new sources per quarter. Score the workload across the six dimensions and pick a pattern.

Input (the constraint table).

Dimension	Value
PII / compliance	PCI — account numbers + card BINs
Latency to insight	15 minutes
Cost model	flexible — BigQuery on-demand
Source velocity	~2 new sources / quarter
Team skill	4 analytics engineers + 1 platform engineer
On-prem vs cloud	cloud (BigQuery)

Code (the scoring sketch — runs as a notebook before the architecture review).

# decision_matrix.py — score the workload across six dimensions
dimensions = {
    "PII / compliance":   {"etl": "WIN", "elt": "loss",  "weight": 3, "note": "PCI — mask BEFORE warehouse"},
    "Latency to insight": {"etl": "loss", "elt": "WIN",  "weight": 2, "note": "15-min SLO favours warehouse autoscale"},
    "Cost model":         {"etl": "loss", "elt": "WIN",  "weight": 1, "note": "on-demand BQ scales with usage"},
    "Source velocity":    {"etl": "loss", "elt": "WIN",  "weight": 1, "note": "~2/quarter — both can keep up, ELT easier"},
    "Team skill":         {"etl": "loss", "elt": "WIN",  "weight": 2, "note": "SQL-fluent team, no Informatica devs"},
    "On-prem vs cloud":   {"etl": "tie",  "elt": "WIN",  "weight": 1, "note": "BigQuery is cloud-only"},
}

etl_score = sum(d["weight"] for d in dimensions.values() if d["etl"] == "WIN")
elt_score = sum(d["weight"] for d in dimensions.values() if d["elt"] == "WIN")
print("ETL score:", etl_score, "  ELT score:", elt_score)
# ETL score: 3   ELT score: 7

Step-by-step explanation.

PII dimension flips the decision toward "mask first." PCI demands that raw account numbers and card BINs never reach BigQuery in cleartext. Pure ELT is off the table.
Every other dimension favours ELT. Latency, cost, team skill, on-prem-vs-cloud — all wins for the warehouse-native shape.
The verdict is EtLT, not pure ETL. Light "Et" outside the warehouse to mask the PCI columns (HMAC tokenisation), then full ELT for everything else. You don't need a full Informatica install to do one tokenisation step — a small Python service or a Fivetran column transform suffices.
The matrix is honest about it. The PII column is the only "ETL wins" row; five out of six favour ELT. The hybrid (EtLT) captures both wins.

Output (the architecture decision).

Dimension	Verdict	Pattern handles it via
PII / compliance	ETL	pre-load HMAC tokenisation of account numbers + BINs
Latency to insight	ELT	dbt models rebuild on 15-min schedule
Cost model	ELT	BigQuery on-demand pricing
Source velocity	ELT	Fivetran connectors for new sources
Team skill	ELT	SQL-fluent team, no Informatica devs
On-prem vs cloud	ELT	BigQuery is cloud-only
Final pattern	EtLT	light pre-load mask + warehouse-native transforms

Rule of thumb. Score every workload across the six dimensions. If one dimension is a hard compliance gate, EtLT or ETL is required — but the other five dimensions still drive the bulk of the shape. Most production pipelines in 2026 score 4-or-more for ELT and either zero or one for ETL; EtLT exists precisely to capture both wins.

ETL vs ELT interview question — defending the matrix in a design round

A senior probe: "If ELT wins five of six dimensions and ETL wins one, why ship EtLT instead of pure ELT with post-load masking?" — testing whether the candidate understands the compliance posture difference.

Solution Using pre-load tokenisation to keep raw PII out of the warehouse entirely

# eltl_preload_mask.py  — runs as a small ECS / Cloud Run service
# Sits BETWEEN the source CDC stream and the BigQuery loader.
import hashlib
import hmac
import json

ROTATION_KEY = open("/secrets/pci_hmac_key.bin", "rb").read()

PII_COLUMNS = {"account_number", "card_bin", "cardholder_name"}

def tokenise(value: str) -> str:
    h = hmac.new(ROTATION_KEY, value.encode(), hashlib.sha256)
    return "TKN_" + h.hexdigest()[:24]

def mask_record(row: dict) -> dict:
    out = {}
    for k, v in row.items():
        if k in PII_COLUMNS and v is not None:
            out[k + "_token"] = tokenise(str(v))
        elif k in PII_COLUMNS:
            out[k + "_token"] = None
        else:
            out[k] = v
    return out

# Streamed from Kafka or Pub/Sub; written to BigQuery via the loader
for msg in cdc_stream:
    masked = mask_record(json.loads(msg.value()))
    bigquery_loader.write(masked)   # only tokens land in BigQuery

Step-by-step trace.

Step	Action	Where PII lives
1	CDC source emits a `transactions` row	source system (in-scope for PCI audit)
2	Streaming masker reads the row	ephemeral memory in the masker service (PCI-scope, short-lived)
3	Masker tokenises `account_number`, `card_bin`, `cardholder_name`	raw values exist for microseconds, never persisted
4	Masker writes the masked row to BigQuery via the loader	warehouse sees only `*_token` columns
5	dbt models transform the masked rows	every downstream layer works on tokens
6	Auditor inspects BigQuery	no raw PCI columns present anywhere

Output:

Layer	Raw PCI present?	Why
Source CDC	yes	inherent — source system is the system of record
Masker service	microseconds only	values exist only in process memory during the tokenisation call
BigQuery `raw_*`	no	only `*_token` columns land
dbt staging / marts	no	downstream layers never see raw
Looker / BI exposures	no	only tokens leave the warehouse

Why this works — concept by concept:

Pre-load mask as the compliance gate — PCI scope shrinks dramatically when raw card data does not persist anywhere in the warehouse stack. The masker service is in scope; everything downstream is out.
HMAC tokenisation, not encryption — tokens are deterministic (joins still work) and irreversible without the rotating key (limits blast radius). Encryption is reversible by anyone with the key, which broadens audit scope.
Light Et + heavy T — the pre-load step does only what compliance requires (mask N specific columns). All other transformation (cast, dedupe, join, aggregate) happens inside BigQuery via dbt. You don't pay for an Informatica install just to do tokenisation.
Key rotation discipline — ROTATION_KEY rotates quarterly; old tokens are rotated by a one-time re-tokenisation pass. Compromise of any single key is bounded to one quarter of data.
Cost — masker service ≈ a few dollars / day at typical FinTech volumes; warehouse cost identical to pure ELT; PCI audit scope shrinks from "the whole warehouse" to "the masker service" — the real win.

Worked example — model the 3-year TCO of ETL vs ELT for a mid-market workload

Detailed explanation. Senior interviewers often probe the cost dimension quantitatively: "Sketch the 3-year TCO of an Informatica ETL stack versus a Fivetran + Snowflake + dbt ELT stack for a mid-market workload." The answer separates one-time costs, recurring licence costs, recurring compute costs, and per-source dev costs.

Question. A mid-market company runs 12 sources, 2 TB of monthly transform compute, ships ~4 new sources per year. Build a 3-year TCO model comparing Informatica + on-prem warehouse versus Fivetran + Snowflake + dbt.

Input (the workload assumptions).

Driver	Value
Sources	12 active, +4 per year
Transform volume	2 TB / month
Latency SLO	1 hour
Team	3 data engineers
Existing licence	none — greenfield decision

Code (the TCO model — a small Python sketch you would run before the architecture review).

# tco_etl_vs_elt.py — 3-year cost comparison
YEARS = 3

# --- ETL stack (Informatica + on-prem warehouse) ---
etl_informatica_licence       = 250_000   # annual, mid-market
etl_server_hardware           =  80_000   # one-time
etl_server_ops_per_year       =  60_000   # SRE + storage
etl_warehouse_licence_per_year= 180_000   # on-prem warehouse
etl_per_source_dev_cost       =  20_000   # specialist labour, per new source
etl_new_sources_total         = 4 * YEARS

etl_total = (
    etl_server_hardware
    + etl_informatica_licence       * YEARS
    + etl_server_ops_per_year       * YEARS
    + etl_warehouse_licence_per_year * YEARS
    + etl_per_source_dev_cost       * etl_new_sources_total
)

# --- ELT stack (Fivetran + Snowflake + dbt) ---
fivetran_per_million_rows     = 1.50      # blended pricing
monthly_rows_millions         = 800       # 800M rows / month across 12 sources
fivetran_per_year             = fivetran_per_million_rows * monthly_rows_millions * 12

snowflake_per_credit          = 2.00      # X-Small at $2/hr ~= 1 credit/hr
credits_per_year              = 8_000     # 2 TB compute / month at moderate WH size
snowflake_per_year            = credits_per_year * snowflake_per_credit

dbt_cloud_team_seat           = 100       # per seat/month
dbt_per_year                  = 3 * dbt_cloud_team_seat * 12

elt_per_source_dev_cost       = 1_000     # mostly YAML + a staging model
elt_new_sources_total         = 4 * YEARS

elt_total = (
    (fivetran_per_year + snowflake_per_year + dbt_per_year) * YEARS
    + elt_per_source_dev_cost * elt_new_sources_total
)

print(f"ETL 3-year TCO: ${etl_total:>12,.0f}")
print(f"ELT 3-year TCO: ${elt_total:>12,.0f}")
print(f"Delta:          ${etl_total - elt_total:>12,.0f}")

Step-by-step explanation.

One-time hardware vs ongoing meter. ETL has an $80K hardware spend up front; ELT has zero — the warehouse is the meter. Over 3 years, the one-time cost is small but the structural difference matters for cash-flow planning.
Licence dominates ETL. Informatica + on-prem warehouse licences run ~$430K/year in this scenario. Over 3 years, that is ~$1.3M of fixed spend that is paid whether the pipeline runs one row or one billion.
Meter dominates ELT. Fivetran (per-row) + Snowflake (per-credit) + dbt Cloud (per seat) run ~$33K/year here. Over 3 years, that is ~$100K of variable spend that scales with usage.
Per-source dev cost differs by an order of magnitude. ETL = ~$20K per new source (specialist labour, mapping designer, deploy cycle). ELT = ~$1K per new source (YAML + a staging model). At 12 new sources over 3 years, that's $240K vs $12K.
Total 3-year delta. ETL ≈ $1.6M; ELT ≈ $110K. Mid-market workload, no compliance gate — ELT wins by an order of magnitude.

Output (the 3-year TCO comparison).

Cost category	ETL stack	ELT stack
Hardware (one-time)	$80,000	$0
Informatica licence (3 yr)	$750,000	$0
ETL server ops (3 yr)	$180,000	$0
On-prem warehouse licence (3 yr)	$540,000	$0
Fivetran (3 yr)	$0	$43,200
Snowflake compute (3 yr)	$0	$48,000
dbt Cloud (3 yr)	$0	$10,800
New-source dev (12 sources × 3 yr)	$240,000	$12,000
3-year TCO	$1,790,000	$114,000
Delta	—	~16× cheaper

Rule of thumb. Build the TCO model with both fixed and variable components, and amortise the per-source dev cost — it's the line item that quietly dominates over the contract length. If you find the ETL stack winning on TCO for a mid-market workload, double-check the assumptions: usually it's because someone pre-loaded the model with an existing licence the team is not paying for at the margin.

ETL
Topic — etl
ETL design under compliance + cost constraints

Practice →

5. The hybrid pattern — EtLT (light pre-load + warehouse transforms)

`etl elt pipeline` in 2026 is almost always EtLT — light Et outside, heavy T inside

The pragmatic reality: most production pipelines in 2026 ship a hybrid — a thin "Et" stage outside the warehouse (mask PII, decrypt, type cast, dedupe) and the heavy "T" inside the warehouse (joins, aggregates, business logic, exposures). EtLT inherits ELT's cloud economics + team-skill match while preserving ETL's compliance gate. It is not a compromise; it is a clean split of work into the layer where each task belongs.

The four stages of EtLT.

Extract. Same as ETL or ELT — pull rows from sources.
Light Et (pre-load). Only what cannot wait until inside the warehouse: PII masking, decryption, hard type casting, deduplication. Runs as a small streaming service or as a loader-level transform.
Load. Push the masked / cast rows into the warehouse's raw zone.
Heavy T (in-warehouse). dbt staging → intermediate → marts. All business logic, all joins, all aggregates.

What belongs in the pre-load step (and what doesn't).

Belongs in pre-load. PII masking, decryption of source payloads, hard type casts that the warehouse cannot do (e.g. binary blob → JSON), dedupe of true source duplicates, schema-shape normalisation.
Belongs in the warehouse. Joins, aggregates, business rules, SCD-2 history, dimensional modelling, exposure layers, BI-facing marts.
The rule. If it requires a business decision (what is the right join key? which currency conversion to apply?), it belongs in the warehouse. If it is a compliance or technical constraint (PII mask, decrypt, cast), it belongs in pre-load.

Why EtLT has become the default.

Best of both worlds. Compliance gate (ETL win) + cloud economics + team skill match + source velocity (ELT wins).
Pre-load step is small. The "Et" stage handles a handful of columns across all sources — typically a single shared service, not a full ETL platform.
Loader handles the boilerplate. Fivetran / Airbyte connectors do the extract + load; the pre-load step plugs in as a column-level transform on the loader, or as a sidecar service.

EtLT vs pure ELT vs pure ETL — when each fits.

Pure ETL. On-prem warehouses, legacy stacks, strict pre-warehouse PII with no cloud target.
Pure ELT. Cloud warehouse, no PII to mask pre-load, SQL-fluent team.
EtLT. Cloud warehouse + at least one pre-load constraint (PII, decrypt, hard cast). This is most pipelines.

Worked example — hybrid Fivetran (column mask) → Snowflake raw → dbt staging → marts

Detailed explanation. A B2B SaaS ingests Salesforce Contact records into Snowflake. The email and phone columns are PII under GDPR; everything else (account, opportunity, deal_stage) is free to land raw. Fivetran's column-level transform masks the two PII columns; raw rows land in raw_salesforce.contact; dbt models build the rest.

Question. Walk through an EtLT pipeline that ingests Salesforce Contact records into Snowflake, masks email and phone at the Fivetran loader (pre-load), and builds the rest via dbt — showing exactly where each transform runs.

Input (the source Contact record from Salesforce).

{
  "Id":          "003abc123",
  "FirstName":   "Jane",
  "LastName":    "Doe",
  "Email":       "jane.doe@example.com",
  "Phone":       "+44-20-7946-0958",
  "AccountId":   "001xyz789",
  "Title":       "Director of Data",
  "CreatedDate": "2026-06-04T10:11:00Z"
}

Code (the three layers of EtLT).

# 1) Light Et — Fivetran column-level transform (pre-load)
# fivetran-salesforce-contact.yaml
connector: salesforce
schema:    raw_salesforce
table:     contact
column_transforms:
  - column: Email
    type:   hash
    algo:   sha256
    salt:   "${secrets.email_salt}"
    output: email_hash
    drop_original: true
  - column: Phone
    type:   hash
    algo:   sha256
    salt:   "${secrets.phone_salt}"
    output: phone_hash
    drop_original: true

-- 2) Load — Fivetran writes the masked row into raw_salesforce.contact
-- (Snowflake table — written by the loader, NOT by the team)
CREATE OR REPLACE TABLE raw_salesforce.contact (
  id                STRING,
  first_name        STRING,
  last_name         STRING,
  email_hash        STRING,        -- pre-loaded by Fivetran's column transform
  phone_hash        STRING,        -- pre-loaded by Fivetran's column transform
  account_id        STRING,
  title             STRING,
  created_date      TIMESTAMP,
  _fivetran_synced  TIMESTAMP,
  _fivetran_deleted BOOLEAN
);

-- 3) Heavy T — dbt staging model inside Snowflake
-- models/staging/stg_salesforce__contact.sql
{{ config(materialized='view') }}

SELECT
  id                        AS contact_id,
  first_name,
  last_name,
  email_hash,                            -- already masked
  phone_hash,                            -- already masked
  account_id,
  title,
  created_date,
  _fivetran_synced          AS loaded_at
FROM {{ source('raw_salesforce', 'contact') }}
WHERE _fivetran_deleted IS NULL

-- 4) Heavy T — dbt mart that joins contacts with accounts and opportunities
-- models/marts/fct_account_engagement.sql
{{ config(materialized='table') }}

SELECT
  a.account_id,
  a.account_name,
  COUNT(DISTINCT c.contact_id)       AS contact_count,
  COUNT(DISTINCT o.opportunity_id)   AS open_opportunities,
  SUM(o.amount_usd)                  AS pipeline_usd
FROM {{ ref('stg_salesforce__account') }}        a
LEFT JOIN {{ ref('stg_salesforce__contact') }}   c USING (account_id)
LEFT JOIN {{ ref('stg_salesforce__opportunity') }} o USING (account_id)
WHERE o.is_open = TRUE
GROUP BY a.account_id, a.account_name

Step-by-step explanation.

Salesforce emits the Contact record. Includes raw Email and Phone — PII under GDPR.
Fivetran's column-level transform masks PII pre-load. Email and Phone are SHA-256-hashed with a per-column salt; the original columns are dropped before the row is written to Snowflake.
Snowflake's raw_salesforce.contact only ever contains the hashes. Auditors can confirm the warehouse contains no raw email or phone for any contact.
stg_salesforce__contact is a 1:1 view. Rename id → contact_id, cast types, filter soft-deletes. No PII work — it's already done.
fct_account_engagement joins three staging models. Pure SQL inside Snowflake; no Python; no separate ETL server; no per-mart deploy cycle.
The whole pipeline runs on two infrastructures. Fivetran (managed, handles E + light t + L). Snowflake (handles heavy T via dbt). That's it. No third box.

Output (a row of the final mart).

account_id	account_name	contact_count	open_opportunities	pipeline_usd
001xyz789	Acme GmbH	14	3	187500.00
001def456	Beta SA	7	1	42000.00
001ghi012	Gamma Inc	23	5	312000.00

Rule of thumb. Put PII / decrypt / hard-cast work in the loader's column transform (or a sidecar service). Put every other transform in dbt. The split is invariant under most workload changes — adding a new source means adding a new connector + a new staging model, never a full new pipeline.

EtLT interview question — defending the hybrid against pure ELT

A common probe: "If Snowflake has dynamic data masking, why bother with pre-load tokenisation in the loader? Just land the data raw and mask at read-time." — testing whether the candidate understands the audit-scope difference between masking at read and not landing raw at all.

Solution Using pre-load tokenisation to keep raw PII out of the warehouse entirely

# Hybrid EtLT — masks happen BEFORE the warehouse, not on read
# This is the structural guarantee that no raw PII exists at rest in Snowflake.
connector: salesforce
column_transforms:
  - column: Email
    type:   hash         # SHA-256 of (value || per-column salt)
    output: email_hash
    drop_original: true  # critical — the raw column is GONE before LOAD
  - column: Phone
    type:   hash
    output: phone_hash
    drop_original: true
  - column: SSN
    type:   tokenise     # HMAC token, deterministic, reversible only by key holder
    output: ssn_token
    drop_original: true

-- Compare against the pure-ELT alternative (NOT what we're shipping)
-- The pure-ELT shape lands raw PII and masks on read:
CREATE MASKING POLICY mask_email AS (val STRING) RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() IN ('PII_ADMIN') THEN val
    ELSE SHA2(val, 256)
  END;

ALTER TABLE raw_salesforce.contact MODIFY COLUMN email
  SET MASKING POLICY mask_email;
-- Raw email STILL EXISTS in the table; only the SELECT result is masked.
-- Backups, time-travel, replication, query history all retain the raw values.

Step-by-step trace.

Concern	EtLT pre-load mask	Pure ELT + dynamic masking
Raw PII at rest in warehouse?	no — never written	yes — written then masked on read
Raw PII in backups / time travel?	no	yes — up to retention window
Raw PII in replication streams?	no	yes — replicated cleartext
Raw PII in query history?	no	yes — full query text may include raw values
Audit scope	masker service only	entire warehouse + backups + replicas + query history
Compliance evidence	"drop_original: true" + masker logs	masking policy + per-role audit trail
Cost of a key compromise	re-tokenise forward	full re-load + re-mask + backup purge

Output:

Aspect	EtLT (pre-load mask)	Pure ELT (mask on read)
Raw PII landed in warehouse	NO	YES
Audit scope	masker only	warehouse + backups + replicas
GDPR "right to be forgotten" cost	low (forward-only)	high (purge backups + replicas)
Read-time masking still possible	yes (additional layer)	this is the only layer

Why this works — concept by concept:

Masking on read is not the same as not landing raw — a masking policy hides the value from the SELECT result but the underlying bytes still exist on disk, in backups, in time-travel snapshots, in replication streams, and in query history. Auditors care about bytes at rest, not bytes returned.
drop_original: true is the structural guarantee — once the loader drops the raw column before LOAD, the warehouse cannot retain it anywhere. No backup-leak risk, no time-travel-leak risk, no replication-leak risk.
Audit-scope contraction — under EtLT, the PCI / GDPR audit scope is the masker service (one container, one log stream, one secrets vault). Under pure-ELT-with-masking-policy, the audit scope is the entire warehouse plus every replica plus every backup retention window.
GDPR cost-to-forget — under EtLT, a "right to be forgotten" request is a forward-only operation (delete the user's rows, the historical hashes are already irreversible). Under pure ELT, you must purge the user's raw PII from every backup, every time-travel snapshot, every replica — quarters of cleanup.
Cost — masker service ≈ a few dollars / day at typical SaaS volumes; the audit-scope contraction is worth several engineer-quarters per compliance cycle. EtLT pays for itself before the first audit.

Worked example — orchestrate the EtLT pipeline with Airflow + dbt

Detailed explanation. A common follow-up: "Show me how you'd orchestrate the whole EtLT shape end-to-end — masker, loader, dbt — so it survives partial failure and re-runs cleanly." The canonical 2026 answer is Airflow (or Dagster / Prefect) as the orchestrator, with dbt as the in-warehouse transform engine. Each step is idempotent so a retry never produces duplicates.

Question. Sketch the Airflow DAG that runs the EtLT pipeline: (1) wait for new Salesforce events, (2) trigger the masker service on the new partition, (3) trigger the Fivetran sync, (4) run the dbt project, (5) publish freshness metrics. Show how partial failure is handled.

Code (the Airflow DAG).

# dags/etlt_salesforce.py
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.fivetran.operators.fivetran import FivetranOperator
from airflow.providers.fivetran.sensors.fivetran import FivetranSensor
from airflow.providers.dbt.cloud.operators.dbt import DbtCloudRunJobOperator

default_args = {
    "owner": "data-platform",
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
    "depends_on_past": False,
}

with DAG(
    "etlt_salesforce",
    start_date=datetime(2026, 6, 1),
    schedule="*/15 * * * *",          # every 15 minutes
    default_args=default_args,
    catchup=False,
    max_active_runs=1,                 # one run at a time, no overlap
) as dag:

    def run_masker(**ctx):
        """Idempotent — keyed on (source, partition_ts)."""
        from masker_client import run
        run(source="salesforce",
            partition_ts=ctx["data_interval_start"])

    mask = PythonOperator(
        task_id="mask_salesforce_pii",
        python_callable=run_masker,
    )

    sync = FivetranOperator(
        task_id="trigger_fivetran_salesforce",
        connector_id="salesforce_prod",
    )

    wait_sync = FivetranSensor(
        task_id="wait_fivetran_completion",
        connector_id="salesforce_prod",
        poke_interval=30,
        timeout=600,
    )

    dbt_build = DbtCloudRunJobOperator(
        task_id="dbt_build_salesforce_models",
        job_id=42,
        check_interval=20,
        wait_for_termination=True,
    )

    def publish_freshness(**ctx):
        from monitoring import emit_metric
        emit_metric("etlt.salesforce.freshness_minutes", 15)

    publish = PythonOperator(
        task_id="publish_freshness",
        python_callable=publish_freshness,
    )

    mask >> sync >> wait_sync >> dbt_build >> publish

Step-by-step explanation.

Schedule. Cron */15 * * * * triggers the DAG every 15 minutes. max_active_runs=1 prevents two runs from racing — if the previous run is still going, the new one queues.
Step 1 — Mask (the "Et"). run_masker calls the masker service for the source + partition window. The masker is keyed on (source, partition_ts) so a retry of the same partition produces the same tokens — idempotent.
Step 2 — Trigger Fivetran sync. FivetranOperator POSTs to Fivetran's REST API to kick the connector. Fivetran handles the actual extract + load.
Step 3 — Wait for Fivetran completion. FivetranSensor polls every 30 seconds until the connector reports success. Times out at 10 minutes; Airflow retries the whole task up to 3 times before paging.
Step 4 — Run dbt. DbtCloudRunJobOperator triggers the dbt Cloud job (or in self-hosted, a DbtRunOperator). dbt's own MERGE-based incremental models are idempotent by construction.
Step 5 — Publish freshness. A small metric emits etlt.salesforce.freshness_minutes. A dashboard alert fires if freshness drifts above SLO.
Partial-failure behaviour. If mask fails, sync and dbt do not run — the partition is reprocessed on next schedule. If dbt_build fails, the warehouse is left in a known-good state (last successful commit); the next run picks up. No duplicates, no data loss.

Output (a typical DAG run timeline).

Step	Duration	Idempotent on retry?
mask_salesforce_pii	~30s	yes (deterministic tokens)
trigger_fivetran_salesforce	~5s	yes (Fivetran connector dedupes)
wait_fivetran_completion	~2-5 min	yes (sensor)
dbt_build_salesforce_models	~90s	yes (MERGE on unique_key)
publish_freshness	~2s	yes (metric emit)
End-to-end	~5-7 min	fully idempotent DAG

Rule of thumb. Orchestrate EtLT with one DAG per source family (one for Salesforce, one for Stripe, etc.) rather than a single mega-DAG — failures localise, retries are cheaper, and the dependency graph stays readable. Use max_active_runs=1 to avoid two runs racing on the same partition.

ETL
Topic — etl
EtLT hybrid pipeline design (pre-load mask + warehouse transforms)

Practice →

Cheat sheet — pick ETL vs ELT vs EtLT

Need to keep raw PII out of the warehouse entirely? Pick ETL or EtLT — pure ELT cannot satisfy a strict pre-warehouse compliance gate, even with read-time masking.
Have a cloud warehouse + SQL-fluent team + no strict pre-load PII rule? Pick ELT — Fivetran/Airbyte → Snowflake/BigQuery/Databricks → dbt is the default 2026 shape.
Have a cloud warehouse + at least one pre-load constraint (PII / decrypt / hard cast)? Pick EtLT — light "Et" outside, heavy "T" inside. This is most production pipelines.
Have a legacy on-prem Informatica / SSIS / DataStage install with no migration mandate? Pick ETL — re-platforming for the sake of "modern" rarely pays off; stay on the legacy until a separate forcing function appears.
Onboarding twenty new sources per quarter? Pick ELT or EtLT — connector-based loaders are the only way to keep up; ETL per-source dev cycles cannot scale.
Need 15-minute freshness for operational dashboards? Pick ELT or EtLT — warehouse autoscale rebuilds dbt models on demand; ETL servers are batch-shaped.
Hard cap on warehouse compute spend? Consider ETL — fixed licence cost is more predictable than per-query metering; or use ELT with strict resource monitors and per-query timeouts.
Need exactly-once into the warehouse? Use ELT or EtLT with MERGE-based incremental dbt models keyed on a business id — the warehouse owns the idempotency boundary cleanly.

Frequently asked questions

ETL vs ELT — which is faster?

For modern cloud warehouses, ELT is almost always faster end-to-end because the warehouse compute (Snowflake, BigQuery, Databricks) is more elastic than a dedicated ETL server — a 100M-row transform spins up a large warehouse, runs in 90 seconds, and suspends. ETL's batch-shaped server pipeline is bounded by the slowest queued job and the fixed compute size. ETL can be faster in narrow cases — small row-level cleansing jobs on a beefy on-prem ETL box, or cases where the warehouse is intentionally throttled by a strict budget cap.

Is ETL dead?

No — ETL is the right shape for two persistent niches: strict pre-warehouse PII / compliance gates (banking, healthcare, EU GDPR-heavy) where raw sensitive columns cannot land in the warehouse, and legacy on-prem stacks (Informatica, SSIS, DataStage) where the warehouse is on-prem and the institutional investment in ETL tooling is mature. What is dead is "pure ETL by default" — for new cloud-native pipelines, ELT or EtLT wins on four-to-five of the six decision dimensions.

Does dbt count as ELT?

Yes — dbt is the canonical transform engine for ELT, doing the "T" inside the warehouse after a separate loader (Fivetran, Airbyte, Stitch, custom Python) handles the "EL." dbt has no extract or load capability of its own; it compiles SQL SELECTs into CREATE TABLE AS or MERGE statements that the warehouse runs against tables that already exist. A pipeline of Fivetran → Snowflake → dbt is the textbook ELT shape; Fivetran (with column-level masks) → Snowflake → dbt is the textbook EtLT shape.

ETL vs ELT for compliance / GDPR?

For strict-PII workloads under GDPR, HIPAA, or PCI, EtLT (or pure ETL) wins on audit-scope grounds because the raw PII never lands in the warehouse — only tokens or hashes do. Pure ELT with read-time masking is not equivalent: the underlying bytes still exist in warehouse storage, backups, time-travel snapshots, replication streams, and query history, all of which fall inside the compliance audit boundary. The pre-load mask in EtLT contracts the audit scope from the entire warehouse stack down to the masker service alone.

Cost comparison — ETL vs ELT?

ETL is predictable fixed cost (server + licence + ops) — easy to budget but expensive at low volumes and hard to scale up quickly. ELT is variable per-query cost — cheap at low volumes, can explode with a runaway query if not governed, but scales with usage. For most cloud-native workloads under 10TB of monthly transforms, ELT lands well below ETL's fixed cost; at very high volumes (100+ TB transforms), the per-query meter starts approaching the licence cost — at which point resource monitors, per-query timeouts, and incremental dbt models become the cost-control tools.

Can I do ETL on Snowflake?

Yes — Snowflake supports the transform-before-final-load pattern via Snowpark, Java/Python UDFs, and ephemeral compute. You can run row-level transforms inside Snowflake's compute before writing the result to the "real" target table, which is effectively in-warehouse ETL. Practically, this is rare — if you're already inside the warehouse, dbt's view/ephemeral/table materialisations cover the same intent more idiomatically. Pure ETL with Snowflake usually means a dedicated transform service (Snowpark / external Python) runs before the canonical raw_* schema receives the data — i.e. EtLT.

Practice on PipeCode

Drill the ETL practice library → for hands-on extract-transform-load problems with input tables and expected outputs.
Stretch with data transformation drills → for warehouse-native SQL modelling reps.
Rehearse SQL aggregation problems → and SQL join problems → for the dbt staging + intermediate layer muscles.
Layer in data modelling problems → and slowly-changing-data problems → for the marts layer.
For the broader interview surface, read top data engineering interview questions →.
Stack the prerequisites with the only 5 skills you need to become a data engineer →.
Build the design-round muscles with the ETL system design course →.
Pair the modelling intuition with the data modelling course →.

Pipecode.ai is Leetcode for Data Engineering — every ETL vs ELT concept above ships with hands-on practice rooms where you write real transforms, model real warehouses, and defend real architectures. Start with the ETL library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.

Practice ETL now →
ETL system design course →

1. The 3-letter difference and why it matters

etl vs elt collapses to a single decision: where does the T run — outside the warehouse or inside it?

Worked example — frame the ETL vs ELT decision for a new pipeline

ETL vs ELT interview question — picking the right shape from raw constraints

Solution Using a constraint-driven decision tree

A short history — why ETL was the only option pre-2015

2. Classic ETL — transform-first, before the warehouse load

etl vs elt architecture starts with the classic shape: dedicated compute outside the warehouse, only modelled rows inside

Worked example — Informatica → SQL Server warehouse with row-level transforms

ETL interview question — defending classic ETL on a regulated workload

Solution Using pre-warehouse PII tokenisation in an Informatica mapping

3. Modern ELT — load-first, transform inside the warehouse

etl elt difference flips the order: load raw rows first, then transform them with the warehouse's own compute (dbt-native)

Worked example — Fivetran → Snowflake → dbt 3-layer model

ELT interview question — defending the dbt 3-layer model graph

Solution Using the dbt layered cake to keep tests and re-use clean

Worked example — incremental dbt model for a 1B-row event fact

4. When ETL still wins — the 6-dimension decision matrix

etl vs elt architecture is not a religion — six concrete dimensions decide the right shape per workload

Worked example — score a real workload across the matrix

ETL vs ELT interview question — defending the matrix in a design round

Solution Using pre-load tokenisation to keep raw PII out of the warehouse entirely

Worked example — model the 3-year TCO of ETL vs ELT for a mid-market workload

5. The hybrid pattern — EtLT (light pre-load + warehouse transforms)

etl elt pipeline in 2026 is almost always EtLT — light Et outside, heavy T inside

Worked example — hybrid Fivetran (column mask) → Snowflake raw → dbt staging → marts

EtLT interview question — defending the hybrid against pure ELT

Solution Using pre-load tokenisation to keep raw PII out of the warehouse entirely

Worked example — orchestrate the EtLT pipeline with Airflow + dbt

Cheat sheet — pick ETL vs ELT vs EtLT

Frequently asked questions

ETL vs ELT — which is faster?

Is ETL dead?

Does dbt count as ELT?

ETL vs ELT for compliance / GDPR?

Cost comparison — ETL vs ELT?

Can I do ETL on Snowflake?

Practice on PipeCode

`etl vs elt` collapses to a single decision: where does the `T` run — outside the warehouse or inside it?

`etl vs elt architecture` starts with the classic shape: dedicated compute outside the warehouse, only modelled rows inside

`etl elt difference` flips the order: load raw rows first, then transform them with the warehouse's own compute (dbt-native)

`etl vs elt architecture` is not a religion — six concrete dimensions decide the right shape per workload

`etl elt pipeline` in 2026 is almost always EtLT — light Et outside, heavy T inside