DEV Community

Cover image for 🔄 ETL vs. ELT: The Evolution of Data Integration
De' Clerke
De' Clerke

Posted on • Edited on

🔄 ETL vs. ELT: The Evolution of Data Integration

In ETL, you transform data before it lands in your warehouse. In ELT, you load raw data first and transform it inside the warehouse using its own compute. The distinction sounds simple, but it shapes your entire pipeline architecture: where transformation logic lives, what breaks when requirements change, and how much you trust your raw data.


ETL: Transform Before You Load

ETL (Extract, Transform, Load) is the traditional approach. Data is extracted from a source, cleaned and shaped by an external processing layer, and only then loaded into the target system.

The transformation happens in your pipeline code, not inside the database. Here's the pattern using Airflow's TaskFlow API:

from airflow.decorators import dag, task
from sqlalchemy import create_engine, text
from datetime import datetime
import requests, os

@dag(schedule="0 7 * * 1-5", start_date=datetime(2025, 1, 1), catchup=False)
def nse_etl_pipeline():

    @task
    def extract() -> list[dict]:
        response = requests.get("https://api.example.com/prices")
        response.raise_for_status()
        return response.json()["records"]

    @task
    def transform(records: list[dict]) -> list[dict]:
        # Transformation happens here, before the data reaches the database
        return [
            {
                "symbol":      r["symbol"].upper().strip(),
                "close_price": round(float(r["close"]), 4),
                "volume":      int(r["volume"]),
                "fetched_at":  datetime.utcnow().isoformat(),
            }
            for r in records
            if r.get("close") is not None and float(r["close"]) > 0
        ]

    @task
    def load(records: list[dict]) -> int:
        engine = create_engine(os.environ["DATABASE_URL"])
        with engine.connect() as conn:
            for r in records:
                conn.execute(text("""
                    INSERT INTO stock_prices (symbol, close_price, volume, fetched_at)
                    VALUES (:symbol, :close_price, :volume, :fetched_at)
                    ON CONFLICT (symbol, fetched_at) DO NOTHING
                """), r)
            conn.commit()
        return len(records)

    raw     = extract()
    cleaned = transform(raw)
    load(cleaned)

nse_etl_pipeline()
Enter fullscreen mode Exit fullscreen mode

The data that enters the database is already clean. Bad records, nulls, and wrong formats never reach storage.

When ETL makes sense:

  • Sensitive data that must be masked or filtered before storage
  • Strict compliance requirements where raw data can't sit in the warehouse
  • Target systems with limited compute (you can't transform inside them)
  • Relatively stable transformations that don't change often

ELT: Load Raw, Transform Inside

ELT (Extract, Load, Transform) is the modern, cloud-native approach. Raw data lands in the target system first, and transformations happen inside the warehouse using SQL. This is where tools like dbt come in.

The Airflow DAG still handles extraction and loading. dbt handles everything after:

# Airflow: extract and load raw data only
@task
def load_raw(records: list[dict]) -> int:
    engine = create_engine(os.environ["DATABASE_URL"])
    with engine.connect() as conn:
        for r in records:
            conn.execute(text("""
                INSERT INTO raw.stock_prices_raw (symbol, close, volume, fetched_at)
                VALUES (:symbol, :close, :volume, :fetched_at)
                ON CONFLICT DO NOTHING
            """), r)
        conn.commit()
    return len(records)
Enter fullscreen mode Exit fullscreen mode
-- dbt: transformation runs inside the warehouse
-- models/staging/stg_stock_prices.sql
{{ config(materialized='view') }}

SELECT
    id,
    UPPER(TRIM(symbol))         AS symbol,
    close_price::NUMERIC(12, 4) AS close_price,
    volume::BIGINT              AS volume,
    timestamp::TIMESTAMPTZ      AS ts,
    CURRENT_TIMESTAMP           AS dbt_updated_at
FROM {{ source('raw', 'stock_prices_raw') }}
WHERE symbol IS NOT NULL
  AND close_price > 0
Enter fullscreen mode Exit fullscreen mode
-- models/marts/mart_daily_summary.sql
{{ config(materialized='table') }}

SELECT
    symbol,
    DATE(ts)         AS trade_date,
    MIN(close_price) AS low,
    MAX(close_price) AS high,
    AVG(close_price) AS avg_price,
    SUM(volume)      AS total_volume
FROM {{ ref('stg_stock_prices') }}
GROUP BY symbol, DATE(ts)
Enter fullscreen mode Exit fullscreen mode

Because the raw data is preserved, you can re-run transformations whenever requirements change without re-extracting from the source. The history is always there.

When ELT makes sense:

  • Cloud warehouse with sufficient compute (BigQuery, Snowflake, DuckDB, PostgreSQL)
  • Analytical requirements that change frequently
  • You want to preserve raw data for future re-processing
  • Team uses dbt for transformation versioning and testing

The Medallion Architecture

Most production data stacks use a layered approach that combines both patterns. This is commonly called the Medallion Architecture, organizing data into three layers:

Bronze (Raw): Data lands here exactly as it came from the source. No transformations, no filtering. This is your source of truth for re-processing.

Silver (Cleaned): Data is validated, standardized, and joined. Null values handled, types cast, duplicates removed. This is what dbt staging models produce.

Gold (Business-Ready): Aggregated, business-logic-applied, ready for dashboards and reports. This is what dbt mart models produce.

-- Bronze: raw insert from Airflow
INSERT INTO bronze.stock_prices_raw (raw_payload, ingested_at)
VALUES (:payload, NOW());

-- Silver: cleaned staging model (dbt view)
SELECT
    UPPER(TRIM(symbol))    AS symbol,
    close_price::NUMERIC   AS close_price,
    timestamp::TIMESTAMPTZ AS ts
FROM bronze.stock_prices_raw
WHERE close_price > 0 AND symbol IS NOT NULL;

-- Gold: aggregated mart (dbt table)
SELECT
    symbol,
    DATE(ts)    AS trade_date,
    MAX(close_price) AS high,
    MIN(close_price) AS low,
    SUM(volume) AS total_volume
FROM silver.stg_stock_prices
GROUP BY symbol, DATE(ts);
Enter fullscreen mode Exit fullscreen mode

The Airflow DAG orchestrates extraction and bronze loading. dbt handles silver and gold. Each layer is independently testable and rerunnable.


Comparison

ETL ELT
Where transformation happens External engine (Python, Spark) Inside the warehouse (SQL, dbt)
Raw data preserved No (only cleaned data stored) Yes (Bronze layer)
Re-processing on logic change Re-extract from source Re-run dbt models
Flexibility for changing requirements Lower Higher
Data quality guarantee Enforced before load Enforced inside warehouse
Best fit Compliance-heavy, sensitive data Cloud-native, analytics-first

The Practical Decision

If your source data is sensitive and must never land raw in storage, ETL. If you're on a cloud warehouse and your business logic changes every quarter, ELT. Most modern data stacks use ELT with a Medallion structure because the ability to replay transformations from preserved raw data is worth more than the simplicity of cleaning data before it arrives.


Follow me on dev.to for more data engineering content, or browse the project code at github.com/declerke.

Top comments (0)