DEV Community

Cover image for ETL Tools Compared: Airflow, dbt, Fivetran, Glue, Talend, Informatica — A Deep Engineering Guide
Gowtham Potureddi
Gowtham Potureddi

Posted on

ETL Tools Compared: Airflow, dbt, Fivetran, Glue, Talend, Informatica — A Deep Engineering Guide

etl tools comparison has become a category — not a single decision — because the modern etl tool stack spans four very different jobs: workflow orchestration (Airflow), in-warehouse transformation (dbt), saas data ingest (Fivetran), and full-platform etl (AWS Glue, Talend, Informatica). The best etl tools 2026 lists keep clumping these together as if they competed head-to-head, but in production they layer on top of each other. This deep-guide counterpart to a feature-matrix comparison walks the etl tools list one tool at a time — DAGs, executors, models, materializations, connectors, MAR billing, DPU-hour pricing, IPU consumption, and the etl architecture patterns that bind them — so you can defend each pick the way a staff data engineer defends one in a system-design round.

Every section below uses the data engineering interview questions teaching shape: a one-paragraph etl explained preamble dense with keyword phrases, two or three ### Title sub-sections that unpack the moving parts (DAG anatomy, executor choices, model layering, schema-drift handling, the Glue Data Catalog, the Talend Studio canvas, the IDMC console), and at least one #### Worked example block that ships a runnable snippet plus a step-by-step trace. We close with a ### Solution flavoured pricing comparison worked example (a 1 TB / day pipeline costed across all six tools), a ### Decision framework ASCII tree, and three reference data engineering stack patterns (Fivetran + dbt + Airflow, pure AWS with Glue + Step Functions, legacy enterprise with Informatica IDMC) so the etl pipeline tools decision lands as a concrete pick, not a vendor brochure.

PipeCode blog header for a deep-dive ETL tools guide — bold white headline 'ETL Tools · Complete Guide' with subtitle 'Airflow · dbt · Fivetran · Glue · Talend · Informatica' and a stylised seven-step layered pipeline (sources → SaaS ingest → orchestration → transformation → warehouse → governance → BI) with six small tool-logo cards arranged along the pipeline, on a dark gradient with purple, orange, green, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps on the SQL, modeling, and ETL skills every data engineering loop tests, browse ETL practice library →, drill data-modeling drills →, sharpen Python ETL patterns →, rehearse aggregation problems →, reinforce joins drills →, or widen coverage on the full SQL practice library →.


On this page


1. Why the ETL toolscape is fragmented in 2026

The etl toolscape is fragmented because there are four distinct jobs inside a modern pipeline — ingest, orchestrate, transform, and govern — and no single vendor has won all four cleanly. The etl tools list that interview prep articles love to recite (Airflow, dbt, Fivetran, Glue, Talend, Informatica, Airbyte, Stitch, Matillion, Prefect, Dagster, SQLMesh) is really four small lists glued together by the etl pipeline label. Once you see the four-category taxonomy, every etl tools comparison prompt — "should we use airflow vs dbt?", "is fivetran vs airbyte about cost or control?", "glue vs talend vs informatica for a regulated enterprise?" — collapses into "which category does this question live in?"

This first section frames the etl tool taxonomy, explains the modern data stack shape that drives 2026 defaults, and names the etl architecture trade-offs the rest of the guide unpacks. Treat it as the map; sections 2–5 are the territory.

The four-category etl tool taxonomyorchestration · transformation · el ingest · full etl

The etl tool selection decision starts with a category, not a tool. Map every tool onto exactly one of the four layers below — that is the etl tools comparison mental model every senior data engineer carries.

  • Orchestration (workflow orchestration) — schedules and monitors other jobs. Airflow is the dominant choice; Prefect and Dagster are modern alternatives. Hands off the actual data movement to operators (Python, Spark, dbt, Fivetran triggers). The unit of work is a dag (directed acyclic graph) of tasks.
  • In-warehouse transformation (transformation tools) — runs SQL inside the warehouse. dbt is the standard; SQLMesh is the newer challenger. Reads from raw tables, writes to mart tables, never moves data across systems. The unit of work is a model (a SELECT statement materialised as a view or table).
  • EL / SaaS ingest (saas data ingest) — pulls data from operational sources into the warehouse. Fivetran is the canonical managed SaaS; Airbyte is the open-source alternative; Stitch is the simpler budget option. Pre-built connectors are the moat — every connector you don't write is engineer-time saved. Pairs with dbt for the T (EL + T, not ETL).
  • Full ETL platforms (enterprise etl) — ship ingest, transform, and orchestration in one product. AWS Glue is the AWS-native serverless Spark choice; Talend is the long-tenured GUI-first option; Informatica is the Fortune-500 enterprise incumbent (IDMC SaaS, PowerCenter legacy). These platforms trade composability for convenience.

Why the taxonomy beats a one-dimensional ranking.

  • Airflow and dbt are not competitors — they live in different layers; many real pipelines run both.
  • Fivetran and Airflow are not competitors — Fivetran moves data in, Airflow schedules the downstream T and load.
  • Glue and Airflow are not competitors eitherStep Functions orchestrates Glue jobs in pure-AWS stacks; Airflow orchestrates Glue jobs in multi-cloud or hybrid stacks.
  • Talend and Informatica are near-competitors — both are GUI-first full-ETL platforms; the pick is usually ecosystem + price.
  • Fivetran and Airbyte are direct competitors — same shape (managed connectors), different commercial model (closed SaaS vs open-source).

Why no single ETL tool wins — the four pressures

A single etl tools list ranking always lies because four different pressures pull buyers in different directions.

  • Composability vs convenience — code-first composable stacks (Airflow + dbt + Fivetran) win for engineering-led teams; GUI-first all-in-one platforms (Talend, Informatica) win for low-code teams.
  • Cloud lock vs portabilityGlue is AWS-only; Fivetran is cloud-only SaaS; Airflow / dbt / Talend / Informatica run multi-cloud or hybrid.
  • Pricing unit — free OSS, per-developer seat, MAR (Monthly Active Rows), $/DPU-hour, per-user license, IPU (Informatica Processing Unit) consumption — five different cost shapes that don't compare apples-to-apples.
  • Learning curvedbt (SQL + YAML) is fastest; Airflow and Glue are medium (Python / PySpark); Talend and Informatica carry the steepest GUI learning curves.

The 2026 default — Fivetran + dbt + Airflow as the modern data stack

The modern data stack has standardised around Fivetran for ingest + dbt for transform + Airflow for orchestration on top of a cloud warehouse (Snowflake, BigQuery, Databricks, or Redshift). Greenfield mid-market teams default here because:

  • Connectors are someone else's problem — Fivetran ships 500+, you ship zero.
  • Transformation stays in SQL — every SQL-fluent engineer can contribute, and the warehouse compute does the heavy lifting.
  • Orchestration is Python-first — Airflow DAGs are code, version-controlled, and reviewable like every other repo.
  • Total tooling cost is predictable — dbt Cloud + Fivetran Starter + Astronomer or MWAA lands in the $3–10k / month range for most teams.

Legacy Fortune-500 enterprises run a different default — Informatica IDMC + Airflow + cloud warehouse — because their regulated data-governance posture maps onto Informatica's catalog and lineage tooling. Pure-AWS shops often run Glue + Step Functions + Glue Data Catalog to avoid OSS Airflow operations entirely.

Worked example — placing eight real tools into the four-category taxonomy

Detailed explanation. A new data engineer joins a team and is told "we use Airflow, dbt, Fivetran, Snowflake, Looker, Glue, Soda, and Great Expectations." Before doing anything, the right first move is to place each tool into one of the four ETL categories (plus the adjacent warehouse, BI, and quality categories). This is the etl tool taxonomy drill every onboarding interview runs.

Question. Categorise the following eight tools and explain which layer of the pipeline they own: Airflow, dbt, Fivetran, Snowflake, Looker, AWS Glue, Soda, Great Expectations.

Input.

Tool Surface API Cloud model What it touches
Airflow Python DAGs OSS / MWAA / Astronomer Schedules tasks
dbt SQL + YAML Core (CLI) / Cloud (SaaS) Compiles SQL inside warehouse
Fivetran Web UI SaaS-only Pulls from SaaS apps and DBs
Snowflake SQL SaaS warehouse Stores raw + mart tables
Looker LookML SaaS BI Reads marts, renders dashboards
AWS Glue PySpark + visual AWS-managed serverless Crawls, transforms, loads
Soda YAML checks OSS / Cloud Asserts data quality
Great Expectations Python suites OSS Asserts data quality

Code (a one-screen mental-model snippet you can keep in a README).

# Layer 1: Sources (operational systems — Salesforce, Stripe, Postgres, etc.)
#   |
#   v
# Layer 2: EL / SaaS ingest -----> Fivetran (managed) | Airbyte (OSS)
#   |
#   v
# Layer 3: Warehouse / Lakehouse -> Snowflake | BigQuery | Databricks | Redshift
#   |
#   v
# Layer 4: Transformation --------> dbt models (staging -> intermediate -> marts)
#   |                              (or AWS Glue for AWS-native Spark ETL)
#   v
# Layer 5: Orchestration ---------> Airflow DAGs trigger 2, 3, 4 (or Step Functions on AWS)
#   |
#   v
# Layer 6: Quality ---------------> Great Expectations / Soda assert at every layer
#   |
#   v
# Layer 7: BI / Activation -------> Looker / Mode / Tableau read marts
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Airflow is the only orchestration tool in the list — it sits across the pipeline and triggers everything below.
  2. Fivetran owns Layer 2 (EL / SaaS ingest) — pure ingest, no transforms.
  3. Snowflake is the warehouse — not an ETL tool itself, but the substrate every layer above and below reads / writes.
  4. dbt owns Layer 4 (in-warehouse transformation) — runs SQL inside Snowflake; never moves data out.
  5. AWS Glue is a full-ETL platform — in this stack it's not used end-to-end because Fivetran + dbt already cover ingest + transform. It might run a one-off PySpark batch (e.g. crawl S3 logs into Snowflake).
  6. Looker is BI (Layer 7) — reads dbt marts; not an ETL tool.
  7. Soda and Great Expectations are quality tools (Layer 6) — they check what the ETL tools produce; not ETL themselves.

Output (a one-line category map you can paste into a data-stack.md).

Airflow            -> Orchestration
dbt                -> Transformation (in-warehouse)
Fivetran           -> EL / SaaS ingest
Snowflake          -> Warehouse (adjacent, not ETL)
Looker             -> BI (adjacent, not ETL)
AWS Glue           -> Full ETL (used selectively for S3 + Spark batches)
Soda               -> Quality (adjacent, not ETL)
Great Expectations -> Quality (adjacent, not ETL)
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if you can't place a tool into the four-category etl tool taxonomy in one breath, you don't yet understand what problem it solves — and you won't defend it well in a system-design interview either.

SQL / Python
Topic — etl
ETL practice library

Practice →

Data modeling
Topic — dimensional-modeling
Dimensional modeling drills

Practice →


2. Orchestration deep-dive — Airflow (with Prefect + Dagster sidenotes)

apache airflow is the single most-tested workflow orchestration tool in data engineering interview questions, and for one reason: every other layer of the modern data stack hands off scheduling to it. Whether the actual work is a Fivetran sync, a dbt build, a PySpark batch, or a bash export, the airflow dag is what owns the when and the if it failed, retry it. If you can explain DAG, operator, executor, scheduler, metadata DB, and sensor in one minute, you've cleared the bar for most senior orchestration questions.

This section walks the airflow internals every senior interview round tests: DAG anatomy, the four operator families, the three executor choices (Celery, Kubernetes, Local), the scheduler + metadata DB pair, sensors, XComs, dynamic task mapping, and the three managed airflow hosting options (MWAA, Astronomer, Cloud Composer). It ends with sidenotes on Prefect and Dagster — the two modern challengers — and a worked example that wires a five-task daily_etl DAG.

Visual diagram of Apache Airflow DAG anatomy — a single 'Pipeline DAG' card on top spanning the diagram width; below it five task nodes connected by arrows (sense_source → load_raw → transform → quality_check → publish); a small executor card on the right (Celery / Kubernetes / Local) plus a metadata DB and webserver/scheduler chips at the bottom; on a light PipeCode card.

DAG anatomytasks, operators, dependencies, schedule_interval

A dag (directed acyclic graph) is a Python file that defines what runs, when it runs, and what depends on what. Five primitives cover 95% of airflow dag interview questions.

  • DAG object — the top-level container; carries dag_id, schedule_interval, start_date, catchup, tags, and default_args.
  • Task — a single unit of work; an instance of an operator class. Tasks have task_id, retries, retry_delay, and timeouts.
  • Operator — the what of a task; the four families are BashOperator, PythonOperator, provider operators (SnowflakeOperator, DbtRunOperator, S3ToRedshiftOperator), and sensors (S3KeySensor, ExternalTaskSensor).
  • Dependencies — set with >> (downstream) and << (upstream); t1 >> t2 >> t3 makes a three-task chain. set_upstream / set_downstream are the explicit forms.
  • schedule_interval — a cron expression ("0 7 * * *"), a preset ("@daily"), a timedelta(hours=1), or None (manual / triggered).

The minimal DAG.

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator

def transform_callable(**ctx):
    print(f"Transforming for execution_date={ctx['ds']}")

default_args = {
    "owner": "data-eng",
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
    "email_on_failure": True,
}

with DAG(
    dag_id="daily_etl",
    description="Daily ETL — extract, transform, load, notify",
    schedule_interval="@daily",
    start_date=datetime(2026, 1, 1),
    catchup=False,
    default_args=default_args,
    tags=["etl", "daily"],
) as dag:

    extract   = BashOperator(task_id="extract",   bash_command="python /opt/scripts/extract.py")
    transform = PythonOperator(task_id="transform", python_callable=transform_callable)
    load      = BashOperator(task_id="load",      bash_command="python /opt/scripts/load.py")
    notify    = BashOperator(task_id="notify",    bash_command="curl -X POST $SLACK_HOOK")

    extract >> transform >> load >> notify
Enter fullscreen mode Exit fullscreen mode
  • catchup=False — don't backfill missed runs on first deploy; almost always what you want.
  • retries=3 + retry_delay — survives transient infra blips without paging on-call.
  • >> chain — reads top-to-bottom; the four tasks run sequentially.
  • @daily — preset equivalent to "0 0 * * *" (midnight UTC); @hourly, @weekly, @monthly exist too.

ExecutorsLocal, Celery, Kubernetes, CeleryKubernetes

The airflow executor is the how and where tasks run. Pick by scale + ops budget.

  • LocalExecutor — runs tasks as subprocesses on the scheduler node; fine for <50 concurrent tasks, single-machine dev, small teams.
  • CeleryExecutor — distributes tasks across a pool of worker nodes via a Celery broker (Redis or RabbitMQ); the historic production default for self-hosted Airflow.
  • KubernetesExecutor — every task runs as a fresh pod on a Kubernetes cluster; perfect isolation, auto-scale, per-task resource limits. The 2026 production default.
  • CeleryKubernetesExecutor — hybrid; short tasks go to Celery workers, heavy tasks to fresh pods. Niche.

Why KubernetesExecutor won the 2026 default. Each task gets a clean pod with its own requirements.txt / Docker image, resources are billed per second, failed pods don't pollute the next run, and Kubernetes' built-in autoscaling means you pay zero for idle capacity overnight.

Scheduler + metadata DB — how Airflow remembers everything

Airflow has three long-running services and one database.

  • scheduler — parses DAG files every scheduler_heartbeat_sec, decides which task instances are ready, and pushes them to the executor. The scheduler is the brain.
  • webserver — Flask UI that reads the metadata DB; lets engineers trigger / pause / inspect DAGs.
  • triggerer — runs deferrable operators (async sensors that don't block a worker slot); a 2.2+ feature.
  • metadata DB — Postgres (production) or MySQL; remembers every DAG run, task instance, XCom value, connection, variable, pool, and log location. The metadata DB is the source of truth — if you lose it, you lose state.

Sensors, XComs, dynamic task mapping — the three power features

  • Sensors — tasks that wait for an external condition (a file lands in S3, an upstream DAG finishes, a Snowflake row appears). S3KeySensor, ExternalTaskSensor, SqlSensor are the staples. mode="reschedule" frees the worker slot between pokes; deferrable=True moves the wait to the triggerer entirely.
  • XComs (cross-communication) — small key-value blobs that one task pushes and another pulls. Limited to ~48KB per row (stored in the metadata DB); for larger payloads use S3 / GCS as the transport with the XCom holding a URI.
  • Dynamic task mappingMyOp.partial(...).expand(...) (Airflow 2.3+) fans out N parallel task instances from a list of inputs. The pythonic alternative to manually generating tasks at DAG-parse time.

Managed AirflowMWAA, Astronomer, Cloud Composer

Self-hosting Airflow on Kubernetes is non-trivial. Three managed options dominate.

  • Amazon MWAA (Managed Workflows for Apache Airflow) — AWS-native; you pick mw1.small/medium/large instance sizes; metadata DB and scheduler are managed; you bring the DAG repo + plugins.
  • Astronomer — vendor-neutral managed Airflow on Astronomer Cloud or your own Kubernetes; ships Astro CLI and Astro Runtime (their hardened image); strong developer experience.
  • Google Cloud Composer — GCP-native managed Airflow on GKE; tight integration with BigQuery, Dataflow, and Cloud Functions.

Prefect + Dagster — the two modern challengers

  • Prefect — Python-first orchestrator; replaces DAGs with @flow and @task decorators; Prefect Cloud is the managed control plane. Strength: better dynamic-workflow ergonomics (task.map(...) first-class). Weakness: smaller ecosystem of provider operators.
  • Dagster — opinionated about software-defined assets; every materialised dataset is an asset, and the DAG is the asset graph. Strength: lineage and asset observability are first-class. Weakness: steeper learning curve; smaller community.

Both are credible alternatives in greenfield 2026 builds. Airflow still wins on ecosystem breadth (200+ provider packages, every cloud, every tool) and is what interviewers reach for by default.

Worked example — a five-task daily_etl Airflow DAG with sensor + dbt + Snowflake

Detailed explanation. The canonical production DAG shape is sense → load → transform → quality-check → publish. Each task uses a different operator family — S3KeySensor waits for the file, S3ToSnowflakeOperator loads it, DbtCloudRunJobOperator transforms, a PythonOperator runs a quality check, and a BashOperator notifies Slack. This worked example wires all five with the >> chain.

Question. Build an Airflow 2.7+ DAG daily_etl that runs at 07:00 UTC daily and (1) waits for s3://lake/raw/{ds}.csv to appear, (2) loads it into RAW.ORDERS in Snowflake, (3) triggers a dbt Cloud job, (4) runs a row-count quality check, (5) notifies Slack. The DAG must retry 3× with a 5-minute back-off and not backfill historical dates on first deploy.

Input.

Source Path / connection Notes
S3 lake s3://lake/raw/{ds}.csv One file per day
Snowflake SNOWFLAKE_DEFAULT connection Loads into RAW.ORDERS
dbt Cloud job id 12345 Runs the prod environment
Slack webhook URL in Variable.get("slack_hook") Notify on success

Code.

from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import Variable
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
from airflow.providers.snowflake.transfers.s3_to_snowflake import S3ToSnowflakeOperator
from airflow.providers.dbt.cloud.operators.dbt import DbtCloudRunJobOperator
from airflow.providers.snowflake.hooks.snowflake import SnowflakeHook

def quality_check_callable(**ctx):
    hook = SnowflakeHook(snowflake_conn_id="SNOWFLAKE_DEFAULT")
    row_count = hook.get_first("SELECT COUNT(*) FROM RAW.ORDERS")[0]
    if row_count < 100:
        raise ValueError(f"Row count too low: {row_count}")
    print(f"Quality check passed — {row_count} rows")

default_args = {
    "owner": "data-eng",
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
    "email_on_failure": True,
}

with DAG(
    dag_id="daily_etl",
    schedule_interval="0 7 * * *",
    start_date=datetime(2026, 1, 1),
    catchup=False,
    default_args=default_args,
    tags=["etl", "daily", "snowflake"],
) as dag:

    sense_source = S3KeySensor(
        task_id="sense_source",
        bucket_key="raw/{{ ds }}.csv",
        bucket_name="lake",
        aws_conn_id="aws_default",
        mode="reschedule",
        poke_interval=300,
        timeout=60 * 60 * 6,
    )

    load_raw = S3ToSnowflakeOperator(
        task_id="load_raw",
        snowflake_conn_id="SNOWFLAKE_DEFAULT",
        s3_keys=["raw/{{ ds }}.csv"],
        table="RAW.ORDERS",
        stage="LAKE_STAGE",
        file_format="(TYPE = CSV, FIELD_DELIMITER=',', SKIP_HEADER=1)",
    )

    transform = DbtCloudRunJobOperator(
        task_id="transform",
        dbt_cloud_conn_id="dbt_cloud_default",
        job_id=12345,
        wait_for_termination=True,
    )

    quality_check = PythonOperator(
        task_id="quality_check",
        python_callable=quality_check_callable,
    )

    publish = BashOperator(
        task_id="publish",
        bash_command=(
            "curl -X POST -H 'Content-Type: application/json' "
            "-d '{\"text\":\"daily_etl finished for {{ ds }}\"}' "
            f"{Variable.get('slack_hook')}"
        ),
    )

    sense_source >> load_raw >> transform >> quality_check >> publish
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. S3KeySensor waits for s3://lake/raw/2026-05-26.csv to appear; mode="reschedule" frees the worker slot between 5-minute pokes, so it doesn't burn a Celery / pod slot for 6 hours.
  2. S3ToSnowflakeOperator issues a COPY INTO RAW.ORDERS FROM @LAKE_STAGE/raw/2026-05-26.csv once the sensor passes.
  3. DbtCloudRunJobOperator triggers dbt Cloud job 12345 and polls until it finishes; failure here fails the task with the dbt run URL in the log.
  4. PythonOperator runs an inline SELECT COUNT(*) quality check; raises if fewer than 100 rows landed — exception kills the DAG before notification.
  5. BashOperator posts to Slack with the {{ ds }} Jinja-templated date; runs only if the previous four succeeded.

Output (one DAG run in the Airflow UI).

2026-05-26 07:00:00  sense_source  ── success (waited 0m17s)
2026-05-26 07:00:17  load_raw      ── success (loaded 14,532 rows)
2026-05-26 07:00:42  transform     ── success (dbt Cloud run 9876)
2026-05-26 07:03:11  quality_check ── success (14,532 rows >= 100)
2026-05-26 07:03:12  publish       ── success
Total elapsed: 3m12s
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: every senior airflow dag ends with sense → do work → check → notify; tasks that skip the check step are why pipelines silently produce wrong dashboards.

Python
Topic — etl
Airflow + ETL drills

Practice →

Python
Topic — design
Workflow design problems

Practice →


3. Transformation deep-dive — dbt (with SQLMesh sidenote)

dbt (data build tool) is the in-warehouse transformation standard, and its design choice — every transformation is a SELECT statement compiled by dbt and executed by the warehouse — is what made it the second-most-tested tool after Airflow in modern data engineering interview questions. dbt does not move data across systems; it never spins up its own compute. The warehouse runs the SQL, dbt orchestrates the compilation, dependency graph, materialization, testing, and documentation. If Airflow is the when, dbt is the what.

This section unpacks the dbt project anatomy every interview round tests: models, sources, refs, the four materializations (view, table, incremental, ephemeral), the canonical staging → intermediate → marts layering, tests, snapshots, dbt Cloud vs dbt Core, and dbt Mesh. It closes with a SQLMesh sidenote — the modern challenger that adds first-class column-level lineage and virtual-environment dev/prod separation.

Visual diagram of a dbt project — sources on the left, four staging models, three intermediate models, and three mart models on the right, all connected by ref() arrows; an in-warehouse compute icon (Snowflake / BigQuery / Databricks) sits below the project; a thin tests + docs ribbon runs underneath; on a light PipeCode card.

Models, refs, and sources — the three primitives of every dbt project

A dbt model is a .sql file containing a single SELECT. dbt compiles each model with ref() and source() macros to resolve dependencies and physical table names at runtime.

  • modelmodels/staging/stg_orders.sql contains select order_id, ... from {{ source('raw', 'orders') }}. dbt compiles this into select order_id, ... from raw.orders and runs CREATE OR REPLACE VIEW analytics.stg_orders AS ... (or table, depending on materialization).
  • sourcesources.yml declares external tables that dbt does not own. {{ source('raw', 'orders') }} resolves to raw.orders and lets dbt source freshness check that the upstream data is fresh.
  • ref{{ ref('stg_orders') }} resolves to the physical table name of another model. Refs are how dbt builds the DAG; every ref() becomes an edge in the dependency graph.

Why ref() is the load-bearing primitive.

  • Environment isolationdev, staging, prod all resolve ref() to the right schema (e.g. dev_alice.stg_orders vs analytics.stg_orders).
  • Dependency graphdbt run runs models in topological order based on ref() calls; you never hand-code an execution order.
  • No hard-coded table names — refactoring a model name updates every downstream consumer automatically.

Materializationsview, table, incremental, ephemeral

The dbt materialization decides how the compiled SQL becomes a physical artifact in the warehouse. Four choices cover 99% of projects.

  • viewCREATE OR REPLACE VIEW analytics.stg_orders AS SELECT .... Fastest to materialise (just metadata); slow to query (re-runs every time). Default for staging.
  • tableCREATE OR REPLACE TABLE analytics.fct_orders AS SELECT .... Slowest to materialise (full scan + write); fastest to query. Default for marts.
  • incremental — first run creates a table; subsequent runs MERGE only new / changed rows. Required for any model on top of an append-only raw table with >100M rows. Driven by is_incremental() macro and a unique_key.
  • ephemeral — no physical artifact; the SQL is inlined as a CTE in downstream models. Saves storage; loses observability. Niche.

The incremental shape every dbt interview tests.

{{ config(
    materialized = 'incremental',
    unique_key   = 'order_id',
    on_schema_change = 'fail'
) }}

select
    order_id,
    customer_id,
    amount,
    order_ts,
    current_timestamp() as dbt_loaded_at
from {{ source('raw', 'orders') }}

{% if is_incremental() %}
  where order_ts > (select max(order_ts) from {{ this }})
{% endif %}
Enter fullscreen mode Exit fullscreen mode
  • {% if is_incremental() %} — runs only after the first build; gates the WHERE filter that bounds the merge to new rows.
  • unique_key = 'order_id' — the natural key dbt uses to MERGE (UPDATE if exists, INSERT otherwise).
  • on_schema_change = 'fail' — refuses to deploy a model whose schema changed without explicit migration; the safest default.

Staging → intermediate → marts — the canonical dbt layering

Every senior dbt project enforces the same three-layer shape.

  • staging (stg_*) — one-to-one with each source table; renames columns, casts types, light cleaning. Materialised as view. Naming: stg_<source>__<entity> (e.g. stg_stripe__charges).
  • intermediate (int_*) — composes staging models; pre-joins, pre-aggregates, derives flags. Materialised as view or ephemeral. Hidden from the BI tool.
  • marts (fct_* / dim_*) — business-facing facts and dimensions; the only layer BI tools query. Materialised as table or incremental. Naming: fct_orders, dim_users, fct_funnel.

Why the layering matters in interviews.

  • Single source of truth — every concept lives in exactly one mart; downstream consumers reuse it.
  • Refactor safety — renaming a source column updates only the matching stg_* model; downstream consumers don't break.
  • Test isolation — schema tests on stg_* catch source-data regressions early; tests on fct_* catch business-logic regressions.

Tests, snapshots, seeds — the dbt quality toolkit

  • tests — generic (unique, not_null, accepted_values, relationships) declared in schema.yml; or singular (tests/*.sql) one-off SELECT statements that fail if any row is returned. Run with dbt test.
  • snapshots — type-2 SCD (slowly changing dimension) capture; dbt snapshot writes a row every time tracked columns change; gives you "what did this row look like on 2026-03-01?" queries.
  • seeds — CSV files committed to the repo and loaded as tables via dbt seed; perfect for small reference tables (country_code → region, fee-tier lookups).

dbt Cloud vs dbt Core

  • dbt Core — the OSS CLI; free; runs locally or under Airflow / Dagster / Prefect. The base layer.
  • dbt Cloud — the managed SaaS; adds an IDE, scheduling, hosted docs, CI/CD on PR, semantic layer, and lineage UI. Priced per-developer-seat (~$50/seat/month) plus a Team / Enterprise jump for advanced features.
  • dbt Mesh — multi-project, multi-team support; lets a marketing project depend on a finance project's published models via cross-project ref(). The 2024+ enterprise pattern.

SQLMesh — the modern challenger

SQLMesh is a newer transformation tool that fixes two dbt pain points: (1) first-class virtual environments so dev and prod share data physically but isolate logically (no more expensive dev clones), and (2) column-level lineage baked into the planner. The teaching shape is similar (models, refs, materializations), but the planner is more aggressive — it produces a plan you review before applying, like Terraform. Worth knowing for 2026 greenfield builds; dbt still wins on ecosystem and hiring market.

Worked example — a three-layer dbt project for an e-commerce funnel

Detailed explanation. This worked example ships the minimum production-shaped dbt project: one source declaration, two staging models, one intermediate model, one fact mart, and one schema test. The pattern is what every senior dbt project looks like — staging is thin and renames-only; intermediate composes; marts are business-facing.

Question. Given a raw schema with raw.orders(order_id, customer_id, amount_cents, status, order_ts) and raw.customers(customer_id, region, signup_ts), build a dbt project that produces analytics.fct_paid_orders_by_region with one row per region containing paid_orders and paid_revenue_usd.

Input.

Source table Columns Volume
raw.orders order_id, customer_id, amount_cents, status, order_ts ~10M / day
raw.customers customer_id, region, signup_ts ~100M total

Code.

models/staging/stg_orders.sql

{{ config(materialized='view') }}

select
    order_id::varchar       as order_id,
    customer_id::varchar    as customer_id,
    amount_cents / 100.0    as amount_usd,
    lower(status)           as status,
    order_ts::timestamp     as order_ts
from {{ source('raw', 'orders') }}
Enter fullscreen mode Exit fullscreen mode

models/staging/stg_customers.sql

{{ config(materialized='view') }}

select
    customer_id::varchar as customer_id,
    region::varchar      as region,
    signup_ts::timestamp as signup_ts
from {{ source('raw', 'customers') }}
Enter fullscreen mode Exit fullscreen mode

models/intermediate/int_orders_enriched.sql

{{ config(materialized='ephemeral') }}

select
    o.order_id,
    o.customer_id,
    o.amount_usd,
    o.status,
    o.order_ts,
    c.region
from {{ ref('stg_orders') }} o
left join {{ ref('stg_customers') }} c
  on c.customer_id = o.customer_id
Enter fullscreen mode Exit fullscreen mode

models/marts/fct_paid_orders_by_region.sql

{{ config(materialized='table') }}

select
    region,
    count(*)        as paid_orders,
    sum(amount_usd) as paid_revenue_usd
from {{ ref('int_orders_enriched') }}
where status = 'paid'
group by region
Enter fullscreen mode Exit fullscreen mode

models/marts/schema.yml

version: 2

models:
  - name: fct_paid_orders_by_region
    description: "Paid orders + revenue aggregated by region."
    columns:
      - name: region
        tests: [unique, not_null]
      - name: paid_orders
        tests: [not_null]
      - name: paid_revenue_usd
        tests: [not_null]
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. stg_orders is a view that renames amount_cents → amount_usd and lowercases status — pure normalisation, no joins.
  2. stg_customers is a view that re-types customer_id and signup_ts — pure normalisation.
  3. int_orders_enriched is ephemeral — it joins the two staging models but never materialises; the mart inlines its CTE.
  4. fct_paid_orders_by_region is a table — filters to status = 'paid', groups by region, sums revenue. The only model the BI tool queries.
  5. schema.yml declares three tests on the mart — unique + not_null on region, not_null on the two metrics. dbt test fails the run if any test fails.

Output (the mart after dbt build).

region paid_orders paid_revenue_usd
US 14,532 1,247,850.00
EU 9,128 765,432.50
APAC 5,341 412,108.75

Rule of thumb: a senior dbt review reads three files — sources.yml (where data comes from), models/marts/*.sql (what the business sees), and schema.yml (what we promise is true). Everything else is plumbing.

SQL
Topic — joins
SQL joins drills

Practice →

Data modeling
Topic — dimensional-modeling
dbt + dimensional modeling

Practice →


4. EL / SaaS ingest deep-dive — Fivetran (with Airbyte + Stitch sidenotes)

Fivetran is the saas data ingest tool that taught the industry to say "EL, not ETL" — by separating extract + load (its job) from transform (dbt's job), Fivetran made ingest a buy-not-build decision for the first time. If you've ever copied a Salesforce API client into your repo and watched it rot, you understand the fivetran value proposition in one sentence: every connector you don't write is engineer-time saved, every schema-drift bug you don't debug is a Friday night saved.

This section unpacks the fivetran internals every modern data engineering interview questions round circles back to: the connector model, MAR (Monthly Active Rows) billing, schema drift handling, incremental sync, change data capture (CDC), when Fivetran beats a hand-rolled ingest, and the Airbyte + Stitch alternatives. It closes with a worked example that wires a Salesforce → Snowflake sync with a downstream dbt staging layer.

Visual diagram of a Fivetran-style SaaS ingest tool — a left column of source connector cards (Salesforce, Stripe, Postgres CDC, Google Ads, NetSuite), a middle Fivetran orchestrator card with sync scheduling, a right column of destination cards (Snowflake, BigQuery, Databricks, Redshift); a small 'MAR · Monthly Active Rows' billing pill floats above; on a light PipeCode card.

Connector model — 500+ pre-built integrations as the moat

A fivetran connector is a managed integration that handles three things on your behalf: authentication (OAuth, API keys, IAM roles), schema discovery (introspecting the source API to derive a table layout), and incremental sync (only pulling rows that changed since the last run). 500+ pre-built connectors span SaaS apps (Salesforce, HubSpot, Stripe, Zendesk), databases (Postgres CDC, MySQL, SQL Server, Oracle), event streams (Kafka, Kinesis), and files (S3, Google Drive).

The connector contract.

  • AuthenticationFivetran stores credentials in its control plane; you configure once via the web UI.
  • Schema discovery — connector introspects the source (Salesforce describe API, Postgres information_schema) and creates one destination table per source object, with columns matching the source schema.
  • Incremental sync — every connector tracks a per-table cursor (a timestamp, a change-log offset, or a CDC LSN) and only pulls rows after that cursor on each run.
  • Sync frequency — configurable from 1 minute (Business Critical) to 24 hours; default 5 minutes.

MAR — Monthly Active Rows billing, explained

MAR (Monthly Active Rows) is Fivetran's billing unit and the source of every "why is our Fivetran bill huge?" Slack thread. A row counts as active in a month if it is inserted, updated, or deleted during that month — not the total row count of the table.

  • A 10M-row dimension that doesn't change in a month = 0 MAR.
  • A 1M-row order table where every row updates once = 1M MAR.
  • A 100k-row table that fully refreshes daily = 3M MAR (100k × 30 days).

The four price tiers.

Tier What it adds Pricing slope
Free 500k MAR / month, 5-min sync $0
Starter All Free + standard connectors ~$1 per 1k MAR (declining with volume)
Standard + Database connectors + 1-min sync ~$1.50 per 1k MAR
Enterprise + Business Critical + advanced security ~$2 per 1k MAR

MAR optimisation patterns.

  • Disable columns you don't needFivetran lets you de-select columns per table; fewer columns sometimes means fewer MAR (depends on the connector).
  • Disable tables you don't need — every selected table contributes MAR; audit quarterly.
  • Prefer history mode offhistory mode doubles MAR (tracks updates as new rows); only enable on slowly-changing dimensions you need to audit.
  • Use Postgres CDC carefully — high-write Postgres tables can blow MAR; consider Airbyte CDC (cheaper) or a hand-rolled Debezium + Kafka pipeline for very high write volumes.

Schema drift handling — the load-bearing differentiator

Schema drift is what happens when a source adds a column, renames a column, changes a column's type, or drops a column. Hand-rolled ingest scripts break loudly when this happens; Fivetran handles it gracefully.

  • Added columnFivetran adds the column to the destination table on the next sync.
  • Renamed columnFivetran treats it as a new column + a dropped column (the old data stays under the old name; new writes go to the new name).
  • Type changeFivetran widens the destination type (e.g. INT → BIGINT, VARCHAR(50) → VARCHAR(255)) without dropping data.
  • Dropped column — column stays in the destination, marked as deleted in Fivetran's sync log; downstream models keep working until you explicitly drop them.

Incremental sync + CDC — what Fivetran actually does on each run

  • Cursor-based — for SaaS APIs with a updated_at column; Fivetran queries WHERE updated_at > cursor.
  • Log-based CDC — for Postgres, MySQL, SQL Server; Fivetran reads the WAL / binlog / transaction log and replays inserts, updates, deletes.
  • Append-only — for event streams; every event is a new row.
  • Full table refresh — fallback for tiny tables or sources without a cursor; the most expensive in MAR terms.

When Fivetran beats hand-rolled ingest — the four trigger conditions

  • >5 SaaS sources — the engineer-hours saved on auth, schema, retries dominate the licensing cost.
  • Schema drift is frequent — Salesforce admins add custom fields constantly; Fivetran handles it.
  • You don't have an on-call rotationFivetran is the on-call.
  • The MAR math works — pencil in $1 / 1k MAR; if MAR × $1 is less than 0.25× of an engineer-month, Fivetran wins.

Airbyte + Stitch sidenotes

  • Airbyte — open-source Fivetran alternative; 350+ connectors; self-host on Kubernetes (free) or Airbyte Cloud (managed). Strength: full source/code access, cheaper at high MAR. Weakness: thinner connector polish; more ops overhead.
  • Stitch — the budget option (acquired by Talend in 2018); fewer connectors (~150); flat per-row pricing; perfect for tiny ingest needs.

Worked example — wiring Salesforce + Postgres CDC to Snowflake via Fivetran + dbt

Detailed explanation. The canonical 2026 ingest pattern is Fivetran lands raw → dbt staging cleans → dbt marts publish. This worked example wires two connectors (Salesforce + Postgres CDC), shows the destination table shape Fivetran lands, and shows the matching dbt stg_* model that turns the raw Fivetran output into a usable shape.

Question. A team has Salesforce (Accounts + Opportunities) and a Postgres application database (Users + Orders). Land everything into Snowflake via Fivetran, then build a stg_* layer in dbt that renames _FIVETRAN_* columns and lowercases ID prefixes.

Input.

Source Destination schema Sync mode Estimated MAR / month
Salesforce Accounts raw_salesforce.accounts Cursor-based (SystemModstamp) 50k
Salesforce Opportunities raw_salesforce.opportunities Cursor-based 200k
Postgres users raw_app.users Log-based CDC 30k
Postgres orders raw_app.orders Log-based CDC 4.5M

Code (dbt staging model that cleans the Fivetran raw output).

-- models/staging/stg_salesforce__accounts.sql
{{ config(materialized='view') }}

select
    id                          as account_id,
    name                        as account_name,
    industry,
    annualrevenue::numeric      as annual_revenue,
    billingcountry              as billing_country,
    createddate::timestamp      as created_at,
    systemmodstamp::timestamp   as updated_at,
    _fivetran_synced::timestamp as dbt_loaded_at
from {{ source('raw_salesforce', 'accounts') }}
where _fivetran_deleted is false
Enter fullscreen mode Exit fullscreen mode
-- models/staging/stg_app__orders.sql
{{ config(materialized='view') }}

select
    id                          as order_id,
    user_id,
    amount_cents / 100.0        as amount_usd,
    status,
    created_at::timestamp       as order_ts,
    _fivetran_synced::timestamp as dbt_loaded_at
from {{ source('raw_app', 'orders') }}
where _fivetran_deleted is false
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Fivetran lands raw_salesforce.accounts with the source column names plus three system columns: _fivetran_synced (when the row was loaded), _fivetran_deleted (soft-delete flag), _fivetran_id (deterministic primary key).
  2. dbt stg_salesforce__accounts renames columns to snake_case, casts types, and filters out soft-deleted rows with where _fivetran_deleted is false.
  3. Postgres CDC behaves identically — Fivetran lands raw_app.orders with the same three system columns; the stg_app__orders view rents the same shape.
  4. MAR math50k + 200k + 30k + 4.5M = 4.78M MAR / month$4,780 on the Standard tier (~$1.50 / 1k MAR with volume discount lands closer to $3.5k–4k).
  5. Downstream — every int_* and fct_* model ref()s the stg_* views; the rest of the pipeline is identical to a hand-rolled ingest world.

Output (Snowflake INFORMATION_SCHEMA.TABLES after a few syncs).

TABLE_SCHEMA      | TABLE_NAME      | ROW_COUNT  | BYTES
RAW_SALESFORCE    | ACCOUNTS        | 8,432       | 12 MB
RAW_SALESFORCE    | OPPORTUNITIES   | 31,587      | 48 MB
RAW_APP           | USERS           | 1,204,332   | 380 MB
RAW_APP           | ORDERS          | 28,402,118  | 12.4 GB
ANALYTICS         | STG_SALESFORCE__ACCOUNTS   | (view)
ANALYTICS         | STG_APP__ORDERS            | (view)
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: Fivetran wins when the source surface is wide (>5 connectors) and the team is small (<10 engineers); a hand-rolled ingest wins when the source surface is narrow, the volume is extreme, or the MAR math doesn't pencil.

SQL
Topic — etl
ETL ingest drills

Practice →

SQL
Topic — data-transformation
Data-transformation patterns

Practice →


5. Full ETL platforms — AWS Glue · Talend · Informatica

Full ETL platforms bundle ingest, transformation, orchestration, and governance into a single product — trading the composability of the modern Fivetran + dbt + Airflow stack for a one-vendor-throat-to-choke shape. Three platforms dominate this category, each anchored in a different ecosystem: AWS Glue (the AWS-native serverless Spark choice), Talend (the long-tenured GUI-first mid-market option), and Informatica (the Fortune-500 enterprise incumbent). The etl tools comparison question "should we go composable or all-in-one?" almost always lives here.

This section unpacks each platform's architecture: Glue (DPU autoscale, Glue Data Catalog, Glue Crawler, Glue Studio), Talend (Open Studio OSS, Talend Cloud SaaS, the Talend Studio drag-drop designer, Java code generation), and Informatica (IDMC SaaS, PowerCenter legacy, IPU consumption pricing, the Cloud Data Integration console). Each platform gets a worked example so the etl architecture trade-offs land as concrete picks.

Visual side-by-side comparison of three full ETL platforms — AWS Glue (serverless Spark + Data Catalog + Crawler), Talend (drag-and-drop GUI + code generation + Talend Studio), Informatica (IDMC SaaS + PowerCenter legacy + IPU pricing); each as a vertical column with three layered tiles (compute, catalog/metadata, governance); on a light PipeCode card.

AWS Glue — serverless Spark + Data Catalog + Crawler

AWS Glue is AWS's serverless spark etl service. You write PySpark (or Scala Spark) jobs, AWS provisions Spark clusters on demand, and you pay per DPU-hour (Data Processing Unit hour, 4 vCPU + 16 GB RAM). Four primitives carry the platform.

  • Glue Jobs — PySpark / Scala Spark scripts that read from S3 / RDS / Redshift / JDBC, transform, write back. Glue 4.0 runs Spark 3.3 with Iceberg / Hudi / Delta support baked in.
  • Glue Crawlers — scan an S3 prefix, infer schema, register the result as a table in the Glue Data Catalog. Run on a schedule or trigger.
  • Glue Data Catalog — Hive-metastore-compatible metadata layer; every AWS analytics tool (Athena, EMR, Redshift Spectrum, Lake Formation) reads schema from it.
  • Glue Studio — low-code visual editor for authoring Spark jobs; generates the underlying PySpark for you.

Glue pricing — $0.44 / DPU-hour (US-East), minimum 1-minute billing. A job that runs for 10 minutes on 10 DPUs costs 10 × (10/60) × $0.44 = $0.73. Autoscale (Glue 4.0) lowers idle DPU cost; flex execution discounts non-urgent jobs by ~30%.

When Glue wins.

  • Pure-AWS stacks — Glue + Step Functions + Lake Formation + Athena keeps everything inside one cloud.
  • S3-as-lakehouse workloads — Glue is the de-facto S3 Spark engine.
  • Bursty workloads — serverless billing means you pay zero between runs.

Talend — GUI drag-drop + Java codegen + Open Studio vs Enterprise

Talend is the long-tenured (2005-founded; Qlik-acquired 2023) gui etl platform. The signature Talend Studio desktop app lets engineers drag components onto a canvas, draw arrows between them, and click RunTalend generates Java code under the hood and executes it.

  • Talend Open Studio — the free OSS edition; runs on your laptop or any JVM; produces standalone .jar files you deploy where you like.
  • Talend Data Integration (Enterprise) — the paid tier; adds scheduling, monitoring, change-management, MDM, team collaboration.
  • Talend Cloud / Data Fabric — the SaaS replatform; cloud-managed runtime, governance, lineage; replaces the legacy on-prem TAC (Talend Administration Center).
  • Talend Studio canvas — components like tFileInputDelimited, tMap, tFilterRow, tAggregateRow, tDBOutput; the tMap component is the heart of every real Talend job (joins + transforms + conditional routing in one tile).

When Talend wins.

  • Low-code mid-market — non-Python teams that want point-and-click ETL.
  • On-prem + cloud hybrid — Talend Studio runs anywhere a JVM does.
  • Mature data-governance needs — Talend Data Catalog + MDM is integrated.

InformaticaIDMC SaaS + PowerCenter legacy + IPU billing

Informatica has been the enterprise etl incumbent since 1993. Two products dominate today.

  • PowerCenter — the legacy on-prem product; massive installed base in Fortune 500 finance / healthcare / telco; mainframe + SAP integration is unmatched. Slowly being migrated to IDMC.
  • Informatica IDMC (Intelligent Data Management Cloud) — the modern SaaS replacement; unified console for Cloud Data Integration (the ETL service), Cloud Data Governance and Catalog, Cloud Mass Ingestion (the Fivetran-style EL competitor), Cloud Data Quality, and Cloud MDM.

IDMC pricing — IPU (Informatica Processing Unit) consumption. One IPU is a unit of compute consumption; the IDMC console shows you IPU usage per service. Prepaid IPU bundles (e.g. 50k IPU / year) are the typical procurement shape; enterprise contracts run from $100k / year to multi-million for Fortune 50 deployments.

When Informatica wins.

  • Regulated industriesInformatica's governance, lineage, and catalog satisfy banking / healthcare auditors out of the box.
  • Mainframe + SAP integrationPowerCenter's legacy connectors are unmatched.
  • VP-of-Data pickability — single vendor, single contract, single throat to choke.
  • Migrating from PowerCenter — IDMC is the only path that re-uses existing mappings and skills.

Comparison at a glance

Aspect AWS Glue Talend Informatica
Origin 2017 (AWS) 2005 (OSS GUI) 1993 (enterprise)
Surface API PySpark + Glue Studio visual Drag-drop Talend Studio (Java codegen) Drag-drop IDMC + PowerCenter
Compute model Serverless Spark (per DPU-hour) JVM runtime (self or Cloud-managed) IDMC SaaS (consumption)
Catalog Glue Data Catalog (Hive-compatible) Talend Data Catalog IDMC Data Governance + Catalog
Pricing unit $/DPU-hour Per-user / per-server license + SaaS IPU consumption
Cloud lock AWS-only Multi-cloud + on-prem Multi-cloud + on-prem
Open source No Open Studio only No
Strongest at AWS-native S3 Spark ETL Mid-market GUI ETL Regulated enterprise governance
Weakest at Non-AWS targets Cloud-native polish vs Fivetran Cost + learning curve

Worked example — costing a 1 TB / day pipeline three ways

Detailed explanation. The fairest etl tools comparison is a head-to-head cost exercise on identical workload. This worked example takes a single 1 TB / day pipeline — 8 SaaS sources, 1 large Postgres DB, 3 transformation stages, 1 Snowflake target — and prices it on all three full-ETL platforms.

Question. Cost the same workload (1 TB / day raw, 8 SaaS sources, 1 Postgres CDC source, 3 transformation steps, Snowflake target) on (a) AWS Glue + Step Functions, (b) Talend Cloud, (c) Informatica IDMC. Use US-East list prices, 2026.

Input.

Workload component Volume / day Notes
8 SaaS source ingests 4M MAR-equivalent Salesforce, Stripe, Shopify, etc.
1 Postgres CDC ingest 4M MAR-equivalent Order events
3 transformation steps 6 DPU-hours / day Joins, aggregations, dedupe
Snowflake target 1 TB / day landed Warehouse cost excluded

Code (the rough monthly cost math).

# AWS Glue + Step Functions (pure AWS)
#   Ingest:      8 sources × 0.5 DPU-hr × 30 days = 120 DPU-hr  -> $52.80
#   CDC:         1 source  × 1.0 DPU-hr × 30 days =  30 DPU-hr  -> $13.20
#   Transform:   6 DPU-hr × 30 days              = 180 DPU-hr  -> $79.20
#   Step Fns:    negligible (~$1)
#   ---------------------------------------------
#   Glue total:  ~$145 / month  (compute only; S3 + Snowflake separate)

# Talend Cloud
#   Talend Cloud Data Integration: 3 named users × $1,170/mo  = $3,510
#     (or volume-based equivalent; pricing varies by deal)
#   Talend runtime VMs (or Cloud runtime): ~$500 / mo
#   ---------------------------------------------
#   Talend total: ~$4,000 / month

# Informatica IDMC
#   Estimated IPU consumption: ~8k IPU / month
#   Pre-paid bundle:           ~$4 / IPU
#   ---------------------------------------------
#   Informatica total: ~$32,000 / month  (entry enterprise; quotes vary 2-5x)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Glue is the cheapest in raw compute — at $0.44 / DPU-hour, 330 DPU-hours / month is ~$145. The catch: you wrote the PySpark, you own the operations, you fix the failures.
  2. Talend Cloud lands around ~$4k / month for a small team — three named users on Data Integration plus a runtime VM. The GUI saves engineer-time; the licensing is the cost.
  3. Informatica IDMC is the most expensive — entry deployments start around $30k / month and scale to $250k+ for Fortune 50. You're paying for governance, catalog, lineage, and a vendor relationship.
  4. Why the spread is 200× — Glue is compute-only and you do the work; Informatica is a managed platform with governance built in.
  5. Real picksGlue for engineering-led AWS teams, Talend for mid-market GUI shops, Informatica for regulated Fortune 500 enterprises.

Output (monthly cost summary).

Platform Monthly cost (entry) Engineer-hours / month Total $ + opportunity cost
AWS Glue ~$145 ~40 hrs (own the PySpark) ~$5–7k loaded
Talend Cloud ~$4,000 ~10 hrs (GUI maintenance) ~$5–6k loaded
Informatica IDMC ~$32,000 ~5 hrs (vendor handles most) ~$33–34k loaded

Rule of thumb: Glue is "cheap if your team is senior"; Talend is "moderate, flat, predictable"; Informatica is "expensive but the auditors smile" — and the right pick depends on which line of that sentence describes your company.

SQL
Topic — etl
Full ETL practice

Practice →

SQL
Topic — data-aggregation
Data aggregation drills

Practice →


6. Pricing and licensing models — comparing 6 cost shapes

etl tools pricing is where every etl tool selection decision gets real — and where most procurement decks lie because the six tools use six different billing units that don't compare apples-to-apples. Airflow is free OSS; dbt Core is free OSS; dbt Cloud is per-seat; Fivetran is MAR (Monthly Active Rows); Glue is $ / DPU-hour; Talend is per-user-license + runtime; Informatica is IPU consumption. To compare, you have to flatten every cost into a monthly dollar figure on the same workload.

This section walks the six cost shapes, then runs the canonical workload (1 TB / day, 8 SaaS sources, 3 transform stages) across all six tools so the etl pricing comparison lands as a single side-by-side table you can show your VP.

The six cost shapes — what each tool actually charges for

  • Airflow (OSS) — free software; you pay for the underlying infrastructure (Kubernetes cluster, metadata Postgres, scheduler / webserver / workers). Self-hosted ~$500–2,000 / month for small teams; managed (MWAA, Astronomer) ~$500–5,000 / month.
  • dbt Core (OSS) — free CLI; you run it under Airflow / Dagster; effectively zero incremental cost.
  • dbt Cloud (per-seat)Developer (free, single seat), Team (~$100 / developer / month), Enterprise (custom; usually $1,500+ / month all-in for SSO, MS Teams, etc.).
  • Fivetran (MAR) — volume tiers; rough $1–2 per 1k MAR with declining unit cost at scale; a 5M MAR workload lands ~$3,500–5,000 / month.
  • AWS Glue (DPU-hour)$0.44 / DPU-hour; minimum 1-minute billing; serverless autoscale. A heavy daily pipeline runs ~$100–500 / month.
  • Talend Cloud (per-user + runtime)Data Integration named-user licenses (~$1,170 / user / month) plus Talend Cloud runtime; small teams ~$3–5k / month.
  • Informatica IDMC (IPU) — pre-paid IPU bundles; ~$3–5 / IPU consumed; mid-size enterprise ~$30k+ / month, large enterprise $250k+ / month.

The licensing posture each cost shape implies

  • OSS + infra (Airflow, dbt Core) — flat infra cost, scales with cluster size, predictable. The "if our engineers are senior" pick.
  • Per-seat (dbt Cloud) — flat per developer; predictable; doesn't scale with data volume. The "10-engineer team" pick.
  • Volume (MAR, DPU-hour, IPU) — variable with workload; great when small, painful when big. The "we'll grow into the bill" pick.
  • Per-license (Talend) — flat per named user; predictable; doesn't scale with data volume but does scale with team. The "non-engineering teams co-build" pick.

Worked example — 1 TB / day, 8 SaaS sources, 3 transforms across all six tools

Detailed explanation. The fairest pricing comparison is a single workload priced across every option. We use the same workload as §5's worked example (1 TB / day, 8 SaaS + 1 CDC source, 3 transform stages, Snowflake target) and add Airflow orchestration plus dbt transformation rows.

Question. Estimate the monthly tooling cost (compute + license, excluding Snowflake) for the canonical workload across all six tools (or stacks involving them).

Input. Same workload as §5.

Component Volume / month
SaaS source MAR ~5M MAR
Postgres CDC MAR ~4M MAR
Glue DPU-hours ~330 DPU-hr
dbt models ~80 models, 6 runs / day
Airflow DAGs ~10 DAGs, 24 runs / day
Team size 3 data engineers

Code (cost shape per tool / stack).

# Tool / stack                          # Monthly cost (entry)
Airflow (MWAA mw1.medium)               ~$500
Airflow (Astronomer Hosted Standard)    ~$1,500
dbt Core under Airflow                  $0
dbt Cloud (Team, 3 devs)                ~$300
Fivetran Standard (9M MAR)              ~$3,500-5,000
AWS Glue (330 DPU-hr / month)           ~$145
Talend Cloud (3 named users + runtime)  ~$4,000
Informatica IDMC (entry mid-size)       ~$32,000

# Three canonical 2026 stacks:
# Stack A: Fivetran + dbt Cloud + MWAA  ~$4,300-5,800 / mo
# Stack B: Glue + Step Functions only   ~$200-300 / mo
# Stack C: Informatica IDMC end-to-end  ~$32,000 / mo
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Stack A (modern data stack)Fivetran ($4k) + dbt Cloud Team ($300) + MWAA ($500) = ~$4,800 / mo. Predictable, code-first, easy to hire for.
  2. Stack B (pure AWS)Glue ($145) + Step Functions (~$1) + Glue Data Catalog (~$1) = ~$150 / mo. Cheapest by far; trades the buy with engineer-hours owning the PySpark.
  3. Stack C (legacy enterprise)Informatica IDMC end-to-end = ~$32k / mo. Buys governance, lineage, vendor relationship; 200× the cost of Stack A.
  4. Cost per TB ingestedFivetran is ~$160 / TB on this workload; Glue is ~$5 / TB; Informatica is ~$1,000 / TB.
  5. Hidden costs not shown — Snowflake compute (~$3–10k / month at this volume) is the same across all three stacks; Airflow engineer-hours running self-hosted are ~10–20 hours / month; Informatica admin training and licensing add another 20–30%.

Output (apples-to-apples monthly bill summary).

Stack Tooling / month Cost per TB ingested Best for
Stack A — Fivetran + dbt Cloud + MWAA ~$4,800 ~$160 / TB Modern data team, 3–20 engineers
Stack B — Glue + Step Functions ~$150 ~$5 / TB Pure-AWS engineering-led shops
Stack C — Informatica IDMC ~$32,000 ~$1,000 / TB Regulated enterprise, governance-first

Rule of thumb: expect the modern data stack (Stack A) to be the default for greenfield 2026 builds, AWS-native (Stack B) for cost-pressured pure-AWS shops, and Informatica (Stack C) for regulated incumbents with existing PowerCenter footprints. Every other stack is a variant of one of these three.

SQL
Topic — etl
Cost-aware ETL drills

Practice →

SQL
Topic — design
System-design ETL problems

Practice →


7. Decision framework + 2026 production stack patterns

The etl tool selection decision is rarely "which is best" and almost always "which fits my team, ecosystem, and budget?" — so a decision framework beats a feature-matrix every time. This final section ships a two-question decision tree as ASCII art, then walks three realistic data engineering stack patterns you'll see in the wild: the modern data stack, the pure AWS stack, and the legacy enterprise stack. By the end you can defend a 2026 stack pick in a system-design round with the same confidence you defend an O(n) SQL plan.

The two-question decision tree

                            START — pick an ETL tool stack
                                        |
                                        v
                        Q1: Is your warehouse already cloud-native
                             (Snowflake / BigQuery / Databricks)?
                                        |
                         +--------------+--------------+
                         |                             |
                        YES                            NO
                         |                             |
                         v                             v
        Q2a: Is your team code-first       Q2b: Is your team in a regulated
            (Python / SQL / Git)?              industry (banking / pharma /
                         |                       healthcare / insurance)?
            +-----------+-----------+                       |
            |                       |              +-------+--------+
           YES                      NO             |                |
            |                       |             YES               NO
            v                       v              |                |
   Stack A: Fivetran +     Stack T: Talend         v                v
   dbt + Airflow          Cloud (GUI-first)   Stack C:        Stack B:
   (modern data stack)                         Informatica     AWS Glue +
                                                IDMC          Step Functions
                                              (regulated      (pure AWS,
                                               enterprise)     engineer-led)
Enter fullscreen mode Exit fullscreen mode

Stack AFivetran + dbt + Airflow (the modern data stack)

The 2026 greenfield default. Pre-built Fivetran connectors land raw data into Snowflake / BigQuery / Databricks; dbt models clean, layer, and publish marts; Airflow orchestrates Fivetran syncs, dbt build, and downstream notifications.

  • Strengths — code-first, Git-reviewable, easy to hire for, every layer is best-in-class.
  • WeaknessesFivetran MAR cost ramps with volume; you operate three tools, not one.
  • Reference cost — $3–6k / month for a small / mid-market team.
  • Reference team size — 2–20 data engineers.
  • Reference industries — SaaS, e-commerce, ad-tech, marketplaces, fintech.

Stack BGlue + Step Functions + Glue Data Catalog (pure AWS)

The cost-conscious engineering-led AWS-native stack. Glue Crawlers register S3 schemas in the Glue Data Catalog; Glue Jobs run PySpark transforms; Step Functions orchestrates; Lake Formation adds row / column security; Athena / Redshift Spectrum queries the result.

  • Strengths — cheapest by an order of magnitude, AWS-native security, zero idle cost.
  • Weaknesses — AWS-only, you own all the PySpark, less polish than dbt.
  • Reference cost — $150–600 / month at modest volume.
  • Reference team size — 3–10 senior engineers.
  • Reference industries — AWS-native startups, ML-platform teams, gaming, IoT.

Stack CInformatica IDMC + Airflow + cloud warehouse (legacy enterprise)

The Fortune-500 incumbent shape. IDMC Mass Ingestion replaces Fivetran for SaaS ingest; Cloud Data Integration runs ETL mappings; Cloud Data Governance ships lineage / catalog / quality as one console; Airflow runs alongside for non-Informatica jobs (Spark batches, custom Python).

  • Strengths — governance / lineage / compliance auditors love it; mainframe + SAP connectors unmatched; single vendor relationship.
  • Weaknesses — 10–100× the cost of Stack A; steeper learning curve; less Git-native dev workflow.
  • Reference cost — $30k–500k / month depending on Fortune ranking.
  • Reference team size — 20+ data engineers, dedicated Informatica admins.
  • Reference industries — banking, insurance, healthcare, pharma, telco, public sector.

Picking — the staff-engineer one-liner

The senior interview answer always starts the same: "It depends on three things — your warehouse, your team profile, and your regulatory posture." If your warehouse is cloud-native and your team is code-first, Stack A. If you're AWS-only and cost-pressured, Stack B. If you're regulated and have a PowerCenter footprint, Stack C. Everything else is a variant.

SQL / Python
Topic — design
System-design practice

Practice →

SQL
Topic — etl
End-to-end ETL drills

Practice →


Choosing the right ETL tool (cheat sheet)

A one-screen cheat sheet for etl tool selection — pick the tool that matches your job-to-be-done.

You want to … Reach for Notes
Orchestrate any kind of job Airflow (or Prefect / Dagster) DAGs as code; 200+ operators
Run SQL transforms inside a cloud warehouse dbt (Core or Cloud) Models, refs, materializations
Pull from 50+ SaaS sources without writing code Fivetran MAR-billed; schema-drift safe
Same job as Fivetran but open-source Airbyte 350+ connectors; self-host
Run serverless Spark on AWS S3 AWS Glue DPU-hour billing; Glue Data Catalog
Drag-and-drop ETL for a mixed-skill team Talend Cloud Java codegen; named-user licenses
Enterprise governance + lineage + MDM Informatica IDMC IPU consumption; PowerCenter migration path
Greenfield modern data stack Fivetran + dbt + Airflow Stack A — the 2026 default
Pure-AWS cost-conscious stack Glue + Step Functions + Lake Formation Stack B — cheapest at scale
Regulated enterprise rebuild Informatica IDMC + Airflow Stack C — governance-first
Track data quality alongside any of the above Soda or Great Expectations Layer 6 quality, not ETL itself
Pair Fivetran with a transform tool Fivetran → dbt EL + T; the canonical 2026 ingest+transform
Replace dbt with a planner-style alternative SQLMesh Virtual envs; column-level lineage
Add an asset-centric orchestrator Dagster Software-defined assets; lineage-first
Cost a workload across tools MAR × $1 vs DPU-hr × $0.44 vs IPU × $4 Flatten to monthly $ on same workload
Avoid vendor lock-in OSS stack: Airflow + dbt Core + Airbyte Higher ops burden; full portability

Frequently asked questions

Which ETL tool should I learn first as a data engineer in 2026?

Start with Airflow and dbt — together they cover the two highest-frequency interview topics and the two most-deployed tools in the modern data stack. Airflow gives you DAGs, operators, executors, sensors, and dynamic task mapping — the canonical orchestration vocabulary every senior round tests. dbt gives you models, refs, materializations, the staging → intermediate → marts layering, and tests — the canonical in-warehouse transformation pattern. After those two, learn Fivetran conceptually (you'll configure it through a UI, not code, so the learning curve is hours not weeks) and pick one full-ETL platform based on your target employers — AWS Glue if you're targeting cloud-native shops, Informatica or Talend if you're targeting regulated enterprises. Skip the "which is best" debate; the right answer in 2026 is "all of them serve different layers, and senior engineers know what each is for."

Is Airflow being replaced by Prefect or Dagster?

No, but the competitive field has matured. Airflow still dominates by ecosystem breadth — 200+ provider packages, every cloud, every operator, every interview prep deck. Prefect is a credible Python-first alternative with better dynamic-workflow ergonomics (task.map(...) is first-class), Prefect Cloud is the managed control plane, and many greenfield 2026 builds pick Prefect. Dagster is the most opinionated of the three; it elevates software-defined assets to first-class status so every materialised dataset is an asset and the DAG is the asset graph — strong for lineage and observability use cases. In interviews, Airflow is the default you should be fluent in; mentioning Prefect / Dagster as alternatives is a senior-signal move, but the loop will keep coming back to Airflow concepts (DAGs, operators, executors, sensors).

What's the practical difference between dbt and Fivetran?

dbt and Fivetran are not substitutes — they live in different layers of the pipeline and 2026's canonical stack runs both. Fivetran is EL / ingest — it pulls data from operational sources (Salesforce, Stripe, Postgres) into your warehouse, with pre-built connectors handling auth, schema discovery, and incremental sync; you never write SQL inside Fivetran. dbt is in-warehouse transformation — it runs SQL SELECT statements inside the warehouse to turn raw landed tables into clean marts; you never move data across systems with dbt. The canonical 2026 stack is Fivetran lands raw → dbt staging cleans → dbt marts publish → Airflow orchestrates. If anyone tells you to pick dbt vs Fivetran, they've misread the question — pick both, in different layers.

When does AWS Glue beat Fivetran + dbt for the same workload?

Three conditions. First, when your data is already on S3 (logs, click-streams, IoT, ML feature stores) — Glue Crawlers and Glue Jobs are built for S3-as-lakehouse workloads, while Fivetran has weaker S3-source connectors. Second, when you're AWS-only and your security team prefers IAM + Lake Formation over a third-party SaaS — Glue lives inside your VPC; Fivetran is external SaaS that needs source credentials. Third, when MAR math doesn't pencil — at very high write volumes (10s of millions of rows / day), Fivetran's MAR bill exceeds Glue's DPU-hour bill by 10–50×. The trade-off: Glue requires PySpark fluency and you own the operations, while Fivetran + dbt ships managed connectors and SQL transforms. The default for most teams is still Fivetran + dbt + Airflow (modern data stack), with Glue slotted in for specific S3-heavy workloads.

Is Informatica still relevant in 2026, or should we go cloud-native?

Informatica is still very relevant in regulated Fortune-500 enterprises — banking, healthcare, insurance, pharma, telco, public sector — and it isn't going away. PowerCenter still runs many Fortune-500 data warehouses; IDMC is the modern SaaS replacement that bundles Cloud Data Integration (ETL), Cloud Data Governance (catalog + lineage), and Cloud Mass Ingestion (the Fivetran-style EL competitor). The reason regulated enterprises stick with Informatica is the governance and lineage posture — auditors trust the platform, and the migration cost from PowerCenter to anything else is enormous. For greenfield builds in cloud-native sectors (SaaS, e-commerce, ad-tech, fintech, marketplaces), the cloud-native Fivetran + dbt + Airflow stack wins on cost, agility, and hiring. Pick Informatica when your industry forces it; pick the modern data stack when it doesn't.

How do I compare the price of ETL tools when they all use different billing units?

Flatten everything to monthly dollars on the same workload. Pick a single reference workload — for example, 1 TB / day landed, 8 SaaS sources, 1 Postgres CDC, 3 transform stages, 3 engineers — and compute the monthly cost across each option. Airflow OSS ≈ $500–1,500 / month (infra only); dbt Cloud Team ≈ $100 / developer / month; Fivetran Standard ≈ $1.50 per 1k MAR (with volume discounts at scale); AWS Glue ≈ $0.44 / DPU-hour × your DPU-hours; Talend Cloud ≈ $1,170 / named user / month plus runtime; Informatica IDMC ≈ $3–5 / IPU consumed. Then add up the stack: Fivetran + dbt + MWAA lands ~$4–6k / month; Glue + Step Functions lands ~$150–500 / month; Informatica IDMC end-to-end lands ~$30k+ / month. Compare on the same workload, and the apples-to-apples picture emerges immediately.


Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including SQL, Python, ETL, data-modeling, aggregation, joins, window functions, dimensional modeling, and the system-design rounds where etl tool selection lands. Whether you're prepping for an Airflow + dbt modern-data-stack interview or a system-design round defending a Glue + Step Functions pick, the practice library mirrors the same four-category taxonomy this guide teaches.

Kick off via Explore practice →; drill the SQL practice lane →; fan out into the ETL practice lane →; reinforce data-modeling drills →; rehearse joins patterns →; sharpen aggregation problems →; reinforce Python practice →; or widen coverage on the full dimensional-modeling library →.

Top comments (0)