How to effectively orchestrate transformation pipelines using open-source tools, containerization, and engineering best practices.
Introduction
Building reliable, repeatable, and scalable data pipelines is now foundational to analytics engineering. By combining Apache Airflow for orchestration, dbt for SQL-based transformations, and Docker for reproducible environments, data teams can design workflows that are modular, maintainable, and cloud-ready.
This guide covers how each tool fits together and walks through a complete example that loads, transforms, and tests data automatically with Airflow and dbt inside Docker.
Tooling Foundations
Airflow: Orchestration Engine
Airflow is an open-source workflow manager that lets you schedule and monitor data pipelines programmatically. It handles task dependencies, retries, and scheduling, becoming the control plane connecting ingestion, transformation, and reporting.
dbt: Transformations as Code
dbt brings the principles of software engineering—modularity, testing, and version control—to analytics.
Core concepts:
- Models: SQL files that define how raw data should be shaped into usable tables or views.
- 
Tests: Defined in YAML, they validate your data for quality. Common types include unique,not_null,accepted_values, andrelationships.
- 
Materializations: Control how dbt models are built and stored in the warehouse. Common materializations include:
- view: The model is saved as a database view (default). It always reflects the latest data but might be slower for complex queries.
- table: The model is built as a physical table, which can improve performance but requires refreshing.
- incremental: Only new or updated data is processed and added to an existing table.
- ephemeral: The model is not stored in the database but is embedded as a subquery in downstream models.
 
Choosing the right materialization depends on your data freshness requirements and warehouse performance.
PostgreSQL:
PostgreSQL is an open-source, highly extensible relational database that serves effectively as a data warehouse. It handles large volumes of data, supports complex queries, and integrates well with analytics tools, making it ideal for scalable and reliable data workflows.
Docker: Consistent Environments
Docker packages applications, dependencies, and configurations together—ensuring the same setup runs identically on local machines, CI/CD pipelines, or cloud VMs.
Hands-On Example: Airflow, dbt, and Docker Pipeline
In this example, you’ll set up a local workflow with Airflow orchestrating dbt commands within Docker. The pipeline will load seed data, run a dbt model, test the results, and display everything in Airflow’s UI.
Step 1: Set Up Project Structure
Goal: Set up the folders that organize both dbt and Airflow files.
Create your working directory and required folders:
mkdir dbt_sample_project
cd dbt_sample_project
mkdir airflow dbt_project
Your directory layout:
dbt_sample_project/
├── airflow/
└── dbt_project/
Step 2: Configure dbt
Goal: Connect dbt to your Postgres database and define the base project structure.
Inside the dbt_project directory, create the following configuration files.
profiles.yml
default:
  target: dev
  outputs:
    dev:
      type: postgres
      host: postgres
      user: postgres
      password: password
      port: 5432
      dbname: sampledb
      schema: public
What it does:
Defines connection details so dbt knows which database and schema to write models into.
dbt_project.yml
name: 'dbt_project'
version: '1.0'
config-version: 2
profile: 'default'
model-paths: ["models"]
What it does:
Defines project-level settings such as version, folder paths, and linked profile.
Step 3: Seed Sample Data
Goal: Create a source dataset to load into Postgres using dbt’s seeding feature.
Inside dbt_project/seeds/, add a CSV file named raw_orders.csv.
order_id,customer_id,order_date,total_amount,order_status
1,101,2025-10-01,250,completed
2,102,2025-10-03,450,pending
3,103,2025-10-05,100,completed
4,104,2025-10-10,300,completed
What it does:
This CSV simulates your “raw orders” dataset. Running dbt seed will upload it into the database as a base table for transformations.
Step 4: Create a dbt Model and Tests
Goal: Transform data and define validation tests to ensure accuracy.
Model: dbt_project/models/orders.sql
{{ config(materialized='table') }}
select
    order_id,
    customer_id,
    order_date,
    total_amount
from {{ ref('raw_orders') }}
where order_status = 'completed'
What it does:
Transforms the raw orders into a curated table containing only completed orders.
Schema Tests: dbt_project/models/orders.yml
version: 2
models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: total_amount
        tests:
          - not_null
What it does:
Defines tests that run after the transformation to verify data integrity.
Step 5: Add the Airflow DAG
Goal: Schedule and manage your dbt steps (seed, run, and test) through Airflow.
In airflow/dags/dbt_dag.py, add:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
default_args = {
    'owner': 'airflow',
    'start_date': datetime(2025, 10, 31),
    'depends_on_past': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}
with DAG('dbt_workflow',
         default_args=default_args,
         schedule_interval='@daily',
         catchup=False) as dag:
    run_dbt_seed = BashOperator(
        task_id='dbt_seed',
        bash_command='dbt seed --profiles-dir /dbt --project-dir /dbt'
    )
    run_dbt_run = BashOperator(
        task_id='dbt_run',
        bash_command='dbt run --profiles-dir /dbt --project-dir /dbt'
    )
    run_dbt_test = BashOperator(
        task_id='dbt_test',
        bash_command='dbt test --profiles-dir /dbt --project-dir /dbt'
    )
    run_dbt_seed >> run_dbt_run >> run_dbt_test
What it does:
Defines a DAG (Directed Acyclic Graph) where:
- 
dbt_seedloads the CSVs into the database.
- 
dbt_runexecutes your dbt models.
- 
dbt_testvalidates data integrity.
The arrows indicate dependency — each step runs only after the previous one completes.
Also, create an empty folder for storing Airflow logs.
mkdir airflow/logs
Step 6: Build the Airflow + dbt Docker Image
Goal: Create a Docker image containing both Airflow and dbt for consistent execution environments.
Inside airflow, create a Dockerfile:
FROM apache/airflow:2.5.3-python3.8
USER root
# Optional but handy for compiling deps
RUN apt-get update && apt-get install -y --no-install-recommends build-essential && rm -rf /var/lib/apt/lists/*
USER airflow
# Install dbt (pin versions that match your adapter)
RUN pip install --no-cache-dir \
    dbt-core==1.5.0 \
    dbt-postgres==1.5.0 \
    psycopg2-binary
What it does:
Starts from an official Airflow base image, installs build tools, and adds dbt dependencies.
This image will power both Airflow webserver and scheduler.
Step 7: Create the Docker Compose Setup
Goal: Define services (Airflow, Postgres, and supporting components) that run together.
In your project root, create docker-compose.yaml:
version: '3'
x-airflow-common:
  &airflow-common
  build:
    context: ./airflow
    dockerfile: Dockerfile
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: LocalExecutor
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://postgres:password@postgres/sampledb
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
  volumes:
    - ./airflow/dags:/opt/airflow/dags
    - ./airflow/logs:/opt/airflow/logs
    - ./dbt_project:/dbt
  user: "${AIRFLOW_UID:-50000}:${AIRFLOW_GID:-50000}"
  depends_on:
    postgres:
      condition: service_healthy
services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: password
      POSTGRES_DB: sampledb
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "postgres"]
      interval: 5s
      retries: 5
  airflow-init:
    <<: *airflow-common
    entrypoint: /bin/bash
    command:
      - -c
      - |
        airflow db upgrade
        airflow users create \
          --username ${_AIRFLOW_WWW_USER_USERNAME:-airflow} \
          --firstname Admin \
          --lastname User \
          --role Admin \
          --email admin@example.com \
          --password ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
    environment:
      <<: *airflow-common-env
      _AIRFLOW_WWW_USER_CREATE: 'true'
  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    restart: always
volumes:
  pgdata:
What it does:
- Defines postgresas the database used by both dbt and Airflow.
- Configures Airflow’s webserver, scheduler, and initialization steps.
- Mounts your project folders into the containers for live development.
Step 8: Run Everything
Goal: Start the entire environment and trigger your workflow.
Initialize Airflow:
docker-compose up airflow-init
Then start all services:
docker-compose up
When it's running, open your browser to http://localhost:8080.
Log in using:
Username: airflow
Password: airflow
From the Airflow UI, trigger the dbt_workflow DAG.
You’ll see three tasks: seed, run, and test — each turning green as it succeeds.  
Optional Verification:
Run this command to confirm the transformed data in Postgres:
PGPASSWORD=password psql -h localhost -p 5432 -U postgres -d sampledb
You should see rows only for completed orders — proving your dbt model worked end to end.
Advanced: Connecting to Cloud Warehouses
Extend the same setup to use BigQuery, Snowflake, or Microsoft Fabric.
- Update profiles.ymlwith the correct credentials.
- Use Airflow Connections for key management.
- Run your workflows through managed Airflow services like MWAA or Cloud Composer for less infrastructure maintenance.
Conclusion
By combining dbt, Airflow, and Docker, you’ve created a fully automated, tested, and reproducible analytics pipeline.
Together they deliver:
- Data transformations as version-controlled code
- Reliable orchestration and scheduling
- Consistent environments for every developer or CI/CD job
- Easy scalability to cloud setups
This setup forms a strong foundation for production ETL pipelines and analytics workflows — one that scales effortlessly as your data and team grow.
 
 
              



 
    
Top comments (0)