DEV Community

Cover image for Integrating Airflow, dbt, Postgres and Docker: Building a Modern, Scalable Data Workflow
Shivangi Varshney
Shivangi Varshney

Posted on

Integrating Airflow, dbt, Postgres and Docker: Building a Modern, Scalable Data Workflow

How to effectively orchestrate transformation pipelines using open-source tools, containerization, and engineering best practices.


Introduction

Building reliable, repeatable, and scalable data pipelines is now foundational to analytics engineering. By combining Apache Airflow for orchestration, dbt for SQL-based transformations, and Docker for reproducible environments, data teams can design workflows that are modular, maintainable, and cloud-ready.

This guide covers how each tool fits together and walks through a complete example that loads, transforms, and tests data automatically with Airflow and dbt inside Docker.


Tooling Foundations

Airflow: Orchestration Engine

Airflow is an open-source workflow manager that lets you schedule and monitor data pipelines programmatically. It handles task dependencies, retries, and scheduling, becoming the control plane connecting ingestion, transformation, and reporting.

dbt: Transformations as Code

dbt brings the principles of software engineering—modularity, testing, and version control—to analytics.

Core concepts:

  • Models: SQL files that define how raw data should be shaped into usable tables or views.
  • Tests: Defined in YAML, they validate your data for quality. Common types include unique, not_null, accepted_values, and relationships.
  • Materializations: Control how dbt models are built and stored in the warehouse. Common materializations include:
    • view: The model is saved as a database view (default). It always reflects the latest data but might be slower for complex queries.
    • table: The model is built as a physical table, which can improve performance but requires refreshing.
    • incremental: Only new or updated data is processed and added to an existing table.
    • ephemeral: The model is not stored in the database but is embedded as a subquery in downstream models.

Choosing the right materialization depends on your data freshness requirements and warehouse performance.

PostgreSQL:

PostgreSQL is an open-source, highly extensible relational database that serves effectively as a data warehouse. It handles large volumes of data, supports complex queries, and integrates well with analytics tools, making it ideal for scalable and reliable data workflows.

Docker: Consistent Environments

Docker packages applications, dependencies, and configurations together—ensuring the same setup runs identically on local machines, CI/CD pipelines, or cloud VMs.


Hands-On Example: Airflow, dbt, and Docker Pipeline

In this example, you’ll set up a local workflow with Airflow orchestrating dbt commands within Docker. The pipeline will load seed data, run a dbt model, test the results, and display everything in Airflow’s UI.


Step 1: Set Up Project Structure

Goal: Set up the folders that organize both dbt and Airflow files.

Create your working directory and required folders:

mkdir dbt_sample_project
cd dbt_sample_project
mkdir airflow dbt_project
Enter fullscreen mode Exit fullscreen mode

Your directory layout:

dbt_sample_project/
├── airflow/
└── dbt_project/
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure dbt

Goal: Connect dbt to your Postgres database and define the base project structure.

Inside the dbt_project directory, create the following configuration files.

profiles.yml

default:
  target: dev
  outputs:
    dev:
      type: postgres
      host: postgres
      user: postgres
      password: password
      port: 5432
      dbname: sampledb
      schema: public
Enter fullscreen mode Exit fullscreen mode

What it does:
Defines connection details so dbt knows which database and schema to write models into.

dbt_project.yml

name: 'dbt_project'
version: '1.0'

config-version: 2

profile: 'default'

model-paths: ["models"]

Enter fullscreen mode Exit fullscreen mode

What it does:
Defines project-level settings such as version, folder paths, and linked profile.


Step 3: Seed Sample Data

Goal: Create a source dataset to load into Postgres using dbt’s seeding feature.

Inside dbt_project/seeds/, add a CSV file named raw_orders.csv.

order_id,customer_id,order_date,total_amount,order_status
1,101,2025-10-01,250,completed
2,102,2025-10-03,450,pending
3,103,2025-10-05,100,completed
4,104,2025-10-10,300,completed
Enter fullscreen mode Exit fullscreen mode

What it does:

This CSV simulates your “raw orders” dataset. Running dbt seed will upload it into the database as a base table for transformations.


Step 4: Create a dbt Model and Tests

Goal: Transform data and define validation tests to ensure accuracy.

Model: dbt_project/models/orders.sql

{{ config(materialized='table') }}

select
    order_id,
    customer_id,
    order_date,
    total_amount
from {{ ref('raw_orders') }}
where order_status = 'completed'
Enter fullscreen mode Exit fullscreen mode

What it does:

Transforms the raw orders into a curated table containing only completed orders.

Schema Tests: dbt_project/models/orders.yml

version: 2

models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: total_amount
        tests:
          - not_null
Enter fullscreen mode Exit fullscreen mode

What it does:

Defines tests that run after the transformation to verify data integrity.


Step 5: Add the Airflow DAG

Goal: Schedule and manage your dbt steps (seed, run, and test) through Airflow.

In airflow/dags/dbt_dag.py, add:

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'start_date': datetime(2025, 10, 31),
    'depends_on_past': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

with DAG('dbt_workflow',
         default_args=default_args,
         schedule_interval='@daily',
         catchup=False) as dag:

    run_dbt_seed = BashOperator(
        task_id='dbt_seed',
        bash_command='dbt seed --profiles-dir /dbt --project-dir /dbt'
    )
    run_dbt_run = BashOperator(
        task_id='dbt_run',
        bash_command='dbt run --profiles-dir /dbt --project-dir /dbt'
    )
    run_dbt_test = BashOperator(
        task_id='dbt_test',
        bash_command='dbt test --profiles-dir /dbt --project-dir /dbt'
    )

    run_dbt_seed >> run_dbt_run >> run_dbt_test

Enter fullscreen mode Exit fullscreen mode

What it does:

Defines a DAG (Directed Acyclic Graph) where:

  1. dbt_seed loads the CSVs into the database.
  2. dbt_run executes your dbt models.
  3. dbt_test validates data integrity.

The arrows indicate dependency — each step runs only after the previous one completes.

DAG Graph

Also, create an empty folder for storing Airflow logs.

mkdir airflow/logs
Enter fullscreen mode Exit fullscreen mode

DAG Logs


Step 6: Build the Airflow + dbt Docker Image

Goal: Create a Docker image containing both Airflow and dbt for consistent execution environments.

Inside airflow, create a Dockerfile:

FROM apache/airflow:2.5.3-python3.8

USER root
# Optional but handy for compiling deps
RUN apt-get update && apt-get install -y --no-install-recommends build-essential && rm -rf /var/lib/apt/lists/*
USER airflow
# Install dbt (pin versions that match your adapter)
RUN pip install --no-cache-dir \
    dbt-core==1.5.0 \
    dbt-postgres==1.5.0 \
    psycopg2-binary

Enter fullscreen mode Exit fullscreen mode

What it does:

Starts from an official Airflow base image, installs build tools, and adds dbt dependencies.

This image will power both Airflow webserver and scheduler.


Step 7: Create the Docker Compose Setup

Goal: Define services (Airflow, Postgres, and supporting components) that run together.

In your project root, create docker-compose.yaml:

version: '3'

x-airflow-common:
  &airflow-common
  build:
    context: ./airflow
    dockerfile: Dockerfile
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: LocalExecutor
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://postgres:password@postgres/sampledb
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
  volumes:
    - ./airflow/dags:/opt/airflow/dags
    - ./airflow/logs:/opt/airflow/logs
    - ./dbt_project:/dbt
  user: "${AIRFLOW_UID:-50000}:${AIRFLOW_GID:-50000}"
  depends_on:
    postgres:
      condition: service_healthy

services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: password
      POSTGRES_DB: sampledb
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "postgres"]
      interval: 5s
      retries: 5

  airflow-init:
    <<: *airflow-common
    entrypoint: /bin/bash
    command:
      - -c
      - |
        airflow db upgrade
        airflow users create \
          --username ${_AIRFLOW_WWW_USER_USERNAME:-airflow} \
          --firstname Admin \
          --lastname User \
          --role Admin \
          --email admin@example.com \
          --password ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
    environment:
      <<: *airflow-common-env
      _AIRFLOW_WWW_USER_CREATE: 'true'

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    restart: always

volumes:
  pgdata:
Enter fullscreen mode Exit fullscreen mode

What it does:

  • Defines postgres as the database used by both dbt and Airflow.
  • Configures Airflow’s webserver, scheduler, and initialization steps.
  • Mounts your project folders into the containers for live development.

Step 8: Run Everything

Goal: Start the entire environment and trigger your workflow.

Initialize Airflow:

docker-compose up airflow-init
Enter fullscreen mode Exit fullscreen mode

Then start all services:

docker-compose up

Enter fullscreen mode Exit fullscreen mode

When it's running, open your browser to http://localhost:8080.

Log in using:
Username: airflow
Password: airflow

From the Airflow UI, trigger the dbt_workflow DAG.

You’ll see three tasks: seed, run, and test — each turning green as it succeeds.

DAG Successful Completion

Optional Verification:

Run this command to confirm the transformed data in Postgres:

PGPASSWORD=password psql -h localhost -p 5432 -U postgres -d sampledb
Enter fullscreen mode Exit fullscreen mode

You should see rows only for completed orders — proving your dbt model worked end to end.


Advanced: Connecting to Cloud Warehouses

Extend the same setup to use BigQuery, Snowflake, or Microsoft Fabric.

  • Update profiles.yml with the correct credentials.
  • Use Airflow Connections for key management.
  • Run your workflows through managed Airflow services like MWAA or Cloud Composer for less infrastructure maintenance.

Conclusion

By combining dbt, Airflow, and Docker, you’ve created a fully automated, tested, and reproducible analytics pipeline.

Together they deliver:

  • Data transformations as version-controlled code
  • Reliable orchestration and scheduling
  • Consistent environments for every developer or CI/CD job
  • Easy scalability to cloud setups

This setup forms a strong foundation for production ETL pipelines and analytics workflows — one that scales effortlessly as your data and team grow.

Top comments (0)