Shivangi Varshney

Posted on Oct 31

Integrating Airflow, dbt, Postgres and Docker: Building a Modern, Scalable Data Workflow

#airflow #docker #beginners #databuildtool

How to effectively orchestrate transformation pipelines using open-source tools, containerization, and engineering best practices.

Introduction

Building reliable, repeatable, and scalable data pipelines is now foundational to analytics engineering. By combining Apache Airflow for orchestration, dbt for SQL-based transformations, and Docker for reproducible environments, data teams can design workflows that are modular, maintainable, and cloud-ready.

This guide covers how each tool fits together and walks through a complete example that loads, transforms, and tests data automatically with Airflow and dbt inside Docker.

Tooling Foundations

Airflow: Orchestration Engine

Airflow is an open-source workflow manager that lets you schedule and monitor data pipelines programmatically. It handles task dependencies, retries, and scheduling, becoming the control plane connecting ingestion, transformation, and reporting.

dbt: Transformations as Code

dbt brings the principles of software engineering—modularity, testing, and version control—to analytics.

Core concepts:

Models: SQL files that define how raw data should be shaped into usable tables or views.
Tests: Defined in YAML, they validate your data for quality. Common types include unique, not_null, accepted_values, and relationships.
Materializations: Control how dbt models are built and stored in the warehouse. Common materializations include:
- view: The model is saved as a database view (default). It always reflects the latest data but might be slower for complex queries.
- table: The model is built as a physical table, which can improve performance but requires refreshing.
- incremental: Only new or updated data is processed and added to an existing table.
- ephemeral: The model is not stored in the database but is embedded as a subquery in downstream models.

Choosing the right materialization depends on your data freshness requirements and warehouse performance.

PostgreSQL:

PostgreSQL is an open-source, highly extensible relational database that serves effectively as a data warehouse. It handles large volumes of data, supports complex queries, and integrates well with analytics tools, making it ideal for scalable and reliable data workflows.

Docker: Consistent Environments

Docker packages applications, dependencies, and configurations together—ensuring the same setup runs identically on local machines, CI/CD pipelines, or cloud VMs.

Hands-On Example: Airflow, dbt, and Docker Pipeline

In this example, you’ll set up a local workflow with Airflow orchestrating dbt commands within Docker. The pipeline will load seed data, run a dbt model, test the results, and display everything in Airflow’s UI.

Step 1: Set Up Project Structure

Goal: Set up the folders that organize both dbt and Airflow files.

Create your working directory and required folders:

mkdir dbt_sample_project
cd dbt_sample_project
mkdir airflow dbt_project

Your directory layout:

dbt_sample_project/
├── airflow/
└── dbt_project/

Step 2: Configure dbt

Goal: Connect dbt to your Postgres database and define the base project structure.

Inside the dbt_project directory, create the following configuration files.

profiles.yml

default:
  target: dev
  outputs:
    dev:
      type: postgres
      host: postgres
      user: postgres
      password: password
      port: 5432
      dbname: sampledb
      schema: public

What it does:
Defines connection details so dbt knows which database and schema to write models into.

dbt_project.yml

name: 'dbt_project'
version: '1.0'

config-version: 2

profile: 'default'

model-paths: ["models"]

What it does:
Defines project-level settings such as version, folder paths, and linked profile.

Step 3: Seed Sample Data

Goal: Create a source dataset to load into Postgres using dbt’s seeding feature.

Inside dbt_project/seeds/, add a CSV file named raw_orders.csv.

order_id,customer_id,order_date,total_amount,order_status
1,101,2025-10-01,250,completed
2,102,2025-10-03,450,pending
3,103,2025-10-05,100,completed
4,104,2025-10-10,300,completed

What it does:

This CSV simulates your “raw orders” dataset. Running dbt seed will upload it into the database as a base table for transformations.

Step 4: Create a dbt Model and Tests

Goal: Transform data and define validation tests to ensure accuracy.

Model: dbt_project/models/orders.sql

{{ config(materialized='table') }}

select
    order_id,
    customer_id,
    order_date,
    total_amount
from {{ ref('raw_orders') }}
where order_status = 'completed'

What it does:

Transforms the raw orders into a curated table containing only completed orders.

Schema Tests: dbt_project/models/orders.yml

version: 2

models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: total_amount
        tests:
          - not_null

What it does:

Defines tests that run after the transformation to verify data integrity.

Step 5: Add the Airflow DAG

Goal: Schedule and manage your dbt steps (seed, run, and test) through Airflow.

In airflow/dags/dbt_dag.py, add:

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'start_date': datetime(2025, 10, 31),
    'depends_on_past': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

with DAG('dbt_workflow',
         default_args=default_args,
         schedule_interval='@daily',
         catchup=False) as dag:

    run_dbt_seed = BashOperator(
        task_id='dbt_seed',
        bash_command='dbt seed --profiles-dir /dbt --project-dir /dbt'
    )
    run_dbt_run = BashOperator(
        task_id='dbt_run',
        bash_command='dbt run --profiles-dir /dbt --project-dir /dbt'
    )
    run_dbt_test = BashOperator(
        task_id='dbt_test',
        bash_command='dbt test --profiles-dir /dbt --project-dir /dbt'
    )

    run_dbt_seed >> run_dbt_run >> run_dbt_test

What it does:

Defines a DAG (Directed Acyclic Graph) where:

dbt_seed loads the CSVs into the database.
dbt_run executes your dbt models.
dbt_test validates data integrity.

The arrows indicate dependency — each step runs only after the previous one completes.

Also, create an empty folder for storing Airflow logs.

mkdir airflow/logs

Step 6: Build the Airflow + dbt Docker Image

Goal: Create a Docker image containing both Airflow and dbt for consistent execution environments.

Inside airflow, create a Dockerfile:

FROM apache/airflow:2.5.3-python3.8

USER root
# Optional but handy for compiling deps
RUN apt-get update && apt-get install -y --no-install-recommends build-essential && rm -rf /var/lib/apt/lists/*
USER airflow
# Install dbt (pin versions that match your adapter)
RUN pip install --no-cache-dir \
    dbt-core==1.5.0 \
    dbt-postgres==1.5.0 \
    psycopg2-binary

What it does:

Starts from an official Airflow base image, installs build tools, and adds dbt dependencies.

This image will power both Airflow webserver and scheduler.

Step 7: Create the Docker Compose Setup

Goal: Define services (Airflow, Postgres, and supporting components) that run together.

In your project root, create docker-compose.yaml:

version: '3'

x-airflow-common:
  &airflow-common
  build:
    context: ./airflow
    dockerfile: Dockerfile
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: LocalExecutor
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://postgres:password@postgres/sampledb
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
  volumes:
    - ./airflow/dags:/opt/airflow/dags
    - ./airflow/logs:/opt/airflow/logs
    - ./dbt_project:/dbt
  user: "${AIRFLOW_UID:-50000}:${AIRFLOW_GID:-50000}"
  depends_on:
    postgres:
      condition: service_healthy

services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: password
      POSTGRES_DB: sampledb
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "postgres"]
      interval: 5s
      retries: 5

  airflow-init:
    <<: *airflow-common
    entrypoint: /bin/bash
    command:
      - -c
      - |
        airflow db upgrade
        airflow users create \
          --username ${_AIRFLOW_WWW_USER_USERNAME:-airflow} \
          --firstname Admin \
          --lastname User \
          --role Admin \
          --email admin@example.com \
          --password ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
    environment:
      <<: *airflow-common-env
      _AIRFLOW_WWW_USER_CREATE: 'true'

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    restart: always

volumes:
  pgdata:

What it does:

Defines postgres as the database used by both dbt and Airflow.
Configures Airflow’s webserver, scheduler, and initialization steps.
Mounts your project folders into the containers for live development.

Step 8: Run Everything

Goal: Start the entire environment and trigger your workflow.

Initialize Airflow:

docker-compose up airflow-init

Then start all services:

docker-compose up

When it's running, open your browser to http://localhost:8080.

Log in using:
Username: airflow
Password: airflow

From the Airflow UI, trigger the dbt_workflow DAG.

You’ll see three tasks: seed, run, and test — each turning green as it succeeds.

Optional Verification:

Run this command to confirm the transformed data in Postgres:

PGPASSWORD=password psql -h localhost -p 5432 -U postgres -d sampledb

You should see rows only for completed orders — proving your dbt model worked end to end.

Advanced: Connecting to Cloud Warehouses

Extend the same setup to use BigQuery, Snowflake, or Microsoft Fabric.

Update profiles.yml with the correct credentials.
Use Airflow Connections for key management.
Run your workflows through managed Airflow services like MWAA or Cloud Composer for less infrastructure maintenance.

Conclusion

By combining dbt, Airflow, and Docker, you’ve created a fully automated, tested, and reproducible analytics pipeline.

Together they deliver:

Data transformations as version-controlled code
Reliable orchestration and scheduling
Consistent environments for every developer or CI/CD job
Easy scalability to cloud setups

This setup forms a strong foundation for production ETL pipelines and analytics workflows — one that scales effortlessly as your data and team grow.

DEV Community