DEV Community

peter muriya
peter muriya

Posted on

Containerizing Apache Airflow: Building Portable Data Pipelines with Docker

Apache Airflow is one of the most widely used orchestration tools in data engineering. It enables teams to schedule, monitor, and manage complex workflows using Directed Acyclic Graphs, commonly known as DAGs. Running Airflow inside Docker containers improves portability and simplifies environment setup for developers and organizations.

Why Containerize Apache Airflow?

Traditional Airflow installations can be difficult to configure because they require multiple components such as the scheduler, webserver, database, and executor. Docker solves this challenge by packaging all dependencies into isolated environments that are easy to reproduce.

Core Components in a Dockerized Airflow Setup

  • Airflow Webserver
  • Airflow Scheduler
  • Metadata Database
  • Executor
  • ETL Scripts and DAGs

Sample Docker Compose File for Apache Airflow

version: '3'

services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow

  airflow-webserver:
    image: apache/airflow:2.9.0
    ports:
      - "8080:8080"

  airflow-scheduler:
    image: apache/airflow:2.9.0
Enter fullscreen mode Exit fullscreen mode

Example Airflow DAG

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def extract_data():
    print("Running ETL task")

with DAG(
    dag_id="sample_pipeline",
    start_date=datetime(2025, 1, 1),
    schedule_interval="@daily",
    catchup=False
) as dag:

    task = PythonOperator(
        task_id="extract_task",
        python_callable=extract_data
    )
Enter fullscreen mode Exit fullscreen mode

Advantages of Using Docker with Airflow

  • Portable workflow orchestration
  • Simplified dependency management
  • Easy scaling with Kubernetes integration
  • Improved development consistency
  • Faster testing and deployment

External Resource

Apache Airflow official documentation

Conclusion

Containerizing Apache Airflow provides data engineers with a reliable and portable orchestration platform. By combining Docker and Airflow, teams can create scalable workflows that are easy to deploy, monitor, and maintain across different environments.

Top comments (0)