DEV Community

Cover image for Presentation: Building Scalable Data Pipelines with Docker and Docker Compose
Lagat Josiah
Lagat Josiah

Posted on

Presentation: Building Scalable Data Pipelines with Docker and Docker Compose

Presentation: Building Scalable Data Pipelines with Docker and Docker Compose


1. Introduction to Containerization

  • Containerization is the process of packaging applications and their dependencies into isolated, portable environments called containers.
  • Each container includes everything needed to run: code, runtime, system tools, libraries, and configurations.
  • Containers ensure consistency across environments — whether in local development, testing, or production.

Key Benefits:

  • Portability
  • Lightweight compared to virtual machines
  • Faster deployment
  • Simplified dependency management
  • Easy scaling and orchestration

2. Docker Overview

Docker is the leading containerization platform. It enables you to:

  • Build containers using a Dockerfile
  • Manage images and containers
  • Run isolated applications efficiently

Core Docker Concepts:

  • Image – A blueprint of your application (immutable)
  • Container – A running instance of an image
  • Volume – Persistent storage for container data
  • Network – Allows inter-container communication

3. Docker Compose for Multi-Service Applications

Docker Compose simplifies running multi-container applications using a single YAML configuration file.

Example (from your docker-compose.yaml):

services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    volumes:
      - ./postgres-data:/var/lib/postgresql/data

  airflow:
    build:
      context: .
      dockerfile: Dockerfile.airflow
    env_file:
      - .env
    depends_on:
      - postgres
    ports:
      - "8080:8080"
Enter fullscreen mode Exit fullscreen mode

This configuration:

  • Spins up PostgreSQL and Airflow containers
  • Passes environment variables via .env
  • Defines inter-service dependencies
  • Exposes ports for web UI access

4. The Role of Docker in Data Pipelines

In modern data engineering, Docker provides:

  • Isolation: Each pipeline component (e.g., API extraction, Spark transformation, PostgreSQL loading) runs independently.
  • Reproducibility: Consistent environments across all developers.
  • Scalability: Multiple containers can run parallel tasks efficiently.

Your uploaded DAGs are perfect real-world examples:

  1. youtube_analytics_dag.py — Automates ingestion and transformation of YouTube channel data using Airflow and PySpark.
  2. stock_market_dag.py — Fetches, transforms, and loads stock data from Polygon API to PostgreSQL.

Both DAGs are orchestrated by Apache Airflow, running inside a Dockerized environment for reliability.


5. Step-by-Step Pipeline Development with Docker

Step 1: Create a Dockerfile

A Dockerfile defines how to build your image.

Example:

FROM apache/airflow:2.7.3

USER root
RUN apt-get update && apt-get install -y python3-pip

COPY requirements.txt /requirements.txt
RUN pip install -r /requirements.txt

COPY dags/ /opt/airflow/dags/
COPY scripts/ /opt/airflow/scripts/

USER airflow
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Uses the official Airflow image as a base.
  • Installs dependencies.
  • Copies DAGs and scripts into the Airflow container.

Step 2: Add a requirements.txt

List all Python dependencies required for your project:

apache-airflow==2.7.3
pandas
requests
sqlalchemy
pyspark
psycopg2-binary
Enter fullscreen mode Exit fullscreen mode

Step 3: Configure Environment Variables (.env)

Sensitive credentials and connection details are managed via .env:

POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow
POSTGRES_HOST=postgres
POSTGRES_PORT=5432
POLYGON_API_KEY=your_polygon_api_key
YOUTUBE_API_KEY=your_youtube_api_key
YOUTUBE_CHANNEL_ID=your_channel_id
Enter fullscreen mode Exit fullscreen mode

Step 4: Define Services with docker-compose.yaml

This orchestrates Airflow, PostgreSQL, and any Spark or supporting services.

Run the stack:

docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Airflow will automatically detect your DAGs (e.g., stock_market_dag.py, youtube_analytics_dag.py) and schedule tasks.


Step 5: Develop and Deploy the Pipeline

  • Extract Phase: Fetch data from APIs
    Example: extract_stock_data() in stock_market_dag.py uses requests with API keys stored in .env.

  • Transform Phase: Clean, validate, and enrich data
    Example: transform_stock_data() in the DAG leverages Pandas.

  • Load Phase: Push data to databases
    Example: load_to_postgres() loads into PostgreSQL using SQLAlchemy.

  • Validation Phase:
    Final check (validate_pipeline_execution) ensures data integrity and logs a summary.


6. Running and Monitoring Containers

Useful Commands:

docker ps              # List running containers
docker logs airflow    # Check logs
docker exec -it airflow bash  # Enter container shell
docker compose down    # Stop and remove all containers
Enter fullscreen mode Exit fullscreen mode

Airflow UI available at:
👉 http://localhost:8080


7. Benefits of This Approach

Feature Benefit
Consistency Identical environment for all developers
Automation Pipelines run on schedule via Airflow
Portability Easily deployed on cloud or local machine
Scalability Add more containers for different pipelines
Maintainability Easy updates to individual services

8. Conclusion

Docker and Docker Compose revolutionize how we design and deploy data pipelines.
By encapsulating tools like Airflow, Spark, and PostgreSQL within containers, we achieve reliability, reproducibility, and simplicity in managing end-to-end analytics systems.

Your uploaded examples — the Stock Market ETL and YouTube Analytics Pipeline — demonstrate real-world, production-ready containerized workflows that integrate seamlessly under Docker.


Top comments (0)