Lagat Josiah

Posted on Oct 15

Presentation: Building Scalable Data Pipelines with Docker and Docker Compose

#docker #containers #kafka

Presentation: Building Scalable Data Pipelines with Docker and Docker Compose

1. Introduction to Containerization

Containerization is the process of packaging applications and their dependencies into isolated, portable environments called containers.
Each container includes everything needed to run: code, runtime, system tools, libraries, and configurations.
Containers ensure consistency across environments — whether in local development, testing, or production.

Key Benefits:

Portability
Lightweight compared to virtual machines
Faster deployment
Simplified dependency management
Easy scaling and orchestration

2. Docker Overview

Docker is the leading containerization platform. It enables you to:

Build containers using a Dockerfile
Manage images and containers
Run isolated applications efficiently

Core Docker Concepts:

Image – A blueprint of your application (immutable)
Container – A running instance of an image
Volume – Persistent storage for container data
Network – Allows inter-container communication

3. Docker Compose for Multi-Service Applications

Docker Compose simplifies running multi-container applications using a single YAML configuration file.

Example (from your docker-compose.yaml):

services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    volumes:
      - ./postgres-data:/var/lib/postgresql/data

  airflow:
    build:
      context: .
      dockerfile: Dockerfile.airflow
    env_file:
      - .env
    depends_on:
      - postgres
    ports:
      - "8080:8080"

This configuration:

Spins up PostgreSQL and Airflow containers
Passes environment variables via .env
Defines inter-service dependencies
Exposes ports for web UI access

4. The Role of Docker in Data Pipelines

In modern data engineering, Docker provides:

Isolation: Each pipeline component (e.g., API extraction, Spark transformation, PostgreSQL loading) runs independently.
Reproducibility: Consistent environments across all developers.
Scalability: Multiple containers can run parallel tasks efficiently.

Your uploaded DAGs are perfect real-world examples:

youtube_analytics_dag.py — Automates ingestion and transformation of YouTube channel data using Airflow and PySpark.
stock_market_dag.py — Fetches, transforms, and loads stock data from Polygon API to PostgreSQL.

Both DAGs are orchestrated by Apache Airflow, running inside a Dockerized environment for reliability.

5. Step-by-Step Pipeline Development with Docker

Step 1: Create a `Dockerfile`

A Dockerfile defines how to build your image.

Example:

FROM apache/airflow:2.7.3

USER root
RUN apt-get update && apt-get install -y python3-pip

COPY requirements.txt /requirements.txt
RUN pip install -r /requirements.txt

COPY dags/ /opt/airflow/dags/
COPY scripts/ /opt/airflow/scripts/

USER airflow

Explanation:

Uses the official Airflow image as a base.
Installs dependencies.
Copies DAGs and scripts into the Airflow container.

Step 2: Add a `requirements.txt`

List all Python dependencies required for your project:

apache-airflow==2.7.3
pandas
requests
sqlalchemy
pyspark
psycopg2-binary

Step 3: Configure Environment Variables (`.env`)

Sensitive credentials and connection details are managed via .env:

POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow
POSTGRES_HOST=postgres
POSTGRES_PORT=5432
POLYGON_API_KEY=your_polygon_api_key
YOUTUBE_API_KEY=your_youtube_api_key
YOUTUBE_CHANNEL_ID=your_channel_id

Step 4: Define Services with `docker-compose.yaml`

This orchestrates Airflow, PostgreSQL, and any Spark or supporting services.

Run the stack:

docker compose up -d

Airflow will automatically detect your DAGs (e.g., stock_market_dag.py, youtube_analytics_dag.py) and schedule tasks.

Step 5: Develop and Deploy the Pipeline

Extract Phase: Fetch data from APIs
Example: extract_stock_data() in stock_market_dag.py uses requests with API keys stored in .env.
Transform Phase: Clean, validate, and enrich data
Example: transform_stock_data() in the DAG leverages Pandas.
Load Phase: Push data to databases
Example: load_to_postgres() loads into PostgreSQL using SQLAlchemy.
Validation Phase:
Final check (validate_pipeline_execution) ensures data integrity and logs a summary.

6. Running and Monitoring Containers

Useful Commands:

docker ps              # List running containers
docker logs airflow    # Check logs
docker exec -it airflow bash  # Enter container shell
docker compose down    # Stop and remove all containers

Airflow UI available at:
👉 http://localhost:8080

7. Benefits of This Approach

Feature	Benefit
Consistency	Identical environment for all developers
Automation	Pipelines run on schedule via Airflow
Portability	Easily deployed on cloud or local machine
Scalability	Add more containers for different pipelines
Maintainability	Easy updates to individual services

8. Conclusion

Docker and Docker Compose revolutionize how we design and deploy data pipelines.
By encapsulating tools like Airflow, Spark, and PostgreSQL within containers, we achieve reliability, reproducibility, and simplicity in managing end-to-end analytics systems.

Your uploaded examples — the Stock Market ETL and YouTube Analytics Pipeline — demonstrate real-world, production-ready containerized workflows that integrate seamlessly under Docker.

DEV Community

Presentation: Building Scalable Data Pipelines with Docker and Docker Compose

Presentation: Building Scalable Data Pipelines with Docker and Docker Compose

1. Introduction to Containerization

2. Docker Overview

3. Docker Compose for Multi-Service Applications

4. The Role of Docker in Data Pipelines

5. Step-by-Step Pipeline Development with Docker

Step 1: Create a `Dockerfile`

Step 2: Add a `requirements.txt`

Step 3: Configure Environment Variables (`.env`)

Step 4: Define Services with `docker-compose.yaml`

Step 5: Develop and Deploy the Pipeline

6. Running and Monitoring Containers

7. Benefits of This Approach

8. Conclusion

Top comments (0)

Presentation: Building Scalable Data Pipelines with Docker and Docker Compose

1. Introduction to Containerization

2. Docker Overview

3. Docker Compose for Multi-Service Applications

4. The Role of Docker in Data Pipelines

5. Step-by-Step Pipeline Development with Docker

Step 1: Create a Dockerfile

Step 2: Add a requirements.txt

Step 3: Configure Environment Variables (.env)

Step 4: Define Services with docker-compose.yaml

Step 5: Develop and Deploy the Pipeline

6. Running and Monitoring Containers

7. Benefits of This Approach

8. Conclusion

Step 1: Create a `Dockerfile`

Step 2: Add a `requirements.txt`

Step 3: Configure Environment Variables (`.env`)

Step 4: Define Services with `docker-compose.yaml`