Presentation: Building Scalable Data Pipelines with Docker and Docker Compose
1. Introduction to Containerization
- Containerization is the process of packaging applications and their dependencies into isolated, portable environments called containers.
- Each container includes everything needed to run: code, runtime, system tools, libraries, and configurations.
- Containers ensure consistency across environments — whether in local development, testing, or production.
Key Benefits:
- Portability
- Lightweight compared to virtual machines
- Faster deployment
- Simplified dependency management
- Easy scaling and orchestration
2. Docker Overview
Docker is the leading containerization platform. It enables you to:
- Build containers using a Dockerfile
- Manage images and containers
- Run isolated applications efficiently
Core Docker Concepts:
- Image – A blueprint of your application (immutable)
- Container – A running instance of an image
- Volume – Persistent storage for container data
- Network – Allows inter-container communication
3. Docker Compose for Multi-Service Applications
Docker Compose simplifies running multi-container applications using a single YAML configuration file.
Example (from your docker-compose.yaml
):
services:
postgres:
image: postgres:15
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
volumes:
- ./postgres-data:/var/lib/postgresql/data
airflow:
build:
context: .
dockerfile: Dockerfile.airflow
env_file:
- .env
depends_on:
- postgres
ports:
- "8080:8080"
This configuration:
- Spins up PostgreSQL and Airflow containers
- Passes environment variables via
.env
- Defines inter-service dependencies
- Exposes ports for web UI access
4. The Role of Docker in Data Pipelines
In modern data engineering, Docker provides:
- Isolation: Each pipeline component (e.g., API extraction, Spark transformation, PostgreSQL loading) runs independently.
- Reproducibility: Consistent environments across all developers.
- Scalability: Multiple containers can run parallel tasks efficiently.
Your uploaded DAGs are perfect real-world examples:
-
youtube_analytics_dag.py
— Automates ingestion and transformation of YouTube channel data using Airflow and PySpark. -
stock_market_dag.py
— Fetches, transforms, and loads stock data from Polygon API to PostgreSQL.
Both DAGs are orchestrated by Apache Airflow, running inside a Dockerized environment for reliability.
5. Step-by-Step Pipeline Development with Docker
Step 1: Create a Dockerfile
A Dockerfile
defines how to build your image.
Example:
FROM apache/airflow:2.7.3
USER root
RUN apt-get update && apt-get install -y python3-pip
COPY requirements.txt /requirements.txt
RUN pip install -r /requirements.txt
COPY dags/ /opt/airflow/dags/
COPY scripts/ /opt/airflow/scripts/
USER airflow
Explanation:
- Uses the official Airflow image as a base.
- Installs dependencies.
- Copies DAGs and scripts into the Airflow container.
Step 2: Add a requirements.txt
List all Python dependencies required for your project:
apache-airflow==2.7.3
pandas
requests
sqlalchemy
pyspark
psycopg2-binary
Step 3: Configure Environment Variables (.env
)
Sensitive credentials and connection details are managed via .env
:
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow
POSTGRES_HOST=postgres
POSTGRES_PORT=5432
POLYGON_API_KEY=your_polygon_api_key
YOUTUBE_API_KEY=your_youtube_api_key
YOUTUBE_CHANNEL_ID=your_channel_id
Step 4: Define Services with docker-compose.yaml
This orchestrates Airflow, PostgreSQL, and any Spark or supporting services.
Run the stack:
docker compose up -d
Airflow will automatically detect your DAGs (e.g., stock_market_dag.py
, youtube_analytics_dag.py
) and schedule tasks.
Step 5: Develop and Deploy the Pipeline
Extract Phase: Fetch data from APIs
Example:extract_stock_data()
instock_market_dag.py
usesrequests
with API keys stored in.env
.Transform Phase: Clean, validate, and enrich data
Example:transform_stock_data()
in the DAG leverages Pandas.Load Phase: Push data to databases
Example:load_to_postgres()
loads into PostgreSQL using SQLAlchemy.Validation Phase:
Final check (validate_pipeline_execution
) ensures data integrity and logs a summary.
6. Running and Monitoring Containers
Useful Commands:
docker ps # List running containers
docker logs airflow # Check logs
docker exec -it airflow bash # Enter container shell
docker compose down # Stop and remove all containers
Airflow UI available at:
👉 http://localhost:8080
7. Benefits of This Approach
Feature | Benefit |
---|---|
Consistency | Identical environment for all developers |
Automation | Pipelines run on schedule via Airflow |
Portability | Easily deployed on cloud or local machine |
Scalability | Add more containers for different pipelines |
Maintainability | Easy updates to individual services |
8. Conclusion
Docker and Docker Compose revolutionize how we design and deploy data pipelines.
By encapsulating tools like Airflow, Spark, and PostgreSQL within containers, we achieve reliability, reproducibility, and simplicity in managing end-to-end analytics systems.
Your uploaded examples — the Stock Market ETL and YouTube Analytics Pipeline — demonstrate real-world, production-ready containerized workflows that integrate seamlessly under Docker.
Top comments (0)