Oliver Samuel

Posted on Oct 10

Containerization for Data Engineering: A practical Guide with Docker and Docker Compose

#dataengineering #tutorial #devops #docker

Introduction

For many aspiring data engineers, Docker sounds intimidating—complex containers, YAML files, and endless docker commands. But here's the truth: Docker isn't just for backend developers. It's your best friend when managing complex data pipelines with multiple moving parts: databases, schedulers, dashboards, and storage systems.

In this guide, I'll demonstrate how I containerized a full YouTube analytics pipeline using Docker and Docker Compose.

The goal? To automate data extraction, transformation, storage, and visualization—all running seamlessly across containers.

Why Containerize Data Pipelines?

Without containers, setting up tools like Airflow, Spark, PostgreSQL, Grafana, and MinIO locally would take hours, each requiring its own dependencies and configurations.

With Docker Compose, all these services run together with a single command:

docker compose up -d

Docker creates isolated environments for each service, ensuring portability, consistency, and easy scaling.

The Engine-Cartridge Architecture

A key design pattern I used in this project was splitting the setup into two distinct layers:

1. `airflow-docker/` → The Engine

This is the core infrastructure. It defines all containers, networks, environment variables, and Airflow services.

Responsibilities:

Defines the Docker Compose stack (Airflow + PostgreSQL + Grafana + MinIO + Spark)
Acts as the "orchestration engine"
Mounts DAGs and pipeline code dynamically

2. `airflow-youtube-analytics/` → The Cartridge

This is the plug-and-play ETL project, which lives outside the engine but connects seamlessly to it.

Think of it like a "cartridge" you can load into the Airflow engine.

Responsibilities:

Contains all DAGs and ETL scripts (extract.py, transform.py, load.py)
Handles API calls, data transformations, and loading logic
Can be swapped or extended without touching the engine

Relationship Diagram:

+-----------------------+
|  airflow-docker/      |   ---> Engine (Airflow + Services)
|  ├── docker-compose.yml|
|  ├── .env             |
|  └── dags/ <mount> ---┼──> Mounts DAGs from cartridge
+-----------------------+

        ⬇

+-----------------------------+
| airflow-youtube-analytics/  |  ---> Cartridge (ETL logic)
| ├── pipelines/youtube/      |
| │    ├── extract.py         |
| │    ├── transform.py       |
| │    └── load.py            |
| └── dags/youtube_pipeline.py|
+-----------------------------+

With this modular setup:

I can add new "cartridges" (projects) like airflow-nasa-apod/ or airflow-weather-analytics/
The airflow-docker/ engine never changes—it simply mounts the new DAGs and runs them
This makes the system scalable and reusable across multiple ETL projects

Project Setup Overview

Our pipeline components:

Layer	Tool	Purpose
Orchestration	Apache Airflow	Automates ETL workflow
Data Storage	MinIO	Acts as local S3 data lake
Transformation	PySpark / Pandas	Cleans and processes raw data
Warehouse	PostgreSQL	Stores transformed metrics
Visualization	Grafana	Visualizes channel performance

Architecture Diagram:

Containerized pipeline architecture.

Each service runs as a Docker container defined in the docker-compose.yaml.

This approach allowed me to test and run everything from extraction to Grafana visualization on my local machine.

Container Orchestration in Action

Here's a sample of how services are spun together:

cd ~/airflow-docker
docker compose up -d

To sync environment variables between project and containers:

./sync_env.sh

Result:

Airflow runs the DAG (Extract >> Transform >> Load)
Spark handles transformations
Data is stored in PostgreSQL and visualized in Grafana
All communication happens inside containers through a shared Docker network

Running Containers:

All containers running simultaneously via Docker Compose.

Key Takeaways

Docker simplifies multi-service setup for data engineering projects
Containerized Airflow pipelines are reproducible and portable
Local MinIO + PostgreSQL simulates a full-scale cloud environment
With Docker Compose, you can spin up a production-grade analytics stack in minutes

Conclusion

Containerization removes the friction between development and deployment. Instead of juggling tool installations, Docker lets you focus on what matters: data flow, not setup.

If you've ever been scared to touch Docker, this is your sign:

Start with one project, one docker-compose.yaml, and build from there.

By the end, you'll realize containers don't complicate data pipelines—they liberate them.

You can explore the complete codebase and pipeline setup in my GitHub repository.

GitHub Repo

DEV Community

Containerization for Data Engineering: A practical Guide with Docker and Docker Compose

Introduction

Why Containerize Data Pipelines?

The Engine-Cartridge Architecture

1. `airflow-docker/` → The Engine

2. `airflow-youtube-analytics/` → The Cartridge

Project Setup Overview

Container Orchestration in Action

Key Takeaways

Conclusion

Top comments (0)

Introduction

Why Containerize Data Pipelines?

The Engine-Cartridge Architecture

1. airflow-docker/ → The Engine

2. airflow-youtube-analytics/ → The Cartridge

Project Setup Overview

Container Orchestration in Action

Key Takeaways

Conclusion

1. `airflow-docker/` → The Engine

2. `airflow-youtube-analytics/` → The Cartridge