DEV Community

Oliver Samuel
Oliver Samuel

Posted on

Containerization for Data Engineering: A practical Guide with Docker and Docker Compose

Introduction

For many aspiring data engineers, Docker sounds intimidating—complex containers, YAML files, and endless docker commands. But here's the truth: Docker isn't just for backend developers. It's your best friend when managing complex data pipelines with multiple moving parts: databases, schedulers, dashboards, and storage systems.

In this guide, I'll demonstrate how I containerized a full YouTube analytics pipeline using Docker and Docker Compose.

The goal? To automate data extraction, transformation, storage, and visualization—all running seamlessly across containers.


Why Containerize Data Pipelines?

Without containers, setting up tools like Airflow, Spark, PostgreSQL, Grafana, and MinIO locally would take hours, each requiring its own dependencies and configurations.

With Docker Compose, all these services run together with a single command:

docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Docker creates isolated environments for each service, ensuring portability, consistency, and easy scaling.


The Engine-Cartridge Architecture

A key design pattern I used in this project was splitting the setup into two distinct layers:

1. airflow-docker/ → The Engine

This is the core infrastructure. It defines all containers, networks, environment variables, and Airflow services.

Responsibilities:

  • Defines the Docker Compose stack (Airflow + PostgreSQL + Grafana + MinIO + Spark)
  • Acts as the "orchestration engine"
  • Mounts DAGs and pipeline code dynamically

2. airflow-youtube-analytics/ → The Cartridge

This is the plug-and-play ETL project, which lives outside the engine but connects seamlessly to it.

Think of it like a "cartridge" you can load into the Airflow engine.

Responsibilities:

  • Contains all DAGs and ETL scripts (extract.py, transform.py, load.py)
  • Handles API calls, data transformations, and loading logic
  • Can be swapped or extended without touching the engine

Relationship Diagram:

+-----------------------+
|  airflow-docker/      |   ---> Engine (Airflow + Services)
|  ├── docker-compose.yml|
|  ├── .env             |
|  └── dags/ <mount> ---┼──> Mounts DAGs from cartridge
+-----------------------+

        ⬇

+-----------------------------+
| airflow-youtube-analytics/  |  ---> Cartridge (ETL logic)
| ├── pipelines/youtube/      |
| │    ├── extract.py         |
| │    ├── transform.py       |
| │    └── load.py            |
| └── dags/youtube_pipeline.py|
+-----------------------------+
Enter fullscreen mode Exit fullscreen mode

With this modular setup:

  • I can add new "cartridges" (projects) like airflow-nasa-apod/ or airflow-weather-analytics/
  • The airflow-docker/ engine never changes—it simply mounts the new DAGs and runs them
  • This makes the system scalable and reusable across multiple ETL projects

Project Setup Overview

Our pipeline components:

Layer Tool Purpose
Orchestration Apache Airflow Automates ETL workflow
Data Storage MinIO Acts as local S3 data lake
Transformation PySpark / Pandas Cleans and processes raw data
Warehouse PostgreSQL Stores transformed metrics
Visualization Grafana Visualizes channel performance

Architecture Diagram:

Architecture Diagram

Containerized pipeline architecture.

Each service runs as a Docker container defined in the docker-compose.yaml.

This approach allowed me to test and run everything from extraction to Grafana visualization on my local machine.


Container Orchestration in Action

Here's a sample of how services are spun together:

cd ~/airflow-docker
docker compose up -d
Enter fullscreen mode Exit fullscreen mode

To sync environment variables between project and containers:

./sync_env.sh
Enter fullscreen mode Exit fullscreen mode

Result:

  • Airflow runs the DAG (Extract >> Transform >> Load)
  • Spark handles transformations
  • Data is stored in PostgreSQL and visualized in Grafana
  • All communication happens inside containers through a shared Docker network

Running Containers:

Running Containers

All containers running simultaneously via Docker Compose.


Key Takeaways

  • Docker simplifies multi-service setup for data engineering projects
  • Containerized Airflow pipelines are reproducible and portable
  • Local MinIO + PostgreSQL simulates a full-scale cloud environment
  • With Docker Compose, you can spin up a production-grade analytics stack in minutes

Conclusion

Containerization removes the friction between development and deployment. Instead of juggling tool installations, Docker lets you focus on what matters: data flow, not setup.

If you've ever been scared to touch Docker, this is your sign:

Start with one project, one docker-compose.yaml, and build from there.

By the end, you'll realize containers don't complicate data pipelines—they liberate them.


You can explore the complete codebase and pipeline setup in my GitHub repository.

GitHub Repo

Top comments (0)