Data Engineering with Docker: A Hands-On Guide to Containerization

#docker #containers #dataengineering #automation

Data engineers juggle multiple tools — databases, ETL scripts, schedulers, APIs — each with its own dependencies.
Containerization makes it easy to run everything consistently, anywhere.

Let’s see how you can Dockerize a simple data pipeline using Docker and Docker Compose.

🚀 Why Containerize?

✅ Consistent environments (no “works on my machine” issues)
⚙️ Easy orchestration of multiple services
🧩 Simple scaling and local testing
🔁 Reproducible pipelines for dev & prod

Step 1: Project Structure

Create a new project folder with this structure:

sales-data-pipeline/
├── data/
│   └── sales.csv
├── app/
│   ├── requirements.txt
│   └── etl.py
├── docker-compose.yml
└── Dockerfile

Sample Data (`data/sales.csv`)

date,product,quantity,price
2023-01-01,Widget A,10,15.99
2023-01-02,Widget B,5,22.50
2023-01-03,Widget A,8,15.99

Python Dependencies (`app/requirements.txt`)

pandas==2.0.3
psycopg2-binary==2.9.7

ETL Script (`app/etl.py`)

import pandas as pd
import psycopg2
from psycopg2 import sql

# Load CSV
df = pd.read_csv('/app/data/sales.csv')

# Connect to PostgreSQL
conn = psycopg2.connect(
    host="db",
    database="salesdb",
    user="user",
    password="password"
)
cur = conn.cursor()

# Create table
cur.execute("""
    CREATE TABLE IF NOT EXISTS sales (
        date DATE,
        product VARCHAR(50),
        quantity INT,
        price NUMERIC
    );
""")

# Insert data
for _, row in df.iterrows():
    cur.execute(
        "INSERT INTO sales VALUES (%s, %s, %s, %s)",
        (row['date'], row['product'], row['quantity'], row['price'])
    )

conn.commit()
cur.close()
conn.close()

print("Data loaded successfully!")

💡 Note: The host is "db"—this will be the service name in Docker Compose, in our case postgres.

Step 2: Write the Dockerfile

The Dockerfile defines how to build your Python application container.

# Use an official Python runtime as base image
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Copy requirements and install dependencies
COPY app/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the app
COPY app/ .
COPY data/ ./data/

# Run the ETL script when container starts
CMD ["python", "etl.py"]

This image:

Starts from a lightweight Python image
Installs only what’s needed
Copies your code and data
Runs the ETL script automatically

Step 3: Orchestrate with Docker Compose

Now, let’s define our full environment—including PostgreSQL—in docker-compose.yml:

version: '3.8'

services:
  db:
    image: postgres:15
    environment:
      POSTGRES_DB: salesdb
      POSTGRES_USER: user
      POSTGRES_PASSWORD: password
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data

  etl:
    build: .
    depends_on:
      - db
    volumes:
      - ./data:/app/data

volumes:
  postgres_data:

What’s happening here?

db service: Runs PostgreSQL with our credentials and persists data using a named volume.
etl service: Builds from your Dockerfile and waits for the database to be ready (depends_on).
Volumes: Share the data/ folder so your CSV is accessible inside the container.

Step 4: Run the Pipeline

From your project root, run:

docker-compose up --build

You should see output like:

etl_1  | Data loaded successfully!

To verify the data:

Connect to PostgreSQL (you can use psql or a GUI like DBeaver or Datagrip by Jetbrains (like me😉) on localhost:5432)
Run:

   SELECT * FROM sales;

You’ll see your CSV data—loaded entirely through containers!

To stop and clean up:

docker-compose down -v  # -v removes the volume (optional)

Bonus: Make It Reusable

Want to run this pipeline daily? Add a scheduler like Apache Airflow on another container. The beauty of Docker Compose is that you can add new services without touching your core logic. (we shall cover this another time, as for now we stick to the basics.

Congratulations!!, Now your entire data stack—database, ETL, and analysis is containerized!

Common Pitfalls & Tips

🚫 Don’t hardcode passwords in production → Use Docker secrets or environment files (.env).
🚫 Avoid large images → Use .dockerignore and multi-stage builds for complex apps.
✅ Test locally first → Run docker build . before docker-compose up.
✅ Use volume mounts for dev → Edit code without rebuilding the image.