DEV Community

Cover image for Data Engineering with Docker: A Hands-On Guide to Containerization
Aineah Simiyu
Aineah Simiyu

Posted on

Data Engineering with Docker: A Hands-On Guide to Containerization

Data engineers juggle multiple tools — databases, ETL scripts, schedulers, APIs — each with its own dependencies.
Containerization makes it easy to run everything consistently, anywhere.

Let’s see how you can Dockerize a simple data pipeline using Docker and Docker Compose.

🚀 Why Containerize?

  • ✅ Consistent environments (no “works on my machine” issues)
  • ⚙️ Easy orchestration of multiple services
  • 🧩 Simple scaling and local testing
  • 🔁 Reproducible pipelines for dev & prod

Step 1: Project Structure

Create a new project folder with this structure:

sales-data-pipeline/
├── data/
│   └── sales.csv
├── app/
│   ├── requirements.txt
│   └── etl.py
├── docker-compose.yml
└── Dockerfile
Enter fullscreen mode Exit fullscreen mode

Sample Data (data/sales.csv)

date,product,quantity,price
2023-01-01,Widget A,10,15.99
2023-01-02,Widget B,5,22.50
2023-01-03,Widget A,8,15.99
Enter fullscreen mode Exit fullscreen mode

Python Dependencies (app/requirements.txt)

pandas==2.0.3
psycopg2-binary==2.9.7
Enter fullscreen mode Exit fullscreen mode

ETL Script (app/etl.py)

import pandas as pd
import psycopg2
from psycopg2 import sql

# Load CSV
df = pd.read_csv('/app/data/sales.csv')

# Connect to PostgreSQL
conn = psycopg2.connect(
    host="db",
    database="salesdb",
    user="user",
    password="password"
)
cur = conn.cursor()

# Create table
cur.execute("""
    CREATE TABLE IF NOT EXISTS sales (
        date DATE,
        product VARCHAR(50),
        quantity INT,
        price NUMERIC
    );
""")

# Insert data
for _, row in df.iterrows():
    cur.execute(
        "INSERT INTO sales VALUES (%s, %s, %s, %s)",
        (row['date'], row['product'], row['quantity'], row['price'])
    )

conn.commit()
cur.close()
conn.close()

print("Data loaded successfully!")
Enter fullscreen mode Exit fullscreen mode

💡 Note: The host is "db"—this will be the service name in Docker Compose, in our case postgres.


Step 2: Write the Dockerfile

The Dockerfile defines how to build your Python application container.

# Use an official Python runtime as base image
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Copy requirements and install dependencies
COPY app/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the app
COPY app/ .
COPY data/ ./data/

# Run the ETL script when container starts
CMD ["python", "etl.py"]
Enter fullscreen mode Exit fullscreen mode

This image:

  • Starts from a lightweight Python image
  • Installs only what’s needed
  • Copies your code and data
  • Runs the ETL script automatically

Step 3: Orchestrate with Docker Compose

Now, let’s define our full environment—including PostgreSQL—in docker-compose.yml:

version: '3.8'

services:
  db:
    image: postgres:15
    environment:
      POSTGRES_DB: salesdb
      POSTGRES_USER: user
      POSTGRES_PASSWORD: password
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data

  etl:
    build: .
    depends_on:
      - db
    volumes:
      - ./data:/app/data

volumes:
  postgres_data:
Enter fullscreen mode Exit fullscreen mode

What’s happening here?

  • db service: Runs PostgreSQL with our credentials and persists data using a named volume.
  • etl service: Builds from your Dockerfile and waits for the database to be ready (depends_on).
  • Volumes: Share the data/ folder so your CSV is accessible inside the container.

Step 4: Run the Pipeline

From your project root, run:

docker-compose up --build
Enter fullscreen mode Exit fullscreen mode

You should see output like:

etl_1  | Data loaded successfully!
Enter fullscreen mode Exit fullscreen mode

To verify the data:

  1. Connect to PostgreSQL (you can use psql or a GUI like DBeaver or Datagrip by Jetbrains (like me😉) on localhost:5432)
  2. Run:
   SELECT * FROM sales;
Enter fullscreen mode Exit fullscreen mode

You’ll see your CSV data—loaded entirely through containers!

To stop and clean up:

docker-compose down -v  # -v removes the volume (optional)
Enter fullscreen mode Exit fullscreen mode

Bonus: Make It Reusable

Want to run this pipeline daily? Add a scheduler like Apache Airflow on another container. The beauty of Docker Compose is that you can add new services without touching your core logic. (we shall cover this another time, as for now we stick to the basics.

Congratulations!!, Now your entire data stack—database, ETL, and analysis is containerized!

Common Pitfalls & Tips

  • 🚫 Don’t hardcode passwords in production → Use Docker secrets or environment files (.env).
  • 🚫 Avoid large images → Use .dockerignore and multi-stage builds for complex apps.
  • Test locally first → Run docker build . before docker-compose up.
  • Use volume mounts for dev → Edit code without rebuilding the image.

Conclusion

Containerization isn’t just for DevOps Engineers it’s a superpower for data engineers too.

Top comments (1)

Collapse
 
aineahsimiyu profile image
Aineah Simiyu

Very Informative.