Data engineers juggle multiple tools — databases, ETL scripts, schedulers, APIs — each with its own dependencies.
Containerization makes it easy to run everything consistently, anywhere.
Let’s see how you can Dockerize a simple data pipeline using Docker and Docker Compose.
🚀 Why Containerize?
- ✅ Consistent environments (no “works on my machine” issues)
- ⚙️ Easy orchestration of multiple services
- 🧩 Simple scaling and local testing
- 🔁 Reproducible pipelines for dev & prod
Step 1: Project Structure
Create a new project folder with this structure:
sales-data-pipeline/
├── data/
│ └── sales.csv
├── app/
│ ├── requirements.txt
│ └── etl.py
├── docker-compose.yml
└── Dockerfile
Sample Data (data/sales.csv
)
date,product,quantity,price
2023-01-01,Widget A,10,15.99
2023-01-02,Widget B,5,22.50
2023-01-03,Widget A,8,15.99
Python Dependencies (app/requirements.txt
)
pandas==2.0.3
psycopg2-binary==2.9.7
ETL Script (app/etl.py
)
import pandas as pd
import psycopg2
from psycopg2 import sql
# Load CSV
df = pd.read_csv('/app/data/sales.csv')
# Connect to PostgreSQL
conn = psycopg2.connect(
host="db",
database="salesdb",
user="user",
password="password"
)
cur = conn.cursor()
# Create table
cur.execute("""
CREATE TABLE IF NOT EXISTS sales (
date DATE,
product VARCHAR(50),
quantity INT,
price NUMERIC
);
""")
# Insert data
for _, row in df.iterrows():
cur.execute(
"INSERT INTO sales VALUES (%s, %s, %s, %s)",
(row['date'], row['product'], row['quantity'], row['price'])
)
conn.commit()
cur.close()
conn.close()
print("Data loaded successfully!")
💡 Note: The host is
"db"
—this will be the service name in Docker Compose, in our casepostgres
.
Step 2: Write the Dockerfile
The Dockerfile
defines how to build your Python application container.
# Use an official Python runtime as base image
FROM python:3.11-slim
# Set working directory
WORKDIR /app
# Copy requirements and install dependencies
COPY app/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the rest of the app
COPY app/ .
COPY data/ ./data/
# Run the ETL script when container starts
CMD ["python", "etl.py"]
This image:
- Starts from a lightweight Python image
- Installs only what’s needed
- Copies your code and data
- Runs the ETL script automatically
Step 3: Orchestrate with Docker Compose
Now, let’s define our full environment—including PostgreSQL—in docker-compose.yml
:
version: '3.8'
services:
db:
image: postgres:15
environment:
POSTGRES_DB: salesdb
POSTGRES_USER: user
POSTGRES_PASSWORD: password
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
etl:
build: .
depends_on:
- db
volumes:
- ./data:/app/data
volumes:
postgres_data:
What’s happening here?
-
db
service: Runs PostgreSQL with our credentials and persists data using a named volume. -
etl
service: Builds from yourDockerfile
and waits for the database to be ready (depends_on
). -
Volumes: Share the
data/
folder so your CSV is accessible inside the container.
Step 4: Run the Pipeline
From your project root, run:
docker-compose up --build
You should see output like:
etl_1 | Data loaded successfully!
To verify the data:
- Connect to PostgreSQL (you can use
psql
or a GUI like DBeaver or Datagrip by Jetbrains (like me😉) onlocalhost:5432
) - Run:
SELECT * FROM sales;
You’ll see your CSV data—loaded entirely through containers!
To stop and clean up:
docker-compose down -v # -v removes the volume (optional)
Bonus: Make It Reusable
Want to run this pipeline daily? Add a scheduler like Apache Airflow on another container. The beauty of Docker Compose is that you can add new services without touching your core logic. (we shall cover this another time, as for now we stick to the basics.
Congratulations!!, Now your entire data stack—database, ETL, and analysis is containerized!
Common Pitfalls & Tips
- 🚫 Don’t hardcode passwords in production → Use Docker secrets or environment files (.env).
- 🚫 Avoid large images → Use
.dockerignore
and multi-stage builds for complex apps. - ✅ Test locally first → Run
docker build .
beforedocker-compose up
. - ✅ Use volume mounts for dev → Edit code without rebuilding the image.
Conclusion
Containerization isn’t just for DevOps Engineers it’s a superpower for data engineers too.
Top comments (1)
Very Informative.