Gathuru_M

Posted on Jun 14

Containerizing an ETL Pipeline with Docker Compose and Clean Git Hygiene

#dataengineering #docker #git #tutorial

In my last post, we broke down the core concepts of Docker and why packaging your code into standardized containers is ideal for avoiding "dependency hell."

In this article, we look into developing an ETL project and running it on Docker using docker-compose. The project uses the Coinpaprika API to fetch and normalize ticker data.
We will also look at how to push the entire project to GitHub using atomic commits to keep our version control clean. Let’s dive into how it works!

Why Docker Compose?

Instead of managing containers individually, Docker Compose allows us to define a multi-container application in a single file → docker-compose.yml.

Our project requires two distinct services/containers:

db (PostgreSQL): The data warehouse destination where our normalized coin data will be loaded.
etl_script (Python): Our custom application container that sends requests to the Coinpaprika API, transforms the JSON response using pandas, and pushes it to our database.

Step 1: Writing the Configuration Files

To build this, we create a clean directory structure. First, write a Dockerfile for the Python ETL script to ensure it has all the necessary packages installed:

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY etl_script.py .

CMD ["python", "etl_script.py"]

Step 2: Keeping Secrets Secret with a `.env` File

Introduce a .env file - a simple text file that holds sensitive environment variables.
In this case, the file will hold our database configurations.

Note that this file is not pushed to GitHub.

Here is what the .env file looks like:

# .env
DB_USER=crypto_admin
DB_PASSWORD=SuperSecretPassword123
DB_NAME=crypto_warehouse

Step 3: Composing the Multi-Container Network

Next, write the docker-compose.yml file to define the database and the script together.
Notice how it references the variables dynamically from our .env file using the ${VARIABLE_NAME} syntax. Docker Compose automatically detects this file in the same directory and injects the values safely:

version: '3.8'

services:
  db:
    image: postgres:15
    container_name: crypto_postgres
    restart: always
    environment:
      POSTGRES_USER: ${DB_USER}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_DB: ${DB_NAME}
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data

  etl_script:
    build: .
    container_name: crypto_etl_runner
    depends_on:
      - db
    environment:
      - DB_HOST=db
      - DB_NAME=${DB_NAME}
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}

volumes:
  postgres_data:

Networking in Docker:

Under etl_script, the DB_HOST environment variable is set to db instead of localhost. Because Docker Compose spins up both containers on a shared default network, they can find each other using their service names as hostnames!

Step 4: Spinning it Up

With our files defined, launching the entire multi-container architecture requires just one command in the terminal:

docker compose up --build

Docker automatically pulls the Postgres image, builds our custom Python image, injects our hidden .env configurations, sets up the network isolation, and spins up both containers simultaneously.

Terminal logs from running docker compose up

Step 5: Practicing Clean Version Control (Atomic Commits)

Instead of working for three hours and writing a giant, generic commit message like "fixed code and added files", let's practice how to write atomic commits.

An atomic commit means each commit does exactly one logical thing. It makes your GitHub commit history readable, and if something breaks, it’s very easy to roll back to the exact step where things went wrong.

Here is an example of our commit timeline for this project:

idx: Starting point, Initializing project scaffolding
infra: Added Docker Infrastructure: Dockerfile, docker-compose.yml and env vars
feature: Add ETL script with CoinPaprika API integration and PostgreSQL connection handler

Key Takeaways

Volumes prevent data loss: Adding the volumes tag to the Postgres container ensures that even if we stop and destroy our containers using docker-compose down, the actual crypto data stays saved safely on the hard drive.
Environments are modular: Now, our configurations are completely decoupled from the code. If someone clones the repository from GitHub, the project won't leak any secrets, and they can easily plug in their own database credentials by creating their own local .env file.

All the best as you continue learning Docker!

DEV Community

Containerizing an ETL Pipeline with Docker Compose and Clean Git Hygiene

Why Docker Compose?

Step 1: Writing the Configuration Files

Step 2: Keeping Secrets Secret with a `.env` File

Step 3: Composing the Multi-Container Network

Networking in Docker:

Step 4: Spinning it Up

Step 5: Practicing Clean Version Control (Atomic Commits)

Key Takeaways

Top comments (0)

Why Docker Compose?

Step 1: Writing the Configuration Files

Step 2: Keeping Secrets Secret with a .env File

Step 3: Composing the Multi-Container Network

Networking in Docker:

Step 4: Spinning it Up

Step 5: Practicing Clean Version Control (Atomic Commits)

Key Takeaways

Step 2: Keeping Secrets Secret with a `.env` File