DEV Community

Abdi Omari
Abdi Omari

Posted on

Dockerizing a crypto ETL pipeline

You know that moment when you switch machines and suddenly nothing works?

You installed Postgres on your laptop months ago, forgot the setup steps, and now you're on a different machine trying to remember which version you had, what the auth config looked like, and why psql won't connect. Multiply that by every tool in your stack and you get the classic "works on my machine" problem.

Docker fixes this by packaging the environment along with the code. No more reinstalling Postgres from scratch, no more guessing which Python version broke your psycopg2 build. You define it once, and it runs the same everywhere.

For this project, I used Docker to package a small crypto ETL pipeline: pull live prices from the CoinPaprika API, clean the data, and load it into Postgres. Nothing fancy, but it's a good example of the pattern.

From notebook to modules

Every project like this starts messy, and mine was no different. I built it first in a test notebook, iterating cell by cell until the extraction, transformation, and load steps actually worked end to end. From there I collapsed everything into one script with hardcoded credentials and no separation of concerns.

That's fine for a first pass. But a single script that mixes API calls, dataframe cleanup, and database credentials in one file is hard to test and impossible to reuse. So the next step was splitting it up:

# extract.py
def extract():
    url = "https://api.coinpaprika.com/v1/tickers/{}"
    coin_ids = ['btc-bitcoin', 'usdt-tether', 'sol-solana']
    coin_data = []
    for coin_id in coin_ids:
        response = requests.get(url.format(coin_id))
        response.raise_for_status()
        coin_data.append(response.json())
    return coin_data
Enter fullscreen mode Exit fullscreen mode
# transform.py
import pandas as pd

def transform(coin_data):
    coin_df = pd.json_normalize(coin_data)
    coin_df.drop(columns=['quotes.USD.percent_change_30m', 'quotes.USD.percent_change_6h','quotes.USD.percent_change_12h', 'quotes.USD.percent_change_7d'], inplace=True)

    return coin_df
Enter fullscreen mode Exit fullscreen mode
# load.py
from sqlalchemy import create_engine
from dotenv import load_dotenv
import os 

load_dotenv()

def load(coin_df):
    DATABASE_NAME = os.getenv('POSTGRES_DB')
    USER = os.getenv('POSTGRES_USER')
    PORT = os.getenv('PORT')
    DATABASE_PASSWORD = os.getenv('POSTGRES_PASSWORD')
    HOST = os.getenv('HOST')


    engine = create_engine(f'postgresql+psycopg2://{USER}:{DATABASE_PASSWORD}@{HOST}:{PORT}/{DATABASE_NAME}')
    coin_df.to_sql('crypto_etl', engine, if_exists='append', index=False)
Enter fullscreen mode Exit fullscreen mode
# main.py
from extract import extract
from transform import transform
from load import load

def main():
    raw_data = extract()
    transform_data = transform(raw_data)
    load(transform_data)
    print("Pipeline run is successful")

if __name__ == '__main__':
    main()
Enter fullscreen mode Exit fullscreen mode

transform.py takes that raw JSON and turns it into a clean dataframe, dropping columns like percent_change_30m that add noise without adding value. load.py handles the database connection and writes to Postgres. main.py just wires the three together:

Three functions, one job each. If the API changes its response format, I know exactly which file to open.

Getting the credentials out of the code

The hardcoded version had DATABASE_PASSWORD = '1234' sitting right in pipeline.py. That's a habit worth breaking early, so load.py reads everything from environment variables instead:

DATABASE_NAME = os.getenv('POSTGRES_DB')
USER = os.getenv('POSTGRES_USER')
PORT = os.getenv('PORT')
DATABASE_PASSWORD = os.getenv('POSTGRES_PASSWORD')
HOST = os.getenv('HOST')
Enter fullscreen mode Exit fullscreen mode

Those values live in a .env file, which never gets committed thanks to .gitignore and never gets copied into the image thanks to .dockerignore. Same variables, two different jobs: one keeps them out of git, the other keeps them out of the container image.

Wiring up Docker

The Dockerfile itself is short. Python 3.12 slim base, install requirements, copy the code, run main.py:

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "crypto_project/main.py"]
Enter fullscreen mode Exit fullscreen mode

The interesting part is docker-compose.yml, since that's what actually solves the "different machine" problem. It defines two services: db, running Postgres 16 with a named volume so data survives restarts, and etl, which builds from the Dockerfile and depends on db being up first.

services:
  db:
    image: postgres:16
    environment:
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: ${POSTGRES_DB}
    volumes:
      - pgdata:/var/lib/postgresql/data

  etl:
    build: .
    env_file:
      - .env
    depends_on:
      - db
Enter fullscreen mode Exit fullscreen mode

One command, docker compose up --build -d, and both containers spin up together, on any machine with Docker installed. No manual Postgres install, no version mismatches, no "wait, what port did I use last time."

For the full project with all the files check out my github https://github.com/abdiomari/Docker-Crypto-Project

The one gotcha worth calling out

There's a small detail that trips people up the first time: inside docker-compose.yml, the ETL container talks to Postgres using HOST=db, the service name Docker's internal network resolves automatically. But if you ever run main.py directly on your host machine, outside Docker, that same hostname means nothing. You need HOST=localhost instead.

It's a small thing, but it's the kind of detail that costs twenty minutes of confused debugging if nobody tells you about it. Now it's written down, in the README and here.

That's the whole pipeline: three small Python modules, a Postgres container, and a compose file tying them together. Nothing about it needs a specific laptop, a specific OS, or a specific person's memory of how they set up their local database three months ago. Clone it, add a .env, run one command. That's the point.

Top comments (0)