Module 1: Docker, SQL & Terraform
This is my notes and walkthrough for Module 1 of the Data Engineering Zoomcamp. If you're new to data engineering, this should help you understand the basics.
What is Data Engineering?
Data Engineering is basically about building systems that collect, store, and analyze data at scale. Think of it as the plumbing that makes data flow from point A to point B so analysts and data scientists can do their thing.
A data pipeline is just a service that takes data in, does something with it, and outputs more data. Simple example: read a CSV file, clean it up, store it in a database.
Part 1: Docker Basics
Why Docker?
Docker lets you package your application and all its dependencies into a "container". This solves the classic "it works on my machine" problem.
Main benefits:
- Reproducibility - same environment everywhere
- Isolation - apps run independently, won't mess with your system
- Portability - works on any machine with Docker installed
Containers are different from virtual machines - they're much lighter because they share the host OS kernel.
Getting Started with Docker
First, check if Docker is installed:
docker --version
Run your first container:
docker run hello-world
Try running Ubuntu:
docker run -it ubuntu
The -it flag means interactive mode with a terminal. Without it, the container just starts and exits.
Important: Containers are Stateless
This tripped me up at first. Any changes you make inside a container are lost when the container stops. For example:
docker run -it ubuntu
apt update && apt install python3
exit
# Run it again
docker run -it ubuntu
python3 # Error! Python is not installed
This is actually a feature, not a bug. It means you can always start fresh.
Managing Containers
See all containers (including stopped ones):
docker ps -a
Clean up old containers:
docker rm $(docker ps -aq)
Better approach - use --rm to auto-delete when container stops:
docker run -it --rm ubuntu
Using Different Base Images
You can use pre-built images with software already installed:
# Python image - starts Python interpreter
docker run -it --rm python:3.13
# If you want bash instead of Python:
docker run -it --rm --entrypoint=bash python:3.13-slim
Volumes - Persisting Data
Since containers are stateless, we need volumes to save data. There are two types:
Named volumes (Docker manages them):
docker run -it -v my_data:/app/data ubuntu
Bind mounts (map to a folder on your computer):
docker run -it -v $(pwd)/my_folder:/app/data ubuntu
Part 2: Creating a Dockerfile
A Dockerfile is a recipe for building your own Docker image.
Simple Example
Create a file called pipeline.py:
import sys
import pandas as pd
print(sys.argv)
day = sys.argv[1]
print(f'Job finished for day = {day}')
Create a Dockerfile:
FROM python:3.13-slim
RUN pip install pandas pyarrow
WORKDIR /app
COPY pipeline.py pipeline.py
ENTRYPOINT ["python", "pipeline.py"]
What each line does:
-
FROM- base image to build on -
RUN- execute commands during build -
WORKDIR- set the working directory -
COPY- copy files from your machine to the image -
ENTRYPOINT- the command that runs when container starts
Build and run:
docker build -t test:pandas .
docker run -it test:pandas some_argument
Part 3: Running PostgreSQL with Docker
Now let's do some real data engineering. We'll run Postgres in a container.
docker run -it --rm \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v ny_taxi_postgres_data:/var/lib/postgresql/data \
-p 5432:5432 \
postgres:17
Breaking this down:
-
-esets environment variables (username, password, database name) -
-vcreates a named volume so data persists -
-p 5432:5432maps the container port to your machine
Connecting to Postgres
Install pgcli (a nice command-line client):
pip install pgcli
# or with uv:
uv add --dev pgcli
Connect:
pgcli -h localhost -p 5432 -u root -d ny_taxi
Try some SQL:
\dt -- list tables
CREATE TABLE test (id INTEGER);
INSERT INTO test VALUES (1);
SELECT * FROM test;
\q -- quit
Top comments (0)