Abdelrahman Adnan

Posted on Jan 27

Data Engineering ZoomCamp Module 1 Notes Part 1

#beginners #dataengineering #docker #tutorial

Module 1: Docker, SQL & Terraform

This is my notes and walkthrough for Module 1 of the Data Engineering Zoomcamp. If you're new to data engineering, this should help you understand the basics.

What is Data Engineering?

Data Engineering is basically about building systems that collect, store, and analyze data at scale. Think of it as the plumbing that makes data flow from point A to point B so analysts and data scientists can do their thing.

A data pipeline is just a service that takes data in, does something with it, and outputs more data. Simple example: read a CSV file, clean it up, store it in a database.

Part 1: Docker Basics

Why Docker?

Docker lets you package your application and all its dependencies into a "container". This solves the classic "it works on my machine" problem.

Main benefits:

Reproducibility - same environment everywhere
Isolation - apps run independently, won't mess with your system
Portability - works on any machine with Docker installed

Containers are different from virtual machines - they're much lighter because they share the host OS kernel.

Getting Started with Docker

First, check if Docker is installed:

docker --version

Run your first container:

docker run hello-world

Try running Ubuntu:

docker run -it ubuntu

The -it flag means interactive mode with a terminal. Without it, the container just starts and exits.

Important: Containers are Stateless

This tripped me up at first. Any changes you make inside a container are lost when the container stops. For example:

docker run -it ubuntu
apt update && apt install python3
exit
# Run it again
docker run -it ubuntu
python3  # Error! Python is not installed

This is actually a feature, not a bug. It means you can always start fresh.

Managing Containers

See all containers (including stopped ones):

docker ps -a

Clean up old containers:

docker rm $(docker ps -aq)

Better approach - use --rm to auto-delete when container stops:

docker run -it --rm ubuntu

Using Different Base Images

You can use pre-built images with software already installed:

# Python image - starts Python interpreter
docker run -it --rm python:3.13

# If you want bash instead of Python:
docker run -it --rm --entrypoint=bash python:3.13-slim

Volumes - Persisting Data

Since containers are stateless, we need volumes to save data. There are two types:

Named volumes (Docker manages them):

docker run -it -v my_data:/app/data ubuntu

Bind mounts (map to a folder on your computer):

docker run -it -v $(pwd)/my_folder:/app/data ubuntu

Part 2: Creating a Dockerfile

A Dockerfile is a recipe for building your own Docker image.

Simple Example

Create a file called pipeline.py:

import sys
import pandas as pd

print(sys.argv)
day = sys.argv[1]
print(f'Job finished for day = {day}')

Create a Dockerfile:

FROM python:3.13-slim

RUN pip install pandas pyarrow

WORKDIR /app
COPY pipeline.py pipeline.py

ENTRYPOINT ["python", "pipeline.py"]

What each line does:

FROM - base image to build on
RUN - execute commands during build
WORKDIR - set the working directory
COPY - copy files from your machine to the image
ENTRYPOINT - the command that runs when container starts

Build and run:

docker build -t test:pandas .
docker run -it test:pandas some_argument

Part 3: Running PostgreSQL with Docker

Now let's do some real data engineering. We'll run Postgres in a container.

docker run -it --rm \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v ny_taxi_postgres_data:/var/lib/postgresql/data \
  -p 5432:5432 \
  postgres:17

Breaking this down:

-e sets environment variables (username, password, database name)
-v creates a named volume so data persists
-p 5432:5432 maps the container port to your machine

Connecting to Postgres

Install pgcli (a nice command-line client):

pip install pgcli
# or with uv:
uv add --dev pgcli

Connect:

pgcli -h localhost -p 5432 -u root -d ny_taxi

Try some SQL:

\dt                              -- list tables
CREATE TABLE test (id INTEGER);
INSERT INTO test VALUES (1);
SELECT * FROM test;
\q                               -- quit

DEV Community