DEV Community

Abdelrahman Adnan
Abdelrahman Adnan

Posted on

Data Engineering ZoomCamp Module 1 Notes Part 1

Module 1: Docker, SQL & Terraform

This is my notes and walkthrough for Module 1 of the Data Engineering Zoomcamp. If you're new to data engineering, this should help you understand the basics.

What is Data Engineering?

Data Engineering is basically about building systems that collect, store, and analyze data at scale. Think of it as the plumbing that makes data flow from point A to point B so analysts and data scientists can do their thing.

A data pipeline is just a service that takes data in, does something with it, and outputs more data. Simple example: read a CSV file, clean it up, store it in a database.


Part 1: Docker Basics

Why Docker?

Docker lets you package your application and all its dependencies into a "container". This solves the classic "it works on my machine" problem.

Main benefits:

  • Reproducibility - same environment everywhere
  • Isolation - apps run independently, won't mess with your system
  • Portability - works on any machine with Docker installed

Containers are different from virtual machines - they're much lighter because they share the host OS kernel.

Getting Started with Docker

First, check if Docker is installed:

docker --version
Enter fullscreen mode Exit fullscreen mode

Run your first container:

docker run hello-world
Enter fullscreen mode Exit fullscreen mode

Try running Ubuntu:

docker run -it ubuntu
Enter fullscreen mode Exit fullscreen mode

The -it flag means interactive mode with a terminal. Without it, the container just starts and exits.

Important: Containers are Stateless

This tripped me up at first. Any changes you make inside a container are lost when the container stops. For example:

docker run -it ubuntu
apt update && apt install python3
exit
# Run it again
docker run -it ubuntu
python3  # Error! Python is not installed
Enter fullscreen mode Exit fullscreen mode

This is actually a feature, not a bug. It means you can always start fresh.

Managing Containers

See all containers (including stopped ones):

docker ps -a
Enter fullscreen mode Exit fullscreen mode

Clean up old containers:

docker rm $(docker ps -aq)
Enter fullscreen mode Exit fullscreen mode

Better approach - use --rm to auto-delete when container stops:

docker run -it --rm ubuntu
Enter fullscreen mode Exit fullscreen mode

Using Different Base Images

You can use pre-built images with software already installed:

# Python image - starts Python interpreter
docker run -it --rm python:3.13

# If you want bash instead of Python:
docker run -it --rm --entrypoint=bash python:3.13-slim
Enter fullscreen mode Exit fullscreen mode

Volumes - Persisting Data

Since containers are stateless, we need volumes to save data. There are two types:

Named volumes (Docker manages them):

docker run -it -v my_data:/app/data ubuntu
Enter fullscreen mode Exit fullscreen mode

Bind mounts (map to a folder on your computer):

docker run -it -v $(pwd)/my_folder:/app/data ubuntu
Enter fullscreen mode Exit fullscreen mode

Part 2: Creating a Dockerfile

A Dockerfile is a recipe for building your own Docker image.

Simple Example

Create a file called pipeline.py:

import sys
import pandas as pd

print(sys.argv)
day = sys.argv[1]
print(f'Job finished for day = {day}')
Enter fullscreen mode Exit fullscreen mode

Create a Dockerfile:

FROM python:3.13-slim

RUN pip install pandas pyarrow

WORKDIR /app
COPY pipeline.py pipeline.py

ENTRYPOINT ["python", "pipeline.py"]
Enter fullscreen mode Exit fullscreen mode

What each line does:

  • FROM - base image to build on
  • RUN - execute commands during build
  • WORKDIR - set the working directory
  • COPY - copy files from your machine to the image
  • ENTRYPOINT - the command that runs when container starts

Build and run:

docker build -t test:pandas .
docker run -it test:pandas some_argument
Enter fullscreen mode Exit fullscreen mode

Part 3: Running PostgreSQL with Docker

Now let's do some real data engineering. We'll run Postgres in a container.

docker run -it --rm \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v ny_taxi_postgres_data:/var/lib/postgresql/data \
  -p 5432:5432 \
  postgres:17
Enter fullscreen mode Exit fullscreen mode

Breaking this down:

  • -e sets environment variables (username, password, database name)
  • -v creates a named volume so data persists
  • -p 5432:5432 maps the container port to your machine

Connecting to Postgres

Install pgcli (a nice command-line client):

pip install pgcli
# or with uv:
uv add --dev pgcli
Enter fullscreen mode Exit fullscreen mode

Connect:

pgcli -h localhost -p 5432 -u root -d ny_taxi
Enter fullscreen mode Exit fullscreen mode

Try some SQL:

\dt                              -- list tables
CREATE TABLE test (id INTEGER);
INSERT INTO test VALUES (1);
SELECT * FROM test;
\q                               -- quit
Enter fullscreen mode Exit fullscreen mode

Top comments (0)