Abdelrahman Adnan

Posted on Jan 27

Data Engineering ZoomCamp Module 1 Notes Part 2

#dataengineering #postgres #python #tutorial

Part 4: Data Ingestion with Python

We're going to load the NYC Taxi dataset into Postgres.

Setting Up

Install dependencies:

pip install pandas sqlalchemy psycopg2-binary jupyter

Or with uv:

uv add pandas sqlalchemy psycopg2-binary
uv add --dev jupyter

The Dataset

We use the NYC Taxi trip data. Download it:

wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz

Loading Data into Postgres

Here's the basic approach:

import pandas as pd
from sqlalchemy import create_engine

# Create connection
engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')

# Read CSV in chunks (it's a big file)
df_iter = pd.read_csv('yellow_tripdata_2021-01.csv.gz', 
                       iterator=True, 
                       chunksize=100000)

# Create table from first chunk
first_chunk = next(df_iter)
first_chunk.head(0).to_sql(name='yellow_taxi_data', con=engine, if_exists='replace')

# Insert first chunk
first_chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')

# Insert remaining chunks
for chunk in df_iter:
    chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')
    print(f'Inserted {len(chunk)} rows')

The key things here:

chunksize prevents loading the whole file into memory
if_exists='replace' creates the table (first time)
if_exists='append' adds rows (subsequent chunks)

Part 5: Docker Compose

Running multiple docker run commands is annoying. Docker Compose lets you define everything in one file.

Create docker-compose.yaml:

services:
  pgdatabase:
    image: postgres:17
    environment:
      POSTGRES_USER: "root"
      POSTGRES_PASSWORD: "root"
      POSTGRES_DB: "ny_taxi"
    volumes:
      - "ny_taxi_postgres_data:/var/lib/postgresql/data"
    ports:
      - "5432:5432"

  pgadmin:
    image: dpage/pgadmin4
    environment:
      PGADMIN_DEFAULT_EMAIL: "admin@admin.com"
      PGADMIN_DEFAULT_PASSWORD: "root"
    volumes:
      - "pgadmin_data:/var/lib/pgadmin"
    ports:
      - "8080:80"

volumes:
  ny_taxi_postgres_data:
  pgadmin_data:

Now just run:

docker-compose up      # start everything
docker-compose up -d   # start in background
docker-compose down    # stop everything
docker-compose down -v # stop and remove volumes

Docker Compose automatically creates a network so containers can talk to each other using their service names (e.g., pgdatabase instead of localhost).

Connecting to Postgres from pgAdmin

Open http://localhost:8080 in browser
Login with the email/password from docker-compose
Right-click Servers > Create > Server
Name it whatever you want
Under Connection tab:
- Host: pgdatabase (the service name, not localhost!)
- Port: 5432
- Username: root
- Password: root

Part 6: SQL Refresher

Quick review of SQL queries we'll use a lot.

JOINs

There are two ways to write an INNER JOIN:

-- Implicit join (old style)
SELECT t.*, z."Zone"
FROM yellow_taxi_data t, zones z
WHERE t."PULocationID" = z."LocationID";

-- Explicit join (preferred)
SELECT t.*, z."Zone"
FROM yellow_taxi_data t
JOIN zones z ON t."PULocationID" = z."LocationID";

For multiple joins:

SELECT 
    t.total_amount,
    zpu."Zone" AS pickup_zone,
    zdo."Zone" AS dropoff_zone
FROM yellow_taxi_data t
JOIN zones zpu ON t."PULocationID" = zpu."LocationID"
JOIN zones zdo ON t."DOLocationID" = zdo."LocationID";

GROUP BY and Aggregations

Count trips per day:

SELECT 
    CAST(tpep_dropoff_datetime AS DATE) AS day,
    COUNT(1) AS trip_count
FROM yellow_taxi_data
GROUP BY CAST(tpep_dropoff_datetime AS DATE)
ORDER BY day;

Multiple aggregations:

SELECT 
    CAST(tpep_dropoff_datetime AS DATE) AS day,
    COUNT(1) AS trips,
    MAX(total_amount) AS max_amount,
    SUM(total_amount) AS total_revenue
FROM yellow_taxi_data
GROUP BY 1
ORDER BY trips DESC;

Data Quality Checks

Find NULL values:

SELECT COUNT(*) FROM yellow_taxi_data
WHERE "PULocationID" IS NULL;

Find values not in lookup table:

SELECT * FROM yellow_taxi_data
WHERE "PULocationID" NOT IN (SELECT "LocationID" FROM zones);

Part 7: Terraform & GCP

Terraform is Infrastructure as Code (IaC). Instead of clicking around in a cloud console, you write config files describing what you want, and Terraform creates it.

Why Terraform?

Version control your infrastructure
Reproducible environments
Easy to replicate across dev/staging/production
Works with AWS, GCP, Azure, and many more

GCP Setup

Create a Google Cloud account (free tier gives you $300 credits)
Create a new project
Create a service account:
- Go to IAM & Admin > Service Accounts
- Create new service account
- Give it these roles: Storage Admin, BigQuery Admin
Download the JSON key file
Set the environment variable:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/key.json"

Terraform Basics

Main files:

main.tf - main configuration
variables.tf - variable definitions

Basic main.tf example:

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "5.6.0"
    }
  }
}

provider "google" {
  project = "your-project-id"
  region  = "us-central1"
}

resource "google_storage_bucket" "data_lake" {
  name          = "your-unique-bucket-name"
  location      = "US"
  force_destroy = true
}

resource "google_bigquery_dataset" "dataset" {
  dataset_id = "trips_data"
  location   = "US"
}

Terraform Commands

The workflow is always:

# 1. Initialize (download providers)
terraform init

# 2. Preview changes
terraform plan

# 3. Apply changes
terraform apply

# 4. When you're done, destroy resources
terraform destroy

For auto-approving (skips confirmation):

terraform apply -auto-approve
terraform destroy -auto-approve

Common Terraform Flags

-auto-approve - don't ask for confirmation
-var="name=value" - pass variables
-var-file="file.tfvars" - use a variables file

Useful Tips

Docker Cleanup Commands

# Remove all stopped containers
docker container prune

# Remove unused images
docker image prune

# Remove unused volumes
docker volume prune

# Nuclear option - remove everything unused
docker system prune -a

Checking Ports

If a port is already in use:

# Find what's using port 5432
lsof -i :5432
# or
netstat -tulpn | grep 5432

Docker Networking

When containers need to talk to each other:

In Docker Compose: use service names as hostnames
Manual setup: create a network with docker network create

docker network create my_network
docker run --network=my_network --name=container1 ...
docker run --network=my_network --name=container2 ...
# container2 can reach container1 using hostname "container1"

Summary

What we covered:

Docker - containerization for reproducible environments
PostgreSQL - relational database running in Docker
Data Ingestion - loading data with Python/pandas/SQLAlchemy
Docker Compose - orchestrating multiple containers
SQL - querying and aggregating data
Terraform - infrastructure as code for GCP

The main takeaway: these tools help you build reproducible, scalable data pipelines. Docker ensures your code runs the same everywhere, and Terraform ensures your infrastructure is consistent and version-controlled.

DEV Community