DEV Community

Abdelrahman Adnan
Abdelrahman Adnan

Posted on

Data Engineering ZoomCamp Module 1 Notes Part 2

Part 4: Data Ingestion with Python

We're going to load the NYC Taxi dataset into Postgres.

Setting Up

Install dependencies:

pip install pandas sqlalchemy psycopg2-binary jupyter
Enter fullscreen mode Exit fullscreen mode

Or with uv:

uv add pandas sqlalchemy psycopg2-binary
uv add --dev jupyter
Enter fullscreen mode Exit fullscreen mode

The Dataset

We use the NYC Taxi trip data. Download it:

wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz
Enter fullscreen mode Exit fullscreen mode

Loading Data into Postgres

Here's the basic approach:

import pandas as pd
from sqlalchemy import create_engine

# Create connection
engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')

# Read CSV in chunks (it's a big file)
df_iter = pd.read_csv('yellow_tripdata_2021-01.csv.gz', 
                       iterator=True, 
                       chunksize=100000)

# Create table from first chunk
first_chunk = next(df_iter)
first_chunk.head(0).to_sql(name='yellow_taxi_data', con=engine, if_exists='replace')

# Insert first chunk
first_chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')

# Insert remaining chunks
for chunk in df_iter:
    chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')
    print(f'Inserted {len(chunk)} rows')
Enter fullscreen mode Exit fullscreen mode

The key things here:

  • chunksize prevents loading the whole file into memory
  • if_exists='replace' creates the table (first time)
  • if_exists='append' adds rows (subsequent chunks)

Part 5: Docker Compose

Running multiple docker run commands is annoying. Docker Compose lets you define everything in one file.

Create docker-compose.yaml:

services:
  pgdatabase:
    image: postgres:17
    environment:
      POSTGRES_USER: "root"
      POSTGRES_PASSWORD: "root"
      POSTGRES_DB: "ny_taxi"
    volumes:
      - "ny_taxi_postgres_data:/var/lib/postgresql/data"
    ports:
      - "5432:5432"

  pgadmin:
    image: dpage/pgadmin4
    environment:
      PGADMIN_DEFAULT_EMAIL: "admin@admin.com"
      PGADMIN_DEFAULT_PASSWORD: "root"
    volumes:
      - "pgadmin_data:/var/lib/pgadmin"
    ports:
      - "8080:80"

volumes:
  ny_taxi_postgres_data:
  pgadmin_data:
Enter fullscreen mode Exit fullscreen mode

Now just run:

docker-compose up      # start everything
docker-compose up -d   # start in background
docker-compose down    # stop everything
docker-compose down -v # stop and remove volumes
Enter fullscreen mode Exit fullscreen mode

Docker Compose automatically creates a network so containers can talk to each other using their service names (e.g., pgdatabase instead of localhost).

Connecting to Postgres from pgAdmin

  1. Open http://localhost:8080 in browser
  2. Login with the email/password from docker-compose
  3. Right-click Servers > Create > Server
  4. Name it whatever you want
  5. Under Connection tab:
    • Host: pgdatabase (the service name, not localhost!)
    • Port: 5432
    • Username: root
    • Password: root

Part 6: SQL Refresher

Quick review of SQL queries we'll use a lot.

JOINs

There are two ways to write an INNER JOIN:

-- Implicit join (old style)
SELECT t.*, z."Zone"
FROM yellow_taxi_data t, zones z
WHERE t."PULocationID" = z."LocationID";

-- Explicit join (preferred)
SELECT t.*, z."Zone"
FROM yellow_taxi_data t
JOIN zones z ON t."PULocationID" = z."LocationID";
Enter fullscreen mode Exit fullscreen mode

For multiple joins:

SELECT 
    t.total_amount,
    zpu."Zone" AS pickup_zone,
    zdo."Zone" AS dropoff_zone
FROM yellow_taxi_data t
JOIN zones zpu ON t."PULocationID" = zpu."LocationID"
JOIN zones zdo ON t."DOLocationID" = zdo."LocationID";
Enter fullscreen mode Exit fullscreen mode

GROUP BY and Aggregations

Count trips per day:

SELECT 
    CAST(tpep_dropoff_datetime AS DATE) AS day,
    COUNT(1) AS trip_count
FROM yellow_taxi_data
GROUP BY CAST(tpep_dropoff_datetime AS DATE)
ORDER BY day;
Enter fullscreen mode Exit fullscreen mode

Multiple aggregations:

SELECT 
    CAST(tpep_dropoff_datetime AS DATE) AS day,
    COUNT(1) AS trips,
    MAX(total_amount) AS max_amount,
    SUM(total_amount) AS total_revenue
FROM yellow_taxi_data
GROUP BY 1
ORDER BY trips DESC;
Enter fullscreen mode Exit fullscreen mode

Data Quality Checks

Find NULL values:

SELECT COUNT(*) FROM yellow_taxi_data
WHERE "PULocationID" IS NULL;
Enter fullscreen mode Exit fullscreen mode

Find values not in lookup table:

SELECT * FROM yellow_taxi_data
WHERE "PULocationID" NOT IN (SELECT "LocationID" FROM zones);
Enter fullscreen mode Exit fullscreen mode

Part 7: Terraform & GCP

Terraform is Infrastructure as Code (IaC). Instead of clicking around in a cloud console, you write config files describing what you want, and Terraform creates it.

Why Terraform?

  • Version control your infrastructure
  • Reproducible environments
  • Easy to replicate across dev/staging/production
  • Works with AWS, GCP, Azure, and many more

GCP Setup

  1. Create a Google Cloud account (free tier gives you $300 credits)
  2. Create a new project
  3. Create a service account:
    • Go to IAM & Admin > Service Accounts
    • Create new service account
    • Give it these roles: Storage Admin, BigQuery Admin
  4. Download the JSON key file
  5. Set the environment variable:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/key.json"
Enter fullscreen mode Exit fullscreen mode

Terraform Basics

Main files:

  • main.tf - main configuration
  • variables.tf - variable definitions

Basic main.tf example:

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "5.6.0"
    }
  }
}

provider "google" {
  project = "your-project-id"
  region  = "us-central1"
}

resource "google_storage_bucket" "data_lake" {
  name          = "your-unique-bucket-name"
  location      = "US"
  force_destroy = true
}

resource "google_bigquery_dataset" "dataset" {
  dataset_id = "trips_data"
  location   = "US"
}
Enter fullscreen mode Exit fullscreen mode

Terraform Commands

The workflow is always:

# 1. Initialize (download providers)
terraform init

# 2. Preview changes
terraform plan

# 3. Apply changes
terraform apply

# 4. When you're done, destroy resources
terraform destroy
Enter fullscreen mode Exit fullscreen mode

For auto-approving (skips confirmation):

terraform apply -auto-approve
terraform destroy -auto-approve
Enter fullscreen mode Exit fullscreen mode

Common Terraform Flags

  • -auto-approve - don't ask for confirmation
  • -var="name=value" - pass variables
  • -var-file="file.tfvars" - use a variables file

Useful Tips

Docker Cleanup Commands

# Remove all stopped containers
docker container prune

# Remove unused images
docker image prune

# Remove unused volumes
docker volume prune

# Nuclear option - remove everything unused
docker system prune -a
Enter fullscreen mode Exit fullscreen mode

Checking Ports

If a port is already in use:

# Find what's using port 5432
lsof -i :5432
# or
netstat -tulpn | grep 5432
Enter fullscreen mode Exit fullscreen mode

Docker Networking

When containers need to talk to each other:

  • In Docker Compose: use service names as hostnames
  • Manual setup: create a network with docker network create
docker network create my_network
docker run --network=my_network --name=container1 ...
docker run --network=my_network --name=container2 ...
# container2 can reach container1 using hostname "container1"
Enter fullscreen mode Exit fullscreen mode

Summary

What we covered:

  1. Docker - containerization for reproducible environments
  2. PostgreSQL - relational database running in Docker
  3. Data Ingestion - loading data with Python/pandas/SQLAlchemy
  4. Docker Compose - orchestrating multiple containers
  5. SQL - querying and aggregating data
  6. Terraform - infrastructure as code for GCP

The main takeaway: these tools help you build reproducible, scalable data pipelines. Docker ensures your code runs the same everywhere, and Terraform ensures your infrastructure is consistent and version-controlled.


Resources

Top comments (0)