Part 4: Data Ingestion with Python
We're going to load the NYC Taxi dataset into Postgres.
Setting Up
Install dependencies:
pip install pandas sqlalchemy psycopg2-binary jupyter
Or with uv:
uv add pandas sqlalchemy psycopg2-binary
uv add --dev jupyter
The Dataset
We use the NYC Taxi trip data. Download it:
wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz
Loading Data into Postgres
Here's the basic approach:
import pandas as pd
from sqlalchemy import create_engine
# Create connection
engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')
# Read CSV in chunks (it's a big file)
df_iter = pd.read_csv('yellow_tripdata_2021-01.csv.gz',
iterator=True,
chunksize=100000)
# Create table from first chunk
first_chunk = next(df_iter)
first_chunk.head(0).to_sql(name='yellow_taxi_data', con=engine, if_exists='replace')
# Insert first chunk
first_chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')
# Insert remaining chunks
for chunk in df_iter:
chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')
print(f'Inserted {len(chunk)} rows')
The key things here:
-
chunksizeprevents loading the whole file into memory -
if_exists='replace'creates the table (first time) -
if_exists='append'adds rows (subsequent chunks)
Part 5: Docker Compose
Running multiple docker run commands is annoying. Docker Compose lets you define everything in one file.
Create docker-compose.yaml:
services:
pgdatabase:
image: postgres:17
environment:
POSTGRES_USER: "root"
POSTGRES_PASSWORD: "root"
POSTGRES_DB: "ny_taxi"
volumes:
- "ny_taxi_postgres_data:/var/lib/postgresql/data"
ports:
- "5432:5432"
pgadmin:
image: dpage/pgadmin4
environment:
PGADMIN_DEFAULT_EMAIL: "admin@admin.com"
PGADMIN_DEFAULT_PASSWORD: "root"
volumes:
- "pgadmin_data:/var/lib/pgadmin"
ports:
- "8080:80"
volumes:
ny_taxi_postgres_data:
pgadmin_data:
Now just run:
docker-compose up # start everything
docker-compose up -d # start in background
docker-compose down # stop everything
docker-compose down -v # stop and remove volumes
Docker Compose automatically creates a network so containers can talk to each other using their service names (e.g., pgdatabase instead of localhost).
Connecting to Postgres from pgAdmin
- Open
http://localhost:8080in browser - Login with the email/password from docker-compose
- Right-click Servers > Create > Server
- Name it whatever you want
- Under Connection tab:
- Host:
pgdatabase(the service name, not localhost!) - Port:
5432 - Username:
root - Password:
root
- Host:
Part 6: SQL Refresher
Quick review of SQL queries we'll use a lot.
JOINs
There are two ways to write an INNER JOIN:
-- Implicit join (old style)
SELECT t.*, z."Zone"
FROM yellow_taxi_data t, zones z
WHERE t."PULocationID" = z."LocationID";
-- Explicit join (preferred)
SELECT t.*, z."Zone"
FROM yellow_taxi_data t
JOIN zones z ON t."PULocationID" = z."LocationID";
For multiple joins:
SELECT
t.total_amount,
zpu."Zone" AS pickup_zone,
zdo."Zone" AS dropoff_zone
FROM yellow_taxi_data t
JOIN zones zpu ON t."PULocationID" = zpu."LocationID"
JOIN zones zdo ON t."DOLocationID" = zdo."LocationID";
GROUP BY and Aggregations
Count trips per day:
SELECT
CAST(tpep_dropoff_datetime AS DATE) AS day,
COUNT(1) AS trip_count
FROM yellow_taxi_data
GROUP BY CAST(tpep_dropoff_datetime AS DATE)
ORDER BY day;
Multiple aggregations:
SELECT
CAST(tpep_dropoff_datetime AS DATE) AS day,
COUNT(1) AS trips,
MAX(total_amount) AS max_amount,
SUM(total_amount) AS total_revenue
FROM yellow_taxi_data
GROUP BY 1
ORDER BY trips DESC;
Data Quality Checks
Find NULL values:
SELECT COUNT(*) FROM yellow_taxi_data
WHERE "PULocationID" IS NULL;
Find values not in lookup table:
SELECT * FROM yellow_taxi_data
WHERE "PULocationID" NOT IN (SELECT "LocationID" FROM zones);
Part 7: Terraform & GCP
Terraform is Infrastructure as Code (IaC). Instead of clicking around in a cloud console, you write config files describing what you want, and Terraform creates it.
Why Terraform?
- Version control your infrastructure
- Reproducible environments
- Easy to replicate across dev/staging/production
- Works with AWS, GCP, Azure, and many more
GCP Setup
- Create a Google Cloud account (free tier gives you $300 credits)
- Create a new project
- Create a service account:
- Go to IAM & Admin > Service Accounts
- Create new service account
- Give it these roles: Storage Admin, BigQuery Admin
- Download the JSON key file
- Set the environment variable:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/key.json"
Terraform Basics
Main files:
-
main.tf- main configuration -
variables.tf- variable definitions
Basic main.tf example:
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "5.6.0"
}
}
}
provider "google" {
project = "your-project-id"
region = "us-central1"
}
resource "google_storage_bucket" "data_lake" {
name = "your-unique-bucket-name"
location = "US"
force_destroy = true
}
resource "google_bigquery_dataset" "dataset" {
dataset_id = "trips_data"
location = "US"
}
Terraform Commands
The workflow is always:
# 1. Initialize (download providers)
terraform init
# 2. Preview changes
terraform plan
# 3. Apply changes
terraform apply
# 4. When you're done, destroy resources
terraform destroy
For auto-approving (skips confirmation):
terraform apply -auto-approve
terraform destroy -auto-approve
Common Terraform Flags
-
-auto-approve- don't ask for confirmation -
-var="name=value"- pass variables -
-var-file="file.tfvars"- use a variables file
Useful Tips
Docker Cleanup Commands
# Remove all stopped containers
docker container prune
# Remove unused images
docker image prune
# Remove unused volumes
docker volume prune
# Nuclear option - remove everything unused
docker system prune -a
Checking Ports
If a port is already in use:
# Find what's using port 5432
lsof -i :5432
# or
netstat -tulpn | grep 5432
Docker Networking
When containers need to talk to each other:
- In Docker Compose: use service names as hostnames
- Manual setup: create a network with
docker network create
docker network create my_network
docker run --network=my_network --name=container1 ...
docker run --network=my_network --name=container2 ...
# container2 can reach container1 using hostname "container1"
Summary
What we covered:
- Docker - containerization for reproducible environments
- PostgreSQL - relational database running in Docker
- Data Ingestion - loading data with Python/pandas/SQLAlchemy
- Docker Compose - orchestrating multiple containers
- SQL - querying and aggregating data
- Terraform - infrastructure as code for GCP
The main takeaway: these tools help you build reproducible, scalable data pipelines. Docker ensures your code runs the same everywhere, and Terraform ensures your infrastructure is consistent and version-controlled.
Top comments (0)