Part 5: Building a ZIP Code Dimension Table

#tutorial #database #analytics #dataengineering

Why?, The Need for it!

Fact tables (like taxi trips) are optimized for events:

Pickup time
Distance
Fare
Pickup ZIP

But analytics teams ask questions like:

Trips by region
Revenue by state

Storing these attributes repeatedly in the fact table:

Increases storage
Slows joins

Breaking these into dimensional modeling is the best practice. In our project, the use case of knowing region of pickup / drop zip code would pave way for creating the dimension table zip_dim.

In real projects, ZIP metadata comes from:

Census data
Exposed via APIs
Internal reference tables

For this project, we simulate it with some random range based hardcoded values.

Where Does the ZIP Dimension Belong?

Layer Responsibility
Bronze Raw ZIP values as they appear
Silver Create and maintain ZIP dimension
Gold Join ZIP dimension for analytics

Even though ZIPs appear in Bronze data, the dimension itself is curated, so it belongs in Silver, not Bronze!
We derive ZIPs from the Bronze Delta table, not directly from raw files.

Zip_dim builder

Step 1: Create the schema for the zip_dim table

%sql
CREATE SCHEMA IF NOT EXISTS nyc_taxi.raw;
CREATE TABLE IF NOT EXISTS nyc_taxi.raw.zip_dim (
zip_code INT,
state STRING,
region STRING
)
USING DELTA;

Step 2: Read the unique and valid list of Zips - both pick up and drop from the bronze data.

from pyspark.sql.functions import *

zip_stream = (
spark.readStream
.table("nyc_taxi.bronze.taxi_trips")
.selectExpr("pickup_zip as zip")
.union(
spark.readStream
.table("nyc_taxi.bronze.taxi_trips")
.selectExpr("dropoff_zip as zip")
)
.where("zip IS NOT NULL")
.dropDuplicates(["zip"])
)

Step 3: Assign random metadata to the Zip values to simulate the actual metadata seeding.

def upsert_zip_dim(batch_df, batch_id):
batch_df.createOrReplaceTempView("zip_updates")
spark.sql("""
MERGE INTO nyc_taxi.raw.zip_dim t
USING (
SELECT
CAST(zip AS INT) AS zip_code,
CASE
WHEN zip BETWEEN 10001 AND 10282 THEN 'NY'
WHEN zip BETWEEN 11201 AND 11256 THEN 'US'
WHEN zip BETWEEN 10451 AND 10475 THEN 'IN'
WHEN zip BETWEEN 10301 AND 10314 THEN 'AD'
ELSE 'SA'
END AS state,
CASE
WHEN zip BETWEEN 10001 AND 10282 THEN 'Manhattan'
WHEN zip BETWEEN 11201 AND 11256 THEN 'Brooklyn'
WHEN zip BETWEEN 10451 AND 10475 THEN 'Bronx'
WHEN zip BETWEEN 10301 AND 10314 THEN 'Staten Island'
ELSE 'Queens'
END AS region
FROM zip_updates
) s
ON t.zip_code = s.zip_code
WHEN NOT MATCHED THEN INSERT *
""")

Step 4: Populate the nyc_taxi.raw.zip_dim delta table with zip meta data by batch processing.

(
zip_stream.writeStream
.foreachBatch(upsert_zip_dim)
.option("checkpointLocation", "/Volumes/nyc_taxi/infra/checkpoints/raw/zip_dim_data")
.trigger(availableNow=True)
.start()
)

The Zip dimention table nyc_taxi.raw.zip_dim is now ready.

Happy learning!