DEV Community

Cover image for Part 5: Building a ZIP Code Dimension Table
Nithyalakshmi Kamalakkannan
Nithyalakshmi Kamalakkannan

Posted on

Part 5: Building a ZIP Code Dimension Table

Why?, The Need for it!

Fact tables (like taxi trips) are optimized for events:

  • Pickup time
  • Distance
  • Fare
  • Pickup ZIP

But analytics teams ask questions like:

  • Trips by region
  • Revenue by state

Storing these attributes repeatedly in the fact table:

  • Increases storage
  • Slows joins

Breaking these into dimensional modeling is the best practice. In our project, the use case of knowing region of pickup / drop zip code would pave way for creating the dimension table zip_dim.

In real projects, ZIP metadata comes from:

  • Census data
  • Exposed via APIs
  • Internal reference tables

For this project, we simulate it with some random range based hardcoded values.

Where Does the ZIP Dimension Belong?

Layer Responsibility
Bronze Raw ZIP values as they appear
Silver Create and maintain ZIP dimension
Gold Join ZIP dimension for analytics

Even though ZIPs appear in Bronze data, the dimension itself is curated, so it belongs in Silver, not Bronze!
We derive ZIPs from the Bronze Delta table, not directly from raw files.

Zip_dim builder

Step 1: Create the schema for the zip_dim table

%sql
CREATE SCHEMA IF NOT EXISTS nyc_taxi.raw;
CREATE TABLE IF NOT EXISTS nyc_taxi.raw.zip_dim (
zip_code INT,
state STRING,
region STRING
)
USING DELTA;

Step 2: Read the unique and valid list of Zips - both pick up and drop from the bronze data.

from pyspark.sql.functions import *

zip_stream = (
spark.readStream
.table("nyc_taxi.bronze.taxi_trips")
.selectExpr("pickup_zip as zip")
.union(
spark.readStream
.table("nyc_taxi.bronze.taxi_trips")
.selectExpr("dropoff_zip as zip")
)
.where("zip IS NOT NULL")
.dropDuplicates(["zip"])
)

Step 3: Assign random metadata to the Zip values to simulate the actual metadata seeding.

def upsert_zip_dim(batch_df, batch_id):
batch_df.createOrReplaceTempView("zip_updates")
spark.sql("""
MERGE INTO nyc_taxi.raw.zip_dim t
USING (
SELECT
CAST(zip AS INT) AS zip_code,
CASE
WHEN zip BETWEEN 10001 AND 10282 THEN 'NY'
WHEN zip BETWEEN 11201 AND 11256 THEN 'US'
WHEN zip BETWEEN 10451 AND 10475 THEN 'IN'
WHEN zip BETWEEN 10301 AND 10314 THEN 'AD'
ELSE 'SA'
END AS state,
CASE
WHEN zip BETWEEN 10001 AND 10282 THEN 'Manhattan'
WHEN zip BETWEEN 11201 AND 11256 THEN 'Brooklyn'
WHEN zip BETWEEN 10451 AND 10475 THEN 'Bronx'
WHEN zip BETWEEN 10301 AND 10314 THEN 'Staten Island'
ELSE 'Queens'
END AS region
FROM zip_updates
) s
ON t.zip_code = s.zip_code
WHEN NOT MATCHED THEN INSERT *
""")

Step 4: Populate the nyc_taxi.raw.zip_dim delta table with zip meta data by batch processing.

(
zip_stream.writeStream
.foreachBatch(upsert_zip_dim)
.option("checkpointLocation", "/Volumes/nyc_taxi/infra/checkpoints/raw/zip_dim_data")
.trigger(availableNow=True)
.start()
)

The Zip dimention table nyc_taxi.raw.zip_dim is now ready.

Happy learning!

Top comments (0)