DEV Community

Abdelrahman Adnan
Abdelrahman Adnan

Posted on

Part 2: dbt Project Structure & Building Models πŸ“

DataEngineeringZoomcamp #dbt #AnalyticsEngineering #DataModeling

Why Model Data? πŸ“

Raw data is messy and hard to query. Dimensional modeling organizes data into a structure that's:

  • Easy to understand
  • Fast to query
  • Flexible for different analyses

Fact Tables vs Dimension Tables

This is the core of dimensional modeling (also called "star schema"):

Fact Tables (fct_)

  • Contain measurements or events
  • One row per thing that happened
  • Usually have many rows (millions/billions)
  • Contain numeric values you want to analyze

Examples:

  • fct_trips - one row per taxi trip
  • fct_sales - one row per sale
  • fct_orders - one row per order
-- Example fact table
CREATE TABLE fct_trips AS
SELECT
    trip_id,           -- unique identifier
    pickup_datetime,   -- when it happened
    dropoff_datetime,
    pickup_zone_id,    -- foreign keys to dimensions
    dropoff_zone_id,
    fare_amount,       -- numeric measures
    tip_amount,
    total_amount
FROM transformed_trips;
Enter fullscreen mode Exit fullscreen mode

Dimension Tables (dim_)

  • Contain attributes or descriptive information
  • One row per entity
  • Usually fewer rows
  • Provide context for fact tables

Examples:

  • dim_zones - one row per taxi zone
  • dim_customers - one row per customer
  • dim_products - one row per product
-- Example dimension table
CREATE TABLE dim_zones AS
SELECT
    location_id,       -- primary key
    borough,           -- descriptive attributes
    zone_name,
    service_zone
FROM zone_lookup;
Enter fullscreen mode Exit fullscreen mode

The Star Schema ⭐

When you join facts and dimensions, you get a star shape:

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  dim_zones   β”‚
                    β”‚  (pickup)    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  dim_vendors │────│  fct_trips   │────│  dim_zones   β”‚
β”‚              β”‚    β”‚  (center)    β”‚    β”‚  (dropoff)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
                    β”‚ dim_payment  β”‚
                    β”‚    types     β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Why it's powerful:

-- Easy to answer business questions!
SELECT 
    z.borough,
    COUNT(*) as trip_count,
    SUM(f.total_amount) as total_revenue
FROM fct_trips f
JOIN dim_zones z ON f.pickup_zone_id = z.location_id
GROUP BY z.borough
ORDER BY total_revenue DESC;
Enter fullscreen mode Exit fullscreen mode

dbt Project Structure

A dbt project has a specific folder structure. Understanding this helps you navigate any project:

taxi_rides_ny/
β”œβ”€β”€ dbt_project.yml      # Project configuration (most important!)
β”œβ”€β”€ profiles.yml         # Database connection (often in ~/.dbt/)
β”œβ”€β”€ packages.yml         # External packages to install
β”‚
β”œβ”€β”€ models/              # ⭐ YOUR SQL MODELS LIVE HERE
β”‚   β”œβ”€β”€ staging/         # Raw data, minimally cleaned
β”‚   β”œβ”€β”€ intermediate/    # Complex transformations
β”‚   └── marts/           # Final, business-ready tables
β”‚
β”œβ”€β”€ seeds/               # CSV files to load as tables
β”œβ”€β”€ macros/              # Reusable SQL functions
β”œβ”€β”€ tests/               # Custom test files
β”œβ”€β”€ snapshots/           # Track data changes over time
└── analysis/            # Ad-hoc queries (not built)
Enter fullscreen mode Exit fullscreen mode

The dbt_project.yml File

This is the most important file - dbt looks for it first:

name: 'taxi_rides_ny'
version: '1.0.0'
profile: 'taxi_rides_ny'  # Must match profiles.yml!

# Default configurations
models:
  taxi_rides_ny:
    staging:
      materialized: view  # Staging models become views
    marts:
      materialized: table # Mart models become tables
Enter fullscreen mode Exit fullscreen mode

The Three Model Layers

dbt recommends organizing models into three layers:

1. Staging Layer (staging/)

Purpose: Clean copy of raw data with minimal transformations

What happens here:

  • Rename columns (snake_case, clear names)
  • Cast data types
  • Filter obviously bad data
  • Keep 1:1 with source (same rows, similar columns)
-- models/staging/stg_green_tripdata.sql
{{ config(materialized='view') }}

with tripdata as (
    select * 
    from {{ source('staging', 'green_tripdata') }}
    where vendorid is not null  -- filter bad data
)

select
    -- Rename and cast columns
    cast(vendorid as integer) as vendor_id,
    cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
    cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
    cast(pulocationid as integer) as pickup_location_id,
    cast(dolocationid as integer) as dropoff_location_id,
    cast(passenger_count as integer) as passenger_count,
    cast(trip_distance as numeric) as trip_distance,
    cast(fare_amount as numeric) as fare_amount,
    cast(total_amount as numeric) as total_amount
from tripdata
Enter fullscreen mode Exit fullscreen mode

2. Intermediate Layer (intermediate/)

Purpose: Complex transformations, joins, business logic

What happens here:

  • Combine multiple staging models
  • Apply business rules
  • Heavy data manipulation
  • NOT exposed to end users
-- models/intermediate/int_trips_unioned.sql
with green_trips as (
    select *, 'Green' as service_type
    from {{ ref('stg_green_tripdata') }}
),

yellow_trips as (
    select *, 'Yellow' as service_type
    from {{ ref('stg_yellow_tripdata') }}
)

select * from green_trips
union all
select * from yellow_trips
Enter fullscreen mode Exit fullscreen mode

3. Marts Layer (marts/)

Purpose: Final, business-ready tables for end users

What happens here:

  • Final fact and dimension tables
  • Ready for dashboards and reports
  • Only these should be exposed to BI tools!
-- models/marts/fct_trips.sql
{{ config(materialized='table') }}

select
    t.trip_id,
    t.service_type,
    t.pickup_datetime,
    t.dropoff_datetime,
    t.pickup_location_id,
    t.dropoff_location_id,
    z_pickup.zone as pickup_zone,
    z_dropoff.zone as dropoff_zone,
    t.passenger_count,
    t.trip_distance,
    t.fare_amount,
    t.total_amount
from {{ ref('int_trips_unioned') }} t
left join {{ ref('dim_zones') }} z_pickup 
    on t.pickup_location_id = z_pickup.location_id
left join {{ ref('dim_zones') }} z_dropoff 
    on t.dropoff_location_id = z_dropoff.location_id
Enter fullscreen mode Exit fullscreen mode

Sources and the source() Function πŸ“₯

What are Sources?

Sources tell dbt where your raw data lives in the warehouse. They're defined in YAML files:

# models/staging/sources.yml
version: 2

sources:
  - name: staging           # Logical name (you choose)
    database: my_project    # Your GCP project or database
    schema: nytaxi          # BigQuery dataset or schema
    tables:
      - name: green_tripdata
      - name: yellow_tripdata
Enter fullscreen mode Exit fullscreen mode

Using the source() Function

Instead of hardcoding table names, use source():

-- ❌ Bad - hardcoded path
SELECT * FROM my_project.nytaxi.green_tripdata

-- βœ… Good - using source()
SELECT * FROM {{ source('staging', 'green_tripdata') }}
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Change database/schema in one place (YAML file)
  • dbt tracks dependencies automatically
  • Can add freshness tests on sources

The ref() Function - Building Dependencies πŸ”—

This is the most important dbt function!

source() vs ref()

Function Use When Example
source() Reading raw/external data {{ source('staging', 'green_tripdata') }}
ref() Reading another dbt model {{ ref('stg_green_tripdata') }}

How ref() Works

-- models/marts/fct_trips.sql
select *
from {{ ref('int_trips_unioned') }}  -- References the int_trips_unioned model
Enter fullscreen mode Exit fullscreen mode

What ref() does:

  1. βœ… Resolves to the correct schema/table name
  2. βœ… Builds the dependency graph automatically
  3. βœ… Ensures models run in the correct order

The DAG (Directed Acyclic Graph)

dbt builds a dependency graph from your ref() calls:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ stg_green_trips  β”‚     β”‚ stg_yellow_trips β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                        β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ int_trips_unionedβ”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
                  β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚    fct_trips     β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

When you run dbt build, models run in dependency order automatically!


Seeds - Loading CSV Files 🌱

Seeds let you load small CSV files into your warehouse as tables.

When to Use Seeds

βœ… Good use cases:

  • Lookup tables (zone names, country codes)
  • Static mappings (vendor ID β†’ vendor name)
  • Small reference data that rarely changes

❌ Not good for:

  • Large datasets (use proper data loading)
  • Frequently changing data

How to Use Seeds

  1. Put CSV files in the seeds/ folder:
seeds/
└── taxi_zone_lookup.csv
Enter fullscreen mode Exit fullscreen mode
locationid,borough,zone,service_zone
1,EWR,Newark Airport,EWR
2,Queens,Jamaica Bay,Boro Zone
3,Bronx,Allerton/Pelham Gardens,Boro Zone
...
Enter fullscreen mode Exit fullscreen mode
  1. Run dbt seed:
dbt seed
Enter fullscreen mode Exit fullscreen mode
  1. Reference in models using ref():
-- models/marts/dim_zones.sql
select
    locationid as location_id,
    borough,
    zone,
    service_zone
from {{ ref('taxi_zone_lookup') }}
Enter fullscreen mode Exit fullscreen mode

Top comments (0)