DataEngineeringZoomcamp #dbt #AnalyticsEngineering #DataModeling
Macros - Reusable SQL Functions ๐ง
Macros are like functions in Python - write once, use everywhere.
Why Use Macros?
Without macros, you repeat code:
-- โ Repeated everywhere
CASE
WHEN payment_type = 1 THEN 'Credit card'
WHEN payment_type = 2 THEN 'Cash'
WHEN payment_type = 3 THEN 'No charge'
WHEN payment_type = 4 THEN 'Dispute'
WHEN payment_type = 5 THEN 'Unknown'
ELSE 'Unknown'
END as payment_type_description
With macros, write it once:
-- macros/get_payment_type_description.sql
{% macro get_payment_type_description(payment_type) %}
CASE {{ payment_type }}
WHEN 1 THEN 'Credit card'
WHEN 2 THEN 'Cash'
WHEN 3 THEN 'No charge'
WHEN 4 THEN 'Dispute'
WHEN 5 THEN 'Unknown'
ELSE 'Unknown'
END
{% endmacro %}
Use it in any model:
-- models/staging/stg_green_tripdata.sql
select
payment_type,
{{ get_payment_type_description('payment_type') }} as payment_type_description
from {{ source('staging', 'green_tripdata') }}
Jinja Templating
dbt uses Jinja - a Python templating language. You'll recognize it by {{ }} and {% %}:
| Syntax | Purpose | Example |
|---|---|---|
{{ }} |
Output expression | {{ ref('my_model') }} |
{% %} |
Logic/control flow | {% if is_incremental() %} |
{# #} |
Comments | {# This is a comment #} |
dbt Packages - Community Libraries ๐ฆ
Packages let you use macros and models built by others.
Popular Packages
| Package | What it Does |
|---|---|
| dbt_utils | Common SQL helpers (surrogate keys, pivot, etc.) |
| dbt_codegen | Auto-generate YAML and SQL |
| dbt_expectations | Great Expectations-style tests |
| dbt_audit_helper | Compare model outputs when refactoring |
Installing Packages
-
Create
packages.yml:
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
-
Run
dbt deps:
dbt deps
- Use the macros:
-- Using dbt_utils to generate surrogate keys
select
{{ dbt_utils.generate_surrogate_key(['vendorid', 'pickup_datetime']) }} as trip_id,
*
from {{ source('staging', 'green_tripdata') }}
Testing in dbt ๐งช
Tests ensure your data meets expectations. dbt has several test types:
1. Generic Tests (Most Common)
Built-in tests you apply in YAML:
# models/staging/schema.yml
version: 2
models:
- name: stg_green_tripdata
columns:
- name: trip_id
tests:
- unique # No duplicate values
- not_null # No null values
- name: payment_type
tests:
- accepted_values:
values: [1, 2, 3, 4, 5, 6] # Only these values allowed
- name: pickup_location_id
tests:
- relationships: # Referential integrity
to: ref('dim_zones')
field: location_id
The four built-in tests:
| Test | What it Checks |
|------|----------------|
| unique | No duplicate values in column |
| not_null | No NULL values in column |
| accepted_values | Values must be in specified list |
| relationships | Values must exist in another table |
2. Singular Tests
Custom SQL tests in the tests/ folder:
-- tests/assert_positive_fare_amount.sql
-- Test FAILS if any rows are returned
select
trip_id,
fare_amount
from {{ ref('fct_trips') }}
where fare_amount < 0 -- Find negative fares (bad data!)
3. Source Freshness Tests
Check if your source data is up to date:
sources:
- name: staging
tables:
- name: green_tripdata
freshness:
warn_after: {count: 24, period: hour}
error_after: {count: 48, period: hour}
loaded_at_field: pickup_datetime
Running Tests
# Run all tests
dbt test
# Run tests for specific model
dbt test --select stg_green_tripdata
# Run tests and models together
dbt build
Documentation ๐
dbt generates beautiful documentation automatically!
Adding Descriptions
In your schema YAML:
version: 2
models:
- name: fct_trips
description: >
Fact table containing all taxi trips (yellow and green).
One row per trip with fare details and zone information.
columns:
- name: trip_id
description: Unique identifier for each trip (surrogate key)
- name: service_type
description: Type of taxi service - 'Yellow' or 'Green'
- name: total_amount
description: Total trip cost including fare, tips, taxes, and fees
Generating Docs
# Generate documentation
dbt docs generate
# Serve locally (opens browser)
dbt docs serve
This creates an interactive website with:
- Model descriptions
- Column definitions
- Dependency graph (visual DAG)
- Source information
Essential dbt Commands ๐ป
The Big Four
| Command | What it Does |
|---|---|
dbt run |
Build all models (create views/tables) |
dbt test |
Run all tests |
dbt build |
Run + test together (recommended!) |
dbt compile |
Generate SQL without executing |
Other Useful Commands
# Check connection
dbt debug
# Load seed files
dbt seed
# Install packages
dbt deps
# Generate docs
dbt docs generate
# Retry failed models
dbt retry
Selecting Specific Models
Use --select (or -s) to run specific models:
# Single model
dbt run --select stg_green_tripdata
# Model and all upstream dependencies
dbt run --select +fct_trips
# Model and all downstream models
dbt run --select stg_green_tripdata+
# Both directions
dbt run --select +fct_trips+
# All models in a folder
dbt run --select staging.*
# Multiple models
dbt run --select stg_green_tripdata stg_yellow_tripdata
Target Environments
# Development (default)
dbt run
# Production
dbt run --target prod
Materializations - Views vs Tables ๐
Materialization controls how dbt persists your models in the warehouse.
Types of Materializations
| Type | What it Creates | Use Case |
|---|---|---|
| view | SQL view (query stored, runs on access) | Staging models, frequently changing logic |
| table | Physical table (data stored) | Final marts, large datasets, performance |
| incremental | Appends new data only | Very large tables, event data |
| ephemeral | Not created (CTE in downstream) | Helper models, intermediate steps |
Setting Materializations
In the model file:
{{ config(materialized='table') }}
select * from {{ ref('stg_trips') }}
In dbt_project.yml (project-wide):
models:
my_project:
staging:
materialized: view
marts:
materialized: table
View vs Table Decision
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Should I use view or table? โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Is the query expensive? โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
Yes No
โ โ
โผ โผ
โโโโโโโโโโโ โโโโโโโโโโโ
โ TABLE โ โ VIEW โ
โโโโโโโโโโโ โโโโโโโโโโโ
Use VIEW when:
- Staging models (simple transformations)
- Logic changes frequently
- Storage cost is a concern
Use TABLE when:
- Final marts queried often
- Complex joins/aggregations
- Query performance matters
Putting It All Together - The NYC Taxi Project ๐
In this module, we build a complete dbt project for NYC taxi data:
What We Build
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ RAW DATA โ
โ green_tripdata (GCS/BigQuery) โ yellow_tripdata (GCS/BigQuery)โ
โโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ STAGING LAYER โ
โ stg_green_tripdata โ stg_yellow_tripdata โ
โ (cleaned, renamed) โ (cleaned, renamed) โ
โโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ INTERMEDIATE LAYER โ
โ int_trips_unioned โ
โ (green + yellow combined) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MARTS LAYER โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ dim_zones โ โ fct_trips โ โfct_monthly_zone_rev โ โ
โ โ (dimension) โ โ (fact) โ โ (report) โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The Models We Create
| Model | Type | Description |
|---|---|---|
stg_green_tripdata |
Staging | Cleaned green taxi data |
stg_yellow_tripdata |
Staging | Cleaned yellow taxi data |
int_trips_unioned |
Intermediate | Combined yellow + green trips |
dim_zones |
Dimension | Zone lookup table |
fct_trips |
Fact | One row per trip |
fct_monthly_zone_revenue |
Report | Monthly revenue by zone |
Setup Options ๐ง
Option 1: Local Setup (DuckDB + dbt Core)
Pros: Free, no cloud account needed
Cons: Limited to your machine's power
# 1. Install dbt with DuckDB adapter
pip install dbt-duckdb
# 2. Clone the project
git clone https://github.com/DataTalksClub/data-engineering-zoomcamp
cd data-engineering-zoomcamp/04-analytics-engineering/taxi_rides_ny
# 3. Create profiles.yml in ~/.dbt/
# 4. Run dbt debug to test connection
dbt debug
# 5. Build the project
dbt build --target prod
Option 2: Cloud Setup (BigQuery + dbt Cloud)
Pros: Powerful, team collaboration, scheduler
Cons: Requires GCP account (free tier available)
- Create dbt Cloud account (free)
- Connect to your BigQuery project
- Clone the repo in dbt Cloud IDE
- Run
dbt build --target prod
Troubleshooting Common Issues ๐
"Profile not found"
- Check
dbt_project.ymlprofile name matchesprofiles.yml - Ensure
profiles.ymlis in~/.dbt/
"Source not found"
- Verify database/schema names in
sources.yml - Check your data is actually loaded in the warehouse
"Model depends on model that was not found"
- Check for typos in
ref()calls - Ensure referenced model exists
DuckDB Out of Memory
- Add memory settings to profiles.yml:
settings:
memory_limit: '2GB'
Key Takeaways ๐
Analytics Engineering bridges data engineering and data analysis
dbt brings software engineering best practices to SQL transformations
Dimensional modeling organizes data into facts (events) and dimensions (attributes)
Three layers - staging (raw copy), intermediate (transformations), marts (final)
ref()andsource()are your main functions for building dependenciesTesting ensures data quality - use unique, not_null, accepted_values, relationships
Documentation is auto-generated from YAML descriptions
dbt buildruns and tests everything in dependency order
Additional Resources ๐
- dbt Documentation
- dbt Fundamentals Course (free)
- SQL Refresher for Window Functions
- dbt Community Slack
Top comments (0)