DEV Community

Abdelrahman Adnan
Abdelrahman Adnan

Posted on

Part 3: Testing, Documentation & Deployment ๐Ÿš€

DataEngineeringZoomcamp #dbt #AnalyticsEngineering #DataModeling

Macros - Reusable SQL Functions ๐Ÿ”ง

Macros are like functions in Python - write once, use everywhere.

Why Use Macros?

Without macros, you repeat code:

-- โŒ Repeated everywhere
CASE 
    WHEN payment_type = 1 THEN 'Credit card'
    WHEN payment_type = 2 THEN 'Cash'
    WHEN payment_type = 3 THEN 'No charge'
    WHEN payment_type = 4 THEN 'Dispute'
    WHEN payment_type = 5 THEN 'Unknown'
    ELSE 'Unknown'
END as payment_type_description
Enter fullscreen mode Exit fullscreen mode

With macros, write it once:

-- macros/get_payment_type_description.sql
{% macro get_payment_type_description(payment_type) %}
    CASE {{ payment_type }}
        WHEN 1 THEN 'Credit card'
        WHEN 2 THEN 'Cash'
        WHEN 3 THEN 'No charge'
        WHEN 4 THEN 'Dispute'
        WHEN 5 THEN 'Unknown'
        ELSE 'Unknown'
    END
{% endmacro %}
Enter fullscreen mode Exit fullscreen mode

Use it in any model:

-- models/staging/stg_green_tripdata.sql
select
    payment_type,
    {{ get_payment_type_description('payment_type') }} as payment_type_description
from {{ source('staging', 'green_tripdata') }}
Enter fullscreen mode Exit fullscreen mode

Jinja Templating

dbt uses Jinja - a Python templating language. You'll recognize it by {{ }} and {% %}:

Syntax Purpose Example
{{ }} Output expression {{ ref('my_model') }}
{% %} Logic/control flow {% if is_incremental() %}
{# #} Comments {# This is a comment #}

dbt Packages - Community Libraries ๐Ÿ“ฆ

Packages let you use macros and models built by others.

Popular Packages

Package What it Does
dbt_utils Common SQL helpers (surrogate keys, pivot, etc.)
dbt_codegen Auto-generate YAML and SQL
dbt_expectations Great Expectations-style tests
dbt_audit_helper Compare model outputs when refactoring

Installing Packages

  1. Create packages.yml:
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
Enter fullscreen mode Exit fullscreen mode
  1. Run dbt deps:
dbt deps
Enter fullscreen mode Exit fullscreen mode
  1. Use the macros:
-- Using dbt_utils to generate surrogate keys
select
    {{ dbt_utils.generate_surrogate_key(['vendorid', 'pickup_datetime']) }} as trip_id,
    *
from {{ source('staging', 'green_tripdata') }}
Enter fullscreen mode Exit fullscreen mode

Testing in dbt ๐Ÿงช

Tests ensure your data meets expectations. dbt has several test types:

1. Generic Tests (Most Common)

Built-in tests you apply in YAML:

# models/staging/schema.yml
version: 2

models:
  - name: stg_green_tripdata
    columns:
      - name: trip_id
        tests:
          - unique       # No duplicate values
          - not_null     # No null values

      - name: payment_type
        tests:
          - accepted_values:
              values: [1, 2, 3, 4, 5, 6]  # Only these values allowed

      - name: pickup_location_id
        tests:
          - relationships:  # Referential integrity
              to: ref('dim_zones')
              field: location_id
Enter fullscreen mode Exit fullscreen mode

The four built-in tests:
| Test | What it Checks |
|------|----------------|
| unique | No duplicate values in column |
| not_null | No NULL values in column |
| accepted_values | Values must be in specified list |
| relationships | Values must exist in another table |

2. Singular Tests

Custom SQL tests in the tests/ folder:

-- tests/assert_positive_fare_amount.sql
-- Test FAILS if any rows are returned

select
    trip_id,
    fare_amount
from {{ ref('fct_trips') }}
where fare_amount < 0  -- Find negative fares (bad data!)
Enter fullscreen mode Exit fullscreen mode

3. Source Freshness Tests

Check if your source data is up to date:

sources:
  - name: staging
    tables:
      - name: green_tripdata
        freshness:
          warn_after: {count: 24, period: hour}
          error_after: {count: 48, period: hour}
        loaded_at_field: pickup_datetime
Enter fullscreen mode Exit fullscreen mode

Running Tests

# Run all tests
dbt test

# Run tests for specific model
dbt test --select stg_green_tripdata

# Run tests and models together
dbt build
Enter fullscreen mode Exit fullscreen mode

Documentation ๐Ÿ“

dbt generates beautiful documentation automatically!

Adding Descriptions

In your schema YAML:

version: 2

models:
  - name: fct_trips
    description: >
      Fact table containing all taxi trips (yellow and green).
      One row per trip with fare details and zone information.

    columns:
      - name: trip_id
        description: Unique identifier for each trip (surrogate key)

      - name: service_type
        description: Type of taxi service - 'Yellow' or 'Green'

      - name: total_amount
        description: Total trip cost including fare, tips, taxes, and fees
Enter fullscreen mode Exit fullscreen mode

Generating Docs

# Generate documentation
dbt docs generate

# Serve locally (opens browser)
dbt docs serve
Enter fullscreen mode Exit fullscreen mode

This creates an interactive website with:

  • Model descriptions
  • Column definitions
  • Dependency graph (visual DAG)
  • Source information

Essential dbt Commands ๐Ÿ’ป

The Big Four

Command What it Does
dbt run Build all models (create views/tables)
dbt test Run all tests
dbt build Run + test together (recommended!)
dbt compile Generate SQL without executing

Other Useful Commands

# Check connection
dbt debug

# Load seed files
dbt seed

# Install packages
dbt deps

# Generate docs
dbt docs generate

# Retry failed models
dbt retry
Enter fullscreen mode Exit fullscreen mode

Selecting Specific Models

Use --select (or -s) to run specific models:

# Single model
dbt run --select stg_green_tripdata

# Model and all upstream dependencies
dbt run --select +fct_trips

# Model and all downstream models
dbt run --select stg_green_tripdata+

# Both directions
dbt run --select +fct_trips+

# All models in a folder
dbt run --select staging.*

# Multiple models
dbt run --select stg_green_tripdata stg_yellow_tripdata
Enter fullscreen mode Exit fullscreen mode

Target Environments

# Development (default)
dbt run

# Production
dbt run --target prod
Enter fullscreen mode Exit fullscreen mode

Materializations - Views vs Tables ๐Ÿ“Š

Materialization controls how dbt persists your models in the warehouse.

Types of Materializations

Type What it Creates Use Case
view SQL view (query stored, runs on access) Staging models, frequently changing logic
table Physical table (data stored) Final marts, large datasets, performance
incremental Appends new data only Very large tables, event data
ephemeral Not created (CTE in downstream) Helper models, intermediate steps

Setting Materializations

In the model file:

{{ config(materialized='table') }}

select * from {{ ref('stg_trips') }}
Enter fullscreen mode Exit fullscreen mode

In dbt_project.yml (project-wide):

models:
  my_project:
    staging:
      materialized: view
    marts:
      materialized: table
Enter fullscreen mode Exit fullscreen mode

View vs Table Decision

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                 Should I use view or table?                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚ Is the query expensive?  โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚            โ”‚
                    Yes          No
                     โ”‚            โ”‚
                     โ–ผ            โ–ผ
               โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
               โ”‚  TABLE  โ”‚  โ”‚  VIEW   โ”‚
               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Enter fullscreen mode Exit fullscreen mode

Use VIEW when:

  • Staging models (simple transformations)
  • Logic changes frequently
  • Storage cost is a concern

Use TABLE when:

  • Final marts queried often
  • Complex joins/aggregations
  • Query performance matters

Putting It All Together - The NYC Taxi Project ๐Ÿš•

In this module, we build a complete dbt project for NYC taxi data:

What We Build

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      RAW DATA                                 โ”‚
โ”‚  green_tripdata (GCS/BigQuery) โ”‚ yellow_tripdata (GCS/BigQuery)โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ”‚                     โ”‚
                    โ–ผ                     โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    STAGING LAYER                              โ”‚
โ”‚      stg_green_tripdata    โ”‚    stg_yellow_tripdata          โ”‚
โ”‚      (cleaned, renamed)    โ”‚    (cleaned, renamed)           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ”‚                     โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
                               โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  INTERMEDIATE LAYER                           โ”‚
โ”‚                   int_trips_unioned                           โ”‚
โ”‚            (green + yellow combined)                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                โ”‚
                                โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      MARTS LAYER                              โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚ dim_zones   โ”‚  โ”‚   fct_trips   โ”‚  โ”‚fct_monthly_zone_rev โ”‚ โ”‚
โ”‚  โ”‚ (dimension) โ”‚  โ”‚    (fact)     โ”‚  โ”‚     (report)        โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Enter fullscreen mode Exit fullscreen mode

The Models We Create

Model Type Description
stg_green_tripdata Staging Cleaned green taxi data
stg_yellow_tripdata Staging Cleaned yellow taxi data
int_trips_unioned Intermediate Combined yellow + green trips
dim_zones Dimension Zone lookup table
fct_trips Fact One row per trip
fct_monthly_zone_revenue Report Monthly revenue by zone

Setup Options ๐Ÿ”ง

Option 1: Local Setup (DuckDB + dbt Core)

Pros: Free, no cloud account needed
Cons: Limited to your machine's power

# 1. Install dbt with DuckDB adapter
pip install dbt-duckdb

# 2. Clone the project
git clone https://github.com/DataTalksClub/data-engineering-zoomcamp
cd data-engineering-zoomcamp/04-analytics-engineering/taxi_rides_ny

# 3. Create profiles.yml in ~/.dbt/
# 4. Run dbt debug to test connection
dbt debug

# 5. Build the project
dbt build --target prod
Enter fullscreen mode Exit fullscreen mode

Option 2: Cloud Setup (BigQuery + dbt Cloud)

Pros: Powerful, team collaboration, scheduler
Cons: Requires GCP account (free tier available)

  1. Create dbt Cloud account (free)
  2. Connect to your BigQuery project
  3. Clone the repo in dbt Cloud IDE
  4. Run dbt build --target prod

Troubleshooting Common Issues ๐Ÿ”

"Profile not found"

  • Check dbt_project.yml profile name matches profiles.yml
  • Ensure profiles.yml is in ~/.dbt/

"Source not found"

  • Verify database/schema names in sources.yml
  • Check your data is actually loaded in the warehouse

"Model depends on model that was not found"

  • Check for typos in ref() calls
  • Ensure referenced model exists

DuckDB Out of Memory

  • Add memory settings to profiles.yml:
settings:
  memory_limit: '2GB'
Enter fullscreen mode Exit fullscreen mode

Key Takeaways ๐ŸŽ“

  1. Analytics Engineering bridges data engineering and data analysis

  2. dbt brings software engineering best practices to SQL transformations

  3. Dimensional modeling organizes data into facts (events) and dimensions (attributes)

  4. Three layers - staging (raw copy), intermediate (transformations), marts (final)

  5. ref() and source() are your main functions for building dependencies

  6. Testing ensures data quality - use unique, not_null, accepted_values, relationships

  7. Documentation is auto-generated from YAML descriptions

  8. dbt build runs and tests everything in dependency order


Additional Resources ๐Ÿ“š


Top comments (0)