Abdelrahman Adnan

Posted on Feb 16

Part 3: Testing, Documentation & Deployment 🚀

#analytics #dataengineering #sql #tutorial

DataEngineeringZoomcamp #dbt #AnalyticsEngineering #DataModeling

Macros - Reusable SQL Functions 🔧

Macros are like functions in Python - write once, use everywhere.

Why Use Macros?

Without macros, you repeat code:

-- ❌ Repeated everywhere
CASE 
    WHEN payment_type = 1 THEN 'Credit card'
    WHEN payment_type = 2 THEN 'Cash'
    WHEN payment_type = 3 THEN 'No charge'
    WHEN payment_type = 4 THEN 'Dispute'
    WHEN payment_type = 5 THEN 'Unknown'
    ELSE 'Unknown'
END as payment_type_description

With macros, write it once:

-- macros/get_payment_type_description.sql
{% macro get_payment_type_description(payment_type) %}
    CASE {{ payment_type }}
        WHEN 1 THEN 'Credit card'
        WHEN 2 THEN 'Cash'
        WHEN 3 THEN 'No charge'
        WHEN 4 THEN 'Dispute'
        WHEN 5 THEN 'Unknown'
        ELSE 'Unknown'
    END
{% endmacro %}

Use it in any model:

-- models/staging/stg_green_tripdata.sql
select
    payment_type,
    {{ get_payment_type_description('payment_type') }} as payment_type_description
from {{ source('staging', 'green_tripdata') }}

Jinja Templating

dbt uses Jinja - a Python templating language. You'll recognize it by {{ }} and {% %}:

Syntax	Purpose	Example
`{{ }}`	Output expression	`{{ ref('my_model') }}`
`{% %}`	Logic/control flow	`{% if is_incremental() %}`
`{# #}`	Comments	`{# This is a comment #}`

dbt Packages - Community Libraries 📦

Packages let you use macros and models built by others.

Popular Packages

Package	What it Does
dbt_utils	Common SQL helpers (surrogate keys, pivot, etc.)
dbt_codegen	Auto-generate YAML and SQL
dbt_expectations	Great Expectations-style tests
dbt_audit_helper	Compare model outputs when refactoring

Installing Packages

Create packages.yml:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1

Run dbt deps:

dbt deps

Use the macros:

-- Using dbt_utils to generate surrogate keys
select
    {{ dbt_utils.generate_surrogate_key(['vendorid', 'pickup_datetime']) }} as trip_id,
    *
from {{ source('staging', 'green_tripdata') }}

Testing in dbt 🧪

Tests ensure your data meets expectations. dbt has several test types:

1. Generic Tests (Most Common)

Built-in tests you apply in YAML:

# models/staging/schema.yml
version: 2

models:
  - name: stg_green_tripdata
    columns:
      - name: trip_id
        tests:
          - unique       # No duplicate values
          - not_null     # No null values

      - name: payment_type
        tests:
          - accepted_values:
              values: [1, 2, 3, 4, 5, 6]  # Only these values allowed

      - name: pickup_location_id
        tests:
          - relationships:  # Referential integrity
              to: ref('dim_zones')
              field: location_id

The four built-in tests:
| Test | What it Checks |
|------|----------------|
| unique | No duplicate values in column |
| not_null | No NULL values in column |
| accepted_values | Values must be in specified list |
| relationships | Values must exist in another table |

2. Singular Tests

Custom SQL tests in the tests/ folder:

-- tests/assert_positive_fare_amount.sql
-- Test FAILS if any rows are returned

select
    trip_id,
    fare_amount
from {{ ref('fct_trips') }}
where fare_amount < 0  -- Find negative fares (bad data!)

3. Source Freshness Tests

Check if your source data is up to date:

sources:
  - name: staging
    tables:
      - name: green_tripdata
        freshness:
          warn_after: {count: 24, period: hour}
          error_after: {count: 48, period: hour}
        loaded_at_field: pickup_datetime

Running Tests

# Run all tests
dbt test

# Run tests for specific model
dbt test --select stg_green_tripdata

# Run tests and models together
dbt build

Documentation 📝

dbt generates beautiful documentation automatically!

Adding Descriptions

In your schema YAML:

version: 2

models:
  - name: fct_trips
    description: >
      Fact table containing all taxi trips (yellow and green).
      One row per trip with fare details and zone information.

    columns:
      - name: trip_id
        description: Unique identifier for each trip (surrogate key)

      - name: service_type
        description: Type of taxi service - 'Yellow' or 'Green'

      - name: total_amount
        description: Total trip cost including fare, tips, taxes, and fees

Generating Docs

# Generate documentation
dbt docs generate

# Serve locally (opens browser)
dbt docs serve

This creates an interactive website with:

Model descriptions
Column definitions
Dependency graph (visual DAG)
Source information

Essential dbt Commands 💻

The Big Four

Command	What it Does
`dbt run`	Build all models (create views/tables)
`dbt test`	Run all tests
`dbt build`	Run + test together (recommended!)
`dbt compile`	Generate SQL without executing

Other Useful Commands

# Check connection
dbt debug

# Load seed files
dbt seed

# Install packages
dbt deps

# Generate docs
dbt docs generate

# Retry failed models
dbt retry

Selecting Specific Models

Use --select (or -s) to run specific models:

# Single model
dbt run --select stg_green_tripdata

# Model and all upstream dependencies
dbt run --select +fct_trips

# Model and all downstream models
dbt run --select stg_green_tripdata+

# Both directions
dbt run --select +fct_trips+

# All models in a folder
dbt run --select staging.*

# Multiple models
dbt run --select stg_green_tripdata stg_yellow_tripdata

Target Environments

# Development (default)
dbt run

# Production
dbt run --target prod

Materializations - Views vs Tables 📊

Materialization controls how dbt persists your models in the warehouse.

Types of Materializations

Type	What it Creates	Use Case
view	SQL view (query stored, runs on access)	Staging models, frequently changing logic
table	Physical table (data stored)	Final marts, large datasets, performance
incremental	Appends new data only	Very large tables, event data
ephemeral	Not created (CTE in downstream)	Helper models, intermediate steps

Setting Materializations

In the model file:

{{ config(materialized='table') }}

select * from {{ ref('stg_trips') }}

In dbt_project.yml (project-wide):

models:
  my_project:
    staging:
      materialized: view
    marts:
      materialized: table

View vs Table Decision

┌─────────────────────────────────────────────────────────────┐
│                 Should I use view or table?                  │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
              ┌──────────────────────────┐
              │ Is the query expensive?  │
              └──────────────────────────┘
                     │            │
                    Yes          No
                     │            │
                     ▼            ▼
               ┌─────────┐  ┌─────────┐
               │  TABLE  │  │  VIEW   │
               └─────────┘  └─────────┘

Use VIEW when:

Staging models (simple transformations)
Logic changes frequently
Storage cost is a concern

Use TABLE when:

Final marts queried often
Complex joins/aggregations
Query performance matters

Putting It All Together - The NYC Taxi Project 🚕

In this module, we build a complete dbt project for NYC taxi data:

What We Build

┌──────────────────────────────────────────────────────────────┐
│                      RAW DATA                                 │
│  green_tripdata (GCS/BigQuery) │ yellow_tripdata (GCS/BigQuery)│
└───────────────────┬─────────────────────┬────────────────────┘
                    │                     │
                    ▼                     ▼
┌──────────────────────────────────────────────────────────────┐
│                    STAGING LAYER                              │
│      stg_green_tripdata    │    stg_yellow_tripdata          │
│      (cleaned, renamed)    │    (cleaned, renamed)           │
└───────────────────┬─────────────────────┬────────────────────┘
                    │                     │
                    └──────────┬──────────┘
                               │
                               ▼
┌──────────────────────────────────────────────────────────────┐
│                  INTERMEDIATE LAYER                           │
│                   int_trips_unioned                           │
│            (green + yellow combined)                          │
└───────────────────────────────┬──────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────┐
│                      MARTS LAYER                              │
│  ┌─────────────┐  ┌───────────────┐  ┌─────────────────────┐ │
│  │ dim_zones   │  │   fct_trips   │  │fct_monthly_zone_rev │ │
│  │ (dimension) │  │    (fact)     │  │     (report)        │ │
│  └─────────────┘  └───────────────┘  └─────────────────────┘ │
└──────────────────────────────────────────────────────────────┘

The Models We Create

Model	Type	Description
`stg_green_tripdata`	Staging	Cleaned green taxi data
`stg_yellow_tripdata`	Staging	Cleaned yellow taxi data
`int_trips_unioned`	Intermediate	Combined yellow + green trips
`dim_zones`	Dimension	Zone lookup table
`fct_trips`	Fact	One row per trip
`fct_monthly_zone_revenue`	Report	Monthly revenue by zone

Setup Options 🔧

Option 1: Local Setup (DuckDB + dbt Core)

Pros: Free, no cloud account needed
Cons: Limited to your machine's power

# 1. Install dbt with DuckDB adapter
pip install dbt-duckdb

# 2. Clone the project
git clone https://github.com/DataTalksClub/data-engineering-zoomcamp
cd data-engineering-zoomcamp/04-analytics-engineering/taxi_rides_ny

# 3. Create profiles.yml in ~/.dbt/
# 4. Run dbt debug to test connection
dbt debug

# 5. Build the project
dbt build --target prod

Option 2: Cloud Setup (BigQuery + dbt Cloud)

Pros: Powerful, team collaboration, scheduler
Cons: Requires GCP account (free tier available)

Create dbt Cloud account (free)
Connect to your BigQuery project
Clone the repo in dbt Cloud IDE
Run dbt build --target prod

Troubleshooting Common Issues 🔍

"Profile not found"

Check dbt_project.yml profile name matches profiles.yml
Ensure profiles.yml is in ~/.dbt/

"Source not found"

Verify database/schema names in sources.yml
Check your data is actually loaded in the warehouse

"Model depends on model that was not found"

Check for typos in ref() calls
Ensure referenced model exists

DuckDB Out of Memory

Add memory settings to profiles.yml:

settings:
  memory_limit: '2GB'

Key Takeaways 🎓

Analytics Engineering bridges data engineering and data analysis
dbt brings software engineering best practices to SQL transformations
Dimensional modeling organizes data into facts (events) and dimensions (attributes)
Three layers - staging (raw copy), intermediate (transformations), marts (final)
ref() and source() are your main functions for building dependencies
Testing ensures data quality - use unique, not_null, accepted_values, relationships
Documentation is auto-generated from YAML descriptions
dbt build runs and tests everything in dependency order

DEV Community