1. Introduction to dbt Models
- dbt (Data Build Tool) sits on top of data platforms like BigQuery or Postgres.
- Objective: Transform raw data (e.g., green and yellow taxi trip data) into structured, usable formats for BI tools.
- Approach: Modular data modeling using fact tables and dimensional tables.
- 
Process:
- Start with raw data (sources).
- Build SQL scripts (models) to clean, deduplicate, recast, and rename data.
- Apply business logic to create fact and dimensional tables.
- Aggregate data for stakeholders.
 
2. Anatomy of a dbt Model
- 
Models are SQL files that contain SELECTstatements.
- 
Materialization: Determines how the model is stored in the database.
- 
Types of Materialization:
- Ephemeral: Not physically stored; exists only within other models (like a CTE).
- View: Materialized as a view in the database.
- Table: Materialized as a physical table.
- Incremental: Updates only new data in the table (avoids recreating the entire table).
 
 
- 
Types of Materialization:
- 
File Structure:
- Each model is a .sqlfile.
- Example: model_name.sqlcontains aSELECTstatement.
- dbt compiles the SQL into DDL/DML (e.g., CREATE TABLE AS SELECT).
 
- Each model is a 
3. Defining Sources
- Sources: Raw data tables loaded into the database (e.g., green and yellow taxi trip data).
- 
YAML Configuration: - Define sources in a schema.ymlfile.
- Specify database, schema, and table names.
- 
Example: 
 version: 2 sources: - name: taxi_trips database: my_database schema: trip_data tables: - name: green_trip_data - name: yellow_trip_data
 
- Define sources in a 
- 
Benefits: - Abstracts the complexity of source locations.
- Allows for freshness testing to ensure data is up-to-date.
 
4. Using Seeds
- 
Seeds: CSV files stored in the dbt repository (e.g., taxi_zone_lookup.csv).
- Purpose: Used for small, static datasets (e.g., master data tables).
- 
Benefits:
- Version-controlled.
- Can be documented and tested like other dbt models.
 
- 
Example:
- Load taxi_zone_lookup.csvas a seed.
- Use it in transformations to enrich trip data with zone information.
 
- Load 
5. Modular Data Modeling
- 
Layers:
- 
Staging: Clean and prepare raw data.
- Example: stg_green_trip_data.sqlandstg_yellow_trip_data.sql.
 
- Example: 
- 
Core: Create fact and dimensional tables.
- Example: fact_trips.sqlanddim_zones.sql.
 
- Example: 
 
- 
Staging: Clean and prepare raw data.
- 
Ref Function: - Use ref('model_name')to reference other dbt models.
- Automatically builds dependencies between models.
- 
Example: 
 SELECT * FROM {{ ref('stg_green_trip_data') }}
 
- Use 
6. Macros in dbt
- Macros: Reusable SQL code snippets (similar to functions in Python).
- Purpose: Simplify repetitive tasks and dynamically generate SQL.
- 
Example: - 
Macro to generate payment type descriptions: 
 {% macro get_payment_type_description(payment_type) %} CASE CAST({{ payment_type }} AS INTEGER) WHEN 1 THEN 'Credit Card' WHEN 2 THEN 'Cash' ELSE 'Unknown' END {% endmacro %}
 
- 
- 
Usage: - 
Call the macro in a model: 
 SELECT {{ get_payment_type_description('payment_type') }} AS payment_description
 
- 
7. Packages in dbt
- Packages: Libraries of reusable macros and models.
- 
Example: dbt_utilspackage provides common utilities (e.g.,surrogate_key).
- 
Installation: - 
Add to packages.yml:
 packages: - package: dbt-labs/dbt_utils version: 0.8.0
- Run - dbt depsto install.
 
- 
- 
Usage: - 
Example: Generate a surrogate key: 
 SELECT {{ dbt_utils.surrogate_key(['vendor_id', 'pickup_datetime']) }} AS trip_key
 
- 
8. Variables in dbt
- Variables: Dynamic values that can be defined at the project level or passed during execution.
- 
Usage: - 
Define in dbt_project.yml:
 vars: payment_type_values: [1, 2, 3, 4, 5, 6]
- 
Use in a model: 
 {% if var('is_test_run', true) %} LIMIT 100 {% endif %}
 
- 
- 
Execution: - 
Pass variables via CLI: 
 dbt run --vars '{"is_test_run": false}'
 
- 
9. Building Fact and Dimensional Tables
- 
Fact Table: Aggregates data from multiple sources.
- Example: fact_trips.sqlcombines green and yellow trip data.
- Union both datasets and join with dim_zonesfor zone information.
 
- Example: 
- 
Dimensional Table: Provides context for fact tables.
- Example: dim_zones.sqlcontains zone details (e.g., borough, service zone).
 
- Example: 
10. Testing and Deployment
- 
Testing: - Use dbt's built-in tests (e.g., uniqueness, not null).
- 
Example: 
 tests: - unique - not_null
 
- 
Deployment: - Run dbt runto compile and execute models.
- Use dbt buildto run models and tests together.
- 
Example: 
 dbt build --vars '{"is_test_run": false}'
- 
Best Practice to implement dev limitwhich can be toggle off when in production by simply settingdefault=false:
 -- dbt build --select <model_name> --vars '{'is_test_run': 'false'}' {% if var('is_test_run', default=true) %} limit 100 {% endif %}→ faster, cheaper and can be made as macro to follow DRY ( Don’t Repeat Yourself) methodology. 
 
- Run 
11. Lineage and Dependency Management
- Lineage: Visual representation of dependencies between models.
- 
Benefits:
- Understand how data flows through the pipeline.
- Identify upstream and downstream impacts of changes.
 
- 
Example:
- 
fact_tripsdepends onstg_green_trip_data,stg_yellow_trip_data, anddim_zones.
 
- 
12. Best Practices
- Modularity: Break down transformations into reusable models.
- Documentation: Use YAML files to document sources, models, and tests.
- Version Control: Store dbt projects in Git for collaboration and history.
- Performance: Materialize large tables as tables (not views) for better performance.
13. Key Commands
- 
dbt run: Compiles and runs models.
- 
dbt test: Runs tests on models.
- 
dbt build: Runs models and tests together.
- 
dbt deps: Installs packages.
- 
dbt compile: Compiles SQL without executing it.
14. Example Project Structure
dbt_project/
├── models/
│   ├── staging/
│   │   ├── stg_green_trip_data.sql
│   │   ├── stg_yellow_trip_data.sql
│   ├── core/
│   │   ├── fact_trips.sql
│   │   ├── dim_zones.sql
├── seeds/
│   ├── taxi_zone_lookup.csv
├── macros/
│   ├── get_payment_type_description.sql
├── schema.yml
├── dbt_project.yml
├── packages.yml
 
 
              






 
    
Top comments (0)