DEV Community

Cover image for Study Notes 4.3.1 - Build the First dbt Models
Pizofreude
Pizofreude

Posted on

Study Notes 4.3.1 - Build the First dbt Models

1. Introduction to dbt Models

intro to dbt models

  • dbt (Data Build Tool) sits on top of data platforms like BigQuery or Postgres.
  • Objective: Transform raw data (e.g., green and yellow taxi trip data) into structured, usable formats for BI tools.
  • Approach: Modular data modeling using fact tables and dimensional tables.

modular data modeling theory

  • Process:
    • Start with raw data (sources).
    • Build SQL scripts (models) to clean, deduplicate, recast, and rename data.
    • Apply business logic to create fact and dimensional tables.
    • Aggregate data for stakeholders.

2. Anatomy of a dbt Model

  • Models are SQL files that contain SELECT statements.
  • Materialization: Determines how the model is stored in the database.
    • Types of Materialization:
      1. Ephemeral: Not physically stored; exists only within other models (like a CTE).
      2. View: Materialized as a view in the database.
      3. Table: Materialized as a physical table.
      4. Incremental: Updates only new data in the table (avoids recreating the entire table).

materializations in dbt cloud

  • File Structure:
    • Each model is a .sql file.
    • Example: model_name.sql contains a SELECT statement.
    • dbt compiles the SQL into DDL/DML (e.g., CREATE TABLE AS SELECT).

3. Defining Sources

  • Sources: Raw data tables loaded into the database (e.g., green and yellow taxi trip data).

defining sources

  • YAML Configuration:

    • Define sources in a schema.yml file.
    • Specify database, schema, and table names.
    • Example:

      version: 2
      sources:
        - name: taxi_trips
          database: my_database
          schema: trip_data
          tables:
            - name: green_trip_data
            - name: yellow_trip_data
      
      
  • Benefits:

    • Abstracts the complexity of source locations.
    • Allows for freshness testing to ensure data is up-to-date.

4. Using Seeds

  • Seeds: CSV files stored in the dbt repository (e.g., taxi_zone_lookup.csv).
  • Purpose: Used for small, static datasets (e.g., master data tables).
  • Benefits:
    • Version-controlled.
    • Can be documented and tested like other dbt models.
  • Example:
    • Load taxi_zone_lookup.csv as a seed.
    • Use it in transformations to enrich trip data with zone information.

5. Modular Data Modeling

  • Layers:
    1. Staging: Clean and prepare raw data.
      • Example: stg_green_trip_data.sql and stg_yellow_trip_data.sql.
    2. Core: Create fact and dimensional tables.
      • Example: fact_trips.sql and dim_zones.sql.

Image description

  • Ref Function:

    • Use ref('model_name') to reference other dbt models.
    • Automatically builds dependencies between models.
    • Example:

      SELECT * FROM {{ ref('stg_green_trip_data') }}
      
      

6. Macros in dbt

  • Macros: Reusable SQL code snippets (similar to functions in Python).
  • Purpose: Simplify repetitive tasks and dynamically generate SQL.
  • Example:

    • Macro to generate payment type descriptions:

      {% macro get_payment_type_description(payment_type) %}
          CASE CAST({{ payment_type }} AS INTEGER)
              WHEN 1 THEN 'Credit Card'
              WHEN 2 THEN 'Cash'
              ELSE 'Unknown'
          END
      {% endmacro %}
      
      
  • Usage:

    • Call the macro in a model:

      SELECT {{ get_payment_type_description('payment_type') }} AS payment_description
      
      

7. Packages in dbt

  • Packages: Libraries of reusable macros and models.
  • Example: dbt_utils package provides common utilities (e.g., surrogate_key).
  • Installation:

    • Add to packages.yml:

      packages:
        - package: dbt-labs/dbt_utils
          version: 0.8.0
      
      
    • Run dbt deps to install.

  • Usage:

    • Example: Generate a surrogate key:

      SELECT {{ dbt_utils.surrogate_key(['vendor_id', 'pickup_datetime']) }} AS trip_key
      
      

8. Variables in dbt

  • Variables: Dynamic values that can be defined at the project level or passed during execution.
  • Usage:

    • Define in dbt_project.yml:

      vars:
        payment_type_values: [1, 2, 3, 4, 5, 6]
      
      
    • Use in a model:

      {% if var('is_test_run', true) %}
          LIMIT 100
      {% endif %}
      
      
  • Execution:

    • Pass variables via CLI:

      dbt run --vars '{"is_test_run": false}'
      
      

9. Building Fact and Dimensional Tables

  • Fact Table: Aggregates data from multiple sources.
    • Example: fact_trips.sql combines green and yellow trip data.
    • Union both datasets and join with dim_zones for zone information.
  • Dimensional Table: Provides context for fact tables.
    • Example: dim_zones.sql contains zone details (e.g., borough, service zone).

10. Testing and Deployment

  • Testing:

    • Use dbt's built-in tests (e.g., uniqueness, not null).
    • Example:

      tests:
        - unique
        - not_null
      
      
  • Deployment:

    • Run dbt run to compile and execute models.
    • Use dbt build to run models and tests together.
    • Example:

      dbt build --vars '{"is_test_run": false}'
      
      
    • Best Practice to implement dev limit which can be toggle off when in production by simply setting default=false:

      -- dbt build --select <model_name> --vars '{'is_test_run': 'false'}'
      {% if var('is_test_run', default=true) %}
      
        limit 100
      
      {% endif %}
      

      → faster, cheaper and can be made as macro to follow DRY ( Don’t Repeat Yourself) methodology.


11. Lineage and Dependency Management

  • Lineage: Visual representation of dependencies between models.
  • Benefits:
    • Understand how data flows through the pipeline.
    • Identify upstream and downstream impacts of changes.
  • Example:
    • fact_trips depends on stg_green_trip_data, stg_yellow_trip_data, and dim_zones.

modular data modeling nyc taxi practical


12. Best Practices

  • Modularity: Break down transformations into reusable models.
  • Documentation: Use YAML files to document sources, models, and tests.
  • Version Control: Store dbt projects in Git for collaboration and history.
  • Performance: Materialize large tables as tables (not views) for better performance.

13. Key Commands

  • dbt run: Compiles and runs models.
  • dbt test: Runs tests on models.
  • dbt build: Runs models and tests together.
  • dbt deps: Installs packages.
  • dbt compile: Compiles SQL without executing it.

14. Example Project Structure

dbt_project/
├── models/
│   ├── staging/
│   │   ├── stg_green_trip_data.sql
│   │   ├── stg_yellow_trip_data.sql
│   ├── core/
│   │   ├── fact_trips.sql
│   │   ├── dim_zones.sql
├── seeds/
│   ├── taxi_zone_lookup.csv
├── macros/
│   ├── get_payment_type_description.sql
├── schema.yml
├── dbt_project.yml
├── packages.yml

Enter fullscreen mode Exit fullscreen mode

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay