James

Posted on Aug 10, 2023 • Edited on Dec 18, 2024

Running Transformations on BigQuery using dbt Cloud: step by step

#bigquery #dbt #codenewbie

Introduction
In today's data-driven world, transforming raw data into valuable insights is crucial. This process, however, often involves complex tasks that demand efficiency, scalability, and reliability. Enter dbt Cloud—a powerful tool that simplifies data transformations on Google BigQuery. In this article, I'll take you through a step-by-step guide on how to run BigQuery transformations using dbt Cloud. Let's dive in!

Prerequisites
dbt cloud account
gcp account
for bigquery ensure you have the following

Service account with elevated roles and keys.json
bigquery api enabled
two bigquery datasets: staging and production

Step 1: Setting Up Your dbt Cloud Project
Start by signing in to dbt Cloud and creating a new project.

Create a project and define the connection to bigquery. Dbt requires you to upload a keys.json file that contains your biqguery credentials.Define your target dataset, this will be the destination for your transformed data during development. In this case mentalhealth_staging. You can choose a managed repository on clone a repository from github.
Initialize your project. DBT will create files and folders for you to start with.

By default, dbt will only allow you to work on a branch, therefore create a branch or more for which you can use for development or deployment

Step 2: Creating Transformations
In dbt Cloud, transformations are defined as dbt models. A dbt model is a SQL file containing the transformation logic. Write your SQL queries to transform and reshape your data as needed. These models can join tables, aggregate data, and create calculated fields.

define the dbt_project.yml. The dbt_project.yml file is a configuration file used in dbt (data build tool) projects. It's a YAML (Yet Another Markup Language) file that allows you to define various project settings and configurations in a structured format. This file serves as the central configuration hub for your dbt project, allowing you to customize how dbt behaves when executing data transformations.

# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'bigquery_proj'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'default'

# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"  # directory which will store compiled SQL files
clean-targets:         # directories to be removed by `dbt clean`
  - "target"
  - "dbt_packages"


# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# In this example config, we tell dbt to build all models in the example/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:
  bigquery_proj:
    staging:
      materialized: table
    core:
      materialized: table

Next we define dbt models for defining transformations. We shall create staging models for development and core models for deployment, we then add a macro for enhancing our models. macrosare reusable SQL code snippets that allow you to encapsulate and parameterize common SQL operations or patterns. Macros in dbt provide a way to abstract and simplify complex SQL logic, making your dbt models more modular, maintainable, and efficient.

 {#
    This macro returns gender into three categories 
#}

{% macro get_gender_properties(Gender) -%}

    case {{ Gender }}
        when 'male' then 'male'
        when 'female' then 'female'
        when 'f' then 'female'
        when 'm' then 'male'
        else 'others'
    end

{%- endmacro %}

1.staging models

extract_mentalhealthdata.sql

{{ config(
    materialized='table'
) }}

select 
-- identifier
rand() as unique_id,

-- use macro for coverting gender types
{{ get_gender_properties('Gender') }} as gendertype,

*
from {{ source('mentalhealth_staging','mental_health_table_1') }}

the above model extracts data from a bigquery dataset mentalhealth_staging .It also uses the defined macros
below is the schema for the model


version: 2


sources:
  - name: mentalhealth_staging
    #database: 
    schema: mentalhealth_staging
    tables:
      - name: mental_health_table_1



models:
    - name: extract_mentalhealthdata
      description: "extract mental health and load to a staging table"
      columns:
          - name: unique_id
            description: random unique id for every record
            tests:
                - unique
                - not_null
          - name: Timestamp
            description: Time the survey was submitted
          - name: Age
            description: Respondent age
          - name: Gender
            description: Respondent gender
          - name: Country
            description: Respondent country
          - name: state 
            description: If you live in the United States, which state or territory do you live in?
          - name: self_employed
            description: Are you self-employed?
          - name: family_history
            description: Do you have a family history of mental illness?
          - name: treatment
            description: Have you sought treatment for a mental health condition?
          - name: work_interfere 
            description: If you have a mental health condition, do you feel that it interferes with your work?
          - name: no_employees
            description: How many employees does your company or organization have?
          - name: remote_work
            description: How many employees does your company or organization have?
          - name: tech_company 
            description: your employer primarily a tech company/organization?
          - name: benefits 
            description: Does your employer provide mental health benefits?
          - name: care_options
            description: Do you know the options for mental health care your employer provides?
          - name: wellness_program 
            description: Has your employer ever discussed mental health as part of an employee wellness program?
          - name: seek_help 
            description: Does your employer provide resources to learn more about mental health issues and how to seek help?
          - name: anonymity 
            description: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment
          - name: leave 
            description: How easy is it for you to take medical leave for a mental health condition?
          - name: mental_health_consequence
            description: Do you think that discussing a mental health issue with your employer would have negative consequences?
          - name: phy_health_consequence
            description: Do you think that discussing a physical health issue with your employer would have negative consequences?
          - name: coworkers
            description: Would you be willing to discuss a mental health issue with your coworkers?
          - name: supervisors
            description: Would you be willing to discuss a mental health issue with your direct supervisor(s)?
          - name: mental_health_interview
            description: Would you bring up a mental health issue with a potential employer in an interview?
          - name: phys_health_interview
            description: Would you bring up a physical health issue with a potential employer in an interview
          - name: mental_vs_physical
            description: Do you feel that your employer takes mental health as seriously as physical health?
          - name: obs_consequence
            description: Have you heard of or observed negative consequences for coworkers with mental health conditions in your
          - name: comments
            description: Any additional notes or comments

2.core models
The core models ae used to load data from the staging tables and persist to production tables in mentalhealth_prod dataset

employee_dim.sql

{{ config(
    materialized='table'
) }}

select 
-- identifier
unique_id,

-- employee details
Age,
gendertype,
Country,
state,
remote_work,
tech_company
from {{ source('mentalhealth_staging','extract_mentalhealthdata')}}

below is the defined schema the core models


version: 2


sources:
  - name: mentalhealth_staging
    #database: 
    schema: mentalhealth_staging
    tables:
      - name: extract_mentalhealthdata



models:
    - name: employee_dim
      description: "create employee dim tablele"
      columns:
          - name: unique_id
            description: unique id for every record
            tests:
                - unique
                - not_null
          - name: Age
          - name: gender_type
          - name: Country
          - name: state
          - name: remote_work
          - name: tech_company

Ensure you configure the schemas and data sources correctly.

now run the staging model using dbt build --select extract_mentalhealthdata.sql this will create a table with transformed data based on the model. It will also run some tests defined in the schema.yml

Step 3: Creating deployment environment and running the jobs in development environment
In dbt Cloud, you can create environment like "production" to manage production of your data transformation process. Configure your target dataset and other settings for your environment.
Before deploying your transformations, it's wise to test them in the development environment. Trigger the job and monitor the logs to identify any issues. If everything runs smoothly, you're ready to move to the deployment phase.

Create a deployment environment in dbt and define the target dataset in this case mentalhealth_prod. Note that this deployment environment runs the production branch created.

Next you create jobs to run you models in production env. Here you can define other parameters such as scheduled time of running jobs, commands to run, generate docs etc.

Setting other parameters for the job runs

Step 7: Monitoring and Maintenance
dbt Cloud provides a dashboard to monitor the status of your jobs. Keep an eye on the execution logs and any potential errors.

Regularly update your transformations as your data and business requirements evolve.

Conclusion:
Running BigQuery transformations using dbt Cloud streamlines the process of turning raw data into actionable insights. With a clear step-by-step approach, you can easily set up, develop, test, and deploy your transformations to production. This ensures that your organization benefits from accurate and timely data-driven decisions. Harness the power of dbt Cloud to elevate your data transformation capabilities and propel your business forward.

In just a few minutes, you've learned how to leverage dbt Cloud's capabilities to transform data in BigQuery, all while simplifying complex processes and increasing efficiency. So, what are you waiting for? It's time to unlock the potential of your data!

DEV Community

Running Transformations on BigQuery using dbt Cloud: step by step

Top comments (0)