Automating Data Builds with dbt and GitHub Actions

#dbt #githubactions #cicd #testing

Introduction:
As more and more organizations rely on data to drive their decision-making processes, it's becoming increasingly important to ensure that this data is accurate, consistent, and up-to-date. This is where dbt (data build tool) comes in – it's a popular open-source tool that's designed to help data analysts and engineers build, test, and maintain data pipelines.

In this blog post, I will show you how to use GitHub Actions to automate data builds with dbt along with one custom test. Specifically, I will walk you through the YAML file provided in the GitHub repository , which sets up a CI/CD pipeline for dbt. This is not a production grade yaml file but to give conceptual understanding this will be definitely helpful.

What is YAML?
Before we dive into the YAML file, let's take a step back and define what YAML is. YAML is a human-readable data serialization language that's often used for configuration files in software development. It's similar to JSON in that it uses key-value pairs to represent data, but it's designed to be more readable and easier to work with.

As a DevOps engineer, you'll likely be working with YAML files on a regular basis to configure CI/CD pipelines, automate deployments, and manage infrastructure as code. Understanding how to write and work with YAML files is a crucial skill for anyone in this field.

YAML File Overview:
Now, let's take a closer look at the YAML file provided in the GitHub repository. This file is a GitHub Actions workflow that's designed to run dbt whenever changes are pushed to a specific branch in a GitHub repository.

The first section of the file defines the name of the workflow ("CI") and specifies when it should run. In this case, it's set to run whenever changes are pushed to the "main" branch of the repository.

name: ci-test

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]
env:
   DBT_PROFILES_DIR: ./

   DBT_POSTGRES_PW: ${{ secrets.DBT_POSTGRES_PW }}

The next section of the file defines the job that will be run by the workflow. In this case, there's only one job called "build" that will run on the latest version of Ubuntu. The steps for this job include checking out the repository, setting up Python version 3.8, installing dependencies (including dbt), and running a test using dbt.

jobs:

  test:
    name: Test
    runs-on: ubuntu-latest

    services:
      postgres:
        image: postgres:14
        env:
          POSTGRES_USER: postgres
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: jaffle_shop
          POSTGRES_SCHEMA: dbt_alice
          POSTGRES_THREAD: 4
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    steps:

    - uses: actions/checkout@v2
    - name: Set up Python 3.9
      uses: actions/setup-python@v2
      with:
        python-version: 3.9
    - name: Install dependencies
      run: |
        cd jaffle_shop/
        python -m pip install --upgrade pip
        pip install -r requirements.txt
        pip install dbt-postgres
        pip install dbt-core
        pip install pytest
        dbt debug
        dbt seed
        dbt run
        dbt test

Finally, there are two additional steps that run after the "Test" step. These steps will run test using pytest framework to check the results using assertions. You can add tests in this step as per your model or business logic.

- name: Run tests
      run: |
         cd jaffle_shop/
         python -m pytest tests/functional/test_example_failing.py -sv

Final output will look like below:

Benefits of Automating Data Builds with dbt and GitHub Actions:
By automating the data build process with dbt and GitHub Actions, you can achieve a number of benefits, including:

Increased efficiency:
Automating the build process can save time and resources, as it eliminates the need for manual intervention and reduces the risk of human error.

Improved accuracy:
By running tests and checks automatically, you can ensure that your data is accurate and consistent, which can lead to better decision-making.

Better collaboration:
By using GitHub Actions, you can collaborate more easily with other members of your team, as everyone can see the status of the build process and make changes as needed.

Increased transparency:
By automating the build process, you can create a more transparent and auditable data pipeline, which can be important for regulatory compliance and data governance.

Conclusion:
In conclusion, automating data builds with dbt and GitHub Actions can be a powerful tool for data analysts and engineers who want to ensure that their data is accurate, consistent, and up-to-date. By understanding how to write and work with YAML files, you can set up a CI/CD pipeline that automates the build process and saves time and resources.

Note:
This is just an attempt to show how dbt builds and tests can be setup. The example yaml file can be still improved such as providing proper naming conventions, using github secrets. In case you are planning to us this file, in your daily build then please modify the file.

References:
To setup dbt, dbt test data, sbt custom tests and postgress service, I found following references useful:

To know more about dbt you can refer dbt introduction
To setup the dbt test data used for this blog you can refer this Test data setup
Creating postgress service using github actions
Example of dbt custom tests

Retry later

DEV Community

Automating Data Builds with dbt and GitHub Actions

Top comments (0)