De' Clerke

Posted on Jun 2

GitHub Actions for Data Pipelines: The Setup I Use Across All My Projects

#cicd #githubactions #dataengineering #datapipelines

Every data engineering repo I push has a GitHub Actions workflow in it. Not because it is required, but because I got burned enough times pushing code that "worked on my machine" and broke in production. After setting up CI across 30+ pipeline projects, the same patterns come up every time and so do the same gotchas. This article covers the setup that actually works for Python pipelines, dbt, Docker, and Airflow DAG validation.

What GitHub Actions Is (In One Paragraph)

GitHub Actions is GitHub's built-in CI/CD system. You write a YAML file in .github/workflows/, push it, and GitHub runs it on a virtual machine whenever the trigger fires. The machine is ephemeral: it spins up, runs your steps, and disappears. You get 2,000 free minutes per month on private repos and unlimited minutes on public ones.

Four terms matter:

Workflow is the .yml file. You can have multiple.
Job is a group of steps that share a runner. Jobs run in parallel by default.
Step is a single command or a reusable action from the Marketplace.
Runner is the VM. ubuntu-latest is free and covers most use cases.

That is the mental model. Everything else builds on it.

The Base Workflow for a Python Data Pipeline

This is the starting point for every project. Tests run on push to main and on every pull request.

# .github/workflows/ci.yml
name: CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  PYTHON_VERSION: '3.11'

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}
          cache: 'pip'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run tests
        run: pytest tests/ -v --tb=short

The cache: 'pip' on setup-python is not optional for data engineering projects. Your requirements.txt likely pulls in pandas, SQLAlchemy, Airflow providers, dbt, or all of the above. Without caching, you pay that install time on every single run. With caching, it restores from a cache keyed to requirements.txt and you skip most of the download.

Adding PostgreSQL as a Service

Most pipeline tests need a real database. GitHub Actions supports service containers, which spin up alongside your job and are accessible on localhost.

jobs:
  test:
    runs-on: ubuntu-latest

    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_USER: postgres
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: testdb
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'

      - run: pip install -r requirements.txt

      - name: Run tests
        env:
          DATABASE_URL: postgresql://postgres:postgres@localhost:5432/testdb
        run: pytest tests/ -v --tb=short --junitxml=reports/junit.xml

The options: block with --health-cmd is not optional. Without it, your steps start running the moment the container starts, not when Postgres is ready to accept connections. The first time I ran a test suite without health checks, half the tests failed with connection refused errors that disappeared when I re-ran the same workflow 30 seconds later. GitHub does not wait for services automatically. You have to tell it what "ready" means.

Pass DATABASE_URL at the step level, not the workflow level. Secrets and sensitive env vars belong as close to the step that uses them as possible.

dbt CI: The Profiles Problem

dbt reads connection credentials from ~/.dbt/profiles.yml. That file is never committed to the repo. In CI, the home directory is empty, so dbt fails immediately on dbt debug. The fix is to generate the file in a step.

# .github/workflows/dbt.yml
name: dbt CI

on:
  pull_request:
    branches: [main]
    paths:
      - 'dbt/**'

jobs:
  dbt-ci:
    runs-on: ubuntu-latest

    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_USER: ${{ secrets.DBT_POSTGRES_USER }}
          POSTGRES_PASSWORD: ${{ secrets.DBT_POSTGRES_PASSWORD }}
          POSTGRES_DB: ${{ secrets.DBT_POSTGRES_DB }}
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-retries 5

    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'

      - name: Install dbt
        run: pip install dbt-core dbt-postgres

      - name: Write profiles.yml
        run: |
          mkdir -p ~/.dbt
          cat > ~/.dbt/profiles.yml << EOF
          my_project:
            target: ci
            outputs:
              ci:
                type: postgres
                host: localhost
                port: 5432
                user: ${{ secrets.DBT_POSTGRES_USER }}
                password: ${{ secrets.DBT_POSTGRES_PASSWORD }}
                dbname: ${{ secrets.DBT_POSTGRES_DB }}
                schema: public
                threads: 4
          EOF

      - name: dbt debug
        working-directory: ./dbt
        run: dbt debug

      - name: dbt deps
        working-directory: ./dbt
        run: dbt deps

      - name: dbt build
        working-directory: ./dbt
        run: dbt build --target ci

Two things to note here. First, the paths: filter on the trigger. If your repo has both Python code and dbt models, you do not want every Python change to trigger a full dbt build. Scoping the trigger to dbt/** means this workflow only runs when dbt files change.

Second, working-directory: ./dbt on each dbt step. dbt looks for dbt_project.yml in the current directory. If your dbt project lives in a subdirectory, every step needs working-directory set or dbt will not find the project.

Airflow DAG Validation Without Running Airflow

Running a full Airflow stack in CI is too heavy. What you actually want is to catch import errors, which is what dag.py bugs look like at runtime. You can do that by importing each DAG file with Python directly.

- name: Validate DAG imports
  run: |
    pip install apache-airflow
    python -c "
    import importlib, sys, pathlib
    dag_files = list(pathlib.Path('dags').glob('*.py'))
    for f in dag_files:
        spec = importlib.util.spec_from_file_location(f.stem, f)
        mod  = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(mod)
        print(f'OK: {f.name}')
    "

This catches syntax errors and import errors in your DAGs without spinning up a scheduler, webserver, or database. It will not catch runtime task failures, but it will catch the most common CI issue which is pushing a DAG with a broken import that takes down the whole DAG parser.

Docker Build with Layer Caching

Data engineering Docker images are large. The base Airflow image alone is several hundred megabytes. Without caching, a docker build on every push takes 5+ minutes. With GitHub's built-in Docker layer cache, unchanged layers are restored and skipped.

# .github/workflows/docker.yml
name: Build and Push Docker Image

on:
  push:
    branches: [main]
    tags: ['v*']

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Log in to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: myusername/my-pipeline
          tags: |
            type=ref,event=branch
            type=semver,pattern={{version}}
            type=sha,prefix=sha-

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

Use a Docker Hub access token in DOCKERHUB_TOKEN, not your account password. Access tokens can be scoped and revoked. Your account password cannot.

The metadata-action handles tagging automatically. Push to main and you get a main tag. Push a v1.2.3 tag and you get 1.2.3, 1.2, and 1 tags. Add a sha- prefix tag so you always know exactly which commit is running in any environment.

Multi-Job Pipelines with Dependencies

When you have test, build, and deploy as separate concerns, you want them chained. A failed test should stop the build. A failed build should stop the deploy.

name: Full Pipeline

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: {python-version: '3.11', cache: 'pip'}
      - run: pip install -r requirements.txt && pytest tests/ -v

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: myuser/my-pipeline:latest

  notify:
    needs: [test, build]
    if: always()
    runs-on: ubuntu-latest
    steps:
      - run: echo "Test=${{ needs.test.result }}, Build=${{ needs.build.result }}"

needs: test means the build job will not start unless test passes. needs: [test, build] means notify waits for both. The if: always() on notify means it runs regardless of whether the earlier jobs succeeded or failed. This is where you would put a Slack alert on failure.

Conditional Steps

Four conditions cover most real scenarios:

# Run only on pushes to main, not on PRs
- name: Deploy to production
  if: github.event_name == 'push' && github.ref == 'refs/heads/main'
  run: ./deploy.sh

# Run only when something fails
- name: Alert on failure
  if: failure()
  run: curl -X POST ${{ secrets.SLACK_WEBHOOK }} -d '{"text":"CI failed!"}'

# Always run regardless of outcome
- name: Cleanup
  if: always()
  run: docker compose down

# Upload test results even if tests fail
- name: Upload test report
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: test-results
    path: reports/

The if: always() on artifact upload matters in practice. If tests fail, you want the report. Without if: always(), the upload step is skipped because a previous step failed, and you have no output to debug from.

Caching Beyond pip

setup-python caches pip automatically. Everything else needs a manual cache step.

- name: Cache dbt packages
  uses: actions/cache@v4
  with:
    path: dbt/dbt_packages
    key: dbt-packages-${{ hashFiles('dbt/packages.yml') }}
    restore-keys: dbt-packages-

- name: Cache Evidence.dev node_modules
  uses: actions/cache@v4
  with:
    path: evidence-app/node_modules
    key: npm-${{ hashFiles('evidence-app/package-lock.json') }}

The key is a hash of the lockfile. When packages.yml or package-lock.json changes, the cache busts and reinstalls. When nothing changes, the cache restores and the install step is a no-op. restore-keys gives a partial match fallback so you still get a cache hit even after a minor version bump.

The Gotchas That Bite Data Engineers Specifically

Postgres health checks are not optional. Already covered above, but worth repeating: without --health-cmd pg_isready in options:, your tests will fail non-deterministically on the first run after cold start. Always add the health check.

Schedule cron is UTC. schedule: - cron: '0 6 * * *' runs at 6 AM UTC, which is 9 AM EAT (Nairobi/Kampala/Dar es Salaam). If you are scheduling a data pull for East African business hours, account for the offset. Use crontab.guru to verify before you push.

Secrets are not available in fork PRs. When someone forks your public repo and opens a pull request, the pull_request event does not have access to your secrets. Steps that need ${{ secrets.* }} will silently get empty strings. This is a security feature, not a bug. Design your CI so that secret-dependent steps only run on push events, not on pull_request from forks.

Do not write secrets to .env files. Some setups do echo "DB_PASSWORD=${{ secrets.DB_PASSWORD }}" >> .env and then read from .env. That writes the secret to the runner filesystem. Even though the runner is ephemeral, this is unnecessary exposure. Pass secrets as env: directly on the step that needs them.

step outputs use $GITHUB_OUTPUT, not set-output. The old ::set-output:: syntax was deprecated in 2022 and disabled in 2023. If you are copying workflow snippets from older articles, update to the current pattern:

- name: Get version
  id: version
  run: echo "tag=$(git describe --tags)" >> $GITHUB_OUTPUT

- name: Use version
  run: echo "Version is ${{ steps.version.outputs.tag }}"

Security Scanning (Required, Not Optional)

Every repo I push includes a security workflow. For Python, pip-audit against the CVE database covers most dependency vulnerabilities:

# .github/workflows/security.yml
name: Security Audit

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 8 * * 1'  # Weekly on Monday at 8 AM UTC

jobs:
  audit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'
      - run: pip install pip-audit
      - run: pip-audit -r requirements.txt

Do not use safety for this. It requires a paid API key as of 2024. pip-audit pulls from the public OSV database and is completely free.

For Node.js projects (Evidence.dev, React frontends):

- name: npm audit
  run: npm audit --audit-level=moderate
  working-directory: ./evidence-app

Day-to-Day Management with gh CLI

Once workflows are running, the gh CLI is faster than the GitHub web UI for most operations:

# See what workflows exist
gh workflow list

# Trigger a workflow manually (requires workflow_dispatch trigger)
gh workflow run ci.yml
gh workflow run ci.yml --ref develop

# Watch a run in real time
gh run watch

# See recent runs
gh run list --workflow ci.yml

# View logs from a specific run
gh run view <run-id> --log

# Re-run only failed jobs
gh run rerun <run-id> --failed-only

# Download artifacts without opening the browser
gh run download <run-id> -n pytest-results

gh run watch is the one I use most during active development. It streams the current run to your terminal so you do not have to refresh the GitHub UI to see progress.

A Practical Starting Point

For a new data engineering project, I add three workflow files from day one:

ci.yml for pytest with Postgres service, triggered on push and pull request
security.yml for pip-audit, triggered on push and weekly on a schedule
docker.yml for building and pushing the image, triggered on push to main

That covers: tests passing, no known CVEs in dependencies, and a fresh image on every merge. Everything else (dbt CI, DAG validation, matrix builds) gets added when the project grows to need it.

The full patterns for all three are in this article. The only setup required in GitHub before any of this works is: go to your repo, open Settings, then Secrets and variables, then Actions, and add whatever credentials your tests need. For a PostgreSQL-backed pipeline, that is typically TEST_DATABASE_URL. For dbt, it is the three Postgres credentials. For Docker, it is DOCKERHUB_USERNAME and DOCKERHUB_TOKEN.

Automate the verification work. The pipeline should tell you what broke before a reviewer has to.

All the workflows in this article are patterns I use across my open data engineering projects. If you want to see them in context, the repos are linked on my GitHub profile.

Follow me on dev.to for more articles on Airflow, dbt, and building production-grade data pipelines.

DEV Community