Every data engineering repo I push has a GitHub Actions workflow in it. Not because it is required, but because I got burned enough times pushing code that "worked on my machine" and broke in production. After setting up CI across 30+ pipeline projects, the same patterns come up every time and so do the same gotchas. This article covers the setup that actually works for Python pipelines, dbt, Docker, and Airflow DAG validation.
What GitHub Actions Is (In One Paragraph)
GitHub Actions is GitHub's built-in CI/CD system. You write a YAML file in .github/workflows/, push it, and GitHub runs it on a virtual machine whenever the trigger fires. The machine is ephemeral: it spins up, runs your steps, and disappears. You get 2,000 free minutes per month on private repos and unlimited minutes on public ones.
Four terms matter:
-
Workflow is the
.ymlfile. You can have multiple. - Job is a group of steps that share a runner. Jobs run in parallel by default.
- Step is a single command or a reusable action from the Marketplace.
-
Runner is the VM.
ubuntu-latestis free and covers most use cases.
That is the mental model. Everything else builds on it.
The Base Workflow for a Python Data Pipeline
This is the starting point for every project. Tests run on push to main and on every pull request.
# .github/workflows/ci.yml
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
PYTHON_VERSION: '3.11'
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: 'pip'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run tests
run: pytest tests/ -v --tb=short
The cache: 'pip' on setup-python is not optional for data engineering projects. Your requirements.txt likely pulls in pandas, SQLAlchemy, Airflow providers, dbt, or all of the above. Without caching, you pay that install time on every single run. With caching, it restores from a cache keyed to requirements.txt and you skip most of the download.
Adding PostgreSQL as a Service
Most pipeline tests need a real database. GitHub Actions supports service containers, which spin up alongside your job and are accessible on localhost.
jobs:
test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: testdb
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- run: pip install -r requirements.txt
- name: Run tests
env:
DATABASE_URL: postgresql://postgres:postgres@localhost:5432/testdb
run: pytest tests/ -v --tb=short --junitxml=reports/junit.xml
The options: block with --health-cmd is not optional. Without it, your steps start running the moment the container starts, not when Postgres is ready to accept connections. The first time I ran a test suite without health checks, half the tests failed with connection refused errors that disappeared when I re-ran the same workflow 30 seconds later. GitHub does not wait for services automatically. You have to tell it what "ready" means.
Pass DATABASE_URL at the step level, not the workflow level. Secrets and sensitive env vars belong as close to the step that uses them as possible.
dbt CI: The Profiles Problem
dbt reads connection credentials from ~/.dbt/profiles.yml. That file is never committed to the repo. In CI, the home directory is empty, so dbt fails immediately on dbt debug. The fix is to generate the file in a step.
# .github/workflows/dbt.yml
name: dbt CI
on:
pull_request:
branches: [main]
paths:
- 'dbt/**'
jobs:
dbt-ci:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_USER: ${{ secrets.DBT_POSTGRES_USER }}
POSTGRES_PASSWORD: ${{ secrets.DBT_POSTGRES_PASSWORD }}
POSTGRES_DB: ${{ secrets.DBT_POSTGRES_DB }}
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-retries 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- name: Install dbt
run: pip install dbt-core dbt-postgres
- name: Write profiles.yml
run: |
mkdir -p ~/.dbt
cat > ~/.dbt/profiles.yml << EOF
my_project:
target: ci
outputs:
ci:
type: postgres
host: localhost
port: 5432
user: ${{ secrets.DBT_POSTGRES_USER }}
password: ${{ secrets.DBT_POSTGRES_PASSWORD }}
dbname: ${{ secrets.DBT_POSTGRES_DB }}
schema: public
threads: 4
EOF
- name: dbt debug
working-directory: ./dbt
run: dbt debug
- name: dbt deps
working-directory: ./dbt
run: dbt deps
- name: dbt build
working-directory: ./dbt
run: dbt build --target ci
Two things to note here. First, the paths: filter on the trigger. If your repo has both Python code and dbt models, you do not want every Python change to trigger a full dbt build. Scoping the trigger to dbt/** means this workflow only runs when dbt files change.
Second, working-directory: ./dbt on each dbt step. dbt looks for dbt_project.yml in the current directory. If your dbt project lives in a subdirectory, every step needs working-directory set or dbt will not find the project.
Airflow DAG Validation Without Running Airflow
Running a full Airflow stack in CI is too heavy. What you actually want is to catch import errors, which is what dag.py bugs look like at runtime. You can do that by importing each DAG file with Python directly.
- name: Validate DAG imports
run: |
pip install apache-airflow
python -c "
import importlib, sys, pathlib
dag_files = list(pathlib.Path('dags').glob('*.py'))
for f in dag_files:
spec = importlib.util.spec_from_file_location(f.stem, f)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
print(f'OK: {f.name}')
"
This catches syntax errors and import errors in your DAGs without spinning up a scheduler, webserver, or database. It will not catch runtime task failures, but it will catch the most common CI issue which is pushing a DAG with a broken import that takes down the whole DAG parser.
Docker Build with Layer Caching
Data engineering Docker images are large. The base Airflow image alone is several hundred megabytes. Without caching, a docker build on every push takes 5+ minutes. With GitHub's built-in Docker layer cache, unchanged layers are restored and skipped.
# .github/workflows/docker.yml
name: Build and Push Docker Image
on:
push:
branches: [main]
tags: ['v*']
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: myusername/my-pipeline
tags: |
type=ref,event=branch
type=semver,pattern={{version}}
type=sha,prefix=sha-
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
Use a Docker Hub access token in DOCKERHUB_TOKEN, not your account password. Access tokens can be scoped and revoked. Your account password cannot.
The metadata-action handles tagging automatically. Push to main and you get a main tag. Push a v1.2.3 tag and you get 1.2.3, 1.2, and 1 tags. Add a sha- prefix tag so you always know exactly which commit is running in any environment.
Multi-Job Pipelines with Dependencies
When you have test, build, and deploy as separate concerns, you want them chained. A failed test should stop the build. A failed build should stop the deploy.
name: Full Pipeline
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: '3.11', cache: 'pip'}
- run: pip install -r requirements.txt && pytest tests/ -v
build:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- uses: docker/build-push-action@v5
with:
push: true
tags: myuser/my-pipeline:latest
notify:
needs: [test, build]
if: always()
runs-on: ubuntu-latest
steps:
- run: echo "Test=${{ needs.test.result }}, Build=${{ needs.build.result }}"
needs: test means the build job will not start unless test passes. needs: [test, build] means notify waits for both. The if: always() on notify means it runs regardless of whether the earlier jobs succeeded or failed. This is where you would put a Slack alert on failure.
Conditional Steps
Four conditions cover most real scenarios:
# Run only on pushes to main, not on PRs
- name: Deploy to production
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
run: ./deploy.sh
# Run only when something fails
- name: Alert on failure
if: failure()
run: curl -X POST ${{ secrets.SLACK_WEBHOOK }} -d '{"text":"CI failed!"}'
# Always run regardless of outcome
- name: Cleanup
if: always()
run: docker compose down
# Upload test results even if tests fail
- name: Upload test report
if: always()
uses: actions/upload-artifact@v4
with:
name: test-results
path: reports/
The if: always() on artifact upload matters in practice. If tests fail, you want the report. Without if: always(), the upload step is skipped because a previous step failed, and you have no output to debug from.
Caching Beyond pip
setup-python caches pip automatically. Everything else needs a manual cache step.
- name: Cache dbt packages
uses: actions/cache@v4
with:
path: dbt/dbt_packages
key: dbt-packages-${{ hashFiles('dbt/packages.yml') }}
restore-keys: dbt-packages-
- name: Cache Evidence.dev node_modules
uses: actions/cache@v4
with:
path: evidence-app/node_modules
key: npm-${{ hashFiles('evidence-app/package-lock.json') }}
The key is a hash of the lockfile. When packages.yml or package-lock.json changes, the cache busts and reinstalls. When nothing changes, the cache restores and the install step is a no-op. restore-keys gives a partial match fallback so you still get a cache hit even after a minor version bump.
The Gotchas That Bite Data Engineers Specifically
Postgres health checks are not optional. Already covered above, but worth repeating: without --health-cmd pg_isready in options:, your tests will fail non-deterministically on the first run after cold start. Always add the health check.
Schedule cron is UTC. schedule: - cron: '0 6 * * *' runs at 6 AM UTC, which is 9 AM EAT (Nairobi/Kampala/Dar es Salaam). If you are scheduling a data pull for East African business hours, account for the offset. Use crontab.guru to verify before you push.
Secrets are not available in fork PRs. When someone forks your public repo and opens a pull request, the pull_request event does not have access to your secrets. Steps that need ${{ secrets.* }} will silently get empty strings. This is a security feature, not a bug. Design your CI so that secret-dependent steps only run on push events, not on pull_request from forks.
Do not write secrets to .env files. Some setups do echo "DB_PASSWORD=${{ secrets.DB_PASSWORD }}" >> .env and then read from .env. That writes the secret to the runner filesystem. Even though the runner is ephemeral, this is unnecessary exposure. Pass secrets as env: directly on the step that needs them.
step outputs use $GITHUB_OUTPUT, not set-output. The old ::set-output:: syntax was deprecated in 2022 and disabled in 2023. If you are copying workflow snippets from older articles, update to the current pattern:
- name: Get version
id: version
run: echo "tag=$(git describe --tags)" >> $GITHUB_OUTPUT
- name: Use version
run: echo "Version is ${{ steps.version.outputs.tag }}"
Security Scanning (Required, Not Optional)
Every repo I push includes a security workflow. For Python, pip-audit against the CVE database covers most dependency vulnerabilities:
# .github/workflows/security.yml
name: Security Audit
on:
push:
branches: [main]
schedule:
- cron: '0 8 * * 1' # Weekly on Monday at 8 AM UTC
jobs:
audit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- run: pip install pip-audit
- run: pip-audit -r requirements.txt
Do not use safety for this. It requires a paid API key as of 2024. pip-audit pulls from the public OSV database and is completely free.
For Node.js projects (Evidence.dev, React frontends):
- name: npm audit
run: npm audit --audit-level=moderate
working-directory: ./evidence-app
Day-to-Day Management with gh CLI
Once workflows are running, the gh CLI is faster than the GitHub web UI for most operations:
# See what workflows exist
gh workflow list
# Trigger a workflow manually (requires workflow_dispatch trigger)
gh workflow run ci.yml
gh workflow run ci.yml --ref develop
# Watch a run in real time
gh run watch
# See recent runs
gh run list --workflow ci.yml
# View logs from a specific run
gh run view <run-id> --log
# Re-run only failed jobs
gh run rerun <run-id> --failed-only
# Download artifacts without opening the browser
gh run download <run-id> -n pytest-results
gh run watch is the one I use most during active development. It streams the current run to your terminal so you do not have to refresh the GitHub UI to see progress.
A Practical Starting Point
For a new data engineering project, I add three workflow files from day one:
-
ci.ymlfor pytest with Postgres service, triggered on push and pull request -
security.ymlfor pip-audit, triggered on push and weekly on a schedule -
docker.ymlfor building and pushing the image, triggered on push to main
That covers: tests passing, no known CVEs in dependencies, and a fresh image on every merge. Everything else (dbt CI, DAG validation, matrix builds) gets added when the project grows to need it.
The full patterns for all three are in this article. The only setup required in GitHub before any of this works is: go to your repo, open Settings, then Secrets and variables, then Actions, and add whatever credentials your tests need. For a PostgreSQL-backed pipeline, that is typically TEST_DATABASE_URL. For dbt, it is the three Postgres credentials. For Docker, it is DOCKERHUB_USERNAME and DOCKERHUB_TOKEN.
Automate the verification work. The pipeline should tell you what broke before a reviewer has to.
All the workflows in this article are patterns I use across my open data engineering projects. If you want to see them in context, the repos are linked on my GitHub profile.
Follow me on dev.to for more articles on Airflow, dbt, and building production-grade data pipelines.
Top comments (0)