Apache Airflow DAGs versioning is an important topic. There's many ways to do it and a lot of tutorials covering how to do it with a single Git repository. In this post I will walk you through on how to use git-sync, Git Submodules and GitHub Workflows to sync Airflow DAGs of multiple GitHub repositories.
Setup - GitHub
In this post, I will use 3 GitHub Repositories. I will refer to them as:
- Main repository: Repository where the Git submodules will be added.
- DAGs repository 1: Repository with Airflow DAGs.
- DAGs repository 2: Repository with Airflow DAGs.
Setup - Main Repository
- Create the GitHub Main Repository
- Create the GitHub DAGs repository 1
- Create the GitHub DAGs repository 2
- Clone the main repository to your local machine
- Execute the following commands to add the first submodule to the Main repository:
git submodule add git@github.com:<your-user>/<your-dag-repo-1>.git
git submodule update --init --remote <your-dag-repo-1>
git submodule update --remote <your-dag-repo-1>
git add .
git commit -m 'Adding <dag-repo-1> submodule'
git push
- Execute the following commands to add the second submodule to the Main repository:
git submodule add git@github.com:<your-user>/<your-dag-repo-2>.git
git submodule update --init --remote <your-dag-repo-2>
git submodule update --remote <your-dag-repo-2>
git add .
git commit -m 'Adding <dag-repo-2> submodule'
git push
Setup - Personal Access Token
You will need to create a GitHub Personal Access Token to access the Main Repository using GitHub Workflows.
To create a PAT:
- Go to Profile -> Settings
- On the left side bar -> Developer settings
- Personal access tokens -> Fine-grained tokens
- Generate new token
- Fill with your infos
- On Repository access, select "Only select repositories" and select your Main Repository
- On Permissions -> Repository permissions. Select "Read and write" permissions to "Commit statuses" and "Contents". All other permissions can be left default
- Generate the token and store the token somewhere safe. It will be used shortly.
Setup - DAGs repositories secrets
We will use the previously created PAT as Secrets in the two DAGs repositories.
You must do the following on both repositories:
- On the repository settings
- On the left side bar. Secrets and variables -> Actions
- Create a Secret with the PAT
Setup - DAGs repositories workflows
We will create GitHub Workflows to sync the DAGs repositories with the Main Repository.
On both DAGs repositories:
On the root folder of the repo, create the file: .github/workflows/github_ci_sync_main.yaml
With the content:
name: Sync with Main Repo
on:
push:
branches:
- main
env:
REPOSITORY_NAME: ${{ github.event.repository.name }}
jobs:
repository-sync:
runs-on: ubuntu-latest
steps:
- name: Checkout main repo
uses: actions/checkout@v3
with:
repository: <your-user>/<your-main-repository>
ref: main
token: ${{ secrets.<your-token-secret-name> }}
submodules: true
- name: Pull & update submodules recursively
run: |
git submodule sync $REPOSITORY_NAME
git submodule update --init --remote $REPOSITORY_NAME
git submodule update --remote $REPOSITORY_NAME
- name: Commit to pipeline hub
run: |
git config user.email "actions@github.com"
git config user.name "GitHub Actions"
git add --all
git commit -m "Update submodule $REPOSITORY_NAME" || echo "No changes to commit"
git push
Explaning the steps:
Checkout main repo: This step will use the token to checkout the Main Repository
Pull & update submodules recursively: Once the job is on the Main Repository, it executes the git comands to update the submodules locally.
Commit to pipeline hub: Commits and Pushes the update.
Setup - ssh-key
To use git-sync, we will need to setup an SSH key with permissions to access the 3 repositories.
We won't be able to use GitHub Deploy Keys because they're repository specific.
You will need to use an SSH key linked to a GitHub profile with read and write permissions to the 3 repositories. You can create this SSH key following this tutorial.
Setup - git-sync
I will use the git-sync parameters available on the Apache Airflow Official Helm Chart.
These are the relevant values:
dags:
gitSync:
enabled: true
repo: git@github.com:your-user/your-main-repo.git
branch: main
rev: HEAD
depth: 1
maxFailures: 0
subPath: ""
sshKeySecret: airflow-ssh-secret
knownHosts: |
github.com ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOMqqnkVzrm0SdG6UOoqKLsabgH5C9okWi0dh2l9GKJl
github.com ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg=
github.com ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCj7ndNxQowgcQnjshcLrqPEiiphnt+VTTvDP6mHBL9j1aNUkY4Ue1gvwnGLVlOhGeYrnZaMgRK6+PKCUXaDbC7qtbW8gIkhL7aGCsOr/C56SJMy/BCZfxd1nWzAOxSDPgVsmerOBYfNqltV9/hWCqBywINIR+5dIg6JTJ72pcEpEjcYgXkE2YEFXV1JHnsKgbLWNlhScqb2UmyRkQyytRLtL+38TGxkxCflmO+5Z8CSSNY7GidjMIZ7Q4zMjA2n1nGrlTDkzwDCsw+wqFPGQA179cnfGWOWRVruj16z6XyvxvjJwbz0wQZ75XK5tKSb7FNyeIEs4TT4jk+S4dhPeAUC5y+bDYirYgM4GC7uEnztnZyaVWQ7B381AK4Qdrwt51ZqExKbQpTUNn+EjqoTwvqNj4kqx5QUCI0ThS/YkOxJCXmPUWZbhjpCg56i+2aB6CmK2JGhn57K5mj0MNdBXA4/WnwH6XoPWJzK5Nyu2zB3nAZp+S5hpQs+p1vN1/wsjk=
- dags.gitSync.sshKeySecret: You can create this Secret with the following command
kubectl create secret generic airflow-ssh-secret -n <your-airflow-namespace> --from-file=gitSshKey=path/to-your/privatekey
- dags.gitSync.knownHosts: These are the GitHub's SSH key fingerprints
Conclusion
Now you can update your Helm Release with these new values and all the DAGs from the DAGs Repositories will be available in your Airflow Release.
Top comments (0)