DEV Community

Cover image for We Upgraded Airflow 2.8 to 3.1 on Kubernetes. Here Is What Actually Changed
Tasrie IT Services
Tasrie IT Services

Posted on

We Upgraded Airflow 2.8 to 3.1 on Kubernetes. Here Is What Actually Changed

We recently finished upgrading a production Airflow instance from 2.8 to 3.1 running on Amazon EKS. The whole thing took about 6 weeks from planning to production cutover.

This post covers what we did, what changed in the DAG code, how we handled the data migration, and the Kubernetes manifests that make up the new deployment. No fluff, just what happened.

Why Upgrade at All

Airflow 2.8 worked. It was running production DAGs without issues. So why bother?

A few reasons pushed us over the edge:

  • End of life. Airflow 2 reaches EOL in April 2026. No more security patches after that. For a system handling production data pipelines, that is not something we could ignore.
  • DAG Processor as a separate process. In Airflow 3, the dag-processor runs independently from the scheduler. This means a slow or broken DAG file does not block the scheduler from doing its job. We had hit this problem before where a DAG with a heavy top-level import would stall scheduling for everything else.
  • Native HA scheduler. Airflow 2 supported multiple schedulers, but it was always a bit awkward. Airflow 3 was built for it from the start.
  • The new UI. The Airflow 3 UI is significantly better. Grid view, better DAG run visualization, faster navigation. The engineering team actually likes using it now.
  • Deferrable operators are first-class. The triggerer component handles async sensors properly. Our DAGs have a lot of S3KeySensors and HTTP sensors waiting for external systems. Moving these to deferrable mode means worker pods are not sitting idle burning resources while waiting.

None of these were "the system is broken" reasons. But together they added up to "we should do this now while we have the bandwidth, not later when we are forced to."

Timeline Breakdown

The whole thing took 6 weeks end to end. Here is roughly how it broke down:

  • Week 1: Planning and infrastructure. Set up the new EKS namespace, EFS storage, IRSA roles, and base manifests. If you are provisioning EKS from scratch, we have a Terraform EKS module that handles the cluster setup. Got a bare Airflow 3.1 cluster running with no DAGs.
  • Week 2: Configuration and deployment manifests. ConfigMaps, secrets, scheduler deployment, webserver, triggerer, StatsD, RBAC. Got git-sync pulling from a test branch.
  • Week 3: DAG code migration. Created the prod-airflow3 branch, ran ruff, fixed imports, exported and imported variables/connections.
  • Week 4: Non-production validation. Ran the full DAG suite in the staging environment, caught edge cases, verified all connections and variables resolved correctly, confirmed remote logging and metrics.
  • Week 5: Production deployment and parallel run. Deployed to production EKS, ran both clusters side by side, compared DAG run results between old and new.
  • Week 6: Cutover and decommission. Switched DNS to the new cluster, monitored for a few days, then tore down the old Airflow 2.8 instance.

Most of the "work" was in weeks 1-2 (the infrastructure side). The DAG code changes in week 3 took about 2 days of actual effort. The rest was testing and building confidence before pulling the trigger on production.

Why Green Field Instead of In-Place Upgrade

We could have done an in-place upgrade. We chose not to.

The existing Airflow 2.8 cluster had been running for a while and had accumulated config drift, stale connections, and a metadata database that had seen better days. Rather than nurse it through a major version upgrade and deal with schema migration issues on a bloated database, we stood up a fresh Airflow 3.1 cluster alongside the old one.

The approach was simple:

  1. Deploy a brand new Airflow 3.1 on EKS
  2. Export variables and connections from the old instance
  3. Import them into the new one
  4. Update DAG code for Airflow 3 compatibility
  5. Point git-sync to the new branch
  6. Validate, then cut over

Since the DAGs were already version-controlled in GitHub and delivered via git-sync, there was no "migration" of DAG files. The code was already backed up. We just needed a new branch with the updated imports.

Exporting Variables and Connections

This was the easiest part. Airflow has CLI commands for exactly this:

# On the old Airflow 2.8 instance
airflow variables export variables.json
airflow connections export connections.json
Enter fullscreen mode Exit fullscreen mode
# On the new Airflow 3.1 instance
airflow variables import variables.json
airflow connections import connections.json
Enter fullscreen mode Exit fullscreen mode

That's it. All the variables and secrets came over cleanly. The JSON format preserves everything, including serialized JSON values inside variables. We did a quick diff after import to make sure nothing was lost.

One thing worth noting: if you have connections with extra fields or custom connection types, double check those after import. Ours were mostly SSH and AWS connections so they came through fine.

What Actually Changed in the DAG Code

This is the part everyone wants to know. We had about 18 active DAGs and the changes fell into four categories.

1. Import Path Updates

This was the bulk of the work. Airflow 3 removed a bunch of deprecated import paths that had been hanging around since Airflow 1.x days.

# Old (Airflow 2.8)
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.email_operator import EmailOperator
from airflow.operators.python_operator import PythonOperator, BranchPythonOperator
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from airflow.utils.db import provide_session

# New (Airflow 3.1)
from airflow.providers.ssh.operators.ssh import SSHOperator
from airflow.operators.empty import EmptyOperator
from airflow.operators.email import EmailOperator
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from airflow.utils.session import provide_session
Enter fullscreen mode Exit fullscreen mode

The pattern is pretty consistent. airflow.contrib.* is gone entirely. The old airflow.operators.*_operator paths are now just airflow.operators.* without the _operator suffix. And DummyOperator got renamed to EmptyOperator, which honestly makes more sense.

2. schedule_interval to schedule

Every single DAG needed this change:

# Old
dag = DAG('my_dag', schedule_interval='0 5 * * *', catchup=False, default_args=default_args)

# New
dag = DAG('my_dag', schedule='0 5 * * *', catchup=False, default_args=default_args)
Enter fullscreen mode Exit fullscreen mode

Just a parameter rename. schedule_interval is removed in Airflow 3, not just deprecated.

3. provide_session Moved

If you have any DAGs that interact with the Airflow metadata database directly (we had one monitoring DAG that checked for delayed tasks), the provide_session decorator moved:

# Old
from airflow.utils.db import provide_session

# New
from airflow.utils.session import provide_session
Enter fullscreen mode Exit fullscreen mode

4. That's It

Seriously. For our codebase, those were the only code changes needed. No DAG logic changes, no task rewrites, no workflow restructuring. The actual business logic of every DAG stayed identical.

We ran ruff with the AIR301 and AIR302 rules to catch most of these automatically:

ruff check dags/ --select AIR301,AIR302
ruff check dags/ --select AIR301,AIR302 --fix
Enter fullscreen mode Exit fullscreen mode

That caught about 80% of the import changes. The remaining ones we fixed by hand after running the DAGs in a dev environment and watching what blew up.

The Kubernetes Deployment

We deployed Airflow 3.1 on EKS using raw Kubernetes manifests instead of the Helm chart. The Helm chart is fine for getting started, but we wanted full control over every resource for GitOps workflows and easier debugging.

Here is the high-level architecture:

Airflow 3.1 Kubernetes Deployment High-Level Architecture

Key Design Decisions

  • KubernetesExecutor over CeleryExecutor. Each task gets its own pod. No Redis, no persistent Celery workers. Clean isolation and autoscaling is built in.
  • Git-sync sidecar pulls DAGs from GitHub via SSH every 5 seconds. No baked-in images, no S3 DAG syncing.
  • EFS for shared storage. DAGs need ReadWriteMany access since git-sync writes and scheduler/workers read across multiple nodes. EBS is ReadWriteOnce only.
  • Raw manifests over Helm for full visibility. Every resource is explicit, diffable, and reviewable.

Namespace and ServiceAccounts

Each component gets its own ServiceAccount with IRSA (IAM Roles for Service Accounts) so pods get AWS credentials without static access keys:

ServiceAccount Component Why
airflow-serviceaccount Scheduler, Workers S3 logging, ECR image pull
airflow-webserver Webserver / API Server Pod log reading
airflow-triggerer Triggerer Async sensor operations
airflow-statsd StatsD Exporter Metrics export

Least-privilege principle. The cleanup CronJob only needs pod list/delete. Workers need S3 access. The webserver needs pod log reading. Different roles for different jobs.

The ConfigMap (airflow.cfg)

This is the core of the deployment. Key sections:

[core]
executor = KubernetesExecutor
dags_folder = /opt/airflow/dags/repo/dags
remote_logging = True
load_examples = False

[kubernetes_executor]
namespace = airflow
pod_template_file = /opt/airflow/pod_templates/pod_template_file.yaml
worker_container_repository = ACCOUNT_ID.dkr.ecr.REGION.amazonaws.com/apache/airflow
worker_container_tag = 3.1.7-extra-pips

[logging]
remote_logging = True
remote_base_log_folder = s3://YOUR-BUCKET/airflow-logs
remote_log_conn_id = aws_s3_conn

[scheduler]
enable_health_check = True
run_duration = 41460
statsd_on = True

[triggerer]
default_capacity = 1000
enable_health_check = True
Enter fullscreen mode Exit fullscreen mode

run_duration = 41460 is about 11.5 hours. The scheduler restarts itself periodically to free memory. If you have seen scheduler memory creep over time, this helps.

The triggerer capacity of 1000 means it can watch 1000 deferrable triggers concurrently. If you are using async sensors (S3KeySensor, HttpSensor with deferrable=True), this is the component that handles them.

Scheduler Deployment

The scheduler is the most complex piece. It runs four containers in a single pod:

  1. scheduler - the main scheduling process
  2. dag-processor - parses DAG files (separate process in Airflow 3)
  3. git-sync - continuous sidecar pulling DAG updates every 5 seconds
  4. scheduler-log-groomer - cleans logs older than 15 days

Plus two init containers:

  • wait-for-airflow-migrations - blocks until airflow db check-migrations passes
  • git-sync-init - one-time clone so DAGs exist before the scheduler starts

We run 2 replicas with pod anti-affinity so they land on different nodes. Airflow 3 supports HA scheduler natively.

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchLabels:
              component: scheduler
          topologyKey: kubernetes.io/hostname
        weight: 100
Enter fullscreen mode Exit fullscreen mode

Resource allocation:

Container CPU Req CPU Limit Mem Req Mem Limit
scheduler 200m 1000m 512Mi 1024Mi
dag-processor 100m 500m 256Mi 512Mi
git-sync 100m 200m 128Mi 256Mi

Webserver Deployment

One thing that tripped us up: Airflow 3 renamed the webserver command to api-server:

args:
  - bash
  - -c
  - exec airflow api-server
Enter fullscreen mode Exit fullscreen mode

2 replicas with rolling updates (maxSurge: 1, maxUnavailable: 0) for zero-downtime deploys. Both replicas share the same webserver-secret-key for session persistence.

Three probes protect the deployment:

  • startupProbe - gives up to 60s for initialization
  • livenessProbe - restarts if /public/health stops responding
  • readinessProbe - removes from service if unhealthy

Git-Sync DAG Delivery

Airflow 3 Git-Sync DAG Delivery

Git-sync uses SSH with a deploy key stored as a Kubernetes Secret. It clones to /git and creates a symlink at /git/repo. The scheduler reads from /opt/airflow/dags/repo/dags via a volume mount.

We use SSH through port 443 (ssh://git@ssh.github.com:443/...) to bypass corporate firewalls that block port 22.

The init + sidecar pattern means:

  • git-sync-init runs once with GITSYNC_ONE_TIME=true to ensure DAGs exist before the scheduler starts
  • git-sync runs continuously, pulling every 5 seconds

This is why the green field approach worked so well. DAGs were already in GitHub. We just created a prod-airflow3 branch, made the import changes, and pointed the new cluster's git-sync at it.

Secrets

Secret Purpose
airflow-fernet-key Encrypts connections/variables in the metadata DB
airflow-webserver-secret Flask session signing
airflow-metadata PostgreSQL connection string
airflow-ssh-secret SSH key for git-sync

Generate the fernet key once:

python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
Enter fullscreen mode Exit fullscreen mode

EFS Storage

DAGs need ReadWriteMany because git-sync writes and scheduler + workers read across multiple nodes:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap
  fileSystemId: fs-XXXXXXXXXXXXXXXXX
  directoryPerms: "700"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: airflow-dags
  namespace: airflow
spec:
  accessModes: ["ReadWriteMany"]
  resources:
    requests:
      storage: "10Gi"
  storageClassName: "efs-sc"
Enter fullscreen mode Exit fullscreen mode

If you try to use EBS here, you will hit issues the moment a second scheduler replica or worker pod starts on a different node. EFS solves this natively on AWS.

Triggerer StatefulSet

The triggerer is a StatefulSet (not a Deployment) because it needs stable network identity and persistent storage for logs. Its EBS volume is AZ-scoped, so we pin it to a single availability zone with node affinity.

Pod Cleanup CronJob

Without this, your namespace fills up with completed worker pods:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: airflow-cleanup
  namespace: airflow
spec:
  schedule: "0 * * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: cleanup
              image: ACCOUNT_ID.dkr.ecr.REGION.amazonaws.com/apache/airflow:3.1.7-extra-pips
              args:
                - bash
                - -c
                - exec airflow kubernetes cleanup-pods --namespace=airflow
          restartPolicy: OnFailure
          serviceAccountName: airflow-cleanup
Enter fullscreen mode Exit fullscreen mode

Runs every hour. Without it, kubectl get pods becomes unusable after a few days of heavy scheduling.

RBAC

The KubernetesExecutor needs permission to create and manage pods. Three roles:

Pod Launcher (scheduler): create, list, get, patch, watch, delete on pods. Full lifecycle management.

Pod Log Reader (webserver + triggerer): list, get, watch on pods and get, list on pods/log. Read-only access so the UI can display live task logs.

Cleanup (CronJob): list, delete on pods. That's all it needs.

Database Migration Job

A Kubernetes Job that runs airflow db migrate before any other component starts:

annotations:
  helm.sh/hook: post-install,post-upgrade
  helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
  helm.sh/hook-weight: "1"
Enter fullscreen mode Exit fullscreen mode

Weight 1 means it runs before the create-user job (weight 2). All other pods have an init container that runs airflow db check-migrations --migration-wait-timeout=300 to block until this completes.

StatsD and Prometheus

Airflow emits StatsD metrics on UDP port 9125. A StatsD exporter converts them to Prometheus format on HTTP port 9102. The mapping config translates dot-separated metric names into labeled Prometheus metrics:

- match: airflow.dag.*.*.duration
  name: "airflow_task_duration"
  labels:
    dag_id: "$1"
    task_id: "$2"
Enter fullscreen mode Exit fullscreen mode

Useful metrics to watch: airflow_task_duration, airflow_dagrun_schedule_delay, airflow_pool_starving_tasks, and airflow_scheduler_heartbeat. If you are setting up Prometheus on Kubernetes for the first time, the StatsD exporter integrates cleanly with standard ServiceMonitor CRDs.

Worker Pod Template

Every task the KubernetesExecutor runs gets a pod from this template:

spec:
  containers:
    - name: base
      image: ACCOUNT_ID.dkr.ecr.REGION.amazonaws.com/apache/airflow:3.1.7-extra-pips
      resources:
        limits:
          cpu: "2"
          memory: 4Gi
        requests:
          cpu: "1"
          memory: 2Gi
      volumeMounts:
        - mountPath: "/opt/airflow/logs"
          name: logs
        - name: config
          mountPath: "/opt/airflow/airflow.cfg"
          subPath: airflow.cfg
        - name: dags
          mountPath: /opt/airflow/dags
          readOnly: true
  serviceAccountName: airflow-serviceaccount
  terminationGracePeriodSeconds: 600
  restartPolicy: Never
Enter fullscreen mode Exit fullscreen mode

restartPolicy: Never is important. Airflow manages retries at the DAG level, not Kubernetes. terminationGracePeriodSeconds: 600 gives tasks 10 minutes to wrap up during node drains. Every container drops all Linux capabilities as a security baseline.

The LDAP Gotcha

We use LDAP authentication via FAB (Flask-AppBuilder) auth manager. It works fine in Airflow 3, but here is the catch: if you use the FAB auth manager for web login, you get the old Flask-AppBuilder login page, not the new Airflow 3 UI login screen.

Functionally it works. LDAP users can log in, self-registration works, RBAC roles are assigned. But visually the login page looks like it is from a different era.

As of now, there is no new auth manager that supports LDAP natively with the new Airflow 3 UI. There is an open discussion in the Airflow repo about building a lightweight LDAP auth manager, but nothing shipped yet. If you are using LDAP, this is something to be aware of. It is cosmetic, not functional.

The Cutover

We ran both clusters in parallel for about a week. The old Airflow 2.8 cluster continued running production DAGs while we validated the new one.

The validation process:

  1. Triggered each DAG manually on the new cluster and compared outputs with the old cluster's runs
  2. Checked that all variables resolved correctly (some DAGs pull config from Airflow Variables at runtime)
  3. Verified remote logging to S3 was working
  4. Confirmed StatsD metrics were flowing to Prometheus
  5. Ran the scheduled DAGs for a few days and watched for failures

Once we were confident, the cutover was just a DNS change on the ALB. The old cluster stayed up for another few days as a safety net, then we decommissioned it.

The beauty of green field: if anything went wrong at any point, the rollback plan was "change DNS back." We never had to use it, but having that escape hatch made the whole process low-stress.

What Went Well

Git-sync made this painless. Since all DAG code was already in GitHub, the "migration" was really just a branch with updated imports. No file copying, no artifact management, no sync scripts.

Variable and connection export just worked. The CLI commands round-tripped cleanly. No manual recreation of dozens of variables and connections.

The code changes were mechanical. Find and replace import paths, rename schedule_interval to schedule, rename DummyOperator to EmptyOperator. No logic changes. Ruff's AIR rules caught most of it automatically.

Green field deployment eliminated risk. The old cluster kept running during the entire migration. If anything went wrong, we could roll back by just pointing DNS back. Zero pressure.

What We Would Do Differently

Clean the metadata database before exporting. We exported everything including stale variables from experiments that ran months ago. Would have been a good opportunity to audit what is actually in use.

Run ruff earlier. We started fixing imports by hand before discovering the AIR301/AIR302 rules. Would have saved us a couple hours.

Closing Thoughts

If you are still on Airflow 2.x, the migration to 3.x is not as scary as it looks. The DAG code changes are almost entirely mechanical. The bigger decision is whether to upgrade in-place or go green field, and if your DAGs are in git (they should be), green field is the way to go. If you are evaluating whether Airflow is still the right tool for your pipelines, we wrote a detailed Apache NiFi vs Airflow comparison that covers the architectural differences.

Airflow 2 reaches end of life in April 2026, so the clock is ticking.


Need Help With Your Airflow Migration?

Upgrading Airflow on Kubernetes involves more than just code changes. From EKS cluster configuration to RBAC policies, git-sync setup, and monitoring, there are a lot of moving parts.

Our team provides managed Apache Airflow services to help you:

  • Plan and execute Airflow 2 to 3 migrations with zero downtime
  • Design production-grade Kubernetes deployments with proper HA, monitoring, and security
  • Set up GitOps workflows for DAG delivery and infrastructure management

We have done this across multiple production environments and know where the sharp edges are.

Talk to our engineering team about your migration

Top comments (0)