Saksham Awasthi for MeteorOps

Posted on Aug 23 • Edited on Aug 24 • Originally published at meteorops.com

Deploy Apache Airflow on AWS Elastic Kubernetes Service (EKS)

It’s not trivial to run your data pipelines smoothly.

Apache Airflow is an excellent option as it has many features and integrations, but it could be better. It requires a lot of heavy lifting to make its infrastructure scalable.

That’s where deploying Apache Airflow on Kubernetes comes in. It enables you to orchestrate multiple DAGs in parallel on multiple types of machines, leverage Kubernetes to autoscale nodes, monitor the pipelines, and distribute the processing.

This guide will help you prepare your EKS environment, deploy Airflow, and integrate it with essential add-ons. You will also learn about a few suggestions for making your Airflow production grade.

Prerequisites

Before you deploy Apache Airflow, ensure you have all the prerequisites: eksctl, kubectl, helm, and an EKS Cluster.

We’ll be using eksctl to create the EKS Cluster, but feel free to skip it if you already have one.

Setup the AWS & EKSCTL CLIs

1.Install the AWS CLI: (skip to step 2 if you already have the CLI installed)

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

Please refer to the full AWS installation guide for other operating systems and architectures.

Once Installed, we have to configure the AWS cli on the local machine. Refer to this AWS guide about configuring the CLI locally.

2.Install the eksctl CLI: (skip to step 3 if you already have eksctl installed)

curl --location "https://github.com/weaveworks/eksctl/releases/download/0.104.0/eksctl_Linux_amd64.tar.gz" | tar xz -C /tmp sudo mv /tmp/eksctl /usr/local/bin

You can also refer to the eksctl installation guide.

Create the AWS EKS (Elastic Kubernetes Service) Cluster

Create an EKS Cluster: (skip to the next section if you already have a cluster)
You can create an EKS Cluster directly from the AWS management console or
using the eksctl cluster command.

Run the below command to create an EKS cluster in a public subnet in the Oregon region.

eksctl create cluster --name airflow-cluster --region us-west-2 --nodegroup-name standard-workers --node-type t3.medium --nodes 3 --nodes-min 1 --nodes-max 4 --managed

You can find a detailed blog on setting up an EKS Cluster.

Connect to the EKS Cluster from your local machine

Install kubectl in your local machine using

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl
kubectl version

Refer to the AWS kubectl & eksctl configuration guide for other operating systems and architectures.

After setting up your cluster, you must access it from your local machine. The below command will update the “kubeconfig” file.

aws eks --region us-west-2 update-kubeconfig --name airflow-cluster

Setup Helm Locally

Run the below command to install Helm on your local machine.

curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash

Refer to the Helm installation guide for other operating systems and architectures.

Support Dynamic Volume Provisioning for Persistent Storage using EBS

For an elastic scalable service, dynamic volume provisioning is preferred. Persistent storage must be configured and registered.

Follow this guide to set up Amazon EBS CSI Driver Add-On and Dynamic Volume.

What is Airflow?

Apache Airflow is an open-source system for scheduling and enhanced data pipeline orchestration or workflows. In simple terms, Apache Airflow is an ETL/ELT tool.

You can create, schedule, and monitor complex workflows in Apache Airflow.

You can connect multiple data sources with Airflow and send pipeline success or failure alerts in Slack or Email. In Airflow, you must define Python workflows, represented by Directed Acyclic Graph (DAG). Airflow can be deployed anywhere, and after deploying, you can access Airflow UI and set up workflows.

Use cases of Airflow:

Data ETL Automation: Streamline the extraction, transformation, and loading of data from various sources into storage systems.
Data Processing: Schedule and oversee tasks like data cleansing, aggregation, and enrichment.
Data Migration: Manage data transfer between different systems or cloud platforms.
Model Training: Automate the training of machine learning models on large datasets.
Reporting: Generate and distribute reports and analytics dashboards automatically.
Workflow Automation: Coordinate complex processes with multiple dependencies.
IoT Data: Analyze and process data from IoT devices.
Workflow Monitoring: Track workflow progress and receive alerts for issues.

Benefits of using Airflow in Kubernetes

Deploying Apache Airflow on a Kubernetes cluster offers several advantages over deploying it on a virtual machine:

Scalability: Kubernetes allows you to scale your Airflow deployment horizontally by adding more pods to handle increased workloads automatically.
Isolation: Enables running different tasks of the same pipeline on various cluster nodes by deploying each task as an isolated pod.
Automation: Kubernetes native capabilities, like auto-scaling, self-healing, and rolling updates, reduce manual intervention, improving operational efficiency.
Portability: Deploying on Kubernetes makes your Airflow setup more portable across different environments, whether on-premise or cloud.
Integration: Kubernetes integrates seamlessly with various tools for monitoring, logging, and security, enhancing the overall management of your Airflow deployment.

Airflow Architecture Diagram

The airflow components are the Executor, Scheduler, Web Server, and Airflow database. The Airflow worker and Triggerer are also involved.
As you can see in the above diagram, the Data Engineer writes Airflow DAGs. Airflow DAGs are collections of tasks that specify the dependencies between them and the order in which they are executed. A DAG is a file that contains your Python code.
The Scheduler picks up these DAGs and has the config to run the tasks specified in the DAGs.
In the above diagram, the Scheduler runs tasks using Kubernetes Executor and creates a separate pod for every task, which provides isolation.
Airflow also stores pipeline metadata in an external database. The main configuration file used by the Web server, Scheduler, and workers is airflow.cfg.
The Data Engineer can view the entire flow through the Airflow UI. Users can also check the logs, monitor the pipelines, and set alerts.

Airflow Deployment Options

When deploying Apache Airflow, there are multiple approaches to consider, each with unique advantages and challenges. Let us see the different deployment examples:

Amazon Managed Workflows for Apache Airflow (MWAA)
You should configure the service through the AWS Management Console. There, you can define your environment, set up necessary permissions, and integrate with other AWS services.
Google Cloud Composer:
Create an environment using the Google Cloud Console and integrate with Google Cloud services like BigQuery and Google Cloud Storage.
Azure Data Factory with Airflow Integration:
Try to Configure Airflow through the Azure Portal. Integrate with other Azure services for efficient workflow automation.
Self-hosted on AWS EC2:
We can launch and configure EC2 instances. We must install Airflow, set up the environment, configure databases, and set up the scheduler.
Running on Kubernetes (e.g., AWS EKS):
We can create Kubernetes clusters, deploy Airflow using Helm charts or custom YAML files, and manage container orchestration and scaling.

These are the different options or ways to deploy Airflow, but we are focusing on Amazon Web Service EKS to deploy Airflow, so let us see this in the below section.

Deploy Airflow on AWS EKS

Let us install Apache Airflow in the EKS cluster using the helm chart.

1.Create a new namespace.

kubectl create namespace airflow

2.Add the Helm chart repository.

helm repo add apache-airflow https://airflow.apache.org

3.Update your Helm repository.

helm repo update

4.Deploy Airflow using the remote Helm Chart

helm install airflow apache-airflow/airflow --namespace airflow   --debug

You will get the Airflow webserver and default Postgres connection credentials in the output. Copy them and save them somewhere.

5.Examine the deployments by getting the Pods

Kubectl get pods -n airflow

The Airflow instance is set up in EKS. All the airflow pods should be running.

Let’s prepare Airflow to run our first DAG

At this point, Airflow is deployed using the default configuration. Let's see how we can get the default values from the helm chart on our local machine, modify it, and update a new release.

1.Save the configuration values from the helm chart by running the below command.

helm show values apache-airflow/airflow > values.yaml

This command generates a file named values.yaml in your current directory, which you can modify and save as needed.

2.Check the release version of the helm chart by running the following command.

helm ls -n airflow

3.Let us add the ingress configuration to access the airflow instance over the internet.

We need to deploy an ingress controller in the EKS cluster first. The commands below will install the NGINX ingress controller from the helm repository.

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install nginx-ingress ingress-nginx/ingress-nginx --namespace airflow-ingress --create-namespace --set controller.replicaCount=2
kubectl get pods -n airflow-ingress

Note - All the pods should be running.

kubectl get service nginx-ingress-controller --namespace airflow-ingress

Look for the external IP in the output of the get service command.

After installing the ingress controller, add the required configuration in the values.yaml file and save the file. There is a section dedicated to the ingress configuration.

# Ingress configuration
ingress:
  enabled: true
  web:
    enabled: true
    annotations: {}
    path: "/"
    pathType: "ImplementationSpecific"
    host: <your domain URL>
    ingressClassName: "nginx"

After the changes to the values in the values.yaml file, we run the helm upgrade command to deploy the changes and create a new release version.

By default, the Helm Chart deploys its own Postgres instance, but using a managed Postgres instance is recommended instead.

You can modify the Helm Chart’s values.yaml file to add configuration of the managed database and volumes

metadataConnection:
             user: postgres
             pass: postgres
             protocol: postgresql
             host: <YOUR_POSTGRES_HOST>
             port: 5432
             db: postgres
             sslmode: disable

Run the helm upgrade command to implement the changes done above.

helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml --debug

Check the release version after the above command is run successfully. You should observe that the revision has changed to 2.

Accessing Airflow UI

We will use port-forwarding to access the Airflow UI in this tutorial. Run the below command and access “localhost:8080” on the browser.

kubectl port-forward svc/airflow-webserver 8080:8080 -n airflow

Use the default webserver credentials saved in the above section, “Installing Airflow Helm chart.”

At this point, Airflow is set up and is accessible. Hurray 😀

You can also access the UI over your domain, which is added in the ingress configuration in the above section.

Create your first Airflow DAG (in Git)

No DAGs have been added to our Airflow deployment yet. Let us see how we can add them.

To Set up a private GitHub repository for DAG, you can create a new one using the Github website's UI.

You can also install the git command line interface on your local machine and run commands to initialize an empty git repo.

git init

Adding DAG configs to the git repo

Once the git repo is initialized, create a DAG file like “sample_dag.py” and push it to the remote branch.

git add .
git commit -m 'Adding first DAG'
git remote add origin
git push -u origin main

Integrate Airflow with a private Git repo

To integrate Airflow with a private Git repository, you will need credentials, i.e. username /password or an SSH key.

We will use the SSH key to connect to the git repo. Skip the first step below if the SSH Key already exists in your Github account.

1.[Skip if it already exists] Generate an SSH key in your local machine and add it to the GitHub account (If not already present).

ssh-keygen -t ed25519 -C "<mailaddress>"

2.Create a generic secret in the namespace where airflow is deployed. This secret contains your SSH key.

kubectl create secret generic airflow-ssh-git-secret --from-file=gitSshKey=<path to SSH private key> -n airflow

3.Update the Git configuration in values.yaml file and run helm update command like in the above section.

gitSync:
    enabled: true
    repo: <your private Git repository URL>
    branch: <Branch-name>
    rev: HEAD
    depth: 1
    maxFailures: 0
    subPath: ""
sshKeySecret: airflow-ssh-git-secret

Below is a “sample_dag.py” that demonstrates a simple workflow.

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2024, 8, 8),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}
dag = DAG('hello_world', default_args=default_args, schedule_interval=timedelta(days=1))
t1 = BashOperator(
    task_id='say_hello',
    bash_command='echo "Hello World from Airflow!"',
    dag=dag,
)

Upon completion, you can see the DAGs in the UI interface. Airflow automatically detects new DAGs, but you can manually refresh the DAGs list in the Airflow UI by clicking the "Refresh" button on the DAGs page.

The UI has many options/settings to experiment with, such as code, graphs, audit logs, etc.

You can also check the EKS cluster’s activity and DAG dashboard from the Activity tab.

Run the Airflow job

DAGs can be scheduled to run or triggered manually from the UI interface. There is a run button on the rightmost side of the DAG table.

Also, it can be triggered from within the DAG.

Make your Airflow on Kubernetes Production-Grade

Apache Airflow is a powerful tool for orchestrating workflows, but making it production-ready requires careful attention to several key areas.

Below, we explore strategies to enhance security, performance, monitoring, and ensure high availability in your Airflow deployment.

Improved Security

a. Role-Based Access Control (RBAC)

Implementation: Enable RBAC in Airflow to ensure only authorized users can access specific features and data.
Benefits: Limits access to critical areas and reduces the risk of unauthorized changes or data breaches.

Refer to the Access Control guide.

b. Secrets Management

Implementation: Integrate with external secret management tools like AWS Secrets Manager, HashiCorp Vault, or Kubernetes secrets.
Benefits: Securely store sensitive information like API keys and database passwords, keeping them out of your codebase.

Refer to this AWS documentation about Secrets management in EKS, as well as this guide to use Kubernetes secrets in Airflow DAG

c. Network Security

Implementation: Use network policies and security groups to restrict Airflow's web interface and API access.
Benefits: Minimizes exposure to potential attacks by limiting network access to trusted sources only.

Refer to this guide to implement Network Security in EKS.

Improved Performance

a. Optimized Resource Allocation

Implementation: Right-size your Kubernetes pods and nodes based on the workload demand. Use Kubernetes Horizontal Pod Autoscaler (HPA) to scale Airflow resources dynamically and cluster autoscaler to scale nodes.
Benefits: Ensures efficient use of resources, reduces costs, and prevents bottlenecks during peak loads.

Airflow uses Executors for Autoscaling pods.

Refer to this generic guide on Implementing HPA and Cluster Autoscaler in EKS.

HPA will autoscale the different Airflow components, while the Cluster Autoscaler will make sure there are nodes to satisfy those requirements.

b. Task Parallelism

Implementation: Configure Airflow to handle parallel task execution by optimizing the number of worker pods and setting appropriate concurrency limits.
Benefits: Accelerates workflow execution by running multiple tasks simultaneously, improving overall performance.

Check out this guide for Implementing parallelism in Airflow.

c. Use of ARM Instances

Implementation: Consider running workloads on ARM-based instances like AWS Graviton for cost efficiency.
Benefits: ARM instances often provide a better cost-to-performance ratio, especially for compute-intensive tasks.

A quick guide to Creating an EKS cluster with ARM instances.

d. Use of HTTPS for ingress host

Implementation: Consider having HTTPS for the Airflow URL using TLS/SSL certificates with the Ingress controller in Kubernetes.
Benefits: HTTPS encrypts data to enhance the security of information being transferred. This is especially crucial when handling sensitive data, as encryption helps protect it from unauthorized access during transmission.

Refer to this guide to Install NGINX ingress and configure TLS.

Monitoring

a. Metrics Collection and Alerting

Implementation: Integrate Airflow with Prometheus to collect metrics on task performance, resource usage, and system health. Tools like Grafana or Prometheus Alertmanagercan can set up alerts based on critical metrics and log events.
Benefits: It provides visibility into Airflow’s performance, allowing you to identify and address issues proactively and enabling quick response to potential problems, reducing downtime and maintaining workflow reliability.

Refer to the “How to set up Prometheus and Grafana with Airflow” guide.

b. Logs Collection

Implementation: Set up centralized logging with tools like Elasticsearch, Logstash, Kibana (ELK stack or EFK stack), or Grafana Loki.
Benefits: Simplifies troubleshooting by consolidating logs from all Airflow components into a single, searchable interface.

Refer to the guide on how to Setup Elastic, Fluentd, and Kibana on EKS.

High Availability

a. Redundant Components

Implementation: Deploy multiple replicas of Airflow’s web server, scheduler, and worker nodes to ensure redundancy.
Benefits: Increases resilience by preventing single points of failure, ensuring that workflows continue even if one component goes down. To deploy multiple pods in Apache Airflow using a Helm chart, follow these steps:

1.Set Replicas for the Scheduler:

In your values.yaml file set the scheduler.replicas to the desired number of replicas. For example:

scheduler:
  replicas: 2

2.Set Replicas for the Web Server:
Similarly, set the web.replicas to deploy multiple web server pods:

web:
  replicas: 2

3.Deploy the Helm Chart:
Apply the Helm chart with the updated values.yaml file:

helm upgrade --install airflow apache-airflow/airflow -f values.yaml

This configuration ensures that multiple scheduler and web server pods are deployed, contributing to the high availability of your Airflow setup.
Airflow helm chart’s value.yaml file can be found here.

b. Database High Availability

Implementation: Use a highly available database solution like Amazon RDS with Multi-AZ deployment for Airflow’s metadata database.
Benefits: Ensures continuous operation and data integrity even during a database failure.

Refer to Amazon RDS with the Multi-AZ deployment guide.

c. Backup and Disaster Recovery

Implementation: Regularly backup Airflow’s database and configuration files. Implement a disaster recovery plan that includes rapid failover procedures.
Benefits: Protects against data loss and enables quick recovery in case of catastrophic failures. Read this document to set up automated backups in Amazon RDS.

Refer to this AWS page to learn about “Backup and Restore of EKS.”

Conclusion

Setting up Apache Airflow on Amazon EKS is a powerful way to manage your workflows at scale, but it requires careful planning and configuration to ensure it’s production-ready. Following this guide, you've deployed Airflow on EKS, created a simple DAG, connected Airflow with a private Git repository, and learned about different ways to implement security, performance, high availability, monitoring, and logging. With these optimizations, your Airflow deployment is now more efficient, cost-effective, and ready to handle the demands of real-world data orchestration.

Frequently Asked Questions

1. What is Apache Airflow?
Apache Airflow is an open-source tool that helps in orchestrating and managing workflows through Directed Acyclic Graphs (DAGs). It automates complex processes like ETL (Extract, Transform, Load) jobs, machine learning pipelines, and more.

2.Why deploy Airflow on Amazon EKS?
Deploying Airflow on Amazon EKS offers scalability, flexibility, and robust workflow management. EKS simplifies Kubernetes management, allowing you to focus on scaling and securing your Airflow environment.

3. What are the prerequisites for deploying Airflow on EKS?
You need an AWS account, an EKS cluster, kubectl configured on your local environment, a dynamic storage class using EBS volumes, and Helm for package management.

4. How do I monitor Airflow on EKS?
You can integrate Prometheus and Grafana for monitoring. Using Loki for log aggregation can also help in centralized log management and troubleshooting.

5. What Kubernetes add-ons are recommended for a production-grade Airflow setup?
Essential add-ons include External Secret Operator for secure secrets management, Prometheus and Grafana for monitoring, and possibly Loki for logging.

6. Can Airflow be integrated with external databases like RDS?
Yes, it’s common to configure Airflow to use an external PostgreSQL database hosted on Amazon RDS for production environments, providing reliability and scalability for your metadata storage.

7. How can I access the Airflow UI on EKS?
You can access the Airflow UI by setting up a LoadBalancer service or using an Ingress Controller with a DNS pointing to your load balancer for easy access.

8. How do I manage DAGs in a production environment?
For production, it’s advisable to store your DAGs in a private Git repository and integrate Airflow with this repo using GitSync to pull the latest DAG configurations automatically.