DEV Community

Armaan Khan
Armaan Khan

Posted on

Building and Managing ETL Pipelines with Apache Airflow – A Complete Guide (2025 Edition)

Image description

Introduction
In today's data-first economy, building reliable and automated ETL (Extract, Transform, Load) pipelines is critical. Apache Airflow is a leading open-source platform that allows data engineers to author, schedule, and monitor workflows using Python. In this complete guide, you’ll learn how to set up Airflow from scratch, build ETL pipelines, and integrate it with modern data stack tools like Snowflake and APIs.


🚀 What is Apache Airflow?
Apache Airflow is a workflow orchestration tool used to define, schedule, and monitor workflows using Directed Acyclic Graphs (DAGs). It turns scripts into data pipelines and helps schedule and monitor them in production.

Core Features:

  • Python-native (Workflows as code)
  • UI to monitor, retry, and trigger jobs
  • Extensible with custom operators and plugins
  • Handles dependencies and retries

🛠️ Step-by-Step Setup on WSL (Ubuntu for Windows Users)

1. Install WSL and Ubuntu

  • Enable WSL in Windows Features
  • Download Ubuntu from Microsoft Store

2. Update and Install Python & Pip

sudo apt update && sudo apt upgrade -y  
sudo apt install python3 python3-pip -y
Enter fullscreen mode Exit fullscreen mode

3. Set Up Airflow Environment

export AIRFLOW_HOME=~/airflow  
pip install apache-airflow  
Enter fullscreen mode Exit fullscreen mode

4. Initialize Airflow DB and Create Admin User

airflow db init  
airflow users create \
  --username armaan \
  --firstname Armaan \
  --lastname Khan \
  --role Admin \
  --email example@gmail.com \
  --password yourpassword
Enter fullscreen mode Exit fullscreen mode

5. Start Webserver and Scheduler

airflow webserver --port 8080  
airflow scheduler
Enter fullscreen mode Exit fullscreen mode

Access the UI at:
http://localhost:8080


📘 Understanding DAGs (Directed Acyclic Graphs)
DAGs define the structure and execution logic of workflows. Each DAG contains tasks (Python functions, Bash commands, SQL statements) and dependencies.

Basic ETL DAG Example:

from airflow import DAG  
from airflow.operators.python import PythonOperator  
from datetime import datetime

def extract():
    print("Extracting data...")

def transform():
    print("Transforming data...")

def load():
    print("Loading data...")

with DAG("etl_pipeline",  
         start_date=datetime(2024, 1, 1),  
         schedule_interval="@daily",  
         catchup=False) as dag:

    t1 = PythonOperator(task_id="extract", python_callable=extract)  
    t2 = PythonOperator(task_id="transform", python_callable=transform)  
    t3 = PythonOperator(task_id="load", python_callable=load)  

    t1 >> t2 >> t3
Enter fullscreen mode Exit fullscreen mode

🧠 Advanced Concepts to Level Up

  • Retries & Failure Alerts
    retries=2,
    retry_delay=timedelta(minutes=5),
    on_failure_callback=my_alert_function
Enter fullscreen mode Exit fullscreen mode
  • Parameterized DAGs
from airflow.models import Variable
source_url = Variable.get("api_source_url")
Enter fullscreen mode Exit fullscreen mode
  • External Triggers

    • Trigger DAGs via API
    • Use TriggerDagRunOperator to connect workflows
  • Sensor Tasks

    • Wait for file/data to be ready
    • e.g., FileSensor, ExternalTaskSensor

📡 Connect to Databases & APIs

To Snowflake:
Use SnowflakeOperator
Install:

pip install apache-airflow-providers-snowflake
Enter fullscreen mode Exit fullscreen mode

To REST APIs:
Use HttpSensor and SimpleHttpOperator for pulling API data before ETL.


🧩 Monitoring & Managing Pipelines

  • UI: View task logs, retry failures, monitor DAG execution
  • CLI: Trigger or pause DAGs
airflow dags list  
airflow tasks list etl_pipeline  
airflow dags trigger etl_pipeline
Enter fullscreen mode Exit fullscreen mode

✅ Summary

Apache Airflow is more than a scheduler — it’s a platform to build scalable and production-grade data pipelines. By writing code instead of GUIs, you gain flexibility, reusability, and full control over your ETL logic. It’s ideal for modern data engineering, especially when integrated with tools like Snowflake, S3, and APIs.


📅 What's Next?

  • Integrate with Airflow + Docker for better deployment
  • Add SLA Miss Callbacks
  • Use Airflow Variables, XComs, and Templates
  • Store logs in AWS S3 or GCS
  • Build real DAGs for: data cleaning, scraping, model training

🔗 Related Resources


Top comments (0)