DEV Community

peter muriya
peter muriya

Posted on

Automating ETL Workflows with Apache Airflow: From Python Script to Scheduled Pipeline

Modern data engineering revolves around automation, reliability, and scalability. Writing an ETL script in Python is only the beginning. To transform that script into a production-grade data pipeline, you need orchestration, scheduling, monitoring, and error handling. This is where Apache Airflow shines.

Apache Airflow is one of the most popular workflow orchestration tools in data engineering. It allows you to define, schedule, and monitor workflows programmatically using Python. Instead of manually running your ETL scripts, Airflow automates the entire process and ensures your data pipelines execute reliably.

Why Apache Airflow Matters

After developing an ETL pipeline in Python, several challenges remain:

• How do you schedule it to run automatically?
• How do you monitor failures?
• How do you retry failed tasks?
• How do you manage dependencies?
• How do you scale multiple workflows?

Apache Airflow solves all these problems by acting as the orchestrator for your ETL workflows.

Prerequisites

Before using Airflow, ensure you have:

• A working Python ETL script
• Python 3.9 or newer
• Apache Airflow installed
• A database (PostgreSQL, MySQL, or SQLite)
• Basic understanding of DAGs

Step 1: Install Apache Airflow

Install Apache Airflow using pip:

pip install apache-airflow
Enter fullscreen mode Exit fullscreen mode

Initialize the Airflow metadata database:

airflow db init
Enter fullscreen mode Exit fullscreen mode

Step 2: Verify Your ETL Script

Suppose you already have an ETL script named etl_pipeline.py:

import pandas as pd

def extract():
    return pd.read_csv("sales.csv")

def transform(df):
    df["total"] = df["quantity"] * df["price"]
    return df

def load(df):
    df.to_csv("processed_sales.csv", index=False)

def run_etl():
    data = extract()
    transformed = transform(data)
    load(transformed)

if __name__ == "__main__":
    run_etl()

Enter fullscreen mode Exit fullscreen mode

Step 3: Create Your Airflow DAG

Airflow workflows are defined using DAGs (Directed Acyclic Graphs). Create a file inside the dags folder:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
from etl_pipeline import run_etl

default_args = {
    "owner": "airflow",
    "start_date": datetime(2026, 1, 1),
    "retries": 2
}

with DAG(
    dag_id="sales_etl_pipeline",
    default_args=default_args,
    schedule="@daily",
    catchup=False
) as dag:

    etl_task = PythonOperator(
        task_id="run_sales_etl",
        python_callable=run_etl
    )
Enter fullscreen mode Exit fullscreen mode

Step 4: Start Airflow Services

Run the following commands in separate terminals:

airflow scheduler
airflow webserver --port 8080
Enter fullscreen mode Exit fullscreen mode

Step 5: Access the Airflow UI

Open your browser and navigate to:

http://localhost:8080

From the Airflow dashboard, you can:

• View all DAGs
• Trigger pipelines manually
• Monitor execution history
• Investigate failures
• View logs

Step 6: Enable Your DAG

Place your DAG file in the dags directory. Airflow automatically discovers it.

Toggle the DAG switch in the Airflow UI to activate scheduling.

Step 7: Add Task Dependencies

For complex pipelines, separate ETL into multiple tasks:

extract_task >> transform_task >> load_task
Enter fullscreen mode Exit fullscreen mode

Step 8: Monitor and Debug

Airflow provides detailed execution logs, retry mechanisms, and alerting.

Key features include:

• Automatic retries
• Task-level logs
• SLA monitoring
• Email notifications
• Failure alerts

Step 9: Production Best Practices

To build robust production pipelines:

• Store credentials securely using Airflow Connections
• Use environment variables
• Enable logging
• Implement idempotent ETL logic
• Add data quality checks
• Use a production-grade metadata database

Step 10: Scale Your Pipeline

As your data platform grows, Airflow can orchestrate:

• Multiple data sources
• Complex dependencies
• Machine learning workflows
• Data warehouse loads
• Real-time integrations

Conclusion

Apache Airflow transforms standalone Python ETL scripts into fully automated, scheduled, and monitored data pipelines. It handles orchestration, dependency management, retries, and observability, making it an essential tool for modern data engineers.

Once your ETL logic is complete, Airflow becomes the engine that runs it reliably in production. Whether you're processing daily reports or managing enterprise-scale data workflows, mastering Airflow is a critical skill in any data engineering toolkit.

Top comments (0)