Airflow: A Beginner's Guide

#airflow #data

Introduction: I Built a Pipeline. Now What?

In the previous article Understanding ETL: A Chaotic Introduction, I built a simple pipeline, and after running it a few times, another problem unfolded.

How can I run this every day at a specific time to get any new articles or updates?

The solution is Airflow. In this article, we will look at how we can have Python scripts running without manual triggers, explore Airflow, and see why and how it is used by Data Engineers.

Why Is Airflow Necessary in Data Engineering?

What Airflow actually is: Not a data processing tool — an orchestrator. It schedules, sequences, and monitors your pipelines. It doesn't do the work; it makes sure the work happens, in the right order, at the right time.

The mental shift: From "I run scripts" to "I define workflows." Your code becomes a DAG (Directed Acyclic Graph) — a blueprint of what should happen and when.

Features that just make sense in Airflow:

A calendar-like schedule.
A UI that shows you what ran, what failed, and why.
Retries built in — if the API fails, Airflow tries again.
Dependencies between tasks — step B only runs if step A succeeds.

How to Run Airflow and Write Your First DAG

The options are many, but we will use the least frictional method:
Local setup using the official Airflow documentation Quick Start

Once successful, run airflow standalone on your terminal.

Visit localhost:8080, and you should see the Apache Airflow sign-in page.

Use the credentials
username: admin
The password is provided on the terminal, or you can check inside the simple_auth_manager_passwords.json.generated file.

From Python Script to DAG

At this point we will take a python script and run it on Airflow, essentially transform into a task.

Your existing ETL script does three things: extract, transform, and load. In Airflow terms, each of those becomes a task. A group of tasks with a defined order becomes a DAG (Directed Acyclic Graph). The DAG is the blueprint — it says what runs, in what order, and how often.

Writing Your First DAG

Inside the dags/ folder created during setup, make a new file called newsapi_etl.py.

Here's what goes in it, step by step:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

DAG is the blueprint.
PythonOperator lets you run Python functions as tasks.
datetime handles scheduling.

default_args = {
    'owner': 'abdi',
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
    'start_date': datetime(2025, 1, 1),
}

This tells Airflow: "If a task fails, try again twice, waiting 5 minutes between attempts." No more writing retry logic inside your ETL script — Airflow handles it.

Here, you drop in the same functions from your ETL script. They don't change:

def extract():
    # Your NewsAPI extraction code
    pass

def transform():
    # Your data cleaning code
    pass

def load():
    # Your SQLite insertion code
    pass

Define the DAG

with DAG(
    dag_id='newsapi_etl_pipeline',
    default_args=default_args,
    description='Extract news data and load into postgres',
    schedule_interval='@daily', 
    catchup=False,
) as dag:

dag_id: A unique name for your pipeline.
schedule_interval='@daily': Runs at midnight every day. No cron syntax required.
catchup=False: Prevents Airflow from running missed schedules from the past when you first turn it on.

Define Tasks and Their Order

 task_extract = PythonOperator(
        task_id='extract_news',
        python_callable=extract,
    )

    task_transform = PythonOperator(
        task_id='transform_news',
        python_callable=transform,
    )

    task_load = PythonOperator(
        task_id='load_to_sqlite',
        python_callable=load,
    )

    # Set the order: extract >> transform >> load
    task_extract >> task_transform >> task_load

Once the file is saved in the dags/ folder, Airflow picks it up automatically (give it a minute or two). Refresh the UI at localhost:8080.
You'll see newsapi_etl_pipeline in the DAG list
Hit the play button (Trigger DAG) in the top right. Watch the squares turn from white to green. Your pipeline is running — not from your terminal, but from an orchestrator that will now run it every single day without you.