Samuel Wachira

Posted on May 13

APACHE AIRFLOW FOR BEGINNERS

#dataengineering #airflow #luxdev

Imagine you have a daily routine that begins with grab data from somewhere(extract),then clean it up(transform) and finally put it in a database(load). Doing this by hand every single day gets old fast. Cron jobs work but they get messy when your tasks need to run in a specific order. That is why Airflow fixes.

What is Apache Airflow
Apache Airflow is an open-source platform for developing, scheduling and monitoring batch-oriented workflows.

Instead of running scripts manually or juggling cron jobs, you write one Python file that describes your whole workflow. Airflow then runs it, tells you what broke and can retry the failed parts automatically.

Workflows as code

Airflow workflows are defined entirely in Python. This workflows as code approach brings several advantages:

Enables dynamic Dag generation and parameterization.
Airflow frameworks includes a wide range of built-in operators that can be extended to fit our needs.
Airflow leverages the Jinja templating engine allowing flexible customizations.

What is a DAG

DAG stands for Directed Acyclic Graph. Let us break that down in simple words:

Directed - Tasks have an order. i.e. Task B waits for Task A to finish.
Acyclic - No going backwards i.e. You can not loop back to an earlier task.(This prevents infinite loops).
Graph - You can see all the connected tasks as a picture.

Important note:In a DAG, once you move forward, you never go back to a previous task. That is what "acyclic" means: no cycles or loops.

A Real-World Example

Here is a daily data pipeline:
Extract Data -> Transform Data -> Load to Database

Basic DAG Code

from airflow.operators.python import PythonOperator
from airflow import DAG
from datetime import datetime

def extract():
    print("Extracting data..")

def transform():
    print("Transforming data..")

def load():
    print("Loading data to database..")

with DAG(
     dag_id="daily_pipeline",
     start_date=datetime(2026,5,12),
     schedule_interval="@daily",
     catchup=False
) as dag:

  extract_task=PythonOperator(task_id="extract",python_callable=extract)

  transform_task=PythonOperator(task_id="transform",python_callable=transform)

  load_task=PythonOperator(task_id="load", python_callable=load)


 extract_task >> transform_task >> load_task

The >> symbol means "run after". i.e. extract_task >>transform_task >> load_task means "etract first, then transform and finally load."

Tasks
Each box in your DAG picture is a task. In the example above, extract_task,transform_task and load_task are all tasks. Airflow runs each one separately.

When a task runs, it has one of these statuses:

queued - Waiting in line to run
skipped - Chose not to run this one
retrying - Failed but trying again
running - Doing its job right now
success - Finished with no problems
failed - Something went wrong

Good to know: if your load_task fails, airflow only retries that task. It doesn't re-run extract_task or transform_task. This saves time.

Operators
An operator is like a template that tells Airflow how to do a specific job. Think of it as a worker who already knows one type of job.

Common Built-In Operators

PythonOperator - Runs any Python code

from airflow.operators.python import PythonOperator

def say_hello():
    print("hello world!")

task =PythonOperator(
    task_id="run_python",
    python_callable=say_hello
 )

2.BashOperator -Runs shell scripts or bash commands

from airflow.operators.bash import BashOperator

task = BashOperator(
    task_id = "run_bash",
    bash_command= "echo 'Pipeline daily '&&python3 demo/extract.py"
)

3.EmailOperator - Sends emails (great for reports)

from airflow.operators.email import EmailOperator

task = EmailOperator(
   task_id ="send_email"
   to ="youremail@gmail.com",
   subject="report is ready daily",
   html_content ="<p> Successful pipeline..</p>"
)

4.PostgresOperator - Runs SQL on a Postgres database.

from airflow.providers.operators.postgres import PostgresOperator

task = PostgresOperator (
    task_id= "run_sql",
    postgres_conn_id= "postgres_connection",
    sql = "INSERT INTO reports SELECT * FROM staging where data ='df';"
)

5.S3ToRedshiftOperator - Copies data from Amazon S3 to Redshift

from airflow.providers.amazon.aws.transfers.s3_to_redshift import S3ToRedshiftOperator

task = S3ToRedshiftOperator (
    task_id ="load_to_redshift",
    s3_bucket="data-bucket",
    s3_key="data/lux.csv",
    schema="public",
    table="students",
    copy_options=["CSV", IGNOREHEADER 1"]
)

Scheduling
Scheduling is just telling Airflow when and how often to run your DAG.

Two ways to Schedule:

Shortcuts(easy mode):
- @once - Run one time only.
- @hourly - Every hour.
- @daily - Once a day at midnight.
- @weekly - Once a week.
- @monthly - Once a month.
Cron expressions(more control):

"0 10 * * *"      - Every day at 10:00 AM
"0 10 * * 2"      - Every Tuesday at 10:00 AM
"*/30 * * * *"   - Every 30 minutes
"0 0 3 * *"      -Third day of every month at midnight

Cron format: Minute Hour Day Month Day-of-week

Example: Run a pipeline every day at 10 AM

with DAG(
    dag_id ="student_pipeline",
    start_date = datetime(2026,4,15),
    schedule_interval = "0 10 * * *",
    catchup=False

) as dag:
  #your tasks go here

Conclusion
Airflow makes multiple pipelines easy to run. You write your workflow once in Python, and Airflow handles the rest.

DEV Community

APACHE AIRFLOW FOR BEGINNERS

Top comments (0)