Imagine you have a daily routine that begins with grab data from somewhere(extract),then clean it up(transform) and finally put it in a database(load). Doing this by hand every single day gets old fast. Cron jobs work but they get messy when your tasks need to run in a specific order. That is why Airflow fixes.
What is Apache Airflow
Apache Airflow is an open-source platform for developing, scheduling and monitoring batch-oriented workflows.
Instead of running scripts manually or juggling cron jobs, you write one Python file that describes your whole workflow. Airflow then runs it, tells you what broke and can retry the failed parts automatically.
Workflows as code
Airflow workflows are defined entirely in Python. This workflows as code approach brings several advantages:
Enables dynamic Dag generation and parameterization.
Airflow frameworks includes a wide range of built-in operators that can be extended to fit our needs.
Airflow leverages the Jinja templating engine allowing flexible customizations.
What is a DAG
DAG stands for Directed Acyclic Graph. Let us break that down in simple words:
Directed - Tasks have an order. i.e. Task B waits for Task A to finish.
Acyclic - No going backwards i.e. You can not loop back to an earlier task.(This prevents infinite loops).
Graph - You can see all the connected tasks as a picture.
Important note:In a DAG, once you move forward, you never go back to a previous task. That is what "acyclic" means: no cycles or loops.
A Real-World Example
Here is a daily data pipeline:
Extract Data -> Transform Data -> Load to Database
Basic DAG Code
from airflow.operators.python import PythonOperator
from airflow import DAG
from datetime import datetime
def extract():
print("Extracting data..")
def transform():
print("Transforming data..")
def load():
print("Loading data to database..")
with DAG(
dag_id="daily_pipeline",
start_date=datetime(2026,5,12),
schedule_interval="@daily",
catchup=False
) as dag:
extract_task=PythonOperator(task_id="extract",python_callable=extract)
transform_task=PythonOperator(task_id="transform",python_callable=transform)
load_task=PythonOperator(task_id="load", python_callable=load)
extract_task >> transform_task >> load_task
The >> symbol means "run after". i.e. extract_task >>transform_task >> load_task means "etract first, then transform and finally load."
Tasks
Each box in your DAG picture is a task. In the example above, extract_task,transform_task and load_task are all tasks. Airflow runs each one separately.
When a task runs, it has one of these statuses:
queued - Waiting in line to run
skipped - Chose not to run this one
retrying - Failed but trying again
running - Doing its job right now
success - Finished with no problems
failed - Something went wrong
Good to know: if your load_task fails, airflow only retries that task. It doesn't re-run extract_task or transform_task. This saves time.
Operators
An operator is like a template that tells Airflow how to do a specific job. Think of it as a worker who already knows one type of job.
Common Built-In Operators
-
PythonOperator- Runs any Python code
from airflow.operators.python import PythonOperator
def say_hello():
print("hello world!")
task =PythonOperator(
task_id="run_python",
python_callable=say_hello
)
2.BashOperator -Runs shell scripts or bash commands
from airflow.operators.bash import BashOperator
task = BashOperator(
task_id = "run_bash",
bash_command= "echo 'Pipeline daily '&&python3 demo/extract.py"
)
3.EmailOperator - Sends emails (great for reports)
from airflow.operators.email import EmailOperator
task = EmailOperator(
task_id ="send_email"
to ="youremail@gmail.com",
subject="report is ready daily",
html_content ="<p> Successful pipeline..</p>"
)
4.PostgresOperator - Runs SQL on a Postgres database.
from airflow.providers.operators.postgres import PostgresOperator
task = PostgresOperator (
task_id= "run_sql",
postgres_conn_id= "postgres_connection",
sql = "INSERT INTO reports SELECT * FROM staging where data ='df';"
)
5.S3ToRedshiftOperator - Copies data from Amazon S3 to Redshift
from airflow.providers.amazon.aws.transfers.s3_to_redshift import S3ToRedshiftOperator
task = S3ToRedshiftOperator (
task_id ="load_to_redshift",
s3_bucket="data-bucket",
s3_key="data/lux.csv",
schema="public",
table="students",
copy_options=["CSV", IGNOREHEADER 1"]
)
Scheduling
Scheduling is just telling Airflow when and how often to run your DAG.
Two ways to Schedule:
-
Shortcuts(easy mode):- @once - Run one time only.
- @hourly - Every hour.
- @daily - Once a day at midnight.
- @weekly - Once a week.
- @monthly - Once a month.
Cron expressions(more control):
"0 10 * * *" - Every day at 10:00 AM
"0 10 * * 2" - Every Tuesday at 10:00 AM
"*/30 * * * *" - Every 30 minutes
"0 0 3 * *" -Third day of every month at midnight
Cron format: Minute Hour Day Month Day-of-week
Example: Run a pipeline every day at 10 AM
with DAG(
dag_id ="student_pipeline",
start_date = datetime(2026,4,15),
schedule_interval = "0 10 * * *",
catchup=False
) as dag:
#your tasks go here
Conclusion
Airflow makes multiple pipelines easy to run. You write your workflow once in Python, and Airflow handles the rest.
Top comments (0)