INTRODUCTION
When I first started learning Apache Airflow, i kept seeing terms like DAGs, tasks, operators, dependencies and scheduling. Every tutorial explained them separately, but I didn't really understand how they worked together until I wrote my first DAG.
In this article, i'll explain what a DAG is, why it's important and how i created my first one.
WHAT IS APACHE AIRFLOW?
-It is an open-source workflow orchestration tool that allows one to automate and schedule data workflows instead of running scripts manually.
Imagine you have a data pipeline that:
- Extracts data from an API
- Cleans the data
- Loads it into a PostgreSQL database
Rather than running these steps yourself every day, Airflow can do it automatically.
WHAT IS A DAG?
-DAG(Directed Acyclic Graph) is the blueprint of your workflow. It tells Airflow what tasks exist, when they should run and what order they should execute.
- Directed – tasks follow a specific order
- Acyclic – tasks never loop back to previous tasks
- Graph – tasks are connected together
For Example:
Task 1: Extract
|
Task 2: Transform
|
Task 3: Load
The three tasks link together to form a DAG
WRITING MY FIRST DAG
1.Import Required Modules
A basic Airflow DAG begins by importing the necessary libraries.
from airflow import DAG
from datetime import datetime
from airflow.operators.python import PythonOperator
datetime - allows us to specify the date when Airflow can start scheduling the workflow
PythonOperator - an operator that tells Airflow to execute a Python function as a task
2.Create Python Functions
-These are instructions that describe the work to be done. They are optional depending on the operator
def extract():
print("Extracting data...")
def transform():
print("Transforming data...")
def load():
print("Loading data...")
Airflow creates tasks that call these functions
3.Create the DAG
-The code block below defines the workflow
with DAG(
dag_id="first_dag",
start_date=datetime(2026, 6, 1),
schedule="@daily"
catchup=False
) as dag:
dag_id - the workflow's name.
start_date - the earliest date from which Airflow is allowed to create scheduled runs for the DAG.
schedule - how often the workflow runs.
Ways you can define schedules:
- Using built-in presets such as @hourly, @daily, @weekly, @monthly, @yearly. eg.
schedule="@daily"- this will run once per day - Using timedelta to specify an interval. eg.
from datetime import timedelta
schedule=timedelta(minutes=5)
-this schedules the DAG every five minutes
- Using cron expressions for precise scheduling. They consist of five fields: Minute Hour Day Month Weekday. eg.
schedule='0 9 * * *'- this schedules the DAG to run everyday at 9.00 AM - Using None if you don't want the DAG to run automatically. The DAG will only be triggered manually or via the API. eg.
schedule=None
catchup
For Example:
Your DAG has a start date of June 1st, but you don't turn on the Airflow scheduler until June 10th:
If catchup=True - Airflow will attempt to create DAG runs for every missed interval between June 1st & June 10th
If catchup=False - Airflow skips the historical runs and schedules only the lastest one
4.Create Tasks
extract_task = PythonOperator(
task_id="extract",
python_callable=extract
)
transform_task = PythonOperator(
task_id="transform",
python_callable=transform
)
load_task = PythonOperator(
task_id="load",
python_callable=load
)
PythonOperator - tells AIRFLOW that the task should execute Python code
-Operators tells Airflow what kind of work should be performed. Different operators are used depending on the type of work the task performs and how it is written.
Types of Operators:
- PythonOperator-executes Python functions
- BashOperator-executes shell commands
- SQL operators-execute database queries
- EmailOperator-sends an email
- DockerOperator-executes tasks inside a docker container
task_id - unique identifiers used to display task status in the UI and logs
python_callable - tells airflow which python function should be executed when the task runs
5.Task Dependencies
It tells airflow the order in which the tasks should execute
extract_task >> transform_task >> load_task
The >> operator means 'run this task before the next one'. If one task fails, the next task does not run
A SIMPLE AIRFLOW DAG

-This DAG defines a simple workflow, schedules it to run every five minutes, creates two tasks using the PythonOperator and specifies the order in which those tasks should execute.
WHY DAGs ARE USEFUL
They allow one to:
- Automate repetitive work
- Schedule pipelines
- Retry failed tasks
- Monitor workflow progress
- View execution history
CONCLUSION
Creating my first DAG helped me transform my scripts into an automated workflow and gave me a foundation for building more advanced pipelines. With these fundamentals, you're ready to move beyond simple examples and start building practical workflows.
Top comments (0)