Lawrence Murithi

Posted on Apr 29 • Edited on May 12

Apache Airflow for Beginners: DAGs, Tasks, Operators, and Scheduling Explained

#luxdev #dataengineering #apacheairflow

Introduction

Being a beginner in data engineering can seem very scary. People use technical words like ETL, pipelines, data warehouses, architecture, orchestration etc. At that point, it is very easy to feel like you need a computer science degree just to understand what they mean. However, most of these terms are just technical but not as complicated as they sound.
Data engineering, in simple terms, involves extracting data from a place such as websites, social media pages, excel/csv files or payment systems etc, cleaning it, and storing it somewhere (database, data warehouse or data lake). If you need this done once, you can run a simple Python script. However, if the job must run every hour, every day, or every week, you need a tool that can manage it for you. That's where Apache Airflow comes in.

What is Apache Airflow?

To understand Apache Airflow, think about a process like baking a cake. You do not just throw everything into the oven. You follow steps:

Buy the ingredients
Prepare the dough
Put the dough in the oven
Bake the cake
Let it cool
Add frosting
Serve the cake

Some steps must happen before others. You cannot frost the cake before baking it. You cannot bake the cake before preparing the dough. You also need to know how long each step should take and what to do if something goes wrong.
This kind of process is called a workflow or pipeline and Airflow helps you manage that workflow.
NB: Airflow does not usually do the heavy data processing itself but tells other tools when to do the work.
A workflow may be a data pipeline, a machine learning pipeline, a reporting process, or any process made up of several steps.
Example

extract_data >> clean_data >> load_data >> send_email

Apache Airflow is an open-source platform used to schedule, monitor and manage workflows. It was originally created by Maxime Beauchemin at Airbnb in 2014 to manage increasingly complex data workflows. It helps you decide what task should run first, what should follow, what should happen if something fails, and when the whole process should run again.

Airflow as an Orchestrator

Orchestration refers to arranging many tasks so they run in the right order and at the scheduled time. It makes sure that task B does not run before task A has finished. It also records whether each task succeeded or failed. Without orchestration, you will have many scripts running manually or through separate cron jobs hence becoming difficult to manage as your project grows.

Why Airflow?

While a normal Python script could run fine with simple tasks, you need more control as the number of tasks increases. Airflow is useful because data jobs often have many moving parts.

Airflow is useful because of various reasons:

1. Scheduling
Since most data work is repetitive, scheduling enables workflows to run automatically based on the scheduled time. Airflow handles complex timezone logic natively, ensuring global data pipelines run exactly when they should.

2. Catchup and Backfilling
If your pipeline breaks over the weekend and you don't fix it until Monday, Airflow knows it missed Saturday and Sunday. It will automatically go back in time and run the missed jobs in order.

3. Task Orchestration
Tasks are arranged depending on which task runs first, second, and last.
Example

extract >> transform >> load

This order is critical because if the load task runs before the transform task, the database may receive dirty data. If the transform task runs before extract, there will be no data to clean.
Airflow has parallel execution capabilities to run several tasks simultaneously and wait for all of them to finish before moving to the next step.

4. Monitoring
Monitoring standard scripts to know if a job ran successfully requires SSH-ing into a server and digging through terminal files. However, Airflow provides a centralized web interface for the entire data ecosystem to monitor.
The Web Dashboard/Task Statuses - Airflow comes with a beautiful, easy-to-read user interface (UI) with Color-coded views. You can log in and see exactly which tasks succeeded(green), which are currently running(light green), queued(gray) and which failed(red).
Gantt Charts - Visual representations of task duration, helping you identify bottlenecks in your pipeline.
Historical Trends - view the history of a specific pipeline over a duration of time to spot intermittent failures or slowing performance.

5. Automated Retries
In the real world, tasks can fail for temporary reasons. An API may be rate-limited, a database might briefly drop a connection, or a network hiccup might occur.
Instead of waking up at 3:00 AM to manually restart a failed script, Airflow handles transient errors gracefully by trying the task again based on the number of retries set.
Example

"retries": 3,
"retry_delay": timedelta(minutes=5)

In this scenario, if the task fails, Airflow will wait for 5 minutes before trying again, up to three times.

6. Accessible Logs
Finding out why and when a pipeline breaks is very critical. Airflow attaches isolated logs to every single task execution eliminating the need to hunt through an entire server log file.
A user is also able to click on a failed task directly in the web UI and instantly read the error message for that specific run, reducing debugging time.

7. Failure Handling
When a task fails, letting the rest of the script run can result in corrupt data or crashed databases. Airflow thus stops execution of the downstream tasks preventing bad data from moving through the pipeline.
Airflow can also be configured to send an automated email, slack message, or an alert when a pipeline fails, ensuring the team is instantly aware of critical data outages.

8. Clear Pipeline Structure
Airflow workflows are written entirely in Python hence the pipeline configuration is treated like any other software project. Workflows are visible and anyone can see how tasks connect to each other hence a new person joining the team can open the Airflow UI and understand the pipeline flow.
Workflows can be committed to Git, peer-reviewed, and rolled back if a mistake is made.

The Core Assets(Airflow Terminologies)

Before writing any Airflow code, its important to understand its building blocks and the main terms used in the Airflow world because they describe parts of a workflow system.

1. DAG
In Airflow, a full workflow is called a DAG(Directed Acyclic Graph).
Directed - the workflow moves in one direction. The process has a starting point and an ending point and does not move backward.
Acyclic - there are no loops. Since workflow must have a clear start and a clear end loops are not allowed since they create endless cycles and the pipeline might never finish running.
Graph - a structure made up of points and connections. The points are tasks and the connections are dependencies
A DAG is, therefore, a workflow made up of tasks arranged in a clear order and indicating how they connect with each other.
Example

with DAG(
    dag_id="stock_etl_dag",
    start_date=datetime(2026, 4, 20),
    schedule=timedelta(hours=1),
    catchup=False
) as dag:

2. Task
A task is one step inside a DAG or one job inside a pipeline. A task should usually do one clear job. Creating one huge task that does everything makes debugging hard thus work should be split into separate tasks.
Example

fetch = PythonOperator(
    task_id="fetch_stock_data",
    python_callable=fetch_stock
)

fetch is the task object, and fetch_stock_data is the task name shown in Airflow.

3. Operator
An operator is the tool used to create and run a task. Different operators are used for different types of jobs.

4. Dependency
A dependency defines the order of tasks by telling Airflow which task must run before another task. In simple terms, a dependency is the relationship between tasks.
Example

extract >> transform >> load

This means extract runs first, transform runs after extract succeeds and load runs after transform succeeds.
You can also define parallel dependencies to show which tasks should run simultaneously.
Example

download >> [clean_data, backup_data] >> send_email

This means download runs first, clean_data and backup_data run after download then send_email executes after both clean_data and backup_data finish.

5. Scheduler
The scheduler is the brain of Airflow which checks the DAGs and decides which tasks should run and when.
If the scheduler is not running, DAGs may appear in the UI but tasks may stay queued or show no status.
The scheduler constantly checks:

which DAGs exist
whether a DAG is due to run
whether a task’s upstream tasks have succeeded
whether a task should be queued
whether a failed task should retry
whether a DAG run is complete The scheduler does not usually execute the task itself but decides which task is ready and sends it to the executor.

6. Executor
The executor is the part of Airflow that decides how tasks are actually run. Different Airflow setups use different executors.
Common executors include:
SequentialExecutor - This runs one task at a time thus cannot run many tasks in parallel. It is simple and often used for learning or testing.

LocalExecutor - This runs tasks locally on the same machine, and it can run more than one task at the same time. It's useful when Airflow is installed on one server and you want tasks to run on that server.

CeleryExecutor - This is used for larger setups. The scheduler sends tasks to a queue, and workers pick them up and run them. This setup usually needs a message broker such as Redis or RabbitMQ.

KubernetesExecutor - This runs each task in a separate Kubernetes pod. It's more advanced and usually used in cloud or production environments.

NB: The scheduler decides that a task should run while the executor handles the running method(how Airflow runs tasks).

7. Worker
A worker is the process that actually executes tasks. This term is particularly important when using CeleryExecutor.
In a Celery setup, the flow looks like:

Scheduler >> Queue >> Worker >> Task runs

The scheduler decides the task is ready, the executor sends the task to a queue and the worker picks it up and runs it.
NB: The scheduler decides what should run while the worker does the actual execution.

8. XCom
XCom means cross-communication. It allows tasks to pass small pieces of data to each other. XCom can help pass data from one task to another.
XCom is for passing small messages between tasks, not for moving large datasets. Passing large datasets through XCom slows down Airflow and fills the metadata database.

In PythonOperator, you can push data to XCom:

kwargs["ti"].xcom_push(key="raw_data", value=data)

Then another task can pull it:

data = kwargs["ti"].xcom_pull(task_ids="extract", key="raw_data")

However, you can save large data somewhere else, then pass the location through XCom.
Example

extract task saves data to /tmp/raw_stock_data.csv
XCom passes "/tmp/raw_stock_data.csv"
transform task reads the file

9. Sensors
A Sensor is a special type of Operator that just waits.
Imagine you are expecting an important package. You stand at the window waiting for the mail truck. A Sensor does this in software. It can wait for a file to drop into a folder, or wait for another database to finish an update before letting the DAG move on to the next step.

10. Metadata Database
The metadata database is Airflow’s internal database and uses it to remember what happened(records results of a DAG).
It stores information such as:

DAGs
DAG runs
Task runs
Task states
Schedules
Retries
Users
Roles
Variables
Connections
XCom values

This database is very important because Airflow needs memory.
For example, Airflow needs to know:

Did this task succeed?
Did this task fail?
How many times has it retried?
When did the DAG last run?
What logs belong to this task?
What DAGs exist?
Which users can log in?

What Airflow is NOT

To fully understand Airflow, you also need to know its limits.
It's not a data streaming tool - Airflow is built for batch processing (running jobs every hour, every day, or every week). It is not designed to process live data happening by the second, like tracking live mouse clicks on a website.
It's not a data processing engine - Airflow is the manager, not the worker. You should not use Airflow to process a 100-gigabyte CSV file in its own memory. Instead, Airflow should send a command to a tool like Apache Spark or Snowflake to do the heavy lifting.

Conclusion

Apache Airflow may look difficult when you first encounter it because it comes with many technical jargons making data engineering feel more complicated than it really is. Airflow is simply a workflow manager which helps you organise work that must happen in a specific order. Apache Airflow is about control. It helps you control timing, order, failure, retries, logs, and monitoring.
Just like baking a cake, you must follow the right sequence. A data pipeline works the same way. You extract data, transform it, load it, check it, and sometimes send a notification. Each step depends on the previous one. Airflow gives you a clean way to define these steps and make sure they run correctly.