If you followed my last post, we successfully built an ETL pipeline that fetched data from the News API, cleaned it with pandas, and loaded it into a PostgreSQL database. It felt amazing to watch it run successfully in the terminal.
But what if the News API goes down for 10 minutes? or what if my laptop is asleep when the script is supposed to run?
In the real world, you can't just sit at your laptop and manually click "Run" on a Python script every day. You need automation, monitoring, and a way to handle failures. That is exactly where Apache Airflow comes in.
The Problem: The "Crashing Script" Nightmare
Before tools like Airflow, developers relied heavily on Cron jobs (a built-in Linux tool used to schedule scripts at specific times). Cron is great for simple things, but it has huge blind spots for data engineering:
- No Dependency Management: If your "Transform" script takes longer than usual, your "Load" script might start running before the data is even ready, causing a massive crash.
- Lack of Visibility: If a script fails at 3 AM, you won't know until you check the logs manually or notice empty tables the next morning.
- No Easy Retries: If a network glitch causes an API call to fail, Cron won't automatically try again 5 minutes later. You have to handle that messy logic yourself in Python.
Airflow solves all of this by acting as the workflow orchestrator. It doesn't actually store or process your data; instead, it acts as the manager telling your scripts exactly when to run, in what order, and what to do if something breaks.
The Core Concepts Explained
Let’s break down the four most important concepts using a simple analogy: Baking a Cake.
1. The DAG (Directed Acyclic Graph)
Think of a DAG as the entire recipe for your cake.
- Directed: It has a clear starting point and moves in a specific direction (you can't frost the cake before you bake it).
- Acyclic: It cannot go in circles. Step C cannot loop back and trigger Step A, otherwise your pipeline would run forever.
- Graph: It's just a structural map of how your steps link together.
In data engineering, your DAG is the blueprint of your entire ETL pipeline.
2. Operators
If the DAG is the recipe, Operators are the kitchen appliances. They are the pre-built templates that define what actually gets done. Airflow provides different types of operators:
-
PythonOperator: Used to execute a piece of Python code (like ourtransform_datafunction). -
PostgresOperator: Used to run SQL queries directly inside a Postgres database. -
BashOperator: Used to run command-line terminal scripts.
3. Tasks
A Task is an operator that has been given specific instructions. It’s a single node inside your DAG. For example, using a PythonOperator to run extract_data() becomes the "Extract Task".
4. XComs (Cross-Communications)
In our standalone Python script, passing data was easy: we just returned a value from one function and passed it into the next (cleaned_df = transform_data(raw_data)).
In Airflow, tasks run completely independently. They can't easily talk to each other. XComs are like little sticky notes that tasks use to pass small amounts of data or metadata down the line. One task "pushes" a note, and the next task "pulls" it.
A Quick Peek at the Airflow UI
The absolute best part of Apache Airflow is its user interface. Instead of staring at text scrolling through a dark terminal window, Airflow gives you a beautiful visual dashboard where you can see your pipelines running in real-time.
When a pipeline runs successfully, the tasks turn a satisfying dark green. If a task fails, it turns red, making it incredibly easy to spot exactly where your pipeline broke.
What We’re Doing Next
Now that we understand the foundational pillars of Airflow—DAGs, Operators, Tasks, and why we use them over simple cron jobs—it’s time to get our hands dirty.
In my next article, we are going to break down an ETL project into Airflow Tasks and watch it run automatically inside the Airflow UI.
Are you using Airflow or which other tools do you prefer for orchestration?

Top comments (0)