The standard
What it is and how we got here
Apache Airflow is an open-source workflow orchestration platform, used to author, schedule, and monitor complex data workflows in a reliable and scalable way.
Fun fact: Apache Airflow is a tool that was developed by Airbnb, yes, the one with the houses and apartments.
In a nutshell, Airflow allows you to define workflows as Directed Acyclic Graphs (DAGs) of tasks, written in Python. Each task might involve data extraction, transformation, loading (ETL), model training, reporting, or any other step in a data pipeline
Ever since its inception, it has only become more and more popular. It's become a staple for most companies and developers in data engineering alike. apartments
In my opinion, it's quite a bit of a learning curve, especially in the setup, but once you get past that, I felt the same, and here's why
Python-based – Workflows are defined in Python code, which means it's relatively easy to pick up, even for beginners
Flexibility – Because it’s Python-based, it integrates easily with existing systems and APIs.
Scalability – Suitable for startups and individual developers, to enterprises and large corporations.
Debugability – The UI and logs make debugging pipelines straightforward. It's really intuitive
Ecosystem Support – Many cloud providers (AWS MWAA, Google Cloud Composer, Astronomer) offer managed Airflow services.
Proven Track Record – Used by tech giants and enterprises for mission-critical pipelines.
A brief tour
Let's walk through how we set up and run a data pipeline. I'll explain more as we go along
The pipelines are defined in Directed Acyclic Graphs (DAGs)
So first, Apache has two sides: the web server UI and the Scheduler
Web server
The web server is the major GUI, which acts as command central. This is where we can do all sorts of things to the DAGs, like debugging and monitoring them.
Run
apache web-server
Once the server is started, you can view the GUI at the default port 8080. It should look something like this
Scheduler
This is the core of Apache Airflow. It decides when and what should be done
Think of the web server as the dashboard, a speedometer, and the scheduler would be that engine that causes the car to move, triggering the dashboard readings
It Parses the DAGs, Schedules Tasks, Manages Dependencies, Dispatches Work, Handles Catchup & Backfill
In a separate terminal, run
airflow scheduler
A simple Dag
Now, in the previous screenshots, the line "export AIRFLOW_HOME=$(pwd)/airflow" is a simple organisational step.
Aiflow automatically creates a folder airflow, but the line above tells Aiflow to create it in your current directory.
In a separate terminal, in this "aiflow" folder, create a folder named dags. Create a Python file to write your DAGs with the text editor of your choice.
Then, proceed to write your DAG.
This is a simple DAG for the Extraction step of the ETL process in Python.
Run your DAG
Now restart your scheduler and web server, and confirm that your DAG is present
The fetch load DAG is present, and you can click the play button to run your DAG
Click on your DAG name for closer inpsection or more options for that specific DAG
I find the Logs to be the most helpful part, especially when debugging
For instance, in the failed run below, I hadn't properly defined the schema before executing
Top comments (0)