DEV Community

Cover image for Why you need to learn Apache Airflow - right now
Eric Kahindi
Eric Kahindi

Posted on

Why you need to learn Apache Airflow - right now

The standard

What it is and how we got here

Apache Airflow is an open-source workflow orchestration platform, used to author, schedule, and monitor complex data workflows in a reliable and scalable way.

Fun fact: Apache Airflow is a tool that was developed by Airbnb, yes, the one with the houses and apartments.

In a nutshell, Airflow allows you to define workflows as Directed Acyclic Graphs (DAGs) of tasks, written in Python. Each task might involve data extraction, transformation, loading (ETL), model training, reporting, or any other step in a data pipeline

Ever since its inception, it has only become more and more popular. It's become a staple for most companies and developers in data engineering alike. apartments
In my opinion, it's quite a bit of a learning curve, especially in the setup, but once you get past that, I felt the same, and here's why

  • Python-based – Workflows are defined in Python code, which means it's relatively easy to pick up, even for beginners

  • Flexibility – Because it’s Python-based, it integrates easily with existing systems and APIs.

  • Scalability – Suitable for startups and individual developers, to enterprises and large corporations.

  • Debugability – The UI and logs make debugging pipelines straightforward. It's really intuitive

  • Ecosystem Support – Many cloud providers (AWS MWAA, Google Cloud Composer, Astronomer) offer managed Airflow services.

  • Proven Track Record – Used by tech giants and enterprises for mission-critical pipelines.

A brief tour

Let's walk through how we set up and run a data pipeline. I'll explain more as we go along
The pipelines are defined in Directed Acyclic Graphs (DAGs)

So first, Apache has two sides: the web server UI and the Scheduler

Web server

The web server is the major GUI, which acts as command central. This is where we can do all sorts of things to the DAGs, like debugging and monitoring them.
Run

apache web-server
Enter fullscreen mode Exit fullscreen mode

A picture or the out come

Once the server is started, you can view the GUI at the default port 8080. It should look something like this

Scheduler

This is the core of Apache Airflow. It decides when and what should be done

Think of the web server as the dashboard, a speedometer, and the scheduler would be that engine that causes the car to move, triggering the dashboard readings

It Parses the DAGs, Schedules Tasks, Manages Dependencies, Dispatches Work, Handles Catchup & Backfill

In a separate terminal, run

airflow scheduler
Enter fullscreen mode Exit fullscreen mode

A simple Dag

Now, in the previous screenshots, the line "export AIRFLOW_HOME=$(pwd)/airflow" is a simple organisational step.
Aiflow automatically creates a folder airflow, but the line above tells Aiflow to create it in your current directory.

In a separate terminal, in this "aiflow" folder, create a folder named dags. Create a Python file to write your DAGs with the text editor of your choice.

Then, proceed to write your DAG.
This is a simple DAG for the Extraction step of the ETL process in Python.

Run your DAG

Now restart your scheduler and web server, and confirm that your DAG is present

The fetch load DAG is present, and you can click the play button to run your DAG
Click on your DAG name for closer inpsection or more options for that specific DAG

I find the Logs to be the most helpful part, especially when debugging
For instance, in the failed run below, I hadn't properly defined the schema before executing

Top comments (0)