Apache Airflow - This is a tool used in workflow orchestration, the automated coordination and management of data workflows.
Airflow is important in data engineering since it provides a way to orchestrate, schedule and monitor workflows/pipelines.
Why use Airflow
Scalability and Flexibility - Airflow supports workflows ranging from small scripts to large scripts handling very large data.
- Airflow works with many systems, databases, cloud storage, snowflake etc
Scheduling - Airflow has a built-in scheduler to run tasks at specific intervals, and also, it automates repetitive tasks, reducing manual intervention.
Monitoring - Airflow provides an interface to track task execution, progress, successes and failures.
Extensibility - Airflow provides plugins and extensions to be able to connect with various systems eg APIs, AWS, AZURE etc
Error handling - Airflow makes error handling automated, flexible, and visible. Instead of always monitoring, you can set retries, alerts, and failure alerts so problems are handled.
Screenshot Documentation
Airflow UI header with "Apache Airflow" logo
DAGs list showing example DAGs
In airflow, a Directed Acyclic Graph is a defined set of instructions that tells airflow what tasks to run and in what order.
A photo example of a DAG in airflow:
Airflow Scheduler
It is a component of airflow responsible for deciding when and which tasks should run.
Scheduler is rsponsible for triggering DAG runs and managing how many runs will run at specified times.
Airflow Webserver
The Airflow Webserver is the component that provides the Graphical User Interface (GUI) for Airflow, it is the part you interact with in your browser to view, monitor, and manage your DAGs and tasks.
Below is an image showing both the Webserver and the Scheduler running:
Below now is an example of the DAG running on my browser, clearly showing the tasks running, first run, most recent run, success/failure of some of the runs and how the tasks are scheduled.
The DAG is running on my localhost:8080
Top comments (0)