GeraldM

Posted on May 2

A Beginners Guide to Apache Airflow

#dataengineering #airflow #automation #beginners

Introduction

In data engineering we build data pipelines using approaches such as ETL(extract, transform, load) and ELT(extract, load, transform). These data pipelines are python code programs that perform the individually defined tasks known as workflows. But we all know that a program only runs once when you manually run it. For data pipelines, data is being extracted from various sources such as websites and payments systems that continuously record new data. This means that the data we are extracting keeps changing and we need to run our python code again and again to accommodate the new data. How do we do this? This is where Apache airflow comes in.

To understand more about ETL and ELT read through my article ETL vs ELT

What is Apache Airflow?

Apache Airflow is an opensource platform used to schedule, monitor and manage workflows. It was created by Maxime Beauchemin at Airbnb in 2014 with the aim of managing increasingly complex data worflows. It organizes tasks into workflows called DAGs(Directed Acyclic Graphs) where each task runs in a defined order based on it's dependencies.

What is a workflow?
This is a sequence of connected tasks that are executed in a specific order to complete a process. In airflow, workflows define what task should run, when they should run and which task depends on others.

Extract/collect data --> Transform/clean data --> Load data --> Send alert

Airflow provides a scheduler, task executor and a web interface that help teams manage workflows reliably, track execution, handle failures and scale operations across multiple systems.
Image: Apache Airflow Web UI

Why use Airflow?

As a data engineer, you don't want to be called at midnight to fix the data pipeline because it is not working and looking at your python scripts running as cron jobs, you don't know which one failed, where and when. Airflow provides orchestration which allows arranging of tasks in so that they run in the right order and a scheduled time. It records whether a task succeeded or failed and provides an error of why it failed. With this a data engineer can be able to easily manage and troubleshoot their data pipelines.

The following are features that make Airflow useful:
1. Task orchestration: Tasks are arranged depending on which task runs first, second and last.
2. Scheduling: Data pipelines are repetitive with the need to constantly collect new data as it is being generated at the sources. With scheduling, this can happen automatically based on a scheduled time.
3. Automated retries: Tasks can fail for various reasons including temporary ones like network failure, database dropped connection or API rate limiting. With Airflow, such errors/failures are handled by retrying a task again based on the number of retries set.
Example: Telling Airflow to wait for four minutes before trying again for a number of three times


"retries": 3,
"retry_delay": timedelta(minutes=4)

4. Logging: Airflow attaches isolated logs to every single task execution enabling engineers to quickly find out when and why a task/pipeline failed. This is accessible via the UI where a user clicks on the failed tasks and the logs are visible.

5. Failure handling: When a failure occurs during running of a task, Airflow stops execution of the downstream tasks. This is very important in data as the execution of downstream task could result to bad/corrupt data moving through the pipeline. Airflow can also be configured to send automated alerts in case of a pipeline failure via channels such as Email and Slack.

6. Monitoring: Airflow provides a centralized web interface to monitor the entire data pipeline. The web interface is able to track what is queued, what is running, what was successful and what failed. It also provides visual representation inform of charts showing task duration providing insights into the pipeline.

Key Concepts in Airflow

1. DAG (Directed Acyclic Graphs)
Directed Acyclic Graphs are a collection of tasks defined in order of dependency (what runs before what) and do not loop back.
Let's break it down:
Directed - The workflow moves in one direction with each process having a starting point, an ending point and does not move backward.
Acyclic - There are no loops making sure that a pipeline ends.
Graph - It is a structure made of points (tasks) and connections (dependencies).

Example: A DAG containing multiple tasks

@dag(
    dag_id = 'gas_prices_dag',
    start_date = datetime(2026, 04, 2022),
    schedule = timedelta(minutes=5),
    catchup = False
)

def gas_prices_dag():
    @task()
    def extract_gasprices():
        conn = http.client.HTTPSConnection("api.collectapi.com")
        headers = {
        'content-type': "application/json",
        'authorization': "apikey <apikey>"
        }

        conn.request("GET", "/gasPrice/stateUsaPrice?state=CA", headers=headers)

        response = conn.getresponse()
        data = response.read()

        decoded_data = data.decode("utf-8")
        return decoded_data

    @task()
    def transform_gasprices(raw_gas_prices):

        parsed_data = json.loads(raw_gas_prices)

        cities_data = parsed_data['result']['cities']

        cities_df = pd.DataFrame(cities_data)

        cities_df = cities_df.rename(columns={ 'name':'cities'})

        cities_df = cities_df.drop(['lowername'], axis=1)

        json_data = cities_df.to_json(orient='records')

        return json_data

    @task()
    def load_gasprices(clean_gas_data):

        df = pd.read_json(StringIO(clean_gas_data))

        engine = create_engine('postgresql+psycopg2://postgres:12345@localhost:5432/postgres') 

        with engine.begin() as conn:

            df.to_sql(name = 'california_gas_prices', con=engine, if_exists='append', index=False)

    #Define task dependencies
    raw_gas_prices = extract_gasprices()
    clean_gas_data = transform_gasprices(raw_gas_prices)
    load_gasprices(clean_gas_data)

dag = gas_prices_dag()

2. Task
This is are individual units inside a DAG that perform a specific task such as extract data.

Example: a task to extract data

 @task()
    def extract_gasprices():
        conn = http.client.HTTPSConnection("api.collectapi.com")
        headers = {
        'content-type': "application/json",
        'authorization': "apikey <apikey>"
        }

        conn.request("GET", "/gasPrice/stateUsaPrice?state=CA", headers=headers)

        response = conn.getresponse()
        data = response.read()

        decoded_data = data.decode("utf-8")
        return decoded_data

3. Scheduler
It is the brains of Airflow as it checks the existing DAGs and decides which task should run and when. A common error that is faced during the use of airflow is when DAGs are appearing on the UI but not running. This is usually due to the scheduler not running.

4. Executor
It handles how tasks are actually run with different Airflow setups using different executors. They include:
LocalExecutor: runs tasks locally on the host machine and is capable of running more than one task at the same time.
SequentialExecutor: runs one task at a time and is not capable of running many tasks in parallel. Used mainly in learning and testing.
CeleryExecutor: in this, the scheduler sends tasks to a queue and workers pick them up and run them. It requires a message broker such as Redis and RabbitMQ. Used in large setups.
KubernetesExecutor: used in a cloud environments and runs each task in a separate Kubernetes pod.

5. Worker
It is the process that actually executes the tasks. The scheduler decides what should run, the executor sends the tasks to a queue and the worker picks it up and runs it.

Scheduler --> Executor --> Queue --> Worker

This is implemented when using a CeleryExecutor.

6. XCom (cross-communication)
Cross-Communication allows tasks to pass small pieces of data to each other.
Note: XCom is for passing small pieces of data from tasks to another and not large datasets. Moving large datasets using XComs will slow down Airflow.

Example:
Push data from one task using XCom:

kwargs['ti'].xcom_push(key='raw_gas_data', value=decoded_data)

Pull the data in another task using XCom:

raw_gasprices = kwargs['ti'].xcom_pull(key='raw_gas_data', task_ids='extracting')

7. Database
Airflow has an internal database known as a metadata database that is used to server as memory for Airflow. Stores information such as DAGs, DAG runs, Task runs, Task states, Schedules, Retries, Users, Roles etc.
This enabled airflow to know things such as if a task succeeded, failed, times at tasks has been retried etc.

Conclusion

We have explored what Apache Airflow is, why it is widely used, and the core building blocks that make it effective for workflow orchestration. By automating and managing complex workflows, Airflow has made data pipelines more efficient, reliable, and easier to monitor and maintain.

DEV Community