Abdi Omari

Posted on Jul 4

Data management patterns in Apache Airflow

#apacheairflow #data #python

What is Airflow?

Apache Airflow is an open-source orchestration tool used by data engineers for developing, scheduling, and monitoring batch-oriented workflows.

How it Works

Workflow as code: Airflow workflows are defined entirely in Python, resulting in what is commonly referred to as "Workflow as code".
DAGs: Directed Acyclic Graphs in Airflow are used to organise tasks in a specific order with clear dependencies, without looping back on themselves.
Automation: Airflow handles scheduling, automated retries if a task fails, and sends alerts if a pipeline breaks.
Web UI: It features a web-based UI to visualize, manage, and debug workflows.

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator

def print_hello():
    print("hello")

with DAG(
    "simple_hello",
    start_date=datetime(2026, 6, 17),
    schedule=timedelta(minutes=5),
    catchup=False,
) as dag:
    hello_task = PythonOperator(
        task_id="print_hello",
        python_callable=print_hello,
    )

    hello_task

How Airflow is used with Python for Data Engineering

Airflow is the standard tool for data engineers for batch-oriented workflows. Ideal when you need tasks to run on a specific schedule, rely on the successful completion of previous tasks and require robust handling.

Airflow can be integrated with Databases like Snowflake, MySQL, and Postgres, Cloud providers like AWS and Google Cloud for more complex workflows.

For a workflow to execute as intended through various tasks and for tasks that depend on data from previous tasks, data has to be handled correctly by Airflow to avoid breaking the pipeline.

How data moves between tasks

Airflow gives you several ways to pass data between tasks. Picking the right one matters, the wrong choice will slow you down.

1. XCom (cross-communication)

XCom is Airflow's built-in way for tasks to talk to each other. By default, it stores data directly in Airflow's metadata database, which makes it fast to set up but a poor fit for anything large. Think status flags and file paths, not dataframes.

Implicit XCom. When a task's Python callable returns a value, Airflow automatically stores it under the return_value key, no extra code needed.

def extract():
    return {"price": 300, "city": "Mombasa"}

Any downstream task can pull that value with xcom_pull(task_ids="extract").

Explicit XCom. Sometimes one return value isn't enough, or you want a custom key instead of return_value. That's what ti.xcom_push() and ti.xcom_pull() are for:

def extract(**context):
    context["ti"].xcom_push(key="raw_price", value=3.42)

def transform(**context):
    price = context["ti"].xcom_pull(key="raw_price", task_ids="extract")

TaskFlow API. The modern way to write Airflow DAGs. Decorate your functions with @task, and Airflow wires up the XCom push and pull for you based on function arguments and return values:

from airflow.decorators import task

@task
def extract():
    return {"price": 3.42}

@task
def transform(data):
    return data["price"] * 1.1

transform(extract())

No manual xcom_push or xcom_pull calls. It reads like normal Python, and it's the approach most new DAGs should start with.

One catch that isn't obvious at first: XCom data has to be JSON-serializable. Return a pandas DataFrame straight from a task and it'll fail silently or throw an unhelpful error. Convert it to df.to_dict("records") first, and if you're passing bytes, decode them to a string before pushing.

2. External object storage

For anything too big for the metadata database, don't push the data itself through XCom, push a pointer to it. One task uploads a file to S3 or GCS and returns the URI; the next task downloads it from that URI. The same pattern works with a shared NFS mount or a Kubernetes persistent volume if everything's running on the same cluster. XCom just carries the address, not the payload.

3. Database state changes

Sometimes the cleanest handoff isn't through Airflow at all, it's through the database. Task A runs SQL that creates a temporary table; task B runs SQL that reads from it. Or task A updates rows in an operational table, and task B picks up the new state on its next read. Airflow's job here is purely orchestration, it triggers the SQL, but the transformation happens entirely inside the warehouse.

4. Airflow global storage

Not every piece of shared data belongs to a single task run. Airflow Variables are key-value pairs stored in the metadata database and readable by any task in any DAG, useful for config values that don't change often. Connections and Hooks work the same way for credentials, defined once in the UI or an environment variable, then injected into tasks at runtime instead of hardcoded in your scripts.

5. Custom XCom backends

If you outgrow the default metadata-database backend but still want XCom's syntax, you can configure a custom backend that intercepts XCom writes and redirects them to S3, GCS, or wherever you'd rather store them. Your DAG code stays the same, xcom_push and xcom_pull work exactly as before, only the storage location changes underneath.

6. SubDAGs and TriggerDagRunOperator

For splitting work across DAGs rather than within one, SubDAGs let you nest a DAG inside a task of a parent DAG, though they've fallen out of favor because they tend to create scheduler bottlenecks. TriggerDagRunOperator is the more common approach now: a task in one DAG triggers a run of a completely separate DAG, optionally passing a config dict along with it. It's a cleaner way to decouple pipelines that still need to hand off to each other.

Most XCom bugs aren't about Airflow being complicated, they're about a wrong assumption slipping through quietly. A typo in task_ids. A DataFrame that isn't JSON-serializable. An os.getenv("USER") call that collides with your actual OS environment variable instead of the value you meant to set. None of these throw a loud error. They just make a downstream task pull None and fail somewhere that looks unrelated to the real cause.

DEV Community