DEV Community

Cover image for How to monitor Apache Airflow DAGs so you know when they silently fail
Kriss
Kriss

Posted on • Originally published at deadmancheck.io

How to monitor Apache Airflow DAGs so you know when they silently fail

Your Airflow DAG ran last night. All tasks: green. All durations: normal. Export job completed at 02:14.

Zero rows exported. Nobody knows.

This is the silent failure Airflow's built-in alerting doesn't catch. on_failure_callback fires when a task crashes. It doesn't fire when a task exits 0 after connecting to a stale database replica and processing nothing. That's the failure mode that eats your Monday morning.

This article shows you two ways to add external monitoring to Airflow DAGs — so you get paged for both kinds of failures.


Why Airflow's built-in alerts aren't enough

Airflow gives you several callback hooks:

  • on_failure_callback — task or DAG run failed
  • on_success_callback — task or DAG run succeeded
  • on_retry_callback — task queued for retry
  • on_execute_callback — task about to start
  • on_skipped_callback — task raised AirflowSkipException

These are useful. But none of them answer the question that actually matters for data pipelines: did the job do something?

Your export DAG catches a database timeout, logs it, and exits cleanly. Airflow marks it green. No callbacks fire. The data never lands.

You need an independent check — something outside Airflow that asks "did this DAG complete, and did it report non-zero output?" every time the schedule fires.


The approach: dead man's switch + output assertions

A dead man's switch monitor works like this:

  1. You set up a monitor with an expected interval — say, "this DAG should report in every 24 hours"
  2. Your DAG pings the monitor when it completes
  3. If the monitor doesn't hear from the DAG within the window, it alerts you

This catches missed runs, paused DAGs, scheduler issues, and slow drift.

But the more powerful feature is output assertions: you pass a count with your ping, and the monitor alerts if count is 0 — even when the job completed and pinged successfully.

I'll use DeadManCheck for the examples. It's the only cron monitoring tool that supports output assertions, and it has a free tier for up to 5 monitors.


Option 1: DAG-level callback (cleanest approach)

If you want to monitor the whole DAG run — not individual tasks — use on_success_callback and on_failure_callback at the DAG level.

# airflow/dags/daily_export.py
# Airflow 2.x imports. For Airflow 3.x use:
#   from airflow.sdk import DAG
#   from airflow.providers.standard.operators.python import PythonOperator

import requests
import os
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

DEADMANCHECK_TOKEN = os.environ.get("DEADMANCHECK_TOKEN")
BASE_URL = f"https://deadmancheck.io/ping/{DEADMANCHECK_TOKEN}"


def ping_start(context):
    """Signal that the DAG has started — enables duration monitoring."""
    try:
        requests.get(f"{BASE_URL}/start", timeout=5)
    except Exception:
        pass  # never let monitoring break the job


def ping_success(context):
    """Signal success. Pull row count from XCom for output assertion."""
    try:
        rows = context["ti"].xcom_pull(task_ids="export_data", key="rows_exported") or 0
        requests.get(BASE_URL, params={"count": rows}, timeout=5)
    except Exception:
        pass


def ping_failure(context):
    """Signal explicit failure."""
    try:
        requests.get(f"{BASE_URL}/fail", timeout=5)
    except Exception:
        pass


with DAG(
    dag_id="daily_export",
    schedule="0 2 * * *",
    start_date=datetime(2026, 1, 1),
    catchup=False,
    on_success_callback=ping_success,
    on_failure_callback=ping_failure,
) as dag:

    def export_data(**context):
        rows = run_export()
        # Push count to XCom so ping_success can read it
        context["ti"].xcom_push(key="rows_exported", value=rows)
        return rows

    export_task = PythonOperator(
        task_id="export_data",
        python_callable=export_data,
        on_execute_callback=ping_start,  # task-level in 2.x; fires when this task starts
    )
Enter fullscreen mode Exit fullscreen mode

A few things to note:

Wrap every ping in try/except. The monitoring call must never fail the DAG. If DeadManCheck is unreachable, your pipeline keeps running.

Push row count via XCom. The success callback receives the context object, which includes a TaskInstance. Use xcom_pull to retrieve the count from the last task.

Set on_execute_callback for duration monitoring. In Airflow 2.x this is a task-level callback, so it lives on the first task rather than the DAG itself. It sends the /start signal before that task runs. DeadManCheck then tracks how long each run takes and alerts when a run is significantly longer than the rolling average.


Option 2: Final task in the DAG graph

If you want the monitoring ping visible in the Airflow task graph — useful for debugging — add it as a final PythonOperator.

from airflow.operators.python import PythonOperator


def notify_deadmancheck(**context):
    rows = context["ti"].xcom_pull(task_ids="export_data", key="rows_exported") or 0
    requests.get(BASE_URL, params={"count": rows}, timeout=5)


# In your DAG:
notify = PythonOperator(
    task_id="notify_deadmancheck",
    python_callable=notify_deadmancheck,
)

export_task >> validate_data >> notify  # replace validate_data with your existing tasks
Enter fullscreen mode Exit fullscreen mode

This approach makes the monitoring step explicit and auditable. Note that notify_deadmancheck deliberately has no try/except — if the ping fails, you want Airflow to retry it (and mark the task failed if retries are exhausted), rather than silently swallowing the error. This is the opposite of the callback approach above, where the pipeline must never be blocked by the monitoring call.


Configuring the monitor

In DeadManCheck, create a new monitor:

  • Type: Cron / Heartbeat
  • Interval: set to 25h (slightly longer than your 24h schedule, to allow for run time)
  • Output assertion: alert if count = 0
  • Alert channels: Slack, PagerDuty, email — whatever's in your incident flow

The output assertion is the key part. When your export runs and calls:

GET https://deadmancheck.io/ping/your-token?count=0
Enter fullscreen mode Exit fullscreen mode

You get an alert. Even though Airflow shows the DAG as green.


Setting the environment variable

In your Airflow deployment, add DEADMANCHECK_TOKEN as an environment variable. Where you set it depends on your setup:

Docker Compose:

environment:
  - DEADMANCHECK_TOKEN=your-token-here
Enter fullscreen mode Exit fullscreen mode

Kubernetes (via Secret):

kubectl create secret generic deadmancheck-secret \
  --from-literal=token=your-token-here
Enter fullscreen mode Exit fullscreen mode
env:
  - name: DEADMANCHECK_TOKEN
    valueFrom:
      secretKeyRef:
        name: deadmancheck-secret
        key: token
Enter fullscreen mode Exit fullscreen mode

Astronomer / MWAA: add it as an Airflow Variable or environment variable via the platform's UI.


What you catch with this setup

With the callback approach + output assertion:

Failure mode Airflow catches? DeadManCheck catches?
Task raises exception YES YES (via on_failure_callback)
DAG paused accidentally No YES (missed ping)
Scheduler down No YES (missed ping)
Job exports 0 rows No YES (output assertion)
Run takes 3× longer than usual No YES (duration anomaly)
API token expired, job exits 0 No YES (output assertion)

Two minutes to set up

  1. Create a free account — no credit card needed
  2. Create a monitor, set interval to match your DAG schedule + buffer
  3. Enable output assertion: alert if count = 0
  4. Add the callbacks to your DAG
  5. Deploy, run once, verify the ping arrives

After the first successful run, DeadManCheck will alert you if the DAG ever goes silent — or succeeds while doing nothing.


DeadManCheck is open source and self-hostable. If you'd rather run it on your own infrastructure, the GitHub repo has setup instructions.

Top comments (0)