Kriss

Posted on May 1 • Edited on May 5 • Originally published at deadmancheck.io

How to monitor Apache Airflow DAGs so you know when they silently fail

#airflow #python #devops #monitoring

Your Airflow DAG ran last night. All tasks: green. All durations: normal. Export job completed at 02:14.

Zero rows exported. Nobody knows.

This is the silent failure Airflow's built-in alerting doesn't catch. on_failure_callback fires when a task crashes. It doesn't fire when a task exits 0 after connecting to a stale database replica and processing nothing. That's the failure mode that eats your Monday morning.

This article shows you two ways to add external monitoring to Airflow DAGs — so you get paged for both kinds of failures.

Why Airflow's built-in alerts aren't enough

Airflow gives you several callback hooks:

on_failure_callback — task or DAG run failed
on_success_callback — task or DAG run succeeded
on_retry_callback — task queued for retry
on_execute_callback — task about to start
on_skipped_callback — task raised AirflowSkipException

These are useful. But none of them answer the question that actually matters for data pipelines: did the job do something?

Your export DAG catches a database timeout, logs it, and exits cleanly. Airflow marks it green. No callbacks fire. The data never lands.

You need an independent check — something outside Airflow that asks "did this DAG complete, and did it report non-zero output?" every time the schedule fires.

The approach: dead man's switch + output assertions

A dead man's switch monitor works like this:

You set up a monitor with an expected interval — say, "this DAG should report in every 24 hours"
Your DAG pings the monitor when it completes
If the monitor doesn't hear from the DAG within the window, it alerts you

This catches missed runs, paused DAGs, scheduler issues, and slow drift.

But the more powerful feature is output assertions: you pass a count with your ping, and the monitor alerts if count is 0 — even when the job completed and pinged successfully.

I'll use DeadManCheck for the examples. It's the only cron monitoring tool that supports output assertions, and it has a free tier for up to 5 monitors.

Option 1: DAG-level callback (cleanest approach)

If you want to monitor the whole DAG run — not individual tasks — use on_success_callback and on_failure_callback at the DAG level.

# airflow/dags/daily_export.py
# Airflow 2.x imports. For Airflow 3.x use:
#   from airflow.sdk import DAG
#   from airflow.providers.standard.operators.python import PythonOperator

import requests
import os
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

DEADMANCHECK_TOKEN = os.environ.get("DEADMANCHECK_TOKEN")
BASE_URL = f"https://deadmancheck.io/ping/{DEADMANCHECK_TOKEN}"


def ping_start(context):
    """Signal that the DAG has started — enables duration monitoring."""
    try:
        requests.get(f"{BASE_URL}/start", timeout=5)
    except Exception:
        pass  # never let monitoring break the job


def ping_success(context):
    """Signal success. Pull row count from XCom for output assertion."""
    try:
        rows = context["ti"].xcom_pull(task_ids="export_data", key="rows_exported") or 0
        requests.post(BASE_URL, json={"count": rows}, timeout=5)
    except Exception:
        pass


def ping_failure(context):
    """Signal explicit failure."""
    try:
        requests.get(f"{BASE_URL}/fail", timeout=5)
    except Exception:
        pass


with DAG(
    dag_id="daily_export",
    schedule="0 2 * * *",
    start_date=datetime(2026, 1, 1),
    catchup=False,
    on_success_callback=ping_success,
    on_failure_callback=ping_failure,
) as dag:

    def export_data(**context):
        rows = run_export()
        # Push count to XCom so ping_success can read it
        context["ti"].xcom_push(key="rows_exported", value=rows)
        return rows

    export_task = PythonOperator(
        task_id="export_data",
        python_callable=export_data,
        on_execute_callback=ping_start,  # task-level in 2.x; fires when this task starts
    )

A few things to note:

Wrap every ping in try/except. The monitoring call must never fail the DAG. If DeadManCheck is unreachable, your pipeline keeps running.

Push row count via XCom. The success callback receives the context object, which includes a TaskInstance. Use xcom_pull to retrieve the count from the last task.

Set on_execute_callback for duration monitoring. In Airflow 2.x this is a task-level callback, so it lives on the first task rather than the DAG itself. It sends the /start signal before that task runs. DeadManCheck then tracks how long each run takes and alerts when a run is significantly longer than the rolling average.

Option 2: Final task in the DAG graph

If you want the monitoring ping visible in the Airflow task graph — useful for debugging — add it as a final PythonOperator.

from airflow.operators.python import PythonOperator


def notify_deadmancheck(**context):
    rows = context["ti"].xcom_pull(task_ids="export_data", key="rows_exported") or 0
    requests.post(BASE_URL, json={"count": rows}, timeout=5)


# In your DAG:
notify = PythonOperator(
    task_id="notify_deadmancheck",
    python_callable=notify_deadmancheck,
)

export_task >> validate_data >> notify  # replace validate_data with your existing tasks

This approach makes the monitoring step explicit and auditable. Note that notify_deadmancheck deliberately has no try/except — if the ping fails, you want Airflow to retry it (and mark the task failed if retries are exhausted), rather than silently swallowing the error. This is the opposite of the callback approach above, where the pipeline must never be blocked by the monitoring call.

Configuring the monitor

In DeadManCheck, create a new monitor:

Type: Cron / Heartbeat
Interval: set to 25h (slightly longer than your 24h schedule, to allow for run time)
Output assertion: alert if count = 0
Alert channels: Slack, PagerDuty, email — whatever's in your incident flow

The output assertion is the key part. When your export runs and calls:

curl -fsS -X POST -H "Content-Type: application/json" \
  -d '{"count": 0}' \
  https://deadmancheck.io/ping/your-token > /dev/null

You get an alert. Even though Airflow shows the DAG as green.

Setting the environment variable

In your Airflow deployment, add DEADMANCHECK_TOKEN as an environment variable. Where you set it depends on your setup:

Docker Compose:

environment:
  - DEADMANCHECK_TOKEN=your-token-here

Kubernetes (via Secret):

kubectl create secret generic deadmancheck-secret \
  --from-literal=token=your-token-here

env:
  - name: DEADMANCHECK_TOKEN
    valueFrom:
      secretKeyRef:
        name: deadmancheck-secret
        key: token

Astronomer / MWAA: add it as an Airflow Variable or environment variable via the platform's UI.

What you catch with this setup

With the callback approach + output assertion:

Failure mode	Airflow catches?	DeadManCheck catches?
Task raises exception	YES	YES (via on_failure_callback)
DAG paused accidentally	No	YES (missed ping)
Scheduler down	No	YES (missed ping)
Job exports 0 rows	No	YES (output assertion)
Run takes 3× longer than usual	No	YES (duration anomaly)
API token expired, job exits 0	No	YES (output assertion)

Two minutes to set up

Create a free account — no credit card needed
Create a monitor, set interval to match your DAG schedule + buffer
Enable output assertion: alert if count = 0
Add the callbacks to your DAG
Deploy, run once, verify the ping arrives

After the first successful run, DeadManCheck will alert you if the DAG ever goes silent — or succeeds while doing nothing.

DeadManCheck is open source and self-hostable. If you'd rather run it on your own infrastructure, the GitHub repo has setup instructions.

DEV Community