Your Airflow DAG ran last night. All tasks: green. All durations: normal. Export job completed at 02:14.
Zero rows exported. Nobody knows.
This is the silent failure Airflow's built-in alerting doesn't catch. on_failure_callback fires when a task crashes. It doesn't fire when a task exits 0 after connecting to a stale database replica and processing nothing. That's the failure mode that eats your Monday morning.
This article shows you two ways to add external monitoring to Airflow DAGs — so you get paged for both kinds of failures.
Why Airflow's built-in alerts aren't enough
Airflow gives you several callback hooks:
-
on_failure_callback— task or DAG run failed -
on_success_callback— task or DAG run succeeded -
on_retry_callback— task queued for retry -
on_execute_callback— task about to start -
on_skipped_callback— task raised AirflowSkipException
These are useful. But none of them answer the question that actually matters for data pipelines: did the job do something?
Your export DAG catches a database timeout, logs it, and exits cleanly. Airflow marks it green. No callbacks fire. The data never lands.
You need an independent check — something outside Airflow that asks "did this DAG complete, and did it report non-zero output?" every time the schedule fires.
The approach: dead man's switch + output assertions
A dead man's switch monitor works like this:
- You set up a monitor with an expected interval — say, "this DAG should report in every 24 hours"
- Your DAG pings the monitor when it completes
- If the monitor doesn't hear from the DAG within the window, it alerts you
This catches missed runs, paused DAGs, scheduler issues, and slow drift.
But the more powerful feature is output assertions: you pass a count with your ping, and the monitor alerts if count is 0 — even when the job completed and pinged successfully.
I'll use DeadManCheck for the examples. It's the only cron monitoring tool that supports output assertions, and it has a free tier for up to 5 monitors.
Option 1: DAG-level callback (cleanest approach)
If you want to monitor the whole DAG run — not individual tasks — use on_success_callback and on_failure_callback at the DAG level.
# airflow/dags/daily_export.py
# Airflow 2.x imports. For Airflow 3.x use:
# from airflow.sdk import DAG
# from airflow.providers.standard.operators.python import PythonOperator
import requests
import os
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
DEADMANCHECK_TOKEN = os.environ.get("DEADMANCHECK_TOKEN")
BASE_URL = f"https://deadmancheck.io/ping/{DEADMANCHECK_TOKEN}"
def ping_start(context):
"""Signal that the DAG has started — enables duration monitoring."""
try:
requests.get(f"{BASE_URL}/start", timeout=5)
except Exception:
pass # never let monitoring break the job
def ping_success(context):
"""Signal success. Pull row count from XCom for output assertion."""
try:
rows = context["ti"].xcom_pull(task_ids="export_data", key="rows_exported") or 0
requests.get(BASE_URL, params={"count": rows}, timeout=5)
except Exception:
pass
def ping_failure(context):
"""Signal explicit failure."""
try:
requests.get(f"{BASE_URL}/fail", timeout=5)
except Exception:
pass
with DAG(
dag_id="daily_export",
schedule="0 2 * * *",
start_date=datetime(2026, 1, 1),
catchup=False,
on_success_callback=ping_success,
on_failure_callback=ping_failure,
) as dag:
def export_data(**context):
rows = run_export()
# Push count to XCom so ping_success can read it
context["ti"].xcom_push(key="rows_exported", value=rows)
return rows
export_task = PythonOperator(
task_id="export_data",
python_callable=export_data,
on_execute_callback=ping_start, # task-level in 2.x; fires when this task starts
)
A few things to note:
Wrap every ping in try/except. The monitoring call must never fail the DAG. If DeadManCheck is unreachable, your pipeline keeps running.
Push row count via XCom. The success callback receives the context object, which includes a TaskInstance. Use xcom_pull to retrieve the count from the last task.
Set on_execute_callback for duration monitoring. In Airflow 2.x this is a task-level callback, so it lives on the first task rather than the DAG itself. It sends the /start signal before that task runs. DeadManCheck then tracks how long each run takes and alerts when a run is significantly longer than the rolling average.
Option 2: Final task in the DAG graph
If you want the monitoring ping visible in the Airflow task graph — useful for debugging — add it as a final PythonOperator.
from airflow.operators.python import PythonOperator
def notify_deadmancheck(**context):
rows = context["ti"].xcom_pull(task_ids="export_data", key="rows_exported") or 0
requests.get(BASE_URL, params={"count": rows}, timeout=5)
# In your DAG:
notify = PythonOperator(
task_id="notify_deadmancheck",
python_callable=notify_deadmancheck,
)
export_task >> validate_data >> notify # replace validate_data with your existing tasks
This approach makes the monitoring step explicit and auditable. Note that notify_deadmancheck deliberately has no try/except — if the ping fails, you want Airflow to retry it (and mark the task failed if retries are exhausted), rather than silently swallowing the error. This is the opposite of the callback approach above, where the pipeline must never be blocked by the monitoring call.
Configuring the monitor
In DeadManCheck, create a new monitor:
- Type: Cron / Heartbeat
-
Interval: set to
25h(slightly longer than your 24h schedule, to allow for run time) -
Output assertion: alert if
count = 0 - Alert channels: Slack, PagerDuty, email — whatever's in your incident flow
The output assertion is the key part. When your export runs and calls:
GET https://deadmancheck.io/ping/your-token?count=0
You get an alert. Even though Airflow shows the DAG as green.
Setting the environment variable
In your Airflow deployment, add DEADMANCHECK_TOKEN as an environment variable. Where you set it depends on your setup:
Docker Compose:
environment:
- DEADMANCHECK_TOKEN=your-token-here
Kubernetes (via Secret):
kubectl create secret generic deadmancheck-secret \
--from-literal=token=your-token-here
env:
- name: DEADMANCHECK_TOKEN
valueFrom:
secretKeyRef:
name: deadmancheck-secret
key: token
Astronomer / MWAA: add it as an Airflow Variable or environment variable via the platform's UI.
What you catch with this setup
With the callback approach + output assertion:
| Failure mode | Airflow catches? | DeadManCheck catches? |
|---|---|---|
| Task raises exception | YES | YES (via on_failure_callback) |
| DAG paused accidentally | No | YES (missed ping) |
| Scheduler down | No | YES (missed ping) |
| Job exports 0 rows | No | YES (output assertion) |
| Run takes 3× longer than usual | No | YES (duration anomaly) |
| API token expired, job exits 0 | No | YES (output assertion) |
Two minutes to set up
- Create a free account — no credit card needed
- Create a monitor, set interval to match your DAG schedule + buffer
- Enable output assertion: alert if count = 0
- Add the callbacks to your DAG
- Deploy, run once, verify the ping arrives
After the first successful run, DeadManCheck will alert you if the DAG ever goes silent — or succeeds while doing nothing.
DeadManCheck is open source and self-hostable. If you'd rather run it on your own infrastructure, the GitHub repo has setup instructions.
Top comments (0)