<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kriss</title>
    <description>The latest articles on DEV Community by Kriss (@krissv).</description>
    <link>https://dev.to/krissv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3849903%2Fb9620b3b-82b5-4fb6-be97-4683ee0859e2.png</url>
      <title>DEV Community: Kriss</title>
      <link>https://dev.to/krissv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/krissv"/>
    <language>en</language>
    <item>
      <title>Kubernetes CronJobs silently fail more than you think</title>
      <dc:creator>Kriss</dc:creator>
      <pubDate>Tue, 05 May 2026 14:06:03 +0000</pubDate>
      <link>https://dev.to/krissv/kubernetes-cronjobs-silently-fail-more-than-you-think-2nb9</link>
      <guid>https://dev.to/krissv/kubernetes-cronjobs-silently-fail-more-than-you-think-2nb9</guid>
      <description>&lt;p&gt;A backup job missed 24 days of runs. Nobody knew. The CronJob looked fine in &lt;code&gt;kubectl get cronjobs&lt;/code&gt;. No alerts fired. The last successful run timestamp in the status field just sat there, quietly getting older.&lt;/p&gt;

&lt;p&gt;The root cause: the CronJob controller had silently given up scheduling after missing 100 runs. Logged an error. Stopped trying. Moved on.&lt;/p&gt;

&lt;p&gt;This article explains why Kubernetes CronJobs are structurally unreliable without external monitoring, and what you can do about it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The three failure modes Kubernetes won't tell you about
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The 100 missed-schedule limit
&lt;/h3&gt;

&lt;p&gt;This is the one that produces the war stories.&lt;/p&gt;

&lt;p&gt;The Kubernetes CronJob controller checks how many schedules it missed since the last successful run. If that number exceeds 100, it permanently stops scheduling that CronJob — and logs a single error line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cannot determine if job needs to be started: too many missed start time (&amp;gt; 100)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No event. No alert. &lt;code&gt;kubectl describe cronjob&lt;/code&gt; shows the last scheduled time getting stale. The CronJob shows as &lt;code&gt;ACTIVE: 0&lt;/code&gt;. Everything looks fine until you notice your data is 24 days old.&lt;/p&gt;

&lt;p&gt;This happens if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The CronJob controller crashes or the API server is unreachable for an extended period&lt;/li&gt;
&lt;li&gt;You set &lt;code&gt;startingDeadlineSeconds&lt;/code&gt; too low and the cluster was briefly overloaded&lt;/li&gt;
&lt;li&gt;A node outage prevented scheduling for long enough&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fix is restarting the CronJob (delete and recreate it, or bump the schedule), but the point is: you won't know it happened until you check manually.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Exit code 0 is not success
&lt;/h3&gt;

&lt;p&gt;Your CronJob container can exit 0 after:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Connecting to a read replica that's 6 hours behind&lt;/li&gt;
&lt;li&gt;Finding an empty queue and processing nothing&lt;/li&gt;
&lt;li&gt;Silently swallowing an exception in a try/catch&lt;/li&gt;
&lt;li&gt;Successfully completing a database backup of 0 bytes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Kubernetes marks the Job as &lt;code&gt;Succeeded&lt;/code&gt;. The CronJob status shows the last successful run timestamp updated. Everything looks healthy. Your data pipeline has been doing nothing for a week.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Job history purged, evidence gone
&lt;/h3&gt;

&lt;p&gt;By default:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;successfulJobsHistoryLimit: 3&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;failedJobsHistoryLimit: 1&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After three successful runs, the oldest Job pod is deleted. Its logs go with it. When you eventually notice something's wrong and go looking for "what happened on Tuesday?", the evidence no longer exists.&lt;/p&gt;

&lt;p&gt;You can increase these limits, but you'll never retain more than a handful of runs. A real audit trail requires shipping logs to an external system.&lt;/p&gt;




&lt;h2&gt;
  
  
  The deeper problem: no external check
&lt;/h2&gt;

&lt;p&gt;All of these failure modes share the same root cause: your monitoring system lives inside the cluster, so it fails along with the cluster.&lt;/p&gt;

&lt;p&gt;If your alerting depends on the cluster being healthy, it won't alert you when the cluster is unhealthy. And CronJob failures almost always correlate with cluster health problems.&lt;/p&gt;

&lt;p&gt;What you need is a check that runs outside your cluster and asks: "did this job run? Did it do something?" If the answer is no, it pages you — regardless of what the cluster thinks.&lt;/p&gt;

&lt;p&gt;This is the dead man's switch pattern: instead of your monitoring system checking whether the job ran, the job checks in with an external system, and the external system alerts if it stops hearing from the job.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementing external monitoring for a Kubernetes CronJob
&lt;/h2&gt;

&lt;p&gt;Add a start/success/fail ping to your job. Here's a minimal implementation:&lt;/p&gt;

&lt;h3&gt;
  
  
  Shell wrapper (works with any container)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;BASE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://deadmancheck.io/ping/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DEADMANCHECK_TOKEN&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# Signal start (enables duration monitoring) — || true so a network blip never kills the job&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsS&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;BASE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/start"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;

&lt;span class="c"&gt;# Alert on any error&lt;/span&gt;
&lt;span class="nb"&gt;trap&lt;/span&gt; &lt;span class="s1"&gt;'curl -fsS "${BASE}/fail" &amp;gt; /dev/null'&lt;/span&gt; ERR

&lt;span class="c"&gt;# Your actual job&lt;/span&gt;
&lt;span class="nv"&gt;ROWS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;/app/run-export.sh&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Signal success + row count for output assertion&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsS&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;count&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ROWS&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;BASE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Python job
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;

&lt;span class="n"&gt;TOKEN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DEADMANCHECK_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;BASE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://deadmancheck.io/ping/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;TOKEN&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Signal start — wrapped so a monitoring outage never kills the job
&lt;/span&gt;    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;BASE&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;records_processed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_job&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="c1"&gt;# POST count for output assertion: alert if count is 0
&lt;/span&gt;        &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BASE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;records_processed&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;BASE&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/fail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;pass&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  CronJob spec
&lt;/h3&gt;

&lt;p&gt;Store the token in a Kubernetes Secret:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create secret generic deadmancheck-secret &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-token-here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reference it in your CronJob spec:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CronJob&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;daily-export&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;
  &lt;span class="na"&gt;successfulJobsHistoryLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;   &lt;span class="c1"&gt;# keep more history than the default&lt;/span&gt;
  &lt;span class="na"&gt;failedJobsHistoryLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;jobTemplate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;exporter&lt;/span&gt;
            &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;your-registry/exporter:latest&lt;/span&gt;
            &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DEADMANCHECK_TOKEN&lt;/span&gt;
              &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;secretKeyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deadmancheck-secret&lt;/span&gt;
                  &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;token&lt;/span&gt;
          &lt;span class="na"&gt;restartPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OnFailure&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Output assertions: the check Kubernetes can't do
&lt;/h2&gt;

&lt;p&gt;Output assertions are the piece most monitoring tutorials skip. Here's why it matters.&lt;/p&gt;

&lt;p&gt;Your job runs. Exits 0. Kubernetes marks it &lt;code&gt;Succeeded&lt;/code&gt;. But the job processed 0 records.&lt;/p&gt;

&lt;p&gt;If your monitoring only checks "did the job ping?" — like every other cron monitoring tool — you don't get alerted. The job pinged. It just pinged with count=0.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://deadmancheck.io" rel="noopener noreferrer"&gt;DeadManCheck&lt;/a&gt; lets you configure an output assertion: alert if &lt;code&gt;count &amp;lt; N&lt;/code&gt;. Set it to &lt;code&gt;count &amp;gt; 0&lt;/code&gt;. Now your job can't silently export nothing without triggering an alert.&lt;/p&gt;

&lt;p&gt;This catches the failure mode that pure heartbeat monitoring misses: the job that runs, succeeds by every technical measure, and still does nothing useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  What external monitoring catches vs what it doesn't
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure mode&lt;/th&gt;
&lt;th&gt;kubectl catches?&lt;/th&gt;
&lt;th&gt;External monitoring catches?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pod CrashLoopBackOff&lt;/td&gt;
&lt;td&gt;Visible in logs/events&lt;/td&gt;
&lt;td&gt;YES (missed ping)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 missed-schedule limit hit&lt;/td&gt;
&lt;td&gt;No alert fires&lt;/td&gt;
&lt;td&gt;YES (missed ping)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Job exits 0, processes nothing&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;YES (output assertion)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cluster outage kills controller&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;YES (missed ping)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Job takes 5× longer than usual&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;YES (duration anomaly)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CronJob accidentally deleted&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;YES (missed ping)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The realistic setup time
&lt;/h2&gt;

&lt;p&gt;For an existing CronJob:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://deadmancheck.io/register" rel="noopener noreferrer"&gt;Create a free monitor&lt;/a&gt; — takes 2 minutes&lt;/li&gt;
&lt;li&gt;Set interval to match your schedule + buffer (e.g., &lt;code&gt;25h&lt;/code&gt; for a daily job)&lt;/li&gt;
&lt;li&gt;Enable output assertion if your job reports a count&lt;/li&gt;
&lt;li&gt;Add the start/success/fail pings to your container script&lt;/li&gt;
&lt;li&gt;Create the Secret, update the CronJob spec&lt;/li&gt;
&lt;li&gt;Deploy and verify the first ping arrives&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Total: 15-20 minutes including deployment. The first time a silent failure happens, you'll have wished you'd done it sooner.&lt;/p&gt;




&lt;h2&gt;
  
  
  One more thing: set a reasonable history limit
&lt;/h2&gt;

&lt;p&gt;While you're in the CronJob spec, increase the history limits from the defaults:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;successfulJobsHistoryLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;span class="na"&gt;failedJobsHistoryLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This doesn't replace external monitoring, but it gives you more context in &lt;code&gt;kubectl describe cronjob&lt;/code&gt; when you're investigating an incident. The default of 3/1 is genuinely too low for production jobs.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;DeadManCheck is open source and self-hostable if you'd rather run it on your own infrastructure. &lt;a href="https://github.com/Kriss-V/deadmancheck" rel="noopener noreferrer"&gt;GitHub →&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>reliability</category>
    </item>
    <item>
      <title>How to monitor Apache Airflow DAGs so you know when they silently fail</title>
      <dc:creator>Kriss</dc:creator>
      <pubDate>Fri, 01 May 2026 12:14:44 +0000</pubDate>
      <link>https://dev.to/krissv/how-to-monitor-apache-airflow-dags-so-you-know-when-they-silently-fail-3pp0</link>
      <guid>https://dev.to/krissv/how-to-monitor-apache-airflow-dags-so-you-know-when-they-silently-fail-3pp0</guid>
      <description>&lt;p&gt;Your Airflow DAG ran last night. All tasks: green. All durations: normal. Export job completed at 02:14.&lt;/p&gt;

&lt;p&gt;Zero rows exported. Nobody knows.&lt;/p&gt;

&lt;p&gt;This is the silent failure Airflow's built-in alerting doesn't catch. &lt;code&gt;on_failure_callback&lt;/code&gt; fires when a task crashes. It doesn't fire when a task exits 0 after connecting to a stale database replica and processing nothing. That's the failure mode that eats your Monday morning.&lt;/p&gt;

&lt;p&gt;This article shows you two ways to add external monitoring to Airflow DAGs — so you get paged for both kinds of failures.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Airflow's built-in alerts aren't enough
&lt;/h2&gt;

&lt;p&gt;Airflow gives you several callback hooks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;on_failure_callback&lt;/code&gt; — task or DAG run failed&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;on_success_callback&lt;/code&gt; — task or DAG run succeeded&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;on_retry_callback&lt;/code&gt; — task queued for retry&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;on_execute_callback&lt;/code&gt; — task about to start&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;on_skipped_callback&lt;/code&gt; — task raised AirflowSkipException&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are useful. But none of them answer the question that actually matters for data pipelines: &lt;strong&gt;did the job do something?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your export DAG catches a database timeout, logs it, and exits cleanly. Airflow marks it green. No callbacks fire. The data never lands.&lt;/p&gt;

&lt;p&gt;You need an independent check — something outside Airflow that asks "did this DAG complete, and did it report non-zero output?" every time the schedule fires.&lt;/p&gt;




&lt;h2&gt;
  
  
  The approach: dead man's switch + output assertions
&lt;/h2&gt;

&lt;p&gt;A dead man's switch monitor works like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You set up a monitor with an expected interval — say, "this DAG should report in every 24 hours"&lt;/li&gt;
&lt;li&gt;Your DAG pings the monitor when it completes&lt;/li&gt;
&lt;li&gt;If the monitor doesn't hear from the DAG within the window, it alerts you&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This catches missed runs, paused DAGs, scheduler issues, and slow drift.&lt;/p&gt;

&lt;p&gt;But the more powerful feature is &lt;strong&gt;output assertions&lt;/strong&gt;: you pass a count with your ping, and the monitor alerts if count is 0 — even when the job completed and pinged successfully.&lt;/p&gt;

&lt;p&gt;I'll use &lt;a href="https://deadmancheck.io" rel="noopener noreferrer"&gt;DeadManCheck&lt;/a&gt; for the examples. It's the only cron monitoring tool that supports output assertions, and it has a free tier for up to 5 monitors.&lt;/p&gt;




&lt;h2&gt;
  
  
  Option 1: DAG-level callback (cleanest approach)
&lt;/h2&gt;

&lt;p&gt;If you want to monitor the whole DAG run — not individual tasks — use &lt;code&gt;on_success_callback&lt;/code&gt; and &lt;code&gt;on_failure_callback&lt;/code&gt; at the DAG level.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# airflow/dags/daily_export.py
# Airflow 2.x imports. For Airflow 3.x use:
#   from airflow.sdk import DAG
#   from airflow.providers.standard.operators.python import PythonOperator
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.python&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="n"&gt;DEADMANCHECK_TOKEN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DEADMANCHECK_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;BASE_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://deadmancheck.io/ping/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;DEADMANCHECK_TOKEN&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ping_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Signal that the DAG has started — enables duration monitoring.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;BASE_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# never let monitoring break the job
&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ping_success&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Signal success. Pull row count from XCom for output assertion.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;export_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows_exported&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BASE_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ping_failure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Signal explicit failure.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;BASE_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/fail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;


&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily_export&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 2 * * *&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2026&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;on_success_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ping_success&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;on_failure_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ping_failure&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;export_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_export&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="c1"&gt;# Push count to XCom so ping_success can read it
&lt;/span&gt;        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows_exported&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;

    &lt;span class="n"&gt;export_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;export_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;export_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;on_execute_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ping_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# task-level in 2.x; fires when this task starts
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things to note:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wrap every ping in try/except.&lt;/strong&gt; The monitoring call must never fail the DAG. If DeadManCheck is unreachable, your pipeline keeps running.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Push row count via XCom.&lt;/strong&gt; The success callback receives the &lt;code&gt;context&lt;/code&gt; object, which includes a &lt;code&gt;TaskInstance&lt;/code&gt;. Use &lt;code&gt;xcom_pull&lt;/code&gt; to retrieve the count from the last task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set &lt;code&gt;on_execute_callback&lt;/code&gt; for duration monitoring.&lt;/strong&gt; In Airflow 2.x this is a task-level callback, so it lives on the first task rather than the DAG itself. It sends the &lt;code&gt;/start&lt;/code&gt; signal before that task runs. DeadManCheck then tracks how long each run takes and alerts when a run is significantly longer than the rolling average.&lt;/p&gt;




&lt;h2&gt;
  
  
  Option 2: Final task in the DAG graph
&lt;/h2&gt;

&lt;p&gt;If you want the monitoring ping visible in the Airflow task graph — useful for debugging — add it as a final &lt;code&gt;PythonOperator&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.python&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;notify_deadmancheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ti&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;export_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows_exported&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BASE_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="c1"&gt;# In your DAG:
&lt;/span&gt;&lt;span class="n"&gt;notify&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notify_deadmancheck&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;notify_deadmancheck&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;export_task&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;validate_data&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;notify&lt;/span&gt;  &lt;span class="c1"&gt;# replace validate_data with your existing tasks
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach makes the monitoring step explicit and auditable. Note that &lt;code&gt;notify_deadmancheck&lt;/code&gt; deliberately has no try/except — if the ping fails, you want Airflow to retry it (and mark the task failed if retries are exhausted), rather than silently swallowing the error. This is the opposite of the callback approach above, where the pipeline must never be blocked by the monitoring call.&lt;/p&gt;




&lt;h2&gt;
  
  
  Configuring the monitor
&lt;/h2&gt;

&lt;p&gt;In DeadManCheck, create a new monitor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Type:&lt;/strong&gt; Cron / Heartbeat&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interval:&lt;/strong&gt; set to &lt;code&gt;25h&lt;/code&gt; (slightly longer than your 24h schedule, to allow for run time)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output assertion:&lt;/strong&gt; alert if &lt;code&gt;count = 0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert channels:&lt;/strong&gt; Slack, PagerDuty, email — whatever's in your incident flow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output assertion is the key part. When your export runs and calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsS&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"count": 0}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  https://deadmancheck.io/ping/your-token &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get an alert. Even though Airflow shows the DAG as green.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting the environment variable
&lt;/h2&gt;

&lt;p&gt;In your Airflow deployment, add &lt;code&gt;DEADMANCHECK_TOKEN&lt;/code&gt; as an environment variable. Where you set it depends on your setup:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docker Compose:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DEADMANCHECK_TOKEN=your-token-here&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Kubernetes (via Secret):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create secret generic deadmancheck-secret &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-token-here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DEADMANCHECK_TOKEN&lt;/span&gt;
    &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;secretKeyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deadmancheck-secret&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;token&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Astronomer / MWAA:&lt;/strong&gt; add it as an Airflow Variable or environment variable via the platform's UI.&lt;/p&gt;




&lt;h2&gt;
  
  
  What you catch with this setup
&lt;/h2&gt;

&lt;p&gt;With the callback approach + output assertion:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure mode&lt;/th&gt;
&lt;th&gt;Airflow catches?&lt;/th&gt;
&lt;th&gt;DeadManCheck catches?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Task raises exception&lt;/td&gt;
&lt;td&gt;YES&lt;/td&gt;
&lt;td&gt;YES (via on_failure_callback)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DAG paused accidentally&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;YES (missed ping)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scheduler down&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;YES (missed ping)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Job exports 0 rows&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;YES (output assertion)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Run takes 3× longer than usual&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;YES (duration anomaly)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API token expired, job exits 0&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;YES (output assertion)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Two minutes to set up
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://deadmancheck.io/register" rel="noopener noreferrer"&gt;Create a free account&lt;/a&gt; — no credit card needed&lt;/li&gt;
&lt;li&gt;Create a monitor, set interval to match your DAG schedule + buffer&lt;/li&gt;
&lt;li&gt;Enable output assertion: alert if count = 0&lt;/li&gt;
&lt;li&gt;Add the callbacks to your DAG&lt;/li&gt;
&lt;li&gt;Deploy, run once, verify the ping arrives&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After the first successful run, DeadManCheck will alert you if the DAG ever goes silent — or succeeds while doing nothing.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;DeadManCheck is open source and self-hostable. If you'd rather run it on your own infrastructure, the &lt;a href="https://github.com/Kriss-V/deadmancheck" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; has setup instructions.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>airflow</category>
      <category>python</category>
      <category>devops</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>How to add dead man's switch monitoring to any cron job in 2 minutes</title>
      <dc:creator>Kriss</dc:creator>
      <pubDate>Thu, 30 Apr 2026 10:47:42 +0000</pubDate>
      <link>https://dev.to/krissv/how-to-add-dead-mans-switch-monitoring-to-any-cron-job-in-2-minutes-3ebj</link>
      <guid>https://dev.to/krissv/how-to-add-dead-mans-switch-monitoring-to-any-cron-job-in-2-minutes-3ebj</guid>
      <description>&lt;h1&gt;
  
  
  How to add dead man's switch monitoring to any cron job in 2 minutes
&lt;/h1&gt;

&lt;p&gt;The concept is simple: your job checks in when it runs. If it stops checking in, you get alerted.&lt;/p&gt;

&lt;p&gt;No agent to install. No SDK to integrate. Just a curl at the end of your script.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one-liner
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsS&lt;/span&gt; https://deadmancheck.io/ping/YOUR-TOKEN &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Stick that at the end of your cron job. If the job stops running — server dies, cron daemon crashes, script errors out before it gets there — you get an alert.&lt;/p&gt;

&lt;p&gt;The flags: &lt;code&gt;-f&lt;/code&gt; fails silently on HTTP errors, &lt;code&gt;-s&lt;/code&gt; suppresses progress output, &lt;code&gt;-S&lt;/code&gt; still shows errors if &lt;code&gt;-s&lt;/code&gt; is set. Redirect to &lt;code&gt;/dev/null&lt;/code&gt; because you don't want curl output polluting your logs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting it up
&lt;/h2&gt;

&lt;p&gt;Sign up at &lt;a href="https://deadmancheck.io" rel="noopener noreferrer"&gt;deadmancheck.io&lt;/a&gt; (free for up to 5 monitors). Create a monitor, set the expected interval — say, every 24 hours — and copy your unique token.&lt;/p&gt;

&lt;p&gt;Then configure the alert window. If you're running a daily job, set it to alert after 25 hours of silence. That gives a 1-hour grace period for slow servers and slight scheduling drift.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start/end pattern for longer jobs
&lt;/h2&gt;

&lt;p&gt;The one-liner is fine for quick jobs. For anything that runs more than a few minutes, use the start/end pattern. This also catches jobs that start but hang indefinitely.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Signal job started&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsS&lt;/span&gt; https://deadmancheck.io/ping/YOUR-TOKEN/start &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null

&lt;span class="c"&gt;# ... your job logic ...&lt;/span&gt;

&lt;span class="c"&gt;# Signal job completed&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsS&lt;/span&gt; https://deadmancheck.io/ping/YOUR-TOKEN &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the job starts but never pings the end URL within your configured timeout, you get alerted. Useful for ETL jobs that sometimes decide to run for 6 hours when they should take 20 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Python
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;DEADMANCHECK_TOKEN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DEADMANCHECK_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;BASE_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://deadmancheck.io/ping/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;DEADMANCHECK_TOKEN&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;BASE_URL&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestException&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# never let monitoring break the job
&lt;/span&gt;
&lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_export&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/fail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;try/except&lt;/code&gt; around each ping is deliberate. Your monitoring call should never take down your job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ruby
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="nb"&gt;require&lt;/span&gt; &lt;span class="s1"&gt;'net/http'&lt;/span&gt;
&lt;span class="nb"&gt;require&lt;/span&gt; &lt;span class="s1"&gt;'uri'&lt;/span&gt;
&lt;span class="nb"&gt;require&lt;/span&gt; &lt;span class="s1"&gt;'json'&lt;/span&gt;

&lt;span class="no"&gt;TOKEN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;ENV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'DEADMANCHECK_TOKEN'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="no"&gt;BASE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"https://deadmancheck.io/ping/&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="no"&gt;TOKEN&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kp"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;uri&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;URI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="no"&gt;BASE&lt;/span&gt;&lt;span class="si"&gt;}#{&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Net&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;HTTP&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;Post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type'&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'application/json'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="ss"&gt;count: &lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="no"&gt;Net&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;HTTP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;port&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;use_ssl: &lt;/span&gt;&lt;span class="kp"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;else&lt;/span&gt;
    &lt;span class="no"&gt;Net&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;HTTP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;rescue&lt;/span&gt; &lt;span class="no"&gt;StandardError&lt;/span&gt;
  &lt;span class="c1"&gt;# don't let monitoring kill the job&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="n"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'/start'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;begin&lt;/span&gt;
  &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;run_etl&lt;/span&gt;
  &lt;span class="n"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;rescue&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
  &lt;span class="n"&gt;ping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'/fail'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;raise&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Bash with error handling
&lt;/h2&gt;

&lt;p&gt;For bash scripts, use a trap to ping the fail URL on any error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"YOUR-TOKEN"&lt;/span&gt;
&lt;span class="nv"&gt;BASE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://deadmancheck.io/ping/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TOKEN&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

curl &lt;span class="nt"&gt;-fsS&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;BASE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/start"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null

&lt;span class="nb"&gt;trap&lt;/span&gt; &lt;span class="s1"&gt;'curl -fsS "${BASE}/fail" &amp;gt; /dev/null'&lt;/span&gt; ERR

/usr/local/bin/run-backup.sh

&lt;span class="nv"&gt;ROW_COUNT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt; &amp;lt; /backups/output.csv&lt;span class="si"&gt;)&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsS&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;count&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ROW_COUNT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;BASE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;set -euo pipefail&lt;/code&gt; means any unhandled error exits the script and triggers the trap. The ERR trap fires before exit, pinging the fail endpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to monitor first
&lt;/h2&gt;

&lt;p&gt;If you're not sure where to start:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Database backups&lt;/strong&gt; — silent failures here are catastrophic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ETL/data pipeline jobs&lt;/strong&gt; — wrong data is worse than no data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invoice/billing jobs&lt;/strong&gt; — customers notice immediately&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Report generation&lt;/strong&gt; — stakeholders notice next morning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache warmers&lt;/strong&gt; — performance degrades silently&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Anything that runs unattended and that you'd be embarrassed to find broken three weeks later.&lt;/p&gt;

&lt;p&gt;One token per cron job. If you have 10 jobs, create 10 monitors. DeadManCheck's free tier covers 5 monitors — the $12/mo plan covers 100, which handles most teams.&lt;/p&gt;

&lt;p&gt;Two minutes of setup. One less thing to find out about the hard way.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>tutorial</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Monitoring GitHub Actions scheduled workflows: a practical guide</title>
      <dc:creator>Kriss</dc:creator>
      <pubDate>Wed, 29 Apr 2026 16:14:13 +0000</pubDate>
      <link>https://dev.to/krissv/monitoring-github-actions-scheduled-workflows-a-practical-guide-31h7</link>
      <guid>https://dev.to/krissv/monitoring-github-actions-scheduled-workflows-a-practical-guide-31h7</guid>
      <description>&lt;h1&gt;
  
  
  Monitoring GitHub Actions scheduled workflows: a practical guide
&lt;/h1&gt;

&lt;p&gt;GitHub Actions is a surprisingly capable cron scheduler. Schedule a workflow, let it run nightly, forget about it.&lt;/p&gt;

&lt;p&gt;Until it stops running. And you don't notice for two weeks.&lt;/p&gt;

&lt;p&gt;Scheduled workflows in GitHub Actions are quietly unreliable. GitHub delays them, skips them during high load, and — most importantly — gives you no built-in alerting when they fail silently. Adding external monitoring takes about 5 minutes and saves you from that two-week discovery.&lt;/p&gt;

&lt;h2&gt;
  
  
  The basic setup
&lt;/h2&gt;

&lt;p&gt;Here's a minimal scheduled workflow with monitoring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Nightly export&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*'&lt;/span&gt;  &lt;span class="c1"&gt;# 2am UTC every day&lt;/span&gt;
  &lt;span class="na"&gt;workflow_dispatch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# allows manual triggering for testing&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;export&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run export&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python scripts/export.py&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ping DeadManCheck&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;success()&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;curl -fsS https://deadmancheck.io/ping/${{ secrets.DEADMANCHECK_TOKEN }} &amp;gt; /dev/null&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last step pings &lt;a href="https://deadmancheck.io" rel="noopener noreferrer"&gt;DeadManCheck&lt;/a&gt; only if all previous steps succeeded (&lt;code&gt;if: success()&lt;/code&gt;). If the export script fails, the ping doesn't fire, and you get alerted after your configured grace period.&lt;/p&gt;

&lt;p&gt;Set up the monitor with a 25-hour interval (giving a 1-hour buffer on the 24-hour schedule). Store your token in GitHub: Settings → Secrets and variables → Actions → New repository secret named &lt;code&gt;DEADMANCHECK_TOKEN&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding start/end pings for longer jobs
&lt;/h2&gt;

&lt;p&gt;For jobs that run more than a few minutes, use the start/end pattern. This catches jobs that hang:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ping start&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;curl -fsS https://deadmancheck.io/ping/${{ secrets.DEADMANCHECK_TOKEN }}/start &amp;gt; /dev/null || &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run ETL&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;etl&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;python scripts/run_etl.py&lt;/span&gt;
      &lt;span class="s"&gt;echo "rows=$(cat /tmp/etl_row_count.txt)" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ping done&lt;/span&gt;
    &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;success()&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;curl -fsS -X POST -H "Content-Type: application/json" \&lt;/span&gt;
        &lt;span class="s"&gt;-d "{\"count\": ${{ steps.etl.outputs.rows }}}" \&lt;/span&gt;
        &lt;span class="s"&gt;"https://deadmancheck.io/ping/${{ secrets.DEADMANCHECK_TOKEN }}" \&lt;/span&gt;
        &lt;span class="s"&gt;&amp;gt; /dev/null || true&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ping fail&lt;/span&gt;
    &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;failure()&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;curl -fsS https://deadmancheck.io/ping/${{ secrets.DEADMANCHECK_TOKEN }}/fail &amp;gt; /dev/null || &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your ETL script writes the row count to &lt;code&gt;/tmp/etl_row_count.txt&lt;/code&gt;. The monitoring step picks it up and includes it in the ping — so your monitor can alert on zero-output runs, not just missed runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gotchas
&lt;/h2&gt;

&lt;h3&gt;
  
  
  GitHub delays scheduled workflows
&lt;/h3&gt;

&lt;p&gt;This is the big one. GitHub's docs admit that scheduled workflows may be delayed during high load. A workflow scheduled for 2:00am UTC might run at 2:23am or 2:51am. During busy periods, delays of 30–60 minutes aren't unusual.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't set your DeadManCheck interval to exactly 24 hours.&lt;/strong&gt; Set it to 25 hours. That buffer absorbs GitHub's scheduling jitter without letting real failures go undetected.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scheduled workflows stop on inactive repos
&lt;/h3&gt;

&lt;p&gt;If a repository has no commits in 60 days, GitHub disables scheduled workflows. You'll get an email warning. If you miss it, the job silently stops running — and your external monitor will catch it where GitHub's notification didn't reach you.&lt;/p&gt;

&lt;h3&gt;
  
  
  Test with workflow_dispatch before trusting the schedule
&lt;/h3&gt;

&lt;p&gt;Always add &lt;code&gt;workflow_dispatch&lt;/code&gt; as a trigger (it's in all examples above). You can trigger the workflow manually from the Actions tab or via the CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gh workflow run nightly-export.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test your monitoring integration before the first scheduled run. Confirm the ping appears in your DeadManCheck dashboard with the correct count.&lt;/p&gt;

&lt;h3&gt;
  
  
  Secrets aren't available in forks
&lt;/h3&gt;

&lt;p&gt;If your repo is public and someone forks it, &lt;code&gt;secrets.DEADMANCHECK_TOKEN&lt;/code&gt; will be empty in their fork. The curl will fail silently. This is fine — you don't want random forks pinging your monitor — but be aware of it when debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  Full production example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Nightly database backup&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*'&lt;/span&gt;
  &lt;span class="na"&gt;workflow_dispatch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;  &lt;span class="c1"&gt;# hard limit — prevent hung jobs accumulating&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ping start&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;curl -fsS \&lt;/span&gt;
            &lt;span class="s"&gt;"https://deadmancheck.io/ping/${{ secrets.DEADMANCHECK_TOKEN }}/start" \&lt;/span&gt;
            &lt;span class="s"&gt;&amp;gt; /dev/null || true  # don't fail if monitoring is down&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure AWS credentials&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;aws-access-key-id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.AWS_ACCESS_KEY_ID }}&lt;/span&gt;
          &lt;span class="na"&gt;aws-secret-access-key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.AWS_SECRET_ACCESS_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;aws-region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run backup&lt;/span&gt;
        &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backup&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;python scripts/backup.py&lt;/span&gt;
          &lt;span class="s"&gt;echo "rows=$(cat /tmp/backup_row_count.txt)" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Upload to S3&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws s3 cp /backups/latest.dump s3://my-backups/&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ping done&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;success()&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;curl -fsS -X POST -H "Content-Type: application/json" \&lt;/span&gt;
            &lt;span class="s"&gt;-d "{\"count\": ${{ steps.backup.outputs.rows }}}" \&lt;/span&gt;
            &lt;span class="s"&gt;"https://deadmancheck.io/ping/${{ secrets.DEADMANCHECK_TOKEN }}" \&lt;/span&gt;
            &lt;span class="s"&gt;&amp;gt; /dev/null || true&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ping fail&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;failure()&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;curl -fsS \&lt;/span&gt;
            &lt;span class="s"&gt;"https://deadmancheck.io/ping/${{ secrets.DEADMANCHECK_TOKEN }}/fail" \&lt;/span&gt;
            &lt;span class="s"&gt;&amp;gt; /dev/null || true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things worth noting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;timeout-minutes: 30&lt;/code&gt; is a hard ceiling. Without it, a hung job can sit there for 6 hours consuming a runner.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;|| true&lt;/code&gt; on the monitoring pings means a DeadManCheck outage won't cause your backup job to report failed.&lt;/li&gt;
&lt;li&gt;The row count flows from the backup step through &lt;code&gt;$GITHUB_OUTPUT&lt;/code&gt; to the ping step.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  After deploying
&lt;/h2&gt;

&lt;p&gt;Trigger the workflow manually and confirm:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The workflow runs end-to-end without errors&lt;/li&gt;
&lt;li&gt;DeadManCheck shows a recent ping on your monitor dashboard&lt;/li&gt;
&lt;li&gt;The count looks correct for what the job processed&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Wait for the first scheduled run and verify again. Two successful data points before you trust it.&lt;/p&gt;

&lt;p&gt;Scheduled workflows are one of those things that feel reliable until the day they aren't. External monitoring is the difference between finding out immediately and finding out when someone asks why the weekly report is missing.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>github</category>
      <category>tutorial</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Output assertions: the cron job check most monitoring tools skip</title>
      <dc:creator>Kriss</dc:creator>
      <pubDate>Tue, 28 Apr 2026 21:28:25 +0000</pubDate>
      <link>https://dev.to/krissv/output-assertions-the-cron-job-check-most-monitoring-tools-skip-15kn</link>
      <guid>https://dev.to/krissv/output-assertions-the-cron-job-check-most-monitoring-tools-skip-15kn</guid>
      <description>&lt;h1&gt;
  
  
  Output assertions: the cron job check most monitoring tools skip
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;A follow-up to &lt;a href="https://dev.to/krissv/a-reader-comment-made-me-realise-id-only-solved-half-the-problem"&gt;A reader comment made me realise I'd only solved half the problem&lt;/a&gt; — this is a deeper reference guide on output assertions specifically.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;"Did it run?" is the wrong question.&lt;/p&gt;

&lt;p&gt;Every monitoring tool asks it. Heartbeat monitors, cron schedulers, even purpose-built tools like Cronitor and Healthchecks.io — they all fundamentally ask: did the job check in? If yes, green. If no, red.&lt;/p&gt;

&lt;p&gt;It's a useful question. But it's not the useful question.&lt;/p&gt;

&lt;h2&gt;
  
  
  The failure mode that looks like success
&lt;/h2&gt;

&lt;p&gt;Imagine a nightly job that syncs user records from your CRM into your database. It runs at midnight, takes about 90 seconds, and exits cleanly. Your heartbeat monitor sees the ping at 12:01:34am and marks it healthy.&lt;/p&gt;

&lt;p&gt;What it doesn't see: the job synced 0 records. It has been syncing 0 records for eight days, since someone rotated the CRM API credentials and forgot to update the environment variable. The job connects, gets a 401, logs a warning, falls back to a no-op, and exits 0.&lt;/p&gt;

&lt;p&gt;All monitoring: green. Business: broken for eight days.&lt;/p&gt;

&lt;p&gt;This is not a hypothetical. Variants of this failure happen constantly. The job ran. That fact is true and also completely useless.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "did it do anything?" looks like
&lt;/h2&gt;

&lt;p&gt;Output assertions flip the question. Instead of only checking that the job pinged in, you also check what it reported.&lt;/p&gt;

&lt;p&gt;A job that processes records should report how many it processed. A job that generates a file should report the file size. A job that sends emails should report how many it sent. You instrument the job to emit a count — one number representing meaningful work done — and your monitoring layer validates it falls within expected bounds.&lt;/p&gt;

&lt;p&gt;The failure modes this catches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero when non-zero expected&lt;/strong&gt;: sync runs, processes nothing, exits clean&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Suspiciously low counts&lt;/strong&gt;: normally syncs 500 records, today synced 3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Count drift over time&lt;/strong&gt;: weekly report used to include 10k rows, now consistently 200&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these trip a heartbeat check. All of them are real problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why most tools don't do this
&lt;/h2&gt;

&lt;p&gt;Heartbeat monitoring is architecturally simple: job pings URL, URL records timestamp, alerting checks timestamp age. The data model is just "last seen at".&lt;/p&gt;

&lt;p&gt;Output assertions require more: the job must emit structured data, the tool must store it, and the alerting logic must understand what "normal" looks like for that specific job. That's a significantly more complex product to build.&lt;/p&gt;

&lt;p&gt;Most tools solve the simpler problem because it covers the obvious failure mode and is much easier to ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to instrument your jobs
&lt;/h2&gt;

&lt;p&gt;The instrumentation is lightweight. Pick a number that represents meaningful work and emit it at the end:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Database backup — report dump file size
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pg_dump&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-Fc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mydb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/backups/mydb.dump&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;dump_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getsize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/backups/mydb.dump&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;ping_monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dump_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# CRM sync — report records synced
&lt;/span&gt;&lt;span class="n"&gt;synced&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sync_from_crm&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;ping_monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;synced&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Email campaign — report emails sent
&lt;/span&gt;&lt;span class="n"&gt;sent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;send_campaign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;campaign_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;ping_monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three extra lines per job. The return is knowing your job didn't just run — it did something. (&lt;code&gt;ping_monitor&lt;/code&gt; is a wrapper around your monitoring call — implementation below.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Sending the count to your monitor
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://deadmancheck.io" rel="noopener noreferrer"&gt;DeadManCheck&lt;/a&gt; accepts a &lt;code&gt;count&lt;/code&gt; parameter with each ping:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsS&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"count": 1547}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  https://deadmancheck.io/ping/YOUR-TOKEN &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You configure the assertion on the monitor: "alert if count is 0" or "alert if count drops below threshold". If the job checks in but reports zero records, you get alerted — even though the job technically ran fine.&lt;/p&gt;

&lt;p&gt;It also does duration monitoring with rolling average anomaly detection. If your 90-second job starts taking 45 minutes, that gets flagged too. Jobs that hang are a separate silent failure mode that output counts don't catch on their own.&lt;/p&gt;

&lt;h2&gt;
  
  
  The right question
&lt;/h2&gt;

&lt;p&gt;Monitoring that only asks "did it run?" will eventually lie to you at the worst possible moment.&lt;/p&gt;

&lt;p&gt;The right question is "did it do anything useful?" Output assertions are how you ask that question automatically, at 2am, every night, without anyone having to check.&lt;/p&gt;

&lt;p&gt;Start with your backup jobs. That's where the answer matters most.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>opensource</category>
      <category>productivity</category>
    </item>
    <item>
      <title>A reader comment made me realise I'd only solved half the problem</title>
      <dc:creator>Kriss</dc:creator>
      <pubDate>Sat, 25 Apr 2026 13:20:52 +0000</pubDate>
      <link>https://dev.to/krissv/a-reader-comment-made-me-realise-id-only-solved-half-the-problem-3cpg</link>
      <guid>https://dev.to/krissv/a-reader-comment-made-me-realise-id-only-solved-half-the-problem-3cpg</guid>
      <description>&lt;h1&gt;
  
  
  A reader comment made me realise I'd only solved half the problem
&lt;/h1&gt;

&lt;p&gt;Last month I wrote about the cron job failure mode nobody talks about: the job that doesn't die, it just drags.&lt;/p&gt;

&lt;p&gt;The short version: a nightly ETL job at a previous employer took four hours instead of forty minutes for six days before anyone noticed. It ran. It completed. It exited zero. Every dashboard showed green. Downstream data was silently wrong.&lt;/p&gt;

&lt;p&gt;The fix I described was duration anomaly detection — once you have a few weeks of run history, you know what "normal" looks like. A job that takes 4x its baseline is a signal even if it succeeded. I built &lt;a href="https://deadmancheck.io" rel="noopener noreferrer"&gt;DeadManCheck&lt;/a&gt; partly because I couldn't find a tool that combined silence detection with duration tracking.&lt;/p&gt;

&lt;p&gt;The article got some traction. Then someone left a comment that stopped me in my tracks.&lt;/p&gt;




&lt;h2&gt;
  
  
  The comment
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;The failure mode I keep seeing: the job runs, logs "complete," and the output silently goes nowhere.&lt;/p&gt;

&lt;p&gt;No error. No alert. Just a cron that appeared healthy while accomplishing nothing for days.&lt;/p&gt;

&lt;p&gt;The fix that actually works is external verification. Don't check that the job ran; check that the downstream artifact exists. A job that succeeds but doesn't write the expected DB record is the same as a failed job.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;They were right. And I hadn't covered it.&lt;/p&gt;

&lt;p&gt;Duration anomaly detection catches "job ran slow." Silence detection catches "job didn't run." Neither catches "job ran fine, on time, but produced nothing."&lt;/p&gt;

&lt;p&gt;That's a third failure mode entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this looks like in practice
&lt;/h2&gt;

&lt;p&gt;Here's a simplified backup script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;cur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM orders WHERE exported = false&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/backups/orders.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writerows&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UPDATE orders SET exported = true WHERE exported = false&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Backup complete. &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows exported.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Can you spot the bug?&lt;/p&gt;

&lt;p&gt;The script runs. It prints "Backup complete. 0 rows exported." It exits cleanly.&lt;/p&gt;

&lt;p&gt;The bug is in a migration from three weeks earlier. A developer renamed the &lt;code&gt;exported&lt;/code&gt; column to &lt;code&gt;is_exported&lt;/code&gt;. The WHERE clause now silently returns nothing. Every night: zero rows fetched, empty CSV written, nothing marked, exit code 0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exit code: &lt;code&gt;0&lt;/code&gt;. Monitoring alert: none.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is exactly what the commenter was describing. A job that succeeds but produces nothing is functionally the same as a failed job. Your monitoring just doesn't know that yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the standard fix is hard to scale
&lt;/h2&gt;

&lt;p&gt;The commenter suggested checking the downstream artifact — verify the DB record exists, check the file isn't empty. That's the correct instinct, but it requires custom verification logic for every job. Each job writes to a different place, in a different format, with different expectations about what "something" looks like.&lt;/p&gt;

&lt;p&gt;What I wanted was a generalised version: tell the monitoring service what your job produced, and let it decide if that's suspicious.&lt;/p&gt;

&lt;p&gt;That's what I built into DeadManCheck as output assertions.&lt;/p&gt;




&lt;h2&gt;
  
  
  How output assertions work
&lt;/h2&gt;

&lt;p&gt;The idea is simple. When your job pings the monitoring service at completion, it includes a count of what it actually did:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsS&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"count": 0}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  https://deadmancheck.io/ping/YOUR-TOKEN &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You configure a rule: "alert if count is 0 more than once in a row" or "alert if count drops more than 80% below the rolling average."&lt;/p&gt;

&lt;p&gt;The job ran. It just did nothing. Now you know.&lt;/p&gt;

&lt;p&gt;In Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ping_deadmancheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DEADMANCHECK_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://deadmancheck.io/ping/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestException&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# never let monitoring break the job
&lt;/span&gt;
&lt;span class="n"&gt;rows_processed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;do_the_work&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;ping_deadmancheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rows_processed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ten lines. The complexity stays in the service, not in your scripts. And unlike checking a downstream artifact, it works the same way regardless of what your job actually produces.&lt;/p&gt;




&lt;h2&gt;
  
  
  The full picture: three failure modes
&lt;/h2&gt;

&lt;p&gt;After that comment, I updated my mental model. There are three distinct ways a cron job can fail silently:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure mode&lt;/th&gt;
&lt;th&gt;What happens&lt;/th&gt;
&lt;th&gt;What catches it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Job doesn't run&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Silence. No ping arrives.&lt;/td&gt;
&lt;td&gt;Dead man's switch (silence detection)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Job runs slow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ping arrives late or after too long&lt;/td&gt;
&lt;td&gt;Duration anomaly detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Job runs, produces nothing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ping arrives on time, output is empty&lt;/td&gt;
&lt;td&gt;Output assertions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most tools only cover the first row. Some cover the first two. The third is almost always a blind spot.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I do now
&lt;/h2&gt;

&lt;p&gt;Every background job I write now has three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A counter variable tracking records processed&lt;/li&gt;
&lt;li&gt;A guard clause that exits non-zero if zero is never a valid outcome&lt;/li&gt;
&lt;li&gt;A heartbeat ping that includes the count
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;rows_processed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;do_the_work&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rows_processed&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Processed 0 records — investigate before marking success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;ping_deadmancheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rows_processed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For jobs where zero is sometimes valid (quiet periods, weekends), skip the guard clause and let the monitoring service decide based on historical patterns.&lt;/p&gt;




&lt;h2&gt;
  
  
  Credit where it's due
&lt;/h2&gt;

&lt;p&gt;I wouldn't have built output assertions without that comment. Sometimes the feature request hiding in a code review or a reply thread is the most valuable one you'll get.&lt;/p&gt;

&lt;p&gt;If you've got a background job running right now, ask yourself three questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Will I know if it silently stops running?&lt;/li&gt;
&lt;li&gt;Will I know if it starts taking 4x longer than normal?&lt;/li&gt;
&lt;li&gt;Will I know if it ran perfectly but accomplished nothing?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of those is "no" — that's your monitoring gap.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://deadmancheck.io" rel="noopener noreferrer"&gt;Try DeadManCheck free at deadmancheck.io&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>backend</category>
      <category>productivity</category>
    </item>
    <item>
      <title>The cron job failure mode nobody talks about</title>
      <dc:creator>Kriss</dc:creator>
      <pubDate>Sun, 29 Mar 2026 19:04:31 +0000</pubDate>
      <link>https://dev.to/krissv/the-cron-job-failure-mode-nobody-talks-about-3p1a</link>
      <guid>https://dev.to/krissv/the-cron-job-failure-mode-nobody-talks-about-3p1a</guid>
      <description>&lt;p&gt;A few months ago, a nightly ETL job at a previous job nearly cost us a major client. Not because it failed. Because it took four hours instead of forty minutes — and nobody noticed for six days.&lt;/p&gt;

&lt;p&gt;The job ran. It completed. It exited zero. Every monitoring dashboard showed green. Meanwhile, the downstream data pipeline was ingesting half-processed records, and reports were silently wrong. By the time a client flagged it, we had six days of corrupted reporting to unpick.&lt;/p&gt;

&lt;p&gt;This is the failure mode nobody talks about: the job that doesn't die, it just... drags.&lt;/p&gt;

&lt;p&gt;Why your existing monitoring misses it -&lt;/p&gt;

&lt;p&gt;If you're using Healthchecks.io, Better Uptime, or a similar dead man's switch tool, here's how it works: your cron job pings a URL at the end of each run. If the ping doesn't arrive within a grace window, you get an alert.&lt;/p&gt;

&lt;p&gt;That's genuinely useful. It catches jobs that crash, hang indefinitely, or never start. But what it doesn't catch is a job that completes in 240 minutes when it should take 45. The ping arrives. The check passes. Everything looks fine. The tool has no idea what "normal" looks like for that job — it only knows silence vs. noise.&lt;/p&gt;

&lt;p&gt;Duration anomaly detection is the missing piece.&lt;/p&gt;

&lt;p&gt;What duration anomaly detection actually means&lt;/p&gt;

&lt;p&gt;The concept is simple: instead of only checking whether a job completed, you also check how long it took.&lt;/p&gt;

&lt;p&gt;Once you have a few weeks of run history, you know that your nightly job usually takes 40–50 minutes. So when it takes four hours, that's a signal — even if it succeeded. Something changed: the dataset grew, a dependency got slow, a query plan degraded, a network hop started timing out and retrying.&lt;/p&gt;

&lt;p&gt;Catching this early means you can investigate before it causes damage downstream.&lt;/p&gt;

&lt;p&gt;The /start + /finish pattern -&lt;/p&gt;

&lt;h1&gt;
  
  
  Job begins
&lt;/h1&gt;

&lt;p&gt;curl -s "&lt;a href="https://deadmancheck.io/ping/abc123/start" rel="noopener noreferrer"&gt;https://deadmancheck.io/ping/abc123/start&lt;/a&gt;"&lt;/p&gt;

&lt;h1&gt;
  
  
  ... your actual job logic ...
&lt;/h1&gt;

&lt;h1&gt;
  
  
  Job ends
&lt;/h1&gt;

&lt;p&gt;curl -s "&lt;a href="https://deadmancheck.io/ping/abc123" rel="noopener noreferrer"&gt;https://deadmancheck.io/ping/abc123&lt;/a&gt;"&lt;/p&gt;

&lt;p&gt;Now the monitoring service knows: this run started at T, it ended at T+4h. It compares that against the rolling average of previous runs and alerts if the duration exceeds a configurable threshold — say, 2x the usual runtime. Two curl calls. The complexity lives in the service, not in your scripts.&lt;/p&gt;

&lt;p&gt;Why this matters more as systems age -&lt;/p&gt;

&lt;p&gt;New jobs are fast. As systems mature, things get slower in ways that creep up on you. Rows accumulate. Indexes bloat. Third-party APIs introduce latency. Your job that took 8 minutes in January takes 35 minutes in October.&lt;/p&gt;

&lt;p&gt;Without duration tracking, you have no visibility into this degradation. With it, you have a canary. The alert fires at 70 minutes, you investigate, you find the index that needs rebuilding. Crisis averted before the downstream effects compound.&lt;/p&gt;

&lt;p&gt;So I built this -&lt;/p&gt;

&lt;p&gt;After looking for a tool that combined silence detection with duration anomaly detection and not finding one, I built DeadManCheck (deadmancheck.io). It supports the /start + /ping pattern, tracks rolling run history, and alerts you when a job takes significantly longer than its baseline. Standard silence detection is included too, so both failure modes are covered in one place.&lt;/p&gt;

&lt;p&gt;Free tier available, no credit card required.&lt;/p&gt;

&lt;p&gt;The checklist - &lt;/p&gt;

&lt;p&gt;Next time you wire up a cron job, ask yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Will I know if this job silently stops running?&lt;/li&gt;
&lt;li&gt;Will I know if this job starts taking 4x longer than normal?&lt;/li&gt;
&lt;li&gt;Will I know before my users do?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the answer to any of those is "no", you have a monitoring gap. It's a small one to close.&lt;/p&gt;

&lt;p&gt;→ Try DeadManCheck free at deadmancheck.io&lt;/p&gt;

</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>backend</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
