DEV Community

Cover image for A reader comment made me realise I'd only solved half the problem
Kriss
Kriss

Posted on

A reader comment made me realise I'd only solved half the problem

A reader comment made me realise I'd only solved half the problem

Last month I wrote about the cron job failure mode nobody talks about: the job that doesn't die, it just drags.

The short version: a nightly ETL job at a previous employer took four hours instead of forty minutes for six days before anyone noticed. It ran. It completed. It exited zero. Every dashboard showed green. Downstream data was silently wrong.

The fix I described was duration anomaly detection — once you have a few weeks of run history, you know what "normal" looks like. A job that takes 4x its baseline is a signal even if it succeeded. I built DeadManCheck partly because I couldn't find a tool that combined silence detection with duration tracking.

The article got some traction. Then someone left a comment that stopped me in my tracks - https://dev.to/krissv/the-cron-job-failure-mode-nobody-talks-about-3p1a


The comment

The failure mode I keep seeing: the job runs, logs "complete," and the output silently goes nowhere.

No error. No alert. Just a cron that appeared healthy while accomplishing nothing for days.

The fix that actually works is external verification. Don't check that the job ran; check that the downstream artifact exists. A job that succeeds but doesn't write the expected DB record is the same as a failed job.

They were right. And I hadn't covered it.

Duration anomaly detection catches "job ran slow." Silence detection catches "job didn't run." Neither catches "job ran fine, on time, but produced nothing."

That's a third failure mode entirely.


What this looks like in practice

Here's a simplified backup script:

import psycopg2
import csv
import os

conn = psycopg2.connect(os.environ["DATABASE_URL"])
cur = conn.cursor()

cur.execute("SELECT * FROM orders WHERE exported = false")
rows = cur.fetchall()

with open("/backups/orders.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(rows)

cur.execute("UPDATE orders SET exported = true WHERE exported = false")
conn.commit()

print(f"Backup complete. {len(rows)} rows exported.")
Enter fullscreen mode Exit fullscreen mode

Can you spot the bug?

The script runs. It prints "Backup complete. 0 rows exported." It exits cleanly.

The bug is in a migration from three weeks earlier. A developer renamed the exported column to is_exported. The WHERE clause now silently returns nothing. Every night: zero rows fetched, empty CSV written, nothing marked, exit code 0.

Exit code: 0. Monitoring alert: none.

This is exactly what the commenter was describing. A job that succeeds but produces nothing is functionally the same as a failed job. Your monitoring just doesn't know that yet.


Why the standard fix is hard to scale

The commenter suggested checking the downstream artifact — verify the DB record exists, check the file isn't empty. That's the correct instinct, but it requires custom verification logic for every job. Each job writes to a different place, in a different format, with different expectations about what "something" looks like.

What I wanted was a generalised version: tell the monitoring service what your job produced, and let it decide if that's suspicious.

That's what I built into DeadManCheck as output assertions.


How output assertions work

The idea is simple. When your job pings the monitoring service at completion, it includes a count of what it actually did:

curl -fsS "https://deadmancheck.io/ping/YOUR-TOKEN?count=0" > /dev/null
Enter fullscreen mode Exit fullscreen mode

You configure a rule: "alert if count is 0 more than once in a row" or "alert if count drops more than 80% below the rolling average."

The job ran. It just did nothing. Now you know.

In Python:

import requests
import os

def ping_deadmancheck(count=None):
    token = os.environ["DEADMANCHECK_TOKEN"]
    params = {"count": count} if count is not None else {}
    try:
        requests.get(f"https://deadmancheck.io/ping/{token}", params=params, timeout=5)
    except requests.RequestException:
        pass  # never let monitoring break the job

rows_processed = do_the_work()
ping_deadmancheck(count=rows_processed)
Enter fullscreen mode Exit fullscreen mode

Ten lines. The complexity stays in the service, not in your scripts. And unlike checking a downstream artifact, it works the same way regardless of what your job actually produces.


The full picture: three failure modes

After that comment, I updated my mental model. There are three distinct ways a cron job can fail silently:

Failure mode What happens What catches it
Job doesn't run Silence. No ping arrives. Dead man's switch (silence detection)
Job runs slow Ping arrives late or after too long Duration anomaly detection
Job runs, produces nothing Ping arrives on time, output is empty Output assertions

Most tools only cover the first row. Some cover the first two. The third is almost always a blind spot.


What I do now

Every background job I write now has three things:

  1. A counter variable tracking records processed
  2. A guard clause that exits non-zero if zero is never a valid outcome
  3. A heartbeat ping that includes the count
rows_processed = do_the_work()

if rows_processed == 0:
    raise RuntimeError("Processed 0 records — investigate before marking success")

ping_deadmancheck(count=rows_processed)
Enter fullscreen mode Exit fullscreen mode

For jobs where zero is sometimes valid (quiet periods, weekends), skip the guard clause and let the monitoring service decide based on historical patterns.


Credit where it's due

I wouldn't have built output assertions without that comment. Sometimes the feature request hiding in a code review or a reply thread is the most valuable one you'll get.

If you've got a background job running right now, ask yourself three questions:

  • Will I know if it silently stops running?
  • Will I know if it starts taking 4x longer than normal?
  • Will I know if it ran perfectly but accomplished nothing?

If any of those is "no" — that's your monitoring gap.

Try DeadManCheck free at deadmancheck.io

Top comments (0)