Kriss

Posted on Mar 29

The cron job failure mode nobody talks about

#devops #monitoring #backend #productivity

A few months ago, a nightly ETL job at a previous job nearly cost us a major client. Not because it failed. Because it took four hours instead of forty minutes — and nobody noticed for six days.

The job ran. It completed. It exited zero. Every monitoring dashboard showed green. Meanwhile, the downstream data pipeline was ingesting half-processed records, and reports were silently wrong. By the time a client flagged it, we had six days of corrupted reporting to unpick.

This is the failure mode nobody talks about: the job that doesn't die, it just... drags.

Why your existing monitoring misses it -

If you're using Healthchecks.io, Better Uptime, or a similar dead man's switch tool, here's how it works: your cron job pings a URL at the end of each run. If the ping doesn't arrive within a grace window, you get an alert.

That's genuinely useful. It catches jobs that crash, hang indefinitely, or never start. But what it doesn't catch is a job that completes in 240 minutes when it should take 45. The ping arrives. The check passes. Everything looks fine. The tool has no idea what "normal" looks like for that job — it only knows silence vs. noise.

Duration anomaly detection is the missing piece.

What duration anomaly detection actually means

The concept is simple: instead of only checking whether a job completed, you also check how long it took.

Once you have a few weeks of run history, you know that your nightly job usually takes 40–50 minutes. So when it takes four hours, that's a signal — even if it succeeded. Something changed: the dataset grew, a dependency got slow, a query plan degraded, a network hop started timing out and retrying.

Catching this early means you can investigate before it causes damage downstream.

The /start + /finish pattern -

Job begins

curl -s "https://deadmancheck.io/ping/abc123/start"

... your actual job logic ...

Job ends

curl -s "https://deadmancheck.io/ping/abc123"

Now the monitoring service knows: this run started at T, it ended at T+4h. It compares that against the rolling average of previous runs and alerts if the duration exceeds a configurable threshold — say, 2x the usual runtime. Two curl calls. The complexity lives in the service, not in your scripts.

Why this matters more as systems age -

New jobs are fast. As systems mature, things get slower in ways that creep up on you. Rows accumulate. Indexes bloat. Third-party APIs introduce latency. Your job that took 8 minutes in January takes 35 minutes in October.

Without duration tracking, you have no visibility into this degradation. With it, you have a canary. The alert fires at 70 minutes, you investigate, you find the index that needs rebuilding. Crisis averted before the downstream effects compound.

So I built this -

After looking for a tool that combined silence detection with duration anomaly detection and not finding one, I built DeadManCheck (deadmancheck.io). It supports the /start + /ping pattern, tracks rolling run history, and alerts you when a job takes significantly longer than its baseline. Standard silence detection is included too, so both failure modes are covered in one place.

Free tier available, no credit card required.

The checklist -

Next time you wire up a cron job, ask yourself:

Will I know if this job silently stops running?
Will I know if this job starts taking 4x longer than normal?
Will I know before my users do?

If the answer to any of those is "no", you have a monitoring gap. It's a small one to close.

→ Try DeadManCheck free at deadmancheck.io

Top comments (2)

Michael "Mike" K. Saleme • Mar 30

The failure mode I keep seeing: the job runs, logs "complete," and the output silently goes nowhere.

No error. No alert. Just a cron that appeared healthy while accomplishing nothing for days.

The fix that actually works is external verification. Don't check that the job ran; check that the downstream artifact exists. A job that succeeds but doesn't write the expected DB record is the same as a failed job.

Kriss • Mar 30

Completely agree — and that's actually a harder problem to solve generically since the "artifact" is different for every job. DeadManCheck covers the "did it run and how long did it take" layer, but you're right that output validation is a separate and equally important check.

The combination that works best: external ping monitoring (so you know it ran) + job-level output assertions (so you know it did something useful). Two different failure modes, two different checks.

Curious what you use for the output validation side?