DEV Community

Ramon
Ramon

Posted on

The Cron Job That Lied to You

Your nightly backup ran at 2 AM. The ping arrived on schedule. No alerts, no incidents, nothing in your dashboard but a row of green checkmarks.

The backup file was empty.

The job ran. It checked in. It lied to you.

Basic heartbeat monitoring solves one problem: knowing whether your job ran at all. If the ping stops arriving, you get alerted. That is genuinely useful and it catches a whole class of failures. But there is a quieter category of failure that heartbeat monitoring alone does not cover. The job shows up, does its ping, and something is still wrong.

Here are the four ways that happens.

The job finished, but so did another copy of it

Your sync job runs every five minutes and usually completes in about 90 seconds. One night the database gets slow. The job starts taking six minutes. Cron does not know this. At the five minute mark it fires a new instance. Now you have two copies of the same job running at the same time, both reading and writing to the same tables.

Each one eventually finishes and pings success. Your monitor sees two pings and is perfectly happy. Your data has duplicate records in it.

This is what overlap detection catches. When you send a start ping at the beginning of a job, PulseMon tracks whether the previous run finished before the new one begins. If it did not, you get an immediate alert. Not after the data is corrupted. At the moment the second instance starts.

# Start of job
curl -fsS https://pulsemon.dev/api/ping/sync-job?status=start

# ... your job logic ...

# End of job
curl -fsS https://pulsemon.dev/api/ping/sync-job
Enter fullscreen mode Exit fullscreen mode

The job finished, but it took way longer than it should have

This one is subtle because it looks completely fine from the outside. The job ran, it finished, it pinged. But it usually takes four minutes and today it took 47.

That is almost always a sign something upstream is struggling. A slow query. A downstream API responding at a crawl. A dataset that has grown past a threshold your job was not designed for. The job will probably fail completely within the next few runs. Or it will keep completing slowly, quietly degrading until it starts missing its window.

Duration thresholds let you set a ceiling on how long a job should take. If the job checks in successfully but blew past that ceiling, you get alerted. The job succeeded by every technical measure and you still get notified, because the duration is itself a signal worth acting on.

The job failed, but your monitor was going to wait it out

Without explicit failure signalling, heartbeat monitoring works on absence. You set an interval, and if no ping arrives by the deadline, the monitor goes down and you get alerted.

The problem is the window. If your job is supposed to run every 30 minutes and it fails immediately, you might not find out for 30 minutes. Plus the grace period. That is a long time to wait on a payment processor job or an order fulfilment worker.

The fix is a fail ping. When your job catches an error, it can tell PulseMon directly instead of just going quiet.

try:
    run_invoice_job()
    requests.get("https://pulsemon.dev/api/ping/invoice-job", timeout=10)
except Exception as e:
    requests.get(
        "https://pulsemon.dev/api/ping/invoice-job?status=fail",
        timeout=10
    )
    raise
Enter fullscreen mode Exit fullscreen mode

A fail ping fires an immediate alert. You find out in seconds, not when the deadline expires.

The job missed its deadline and you have no idea why

This one is not a lie exactly. The job did not check in, you got alerted, something is clearly wrong. But then what? You SSH into the server, check the logs, and try to piece together what happened from whatever the job managed to write before it died.

The ping body changes this. You can POST your job's output with the ping, and when the alert fires it includes that output. The failure context comes to you instead of you going to find it.

OUTPUT=$(your-job-command 2>&1)
STATUS=$?

if [ $STATUS -eq 0 ]; then
    curl -fsS -X POST \
      -d "$OUTPUT" \
      https://pulsemon.dev/api/ping/your-job
else
    curl -fsS -X POST \
      -d "$OUTPUT" \
      https://pulsemon.dev/api/ping/your-job?status=fail
fi
Enter fullscreen mode Exit fullscreen mode

Now your alert email contains the last thing the job printed before it went wrong. No SSH required.

What a successful ping actually means

A ping tells you the job reached the line of code that fires the request. That is it. It says nothing about whether the job ran in isolation, whether it finished in a reasonable time, or whether it failed and told you immediately.

These four features are not replacements for heartbeat monitoring. They sit on top of it. A ping is still the foundation. But a ping on its own is a pretty low bar for "everything is fine."

The jobs that bite you worst are not the ones that go completely dark. Those are obvious. The hard ones are the jobs that keep showing up, keep checking in, and are quietly doing something wrong every single time.


PulseMon supports start, success, and fail pings, duration thresholds, overlap detection, and ping body in alerts on all plans. Free tier includes 30 monitors. No credit card required. pulsemon.dev

Top comments (0)