DEV Community

Krasimir Petkov
Krasimir Petkov

Posted on • Edited on

I built a small tool to notice when cron jobs fail silently

I found out one of my background jobs had stopped running only after the data looked wrong the next day.

There was no dramatic crash. No big incident. The job just quietly failed, and I only noticed because something downstream looked stale.

That is the annoying part about cron jobs and scheduled scripts. Most of the time they run in the background, write some logs, and nobody thinks about them until something is missing.

I have a few jobs like this:

  • data updates
  • cleanup scripts
  • small imports
  • external API calls
  • recurring background tasks

None of them are very exciting. But when one of them does not run, or starts and never finishes, it can create a surprisingly annoying problem.

That is the kind of failure I wanted to make more visible.

I also built a small V1 of this idea here:

https://missedrun.com

This is not a big launch. I am mostly trying to understand if this is a real enough problem for other developers who run cron jobs, ETL jobs, backups, imports, cleanup scripts, or other scheduled tasks.

The problem

Cron jobs are easy to forget about.

They usually do not have a UI. They run somewhere on a server, maybe write logs, and then disappear into the background.

A job can fail because:

  • an API token expired
  • an environment variable is missing
  • a database connection failed
  • the server restarted
  • the script crashed
  • the job started but never finished
  • the cron entry was changed or removed

Logs are useful, but only if you go and check them.

In practice, I usually only check logs after I already suspect something is broken.

For recurring jobs, I often want a much simpler answer:

  • did it start?
  • did it finish?
  • did it fail?
  • did it miss the expected time?

The ping approach

One simple way to monitor this is to make the job report its own status.

The basic pattern is:

  1. send a start ping when the job begins
  2. send a success ping when it finishes
  3. send a failure ping if it crashes
  4. mark it as late or missed if the expected ping does not arrive

It is not a complicated idea, but I have found it very useful in practice.

Instead of checking logs manually, the job tells you whether it is still alive.

For example:

  • if the start ping arrives, the job is running
  • if the success ping arrives, the job finished
  • if the fail ping arrives, the job crashed
  • if nothing arrives when expected, the job is late or missed

That last case is the important one for me.

A lot of failures are not loud. The job does not always send an error. Sometimes it just does not run.

Bash example

Here is a simple shell wrapper.

This uses placeholder URLs. In a real setup, these would be the ping URLs generated by your monitoring tool.

#!/bin/bash

START_URL="https://example.com/ping/YOUR_TOKEN/start"
SUCCESS_URL="https://example.com/ping/YOUR_TOKEN"
FAIL_URL="https://example.com/ping/YOUR_TOKEN/fail"

curl -fsS -X POST --max-time 5 "$START_URL" >/dev/null || true

your-real-command-here

EXIT_CODE=$?

if [ $EXIT_CODE -eq 0 ]; then
  curl -fsS -X POST --max-time 5 "$SUCCESS_URL" >/dev/null || true
else
  curl -fsS -X POST --max-time 5 "$FAIL_URL" >/dev/null || true
fi

exit $EXIT_CODE
Enter fullscreen mode Exit fullscreen mode

The important part is that the monitoring calls should not break the real job.

That is why the curl calls use || true. If the monitoring service is temporarily unavailable, the actual job should still be able to run.

Python example

The same idea can be used inside a Python job.

import requests

START_URL = "https://example.com/ping/YOUR_TOKEN/start"
SUCCESS_URL = "https://example.com/ping/YOUR_TOKEN"
FAIL_URL = "https://example.com/ping/YOUR_TOKEN/fail"


def ping(url: str) -> None:
    try:
        requests.post(url, timeout=5)
    except requests.RequestException:
        pass


try:
    ping(START_URL)

    # Run your real job here
    print("doing work...")

    ping(SUCCESS_URL)

except Exception:
    ping(FAIL_URL)
    raise
Enter fullscreen mode Exit fullscreen mode

Again, the pings are not meant to replace logs.

Logs still matter when you need to debug what happened. The pings are just a simple way to know that something happened.

What I built

I built a small V1 around this idea.

The flow is simple:

  1. create a monitor
  2. choose how often the job is expected to run
  3. add the ping URLs to the job
  4. get notified if the job fails, gets stuck, runs late, or misses its expected time

The current version is intentionally simple.

I am not trying to replace full observability tools. For now, I am mostly thinking about the boring jobs that do important work in the background but do not need a huge monitoring setup.

Examples:

  • nightly imports
  • database cleanup jobs
  • billing syncs
  • backup scripts
  • report generation
  • small ETL jobs
  • scripts that call third-party APIs

For these, I mostly want to know:

Did the job run when it was supposed to?

And if not, I want to know before I notice stale data later.

Why I made it

I built this because I had the problem myself.

I had background jobs running, but I did not always have a good way to know when one silently stopped.

Checking logs manually does not scale well, even for small projects. And full observability tools can feel like too much when the thing you want to monitor is just a cron job or a small script.

So I wanted something very basic:

  • one URL for start
  • one URL for success
  • one URL for failure
  • email alert when something looks wrong

That is basically it.

What I am still figuring out

The main thing I am trying to understand is whether this is useful beyond my own use case.

I know it helps me, but I am still trying to learn how other developers handle this.

Maybe people already use something like this. Maybe they use logs, cron emails, healthchecks, uptime monitors, custom scripts, or their existing observability stack.

That is the feedback I am looking for.

If you run cron jobs, ETL jobs, backups, imports, cleanup scripts, or other scheduled tasks:

How do you currently notice when one silently stops running?

And is this kind of simple ping-based monitoring something you would actually use?

Top comments (1)

Collapse
 
krasimir_petkov_c14f3b461 profile image
Krasimir Petkov • Edited

Curious how others think about this:

What’s worse in your setup — a cron job that fails loudly, or one that never runs at all?

For me, the missed case is usually worse because there’s no visible crash. I just notice later that some data is stale.