I found out one of my background jobs had stopped running only after the data looked wrong the next day.
There was no dramatic crash. No big incident. The job just quietly failed, and I only noticed because something downstream looked stale.
That is the annoying part about cron jobs and scheduled scripts. Most of the time they run in the background, write some logs, and nobody thinks about them until something is missing.
I have a few jobs like this:
- data updates
- cleanup scripts
- small imports
- external API calls
- recurring background tasks
None of them are very exciting. But when one of them does not run, or starts and never finishes, it can create a surprisingly annoying problem.
That is the kind of failure I wanted to make more visible.
I also built a small V1 of this idea here:
This is not a big launch. I am mostly trying to understand if this is a real enough problem for other developers who run cron jobs, ETL jobs, backups, imports, cleanup scripts, or other scheduled tasks.
The problem
Cron jobs are easy to forget about.
They usually do not have a UI. They run somewhere on a server, maybe write logs, and then disappear into the background.
A job can fail because:
- an API token expired
- an environment variable is missing
- a database connection failed
- the server restarted
- the script crashed
- the job started but never finished
- the cron entry was changed or removed
Logs are useful, but only if you go and check them.
In practice, I usually only check logs after I already suspect something is broken.
For recurring jobs, I often want a much simpler answer:
- did it start?
- did it finish?
- did it fail?
- did it miss the expected time?
The ping approach
One simple way to monitor this is to make the job report its own status.
The basic pattern is:
- send a start ping when the job begins
- send a success ping when it finishes
- send a failure ping if it crashes
- mark it as late or missed if the expected ping does not arrive
It is not a complicated idea, but I have found it very useful in practice.
Instead of checking logs manually, the job tells you whether it is still alive.
For example:
- if the start ping arrives, the job is running
- if the success ping arrives, the job finished
- if the fail ping arrives, the job crashed
- if nothing arrives when expected, the job is late or missed
That last case is the important one for me.
A lot of failures are not loud. The job does not always send an error. Sometimes it just does not run.
Bash example
Here is a simple shell wrapper.
This uses placeholder URLs. In a real setup, these would be the ping URLs generated by your monitoring tool.
#!/bin/bash
START_URL="https://example.com/ping/YOUR_TOKEN/start"
SUCCESS_URL="https://example.com/ping/YOUR_TOKEN"
FAIL_URL="https://example.com/ping/YOUR_TOKEN/fail"
curl -fsS -X POST --max-time 5 "$START_URL" >/dev/null || true
your-real-command-here
EXIT_CODE=$?
if [ $EXIT_CODE -eq 0 ]; then
curl -fsS -X POST --max-time 5 "$SUCCESS_URL" >/dev/null || true
else
curl -fsS -X POST --max-time 5 "$FAIL_URL" >/dev/null || true
fi
exit $EXIT_CODE
The important part is that the monitoring calls should not break the real job.
That is why the curl calls use || true. If the monitoring service is temporarily unavailable, the actual job should still be able to run.
Python example
The same idea can be used inside a Python job.
import requests
START_URL = "https://example.com/ping/YOUR_TOKEN/start"
SUCCESS_URL = "https://example.com/ping/YOUR_TOKEN"
FAIL_URL = "https://example.com/ping/YOUR_TOKEN/fail"
def ping(url: str) -> None:
try:
requests.post(url, timeout=5)
except requests.RequestException:
pass
try:
ping(START_URL)
# Run your real job here
print("doing work...")
ping(SUCCESS_URL)
except Exception:
ping(FAIL_URL)
raise
Again, the pings are not meant to replace logs.
Logs still matter when you need to debug what happened. The pings are just a simple way to know that something happened.
What I built
I built a small V1 around this idea.
The flow is simple:
- create a monitor
- choose how often the job is expected to run
- add the ping URLs to the job
- get notified if the job fails, gets stuck, runs late, or misses its expected time
The current version is intentionally simple.
I am not trying to replace full observability tools. For now, I am mostly thinking about the boring jobs that do important work in the background but do not need a huge monitoring setup.
Examples:
- nightly imports
- database cleanup jobs
- billing syncs
- backup scripts
- report generation
- small ETL jobs
- scripts that call third-party APIs
For these, I mostly want to know:
Did the job run when it was supposed to?
And if not, I want to know before I notice stale data later.
Why I made it
I built this because I had the problem myself.
I had background jobs running, but I did not always have a good way to know when one silently stopped.
Checking logs manually does not scale well, even for small projects. And full observability tools can feel like too much when the thing you want to monitor is just a cron job or a small script.
So I wanted something very basic:
- one URL for start
- one URL for success
- one URL for failure
- email alert when something looks wrong
That is basically it.
What I am still figuring out
The main thing I am trying to understand is whether this is useful beyond my own use case.
I know it helps me, but I am still trying to learn how other developers handle this.
Maybe people already use something like this. Maybe they use logs, cron emails, healthchecks, uptime monitors, custom scripts, or their existing observability stack.
That is the feedback I am looking for.
If you run cron jobs, ETL jobs, backups, imports, cleanup scripts, or other scheduled tasks:
How do you currently notice when one silently stops running?
And is this kind of simple ping-based monitoring something you would actually use?
Top comments (1)
Curious how others think about this:
What’s worse in your setup — a cron job that fails loudly, or one that never runs at all?
For me, the missed case is usually worse because there’s no visible crash. I just notice later that some data is stale.