Ajdin Halac

Posted on Apr 9

Your cron job started. That does not mean it finished.

#monitoring #webdev #devops #backend

A lot of cron monitoring is too shallow.

It tells you that a job ran, or that the server was up when it was supposed to run, and that is where the thinking stops.

That might be enough for throwaway jobs. It is not enough for anything important.

Because for real scheduled work, "it started" and "it finished" are two different things.

And that gap is where a lot of failures hide.

The problem with most cron monitoring

A lot of teams monitor cron jobs as if the only failure mode is "it never ran."

But that is not how these jobs usually fail.

They fail like this:

the job starts and gets stuck halfway through
the script exits before the important part
the process times out after doing some of the work
the backup starts but never finishes
the sync starts on time and silently dies later

That is why "did it run?" is not a very useful question by itself.

The better questions are:

Did it start on time?
Did it finish successfully?
If it did not finish, where did it stop?

If you cannot answer those, your monitoring is giving you false confidence.

One heartbeat is fine for simple jobs

There is nothing wrong with a single heartbeat if the job is short and simple.

If a task runs quickly and one completion ping tells you the whole story, that is fine.

But a lot of jobs are not like that.

Backups are not.

Imports are not.

Sync jobs are not.

Billing tasks are not.

Scheduled reports are not.

These jobs can start correctly and still fail in a way that matters.

That is the class of failure basic cron monitoring often misses.

What you actually want to know

For anything important, you usually care about two things:

the job started
the job finished

That sounds obvious, but it changes how you monitor the job.

If there is no start signal, the job never began.

If there is a start signal but no finish signal, the job probably got stuck, timed out, or failed during execution.

If both signals arrive, the run completed.

That is a much better operational signal than a single generic success ping.

Track the lifecycle, not just one event

This is really the core idea.

Instead of treating a cron job as one event, treat it like a sequence.

At minimum:

start
finish

That gives you a more useful signal immediately.

You stop asking "did something happen?" and start asking "did the job complete the way it was supposed to?"

That is much closer to how these jobs actually behave in production.

A practical example

Say you have a nightly backup.

A lot of setups only send one ping after the backup command finishes. That works if everything goes well.

But if the job starts and gets stuck halfway through, that single-ping model tells you very little.

A better version looks like this:

#!/bin/bash
set -e

curl "https://heartbeats.upti.my/v1/heartbeat/<heartbeat-id>?step=start"

pg_dump mydb > /backups/mydb.sql

curl "https://heartbeats.upti.my/v1/heartbeat/<heartbeat-id>?step=finish"

Now the signal is more useful.

No start means the job never began.

start without finish means it failed during execution.

start and finish means it completed.

That is the kind of monitoring that actually helps when something breaks.

Why this matters more than people think

Silent cron failures are annoying because they usually do not show up as immediate downtime.

They show up later.

You find out your backup was broken when you need to restore it.

You find out a sync stopped when someone notices stale data.

You find out invoices were not generated after a customer asks about billing.

You find out reports were never sent because somebody complains.

By then, the failure is old news. You are already dealing with the fallout.

That is why "the server is healthy" is not enough, and "the job probably ran" is definitely not enough.

For important scheduled work, you want direct visibility into whether the job started and whether it actually reached the end.

A simple rule

Use one heartbeat when the job is:

short
simple
easy to validate with one completion event

Use a chained heartbeat approach when the job:

takes time
can fail midway
has meaningful execution stages
matters to the business if it only partially runs

That usually covers backups, syncs, imports, exports, billing tasks, ETL pipelines, and reporting jobs.

Final thought

For important cron jobs, "did it run?" is a weak question.

A better one is:

Did it start on time, and did it finish successfully?

That is the difference between basic monitoring and useful monitoring.

If you only track one heartbeat, you can miss a whole class of failures where the job started but never completed.

I built this into upti.my as Heartbeat with Job Chain, which lets jobs ping at different steps like start and finish.

If you want to see the original post, it is here:

https://www.upti.my/blog/why-cron-job-started-does-not-mean-finished

DEV Community