DEV Community

Chalom Ellezam
Chalom Ellezam

Posted on

Your crontab is silently failing. The 5 silent killers of VPS-based cron jobs (and the modern setup that fixes them in an afternoon).

Disclosure: I'm a senior backend tech lead and I run HostingGuru. This article mentions HostingGuru once at the end, but the patterns and the fix work on any platform. The point is to stop losing scheduled jobs in silence, not to sell you anything.

It's Monday, 9:17 AM. The weekly digest email didn't go out. The Discord channel you wired to announce new signups is suspiciously quiet. You SSH into the $12 droplet that runs your scheduled jobs. crontab -l looks fine. /var/log/syslog says the job ran at 06:00. You open the output log. Empty.

You don't know if the script crashed, if it sent the emails but logged nothing, or if cron started a second run before the first one finished and the two are now stepping on each other. You spend an hour digging instead of writing code.

I've onboarded thirty-something teams off VPS-based cron in the last 14 months. The story is always a variation of the above. Cron itself isn't broken (it's been working since 1975). What's broken is the operational layer around cron when you run it on a generic Linux box you maintain yourself. Below are the 5 silent failure modes I see most often, and the modern setup that retires all of them.

Killer #1: you edited the wrong crontab

There are at least four places on a stock Ubuntu box where a cron job can live: the user's crontab (crontab -e), the root crontab (sudo crontab -e), /etc/crontab, and any file inside /etc/cron.d/. They use slightly different syntax (/etc/crontab and /etc/cron.d/* require a username field, the user-level crontab does not). Copy a snippet from a blog post into the wrong file and it silently does nothing. No error, no warning, no run.

I once watched a senior engineer (ex-Oney, plenty of Linux years on him) lose an hour to this on a Friday afternoon. The fix isn't "be more careful". The fix is to not have four places where the same job could live.

Killer #2: the environment your job runs in isn't your shell

Cron jobs run with a near-empty environment. No .bashrc, no .profile, no nvm, no pyenv. PATH is usually /usr/bin:/bin, which means your job can't find node if you installed it through nvm, can't find python3.11 if you're on a pyenv setup, and can't find psql if you installed Postgres client tools through a custom apt repo.

The classic symptom: the job runs interactively when you test it from your shell, runs silently as cron, and you discover three weeks later that which node returned a different path under cron than under your login session.

People work around it with line one of every crontab being:

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
SHELL=/bin/bash
Enter fullscreen mode Exit fullscreen mode

That helps with binaries. It doesn't help with secrets. Your DATABASE_URL is in ~/.env and you assumed source ~/.env runs. It doesn't (cron uses /bin/sh by default, not your interactive bash). You end up wrapping every job in:

0 6 * * * cd /srv/app && set -a && . ./.env && set +a && node scripts/digest.js >> /var/log/digest.log 2>&1
Enter fullscreen mode Exit fullscreen mode

That's a lot of ceremony for "run this script every morning."

Killer #3: output goes to /dev/null, errors go nowhere

The default cron behavior is to email the user the script's stdout and stderr. On a fresh VPS, there's no mail transport configured, so the output gets discarded. Most people add > /dev/null 2>&1 to silence the warning messages. Now you have no error output anywhere.

When the job crashes, you find out because a customer complains. Then you SSH in, scroll back through /var/log/syslog, see that the cron daemon dutifully started your command, and have zero idea what happened inside it.

The minimum fix is to log everything:

0 6 * * * /srv/app/scripts/digest.sh >> /var/log/digest.log 2>&1
Enter fullscreen mode Exit fullscreen mode

Better: pipe to a tool like logger so it ends up in journald with proper tagging, so journalctl -t digest returns history:

0 6 * * * /srv/app/scripts/digest.sh 2>&1 | logger -t digest
Enter fullscreen mode Exit fullscreen mode

Best: have your script itself report its own success or failure to somewhere you actually read, like a Telegram chat or a Slack channel. I wrote a whole post on the Telegram side of that (linked at the bottom).

Killer #4: your VPS is a single point of failure

The droplet reboots after a kernel update. The cron jobs run again as soon as the box comes back up, on the OS's clock. Fine, except your provider's clock skewed two minutes during the reboot and now your "every minute" health-check job and your "every five minutes" reconciliation job briefly overlap. Or the reboot took 8 minutes and the 06:00 digest didn't fire at all because the box wasn't on.

Worse case: the disk fills up because nobody rotates /var/log/digest.log. Cron itself starts failing because it can't write its own log file. You don't notice until a customer tells you the app is acting strange.

I have a slide I use in onboarding calls that just says "production scheduled jobs should not depend on the uptime of a Linux box you personally maintain." Half the room nods. The other half says "but it's so cheap." It is cheap, until the day it isn't.

Killer #5: overlapping runs corrupt your data

This is the one that bit a client most recently. SaaS doing inventory sync for indie ecommerce shops. The reconciliation job runs every 15 minutes. Normally it takes 3 minutes. Then their database hit a slow week, the job started taking 18 minutes, and now job N+1 starts while job N is still writing. Two simultaneous writers on the same Postgres rows. Half the inventory counts ended up wrong. They didn't notice for 4 days because the symptom was "some products show wrong stock", not "the script crashes."

The fix is locking. The simplest version is flock:

*/15 * * * * /usr/bin/flock -n /tmp/reconcile.lock /srv/app/scripts/reconcile.sh
Enter fullscreen mode Exit fullscreen mode

flock -n exits immediately if another instance holds the lock, instead of queuing. That prevents the overlap. It does not tell you that an overlap happened, so you can silently skip every 15-minute job for an hour and never notice. You only know if you instrument the script itself.

At the database level, pg_try_advisory_lock does the same thing inside Postgres:

SELECT pg_try_advisory_lock(54321);
-- ... do work ...
SELECT pg_advisory_unlock(54321);
Enter fullscreen mode Exit fullscreen mode

Either is fine. Neither shows up in the tutorial that taught you cron in the first hour. They become obvious only after you've been bitten.

The modern setup, in three pieces

I'll describe this generically because every reasonable PaaS in 2026 supports it. The three pieces are:

First, a scheduled job primitive your platform manages. You declare "this command, this schedule" in config or a UI. The platform handles when the box reboots, when the cron daemon misbehaves, when overlapping runs need to be prevented. Render and Railway both call them Cron Jobs. Fly has scheduled Machines. The naming varies. The pattern is the same: the platform owns scheduling, you own the script.

Second, per-run logs you can actually read. Not a log file you have to SSH and tail. Logs in a UI, by run, with timestamps and exit codes. This is the single biggest quality-of-life upgrade over crontab. You stop debugging "did it run?" and start debugging "what did it print?".

Third, alerts on failure that go to a channel you check. If exit code is non-zero, ping Telegram or Slack. Not email. Email gets buried (see article 4 in the series).

That's the minimum. You can add nice-to-haves: retries with backoff, run history, manual "run now" buttons for debugging, isolation so a runaway job doesn't take down your web service. But scheduling, logs, and alerts is the floor.

A real before/after

I'll show you the kind of script most indie SaaS use for a weekly digest. The "before" is what I see on VPS boxes. The "after" is the same script with proper instrumentation that works on any platform that gives you the three pieces above.

Before (digest.sh on a VPS, dropped into crontab -e):

0 9 * * 1 cd /srv/app && /usr/bin/node scripts/digest.js > /dev/null 2>&1
Enter fullscreen mode Exit fullscreen mode

After (digest.js, run by the platform's scheduler, logs and alerts built in):

// scripts/digest.js
const start = Date.now();
const TELEGRAM_TOKEN = process.env.TELEGRAM_TOKEN;
const TELEGRAM_CHAT = process.env.TELEGRAM_CHAT_ID;

async function notify(text) {
  if (!TELEGRAM_TOKEN) return;
  await fetch(`https://api.telegram.org/bot${TELEGRAM_TOKEN}/sendMessage`, {
    method: "POST",
    headers: { "content-type": "application/json" },
    body: JSON.stringify({ chat_id: TELEGRAM_CHAT, text }),
  });
}

async function run() {
  console.log(JSON.stringify({
    event: "digest.start",
    at: new Date().toISOString(),
  }));

  const sent = await sendWeeklyDigest(); // your code
  const ms = Date.now() - start;

  console.log(JSON.stringify({
    event: "digest.done",
    sent,
    ms,
  }));

  if (sent === 0) {
    await notify(`Weekly digest sent 0 emails (took ${ms}ms). Investigate.`);
  }
}

run().catch(async (err) => {
  console.error(JSON.stringify({
    event: "digest.error",
    error: err.message,
    stack: err.stack,
  }));
  await notify(`Weekly digest FAILED: ${err.message}`);
  process.exit(1);
});
Enter fullscreen mode Exit fullscreen mode

Three things changed: structured JSON logs (your platform's log viewer can filter them), a Telegram ping on actual failure or zero-send anomaly, and an explicit non-zero exit code so the platform marks the run as failed. None of this is exotic. All of it was inconvenient enough on a VPS that most people skipped it.

The Python version is the same shape. logging.info(json.dumps({...})), requests.post to Telegram, sys.exit(1) from the exception handler. Six extra lines around your real script. The win is that you can answer "did Monday's digest succeed?" in under 10 seconds for the next two years.

What about background workers vs cron?

Quick clarifier because I see this confused. A scheduled job runs on a clock (every 15 minutes, every Monday at 9 AM). A background worker runs continuously and consumes a queue (process this Stripe webhook now, render this PDF on request). Different primitives. You usually need both. Running both off the same cron line is a different conversation.

What I built

I built HostingGuru partly because I was tired of writing the same flock plus logger plus Telegram wrapper for client after client. On the Pro tier, scheduled scripts are first-class with per-run logs, exit-code tracking, and an AI monitor that pings my Telegram bot when something looks off (silent cron failures are one of the patterns it flags, alongside the retry loops and token spikes from article 3). EU (Germany) and US (Oregon) regions, ISO 27001 and GDPR compliant. The Free Starter tier doesn't sleep if you want to try a scheduled script without paying.

You don't need HostingGuru for this. Render, Railway, Fly, and Heroku all offer scheduled jobs as a primitive. The point of the article is that "scheduled jobs as a primitive" should be table stakes for your stack in 2026, regardless of which platform you pick.

What to do tonight regardless of which platform you use

  1. List every cron line on every VPS you own. A one-liner: for u in $(cut -d: -f1 /etc/passwd); do echo "=== $u ==="; sudo crontab -u $u -l 2>/dev/null; done plus cat /etc/crontab plus ls /etc/cron.d/. Write them down. You probably have more than you remember.
  2. For each one, ask: when did it last succeed? If you can't answer in under 30 seconds, you have no observability. That's the actual problem to fix first.
  3. Add flock -n to every job that writes to a shared resource. Five-minute change. Prevents overlap-induced data corruption.
  4. Add structured logging to the script itself. JSON line on start, JSON line on finish, JSON line on error. Make grep digest.error your.log useful.
  5. Add a failure ping to a channel you actually read. Telegram is the cheapest option and there's a tutorial in this series (article 4).
  6. Pick one job, the most important one (the billing reconciliation, the daily backup, the weekly digest), and move it off the VPS to a managed scheduler this week. Don't migrate everything at once. Migrate the one that would hurt most if it silently failed for a week.
  7. Set a calendar reminder for 30 days from today to check the success rate of that migrated job. The whole point of the migration is observability. Test that it actually delivered.

Honest closing question

The thing that keeps me up about cron isn't that it fails. It's that it fails silently and the discovery latency is measured in days. You find out when a customer complains, or the bill arrives, or the inventory is off.

What's the longest a scheduled job has been silently broken in your stack before you noticed? I want to know whether the "I lost a week on a digest that wasn't sending" story is unusually bad or whether every indie SaaS has one in the closet.


Previous posts in this series

  1. Heroku just went into "sustaining engineering mode." Here are 5 alternatives whose free tier actually doesn't sleep.
  2. I built my MVP with Claude Code. Now I need to deploy it. Here's what nobody tells you.
  3. Your AI app is silently burning $2,000/month and you don't know it. Here are the 5 patterns that bite founders.
  4. Telegram alerts for any production app, a 5-minute setup (no SaaS, no signup, just curl)
  5. How I built a Discord 'ship-tracker' bot in a weekend (and the 3-process architecture that keeps it alive 24/7)
  6. I migrated 12 client projects off Heroku. Here's the playbook (and the 7 things that bit me every single time).
  7. The Claude Code to production checklist: 15 things that aren't obvious until they bite you
  8. Your indie SaaS has zero working Postgres backups. Here's the 20-minute fix.
  9. Your Stripe webhook is going to silently drop a paid customer. Here are the 4 patterns that catch it.

Top comments (0)