Kriss

Posted on May 5

Kubernetes CronJobs silently fail more than you think

#kubernetes #devops #monitoring #reliability

A backup job missed 24 days of runs. Nobody knew. The CronJob looked fine in kubectl get cronjobs. No alerts fired. The last successful run timestamp in the status field just sat there, quietly getting older.

The root cause: the CronJob controller had silently given up scheduling after missing 100 runs. Logged an error. Stopped trying. Moved on.

This article explains why Kubernetes CronJobs are structurally unreliable without external monitoring, and what you can do about it.

The three failure modes Kubernetes won't tell you about

1. The 100 missed-schedule limit

This is the one that produces the war stories.

The Kubernetes CronJob controller checks how many schedules it missed since the last successful run. If that number exceeds 100, it permanently stops scheduling that CronJob — and logs a single error line:

Cannot determine if job needs to be started: too many missed start time (> 100)

That's it. No event. No alert. kubectl describe cronjob shows the last scheduled time getting stale. The CronJob shows as ACTIVE: 0. Everything looks fine until you notice your data is 24 days old.

This happens if:

The CronJob controller crashes or the API server is unreachable for an extended period
You set startingDeadlineSeconds too low and the cluster was briefly overloaded
A node outage prevented scheduling for long enough

The fix is restarting the CronJob (delete and recreate it, or bump the schedule), but the point is: you won't know it happened until you check manually.

2. Exit code 0 is not success

Your CronJob container can exit 0 after:

Connecting to a read replica that's 6 hours behind
Finding an empty queue and processing nothing
Silently swallowing an exception in a try/catch
Successfully completing a database backup of 0 bytes

Kubernetes marks the Job as Succeeded. The CronJob status shows the last successful run timestamp updated. Everything looks healthy. Your data pipeline has been doing nothing for a week.

3. Job history purged, evidence gone

By default:

successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1

After three successful runs, the oldest Job pod is deleted. Its logs go with it. When you eventually notice something's wrong and go looking for "what happened on Tuesday?", the evidence no longer exists.

You can increase these limits, but you'll never retain more than a handful of runs. A real audit trail requires shipping logs to an external system.

The deeper problem: no external check

All of these failure modes share the same root cause: your monitoring system lives inside the cluster, so it fails along with the cluster.

If your alerting depends on the cluster being healthy, it won't alert you when the cluster is unhealthy. And CronJob failures almost always correlate with cluster health problems.

What you need is a check that runs outside your cluster and asks: "did this job run? Did it do something?" If the answer is no, it pages you — regardless of what the cluster thinks.

This is the dead man's switch pattern: instead of your monitoring system checking whether the job ran, the job checks in with an external system, and the external system alerts if it stops hearing from the job.

Implementing external monitoring for a Kubernetes CronJob

Add a start/success/fail ping to your job. Here's a minimal implementation:

Shell wrapper (works with any container)

#!/bin/bash
set -euo pipefail

BASE="https://deadmancheck.io/ping/${DEADMANCHECK_TOKEN}"

# Signal start (enables duration monitoring) — || true so a network blip never kills the job
curl -fsS "${BASE}/start" > /dev/null || true

# Alert on any error
trap 'curl -fsS "${BASE}/fail" > /dev/null' ERR

# Your actual job
ROWS=$(/app/run-export.sh)

# Signal success + row count for output assertion
curl -fsS -X POST -H "Content-Type: application/json" \
  -d "{\"count\": ${ROWS}}" \
  "${BASE}" > /dev/null

Python job

import requests
import os
import sys

TOKEN = os.environ["DEADMANCHECK_TOKEN"]
BASE = f"https://deadmancheck.io/ping/{TOKEN}"

def main():
    # Signal start — wrapped so a monitoring outage never kills the job
    try:
        requests.get(f"{BASE}/start", timeout=5)
    except Exception:
        pass
    try:
        records_processed = run_job()
        # POST count for output assertion: alert if count is 0
        requests.post(BASE, json={"count": records_processed}, timeout=5)
    except Exception:
        try:
            requests.get(f"{BASE}/fail", timeout=5)
        except Exception:
            pass
        sys.exit(1)

if __name__ == "__main__":
    main()

CronJob spec

Store the token in a Kubernetes Secret:

kubectl create secret generic deadmancheck-secret \
  --from-literal=token=your-token-here

Reference it in your CronJob spec:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-export
  namespace: production
spec:
  schedule: "0 2 * * *"
  successfulJobsHistoryLimit: 5   # keep more history than the default
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: exporter
            image: your-registry/exporter:latest
            env:
            - name: DEADMANCHECK_TOKEN
              valueFrom:
                secretKeyRef:
                  name: deadmancheck-secret
                  key: token
          restartPolicy: OnFailure

Output assertions: the check Kubernetes can't do

Output assertions are the piece most monitoring tutorials skip. Here's why it matters.

Your job runs. Exits 0. Kubernetes marks it Succeeded. But the job processed 0 records.

If your monitoring only checks "did the job ping?" — like every other cron monitoring tool — you don't get alerted. The job pinged. It just pinged with count=0.

DeadManCheck lets you configure an output assertion: alert if count < N. Set it to count > 0. Now your job can't silently export nothing without triggering an alert.

This catches the failure mode that pure heartbeat monitoring misses: the job that runs, succeeds by every technical measure, and still does nothing useful.

What external monitoring catches vs what it doesn't

Failure mode	kubectl catches?	External monitoring catches?
Pod CrashLoopBackOff	Visible in logs/events	YES (missed ping)
100 missed-schedule limit hit	No alert fires	YES (missed ping)
Job exits 0, processes nothing	No	YES (output assertion)
Cluster outage kills controller	No	YES (missed ping)
Job takes 5× longer than usual	No	YES (duration anomaly)
CronJob accidentally deleted	No	YES (missed ping)

The realistic setup time

For an existing CronJob:

Create a free monitor — takes 2 minutes
Set interval to match your schedule + buffer (e.g., 25h for a daily job)
Enable output assertion if your job reports a count
Add the start/success/fail pings to your container script
Create the Secret, update the CronJob spec
Deploy and verify the first ping arrives

Total: 15-20 minutes including deployment. The first time a silent failure happens, you'll have wished you'd done it sooner.

One more thing: set a reasonable history limit

While you're in the CronJob spec, increase the history limits from the defaults:

successfulJobsHistoryLimit: 10
failedJobsHistoryLimit: 5

This doesn't replace external monitoring, but it gives you more context in kubectl describe cronjob when you're investigating an incident. The default of 3/1 is genuinely too low for production jobs.

DeadManCheck is open source and self-hostable if you'd rather run it on your own infrastructure. GitHub →

DEV Community