DEV Community

quietpulse
quietpulse

Posted on • Originally published at quietpulse.xyz

How to Monitor Python Scripts in Production Before They Fail Silently

If you run important automation with Python, you need a way to monitor Python scripts in production beyond “the server is up” and “there are logs somewhere.” A script can stop running, hang forever, exit early, fail under cron, lose permissions, or silently skip the work it was supposed to do — while your app and server still look perfectly healthy.

That is the uncomfortable part of production scripts: they often fail quietly.

Maybe a daily import stopped pulling customer data. Maybe a billing reconciliation script crashed last Thursday. Maybe a cleanup job has not deleted old files for two weeks. Nobody notices until the downstream symptoms become visible.

This guide explains how to monitor Python scripts in production with practical signals, heartbeat checks, and simple examples that catch missed or broken runs before users do.

The problem

Python scripts are often the invisible glue in production systems.

They import data, export reports, sync APIs, clean temporary files, rotate records, generate invoices, update search indexes, send notifications, reconcile payments, or move files between systems.

A typical setup might look like this:

0 * * * * /usr/bin/python3 /opt/app/scripts/sync_customers.py
Enter fullscreen mode Exit fullscreen mode

Or maybe it runs inside a virtual environment:

*/15 * * * * cd /opt/app && . .venv/bin/activate && python scripts/process_queue.py
Enter fullscreen mode Exit fullscreen mode

This works well until it does not.

The script is not part of the main web request path. It may not have a dashboard. It may not expose an HTTP endpoint. It may only run every hour, every day, or every week. If it fails, there may be no immediate user-facing error.

That creates a monitoring blind spot.

Your uptime monitor can say the website is online. Your server metrics can say CPU and memory are fine. Your logs may contain an error, but only if someone looks at the right file. Meanwhile, the script that actually performs critical business work may not be running at all.

The real production question is not only:

“Is the server alive?”

It is:

“Did this Python script run successfully when it was supposed to?”

Why it happens

Python scripts fail silently for many ordinary reasons.

Cron is one of the biggest sources of surprises. A script that works from your terminal may fail under cron because the environment is different. Cron usually runs with a minimal PATH, a different working directory, and fewer environment variables.

For example, this may work manually:

python scripts/sync_customers.py
Enter fullscreen mode Exit fullscreen mode

But fail under cron because python points to a different interpreter, dependencies are missing, or the script expects to be run from a specific directory.

Virtual environments are another common issue. If the cron job does not activate the right environment, imports can fail:

ModuleNotFoundError: No module named 'requests'
Enter fullscreen mode Exit fullscreen mode

File permissions can also break scripts after deployments. A script may no longer be executable. A log directory may become unwritable. A credentials file may move. A new release may change paths.

External APIs create another class of failures. A Python script may depend on a payment provider, analytics API, S3 bucket, database, webhook endpoint, or internal service. If that dependency times out or changes response format, the script may fail halfway through.

There are also logic failures. A script can exit with code 0 while doing no useful work. It may catch exceptions too broadly. It may skip records because of a bad filter. It may process only part of a batch and still report success.

For example:

try:
    sync_customers()
except Exception as exc:
    print(f"Sync failed: {exc}")
Enter fullscreen mode Exit fullscreen mode

This logs the error but may still allow the process to exit successfully unless the code explicitly returns a failure exit code. From the outside, the job may look fine.

Long-running scripts can fail in a different way: they hang. No exception, no exit code, no completion log. The process is still there, but the work never finishes.

That is why monitoring Python scripts in production needs more than logs and exit codes. You need a signal that confirms the script actually completed the expected work.

Why it's dangerous

Silent script failures are dangerous because they create delayed incidents.

When a web endpoint fails, someone usually notices quickly. A user sees an error. An uptime check fails. Error tracking lights up.

When a background Python script fails, the impact may build slowly.

A missed billing reconciliation might leave payments in the wrong state. A failed import might make dashboards stale. A broken cleanup script might fill disk space over time. A failed notification script might quietly reduce activation or retention. A stuck sync job might leave two systems disagreeing for days.

The damage often appears far away from the original failure.

By the time someone notices, the team has to answer harder questions:

  • When did the script last run successfully?
  • Which records were processed?
  • Which records were skipped?
  • Did it fail completely or partially?
  • Can we safely rerun it?
  • Did users see stale or incorrect data?

For small teams, this is especially painful. Many production scripts are written because they are “just a quick automation.” They solve a real problem, but they do not always get the same operational care as the main app.

That is risky.

If a Python script is important enough to run in production, it is important enough to monitor.

How to detect it

The most reliable pattern is to monitor the script from the inside.

Instead of only checking the server or log file, make the script send a heartbeat when it finishes successfully. A heartbeat is a small HTTP request to a unique monitoring URL. The monitor expects that request within a defined schedule.

For example:

  • A script runs every 15 minutes.
  • The monitor expects a heartbeat every 15 minutes, with a small grace period.
  • The script sends the heartbeat only after it completes successfully.
  • If the heartbeat does not arrive, you get an alert.

This detects several real production failures:

  • The cron job did not run.
  • The script crashed before completion.
  • The script hung and never reached the end.
  • The server was down during the scheduled run.
  • The deployment broke the script path or environment.
  • A dependency failure prevented successful completion.

The key detail is timing.

A heartbeat should not be sent at the start of the script if your goal is to confirm success. Sending it at the start only proves that the script began. It does not prove that the work finished.

For critical scripts, send the heartbeat after the important work is done.

You can also add more signals:

  • Log start and finish timestamps.
  • Return non-zero exit codes on failure.
  • Capture exceptions in error tracking.
  • Measure duration.
  • Alert when runtime is unusually long.
  • Store last successful run in a database.
  • Track rows processed or files handled.

But the minimum useful signal is simple:

“Did this script successfully check in when expected?”

Simple solution (with example)

Here is a basic Python script that performs work and then sends a heartbeat ping after success.

import sys
import requests

PING_URL = "https://quietpulse.xyz/ping/YOUR_TOKEN_HERE"

def sync_customers():
    # Your real production logic goes here.
    # Examples:
    # - pull data from an API
    # - update your database
    # - write files
    # - send notifications
    print("Syncing customers...")

def send_heartbeat():
    response = requests.get(PING_URL, timeout=10)
    response.raise_for_status()

def main():
    try:
        sync_customers()
        send_heartbeat()
        print("Script completed successfully")
        return 0
    except Exception as exc:
        print(f"Script failed: {exc}", file=sys.stderr)
        return 1

if __name__ == "__main__":
    raise SystemExit(main())
Enter fullscreen mode Exit fullscreen mode

The important part is that send_heartbeat() runs only after sync_customers() completes.

If the script crashes before that point, no heartbeat is sent. If the machine is down, no heartbeat is sent. If cron is misconfigured, no heartbeat is sent. If the script hangs forever, no heartbeat is sent.

That missing heartbeat becomes the alert.

You can run the script from cron like this:

*/15 * * * * cd /opt/app && /opt/app/.venv/bin/python scripts/sync_customers.py >> /var/log/sync_customers.log 2>&1
Enter fullscreen mode Exit fullscreen mode

For better safety, use timeout so a stuck script does not run forever:

*/15 * * * * cd /opt/app && timeout 10m /opt/app/.venv/bin/python scripts/sync_customers.py >> /var/log/sync_customers.log 2>&1
Enter fullscreen mode Exit fullscreen mode

Now you have three useful layers:

  1. Cron starts the script on schedule.
  2. timeout prevents infinite hangs.
  3. The heartbeat confirms successful completion.

Instead of building the heartbeat receiver yourself, you can use a simple heartbeat monitoring tool like QuietPulse. Create a monitor, copy its ping URL, and call https://quietpulse.xyz/ping/{token} from the script after successful completion. If the expected ping does not arrive, you get an alert.

Common mistakes

1. Sending the heartbeat too early

A common mistake is pinging the monitor at the start of the script.

send_heartbeat()
sync_customers()
Enter fullscreen mode Exit fullscreen mode

This proves only that the script started. If sync_customers() fails later, the monitor still thinks everything is fine.

For success monitoring, send the heartbeat at the end.

2. Swallowing exceptions

Catching exceptions without failing the process hides real errors.

try:
    sync_customers()
except Exception as exc:
    print(exc)
Enter fullscreen mode Exit fullscreen mode

If the script exits with code 0, cron and deployment tools may treat it as successful. Prefer returning a non-zero exit code on failure.

3. Relying only on logs

Logs are useful, but they are not alerts by themselves.

A perfect error message in a forgotten log file does not help if nobody reads it. Logs should support debugging after an alert fires. They should not be your only detection mechanism.

4. Forgetting cron environment differences

Cron does not run like your shell.

Use absolute paths. Set the working directory. Use the correct virtual environment. Redirect output somewhere useful. Test the exact cron command manually.

5. Monitoring the server instead of the script

Server-level monitoring is important, but it does not prove that a script ran. CPU, memory, disk, and uptime checks can all look normal while a production script silently stops doing its job.

Monitor the job outcome directly.

Alternative approaches

Heartbeat monitoring is not the only way to monitor Python scripts in production, but it is one of the simplest and most direct.

Logs

Logs are essential for debugging. Every important script should log when it starts, what it processed, and whether it finished.

For example:

print("Starting customer sync")
print("Processed 128 customers")
print("Customer sync complete")
Enter fullscreen mode Exit fullscreen mode

Structured logs are even better if you already use a log platform.

But logs are passive unless you attach alerts to them. They also may not detect a script that never started.

Exit codes

Exit codes are useful for local correctness.

A script should return 0 on success and non-zero on failure. This makes failures visible to cron wrappers, CI jobs, systemd units, and deployment tools.

But exit codes alone do not notify you unless something watches them.

Error tracking

Tools like Sentry can catch unhandled exceptions. This is valuable for Python scripts, especially when failures are caused by code bugs.

But error tracking may not detect missed runs, disabled cron jobs, hung processes, or scripts that exit successfully while doing the wrong thing.

Systemd timers

Instead of cron, you can run scripts with systemd timers. This gives you better logging, status inspection, and service management.

For some teams, systemd timers are a strong upgrade. Still, you usually want an external heartbeat if the job is important, because local service status does not always tell you whether the business task completed successfully.

Database “last run” records

Some teams store a last_successful_run_at timestamp in the database. This can work well, especially if you build an internal admin page around it.

The downside is that you also need to monitor that timestamp. If nobody checks it, it becomes another hidden signal.

A heartbeat monitor is essentially a simple external version of that idea, with alerting built in.

FAQ

How do I monitor Python scripts in production?

The simplest way to monitor Python scripts in production is to send a heartbeat after each successful run. Configure a monitor that expects the heartbeat on the same schedule as the script. If the script does not run, crashes, hangs, or fails before completion, the heartbeat is missing and you get an alert.

Is cron enough for running Python scripts?

Cron is fine for scheduling, but cron alone is not monitoring. It can start scripts on a schedule, but it does not reliably tell you whether the script completed the expected work. For production scripts, combine cron with logs, non-zero exit codes, timeout protection, and heartbeat monitoring.

Should a Python script send a heartbeat at the start or end?

For success monitoring, send the heartbeat at the end. A start ping only proves that the script began. An end ping confirms that the important work completed. If you need both start and finish tracking, use separate signals, but do not treat a start ping as proof of success.

How can I detect a Python script that hangs?

Use a timeout around the script and a heartbeat monitor. The timeout prevents the process from running forever. The heartbeat monitor alerts if the script does not complete and send its success ping within the expected window.

Do I still need logs if I use heartbeat monitoring?

Yes. Heartbeats tell you that something did not run successfully. Logs help you understand why. A good setup uses both: heartbeat alerts for detection, logs for investigation.

Conclusion

Production Python scripts are easy to forget because they often run outside the main application. But they may handle some of the most important work in your system.

If you want to monitor Python scripts in production, do not rely only on server uptime or log files. Track whether each important script actually completes on schedule.

A simple heartbeat at the end of the script can catch missed runs, crashes, hangs, cron problems, and deployment mistakes early — before a quiet automation failure turns into a user-visible incident.


Originally published at https://quietpulse.xyz/blog/monitor-python-scripts-production

Top comments (0)