DEV Community: Ramon

Your LangSmith Traces Are Not an Audit Trail

Ramon — Fri, 10 Apr 2026 19:02:04 +0000

You have LangSmith set up. You can see every prompt, every token, every span. Your traces are clean, your latency charts are green, and you can replay any run from the last 30 days.

When your compliance officer asks what your agent did on March 14th, you send them a LangSmith link.

They come back with more questions. And you realise you answered the wrong question.

Observability tools are built for engineers

LangSmith, Langfuse, Arize, Helicone. These are genuinely useful tools. They exist to help you debug prompts, track costs, measure latency, and understand why a chain returned something unexpected.

They are built for the question: why did this not work the way I expected?

That is an engineering question. It gets asked during development, during an incident, during a postmortem. The audience is you and your team. The output is a fix.

Audit trails are built for auditors

An audit trail exists to answer a different question: can you prove what your agent did, and that it was authorised to do it?

That question gets asked by a compliance officer, a regulator, a customer's legal team, or an external auditor. It might get asked six months from now. The audience is not your engineering team. The output is evidence.

The distinction matters because the two tools are built with completely different constraints in mind.

Observability tools are optimised for developer experience. Fast search, good visualisation, easy filtering. Retention is typically short because storage is expensive and the main use case is recent debugging. The data lives in a database the vendor controls. If you delete a trace, it is gone.

An audit trail needs to be the opposite. Long retention by default. Immutable records that cannot be edited or deleted after the fact. Cryptographic proof that what you are showing today is exactly what was recorded at the time. Readable by a non-technical person. Something you can hand to an auditor without a 20-minute explanation of what a span is.

The specific things that are missing

Immutability. LangSmith stores your traces in a database. That database can be written to. Records can be deleted. There is no cryptographic proof that a trace you show an auditor today is identical to what was recorded six months ago. A mutable log is not evidence. It is an assertion.

Chain of custody. Observability traces show you what the LLM did. They typically do not capture the full sequence of tool calls, external API calls, human approval steps, and downstream effects that make up an agent's actual action in the world. An auditor does not care about your token counts. They care about what your agent did to real data and real systems.

Retention guarantees. The EU AI Act requires six months minimum for high-risk systems. HIPAA requires six years. Most observability tools default to 30 or 90 days. You can pay for longer, but retention is not the same as an archived, legally defensible record.

Non-technical readability. Traces are structured for developers. They are full of span IDs, model names, raw JSON, and timing data. If your compliance team needs to understand what your agent did, they cannot read a LangSmith trace without help. An audit trail needs to be legible to the person asking the question.

You probably need both

This is not an argument to stop using observability tools. Use them. They are the right tool for debugging and performance monitoring.

But they should not be your answer when someone asks you to prove what your agent did. They were never designed to be that answer, and treating them as one creates a compliance gap that will surface at the worst possible time.

The question to ask about your current setup: if an auditor asked you right now to produce a tamper-proof record of every action your agent took in a specific session three months ago, could you do it?

If the answer is "we would pull the LangSmith traces," you have observability. You do not have an audit trail.

AgentReceipt gives your AI agents a tamper-proof audit trail with hash-chained records anchored to a public transparency log. Three lines of code. No infrastructure to manage. agentreceipt.co

The Cron Job That Lied to You

Ramon — Fri, 10 Apr 2026 18:50:03 +0000

Your nightly backup ran at 2 AM. The ping arrived on schedule. No alerts, no incidents, nothing in your dashboard but a row of green checkmarks.

The backup file was empty.

The job ran. It checked in. It lied to you.

Basic heartbeat monitoring solves one problem: knowing whether your job ran at all. If the ping stops arriving, you get alerted. That is genuinely useful and it catches a whole class of failures. But there is a quieter category of failure that heartbeat monitoring alone does not cover. The job shows up, does its ping, and something is still wrong.

Here are the four ways that happens.

The job finished, but so did another copy of it

Your sync job runs every five minutes and usually completes in about 90 seconds. One night the database gets slow. The job starts taking six minutes. Cron does not know this. At the five minute mark it fires a new instance. Now you have two copies of the same job running at the same time, both reading and writing to the same tables.

Each one eventually finishes and pings success. Your monitor sees two pings and is perfectly happy. Your data has duplicate records in it.

This is what overlap detection catches. When you send a start ping at the beginning of a job, PulseMon tracks whether the previous run finished before the new one begins. If it did not, you get an immediate alert. Not after the data is corrupted. At the moment the second instance starts.

# Start of job
curl -fsS https://pulsemon.dev/api/ping/sync-job?status=start

# ... your job logic ...

# End of job
curl -fsS https://pulsemon.dev/api/ping/sync-job

The job finished, but it took way longer than it should have

This one is subtle because it looks completely fine from the outside. The job ran, it finished, it pinged. But it usually takes four minutes and today it took 47.

That is almost always a sign something upstream is struggling. A slow query. A downstream API responding at a crawl. A dataset that has grown past a threshold your job was not designed for. The job will probably fail completely within the next few runs. Or it will keep completing slowly, quietly degrading until it starts missing its window.

Duration thresholds let you set a ceiling on how long a job should take. If the job checks in successfully but blew past that ceiling, you get alerted. The job succeeded by every technical measure and you still get notified, because the duration is itself a signal worth acting on.

The job failed, but your monitor was going to wait it out

Without explicit failure signalling, heartbeat monitoring works on absence. You set an interval, and if no ping arrives by the deadline, the monitor goes down and you get alerted.

The problem is the window. If your job is supposed to run every 30 minutes and it fails immediately, you might not find out for 30 minutes. Plus the grace period. That is a long time to wait on a payment processor job or an order fulfilment worker.

The fix is a fail ping. When your job catches an error, it can tell PulseMon directly instead of just going quiet.

try:
    run_invoice_job()
    requests.get("https://pulsemon.dev/api/ping/invoice-job", timeout=10)
except Exception as e:
    requests.get(
        "https://pulsemon.dev/api/ping/invoice-job?status=fail",
        timeout=10
    )
    raise

A fail ping fires an immediate alert. You find out in seconds, not when the deadline expires.

The job missed its deadline and you have no idea why

This one is not a lie exactly. The job did not check in, you got alerted, something is clearly wrong. But then what? You SSH into the server, check the logs, and try to piece together what happened from whatever the job managed to write before it died.

The ping body changes this. You can POST your job's output with the ping, and when the alert fires it includes that output. The failure context comes to you instead of you going to find it.

OUTPUT=$(your-job-command 2>&1)
STATUS=$?

if [ $STATUS -eq 0 ]; then
    curl -fsS -X POST \
      -d "$OUTPUT" \
      https://pulsemon.dev/api/ping/your-job
else
    curl -fsS -X POST \
      -d "$OUTPUT" \
      https://pulsemon.dev/api/ping/your-job?status=fail
fi

Now your alert email contains the last thing the job printed before it went wrong. No SSH required.

What a successful ping actually means

A ping tells you the job reached the line of code that fires the request. That is it. It says nothing about whether the job ran in isolation, whether it finished in a reasonable time, or whether it failed and told you immediately.

These four features are not replacements for heartbeat monitoring. They sit on top of it. A ping is still the foundation. But a ping on its own is a pretty low bar for "everything is fine."

The jobs that bite you worst are not the ones that go completely dark. Those are obvious. The hard ones are the jobs that keep showing up, keep checking in, and are quietly doing something wrong every single time.

PulseMon supports start, success, and fail pings, duration thresholds, overlap detection, and ping body in alerts on all plans. Free tier includes 30 monitors. No credit card required. pulsemon.dev

Your AI agent just took an action. Do you know what it did?

Ramon — Sun, 22 Mar 2026 18:35:58 +0000

A few months ago, a fintech company's accounts payable agent approved and triggered a $47,000 payment to a vendor that had been flagged for fraud two weeks earlier. The flag was in the system. The agent never saw it. By the time anyone noticed, the money was gone.

The company had logs. Technically. They had server logs, database logs, error logs. What they didn't have was a clear record of what the agent saw, what it decided, and why it sent that payment. When their auditors asked, the engineering team spent three days piecing together a timeline from scattered log files that were never designed to answer that question.

This is not an edge case. This is what happens when you put AI agents into production without thinking about accountability.

Agents are different from software

Traditional software is deterministic. If a bug causes a wrong transaction, you look at the code, find the bug, fix it. The behavior is reproducible and the cause is traceable.

AI agents don't work like that. They reason. They make judgment calls. Two identical inputs can produce different outputs depending on context, model temperature, and what happened in previous steps. When something goes wrong, "look at the code" doesn't give you answers. You need to know what the agent actually did, step by step, in that specific run.

This is a fundamentally new problem. And regulators are starting to notice.

What the law says now

The EU AI Act

The EU AI Act became enforceable in stages through 2025 and 2026. The full weight of it lands on August 2, 2026.

For anyone deploying AI in high-risk categories, Article 19 is the one to know. It requires providers of high-risk AI systems to maintain automatically generated logs for a minimum of six months. Longer in some sectors. The logs must be detailed enough to reconstruct what the system did and why.

High-risk categories include: employment and HR decisions, credit and financial services, healthcare, education, law enforcement, and critical infrastructure. If your AI agent touches any of those areas, Article 19 applies to you.

The fines for non-compliance go up to 30 million euros or 6% of global annual revenue, whichever is higher. These are not theoretical numbers. The EU has shown it will enforce them.

The US picture

The US has no single federal AI law yet. But the regulatory pressure is real and it comes from multiple directions at once.

SOC 2 is the de facto standard for B2B SaaS security. If you're selling to enterprise customers, they will ask for your SOC 2 report. Auditors evaluating SOC 2 compliance specifically look for activity logs that show who or what accessed what, when, and what they did. An AI agent that sends emails or triggers payments on your behalf is a system that SOC 2 auditors will want to see logs for.

HIPAA applies to any system handling protected health information. If your agent reads patient records, schedules appointments, or processes healthcare data in any form, HIPAA requires six-year retention of activity logs. Six years. Most teams think about HIPAA in terms of data encryption and access controls, but the logging requirement is just as strict.

SOX and SEC rules govern financial reporting and trading. If your agents are involved in expense approvals, transaction processing, or financial data handling, you need to be able to prove they followed the rules. Not just that the rules existed, but that they were followed, step by step, in each specific instance.

State laws are filling the federal gap. Colorado's AI Act took effect in 2026, requiring reasonable care to prevent algorithmic discrimination and documentation to prove it. California has multiple overlapping AI transparency requirements now in effect. Texas passed TRAIGA on January 1, 2026. These laws are moving fast and the trend is clearly toward more documentation, not less.

The common thread

Across all of these frameworks, the requirement is the same: prove what your AI did. Not in general. In the specific instance your auditor is asking about.

"Our agent follows these rules" is not an answer. "Here is a timestamped, immutable record of every action the agent took on March 14 at 2:47pm, here is what it saw, here is what it decided, and here is why" is an answer.

The problem with existing tools

Most teams are using one of three approaches to deal with this.

Application logs. Standard server logs capture requests and responses but not reasoning. They tell you the agent made a call. They don't tell you what it was thinking. When something goes wrong, you're reconstructing a timeline from logs that were never designed to answer compliance questions.

LLM observability tools like Langfuse or LangSmith are genuinely useful for debugging. They capture traces, spans, token counts, and latency. They're built for engineers who want to understand why a prompt failed or why costs spiked. They are not built for the compliance officer asking what your agent did on Tuesday.

Nothing. More common than people admit. Teams move fast, get agents into production, and assume logging can be sorted out later. Later is when the auditor arrives.

The gap isn't technical. The tools to capture logs exist. The gap is that nobody is building for the people who need to read those logs.

What a real audit trail actually needs

When regulators or auditors ask what your agent did, they need specific things.

A complete timeline. Every action in sequence. Not just the LLM calls but the tool calls, the decisions, the data accessed, the outputs produced.

The reasoning, not just the result. Why did the agent approve that payment? What criteria did it apply? What did it see that led it to that conclusion?

Human review steps. If a person signed off before the agent proceeded, that needs to be in the record too. The full chain of accountability, not just the automated parts.

Immutability. A log you can edit is not an audit trail. The record needs to be append-only with cryptographic proof that nothing was changed after the fact.

Readability. Your compliance team is not going to read JSON traces. The record needs to be something a non-technical person can actually understand.

Retention. Six months minimum for EU AI Act. Six years for HIPAA. The record needs to exist when someone asks for it, not just when it's convenient.

The window is closing

The EU AI Act enforcement deadline is August 2026. That is not far away. Companies that have been running agents in production without audit trails are going to face a choice: retrofit compliance into systems that were never designed for it, or get ahead of it now.

Getting ahead of it now is much cheaper than getting ahead of it in July 2026 with an auditor waiting.

The companies that take compliance seriously from the start will close enterprise deals faster. They will pass security reviews without delays. They will have answers when auditors ask questions. And when something goes wrong, they will know exactly what happened.

The companies that wait will be doing log archaeology at the worst possible time.

In a follow-up post, I'll cover how I built AgentReceipt to solve this problem, including how hash chaining works, why we anchor receipts to a public transparency log, and how three lines of code gives your agent a tamper-proof audit trail.

What to Tell Claude Code to Test (and What to Skip)

Ramon — Sun, 15 Mar 2026 21:02:12 +0000

If you're using Claude Code to build apps, you've probably noticed it loves writing tests. Ask it to build a feature and it'll offer to test it. Ask it to fix a bug and it'll suggest adding coverage. Left to its own devices, it will happily generate hundreds of tests for your application.

Most of them won't matter.

Here's the filter I use before writing any test: if this breaks silently, what happens?

If the answer is "nothing for a while" or "a user gets a slightly wrong result," skip it. If the answer is "data gets corrupted," "someone gets charged incorrectly," or "the failure is invisible until it's too late," write the test.

That's the whole framework. Everything below is just applying it.

One important caveat: this is a strategy for the build phase. If you're shipping an MVP or a side project and you're the only person working on it, this is the right approach. Once you have a team, long-lived systems, and daily deploys, your testing philosophy needs to evolve. But that's a different post.

What Claude Code will test if you don't guide it

Claude Code defaults to thoroughness. It will test that your database queries return data, that your UI components render, that your API routes respond with 200, that your forms submit, that your auth redirects work.

These tests aren't wrong. They're just low value at the start. A broken UI component is visible the moment you open the browser. A failing form submit takes five seconds of manual testing to catch. You don't need automation for things that announce themselves.

What you need automation for are the things that fail quietly.

The areas worth testing

Business logic with edge cases

Any function that makes a decision based on data is worth testing. Not the happy path, which you'll notice when it breaks. The edge cases are what get you.

In a monitoring tool I built called PulseMon, the core checker function determines whether a monitor is late based on the last ping timestamp, the expected interval, and the grace period. It's about 20 lines of pure logic with 9 tests covering scenarios like: what if the last ping arrived exactly at the deadline, what if there's no ping at all, what if the grace period is zero.

That function runs every 60 seconds for every user. If it's wrong, monitors either never alert or alert constantly. Neither failure is obvious until users complain. That's the definition of something worth testing.

Tell Claude Code: "Write tests for the core business logic only. Focus on edge cases, not the happy path."

External integrations that handle money or critical data

Stripe webhooks. Payment processing. Anything where a bug means someone gets charged twice, doesn't get charged at all, or loses access to something they paid for.

These are worth testing because the failure modes are severe and the bugs are subtle. A wrong status code, a missing field, an event type you didn't handle. These don't throw obvious errors. They silently do the wrong thing.

For PulseMon's Stripe webhook handler there are 10 tests covering: subscription created, updated, deleted, payment failed, and an invalid signature. That last one matters because without it, anyone can send fake webhook events to your endpoint.

Tell Claude Code: "Test the Stripe webhook handler. Cover subscription created, updated, deleted, and invalid signature. Mock the Stripe library at the module boundary."

Unit tests cover your logic. Before going live, run one real test with the Stripe CLI to verify the raw body handling and your webhook secret are configured correctly. That catches the two bugs tests can't.

Authorisation boundaries

Not "can a user log in" since that's visible immediately if it breaks. The subtle version: can user A access user B's data?

In any multi-user app, the query that fetches data scoped to the current user is the one worth testing. The bug that leaks one user's data to another is catastrophic and invisible. You won't catch it in manual testing because you're always logged in as the same user.

Tell Claude Code: "Write tests that verify a user cannot access another user's resources. Mock the database and test that all queries include the userId filter."

Anything that runs on a schedule without human oversight

Cron jobs, background workers, scheduled cleaners. If nobody is watching them run, you need tests for the logic inside. Not integration tests that actually fire the job, but unit tests that cover what the job decides.

If your cleanup job deletes records older than 30 days, test that it deletes the right records and leaves the wrong ones alone. These jobs run at 3am and nobody checks them.

What to tell Claude Code to skip

UI tests during early development

Skipping UI tests is not a blanket recommendation. It's a prioritisation call for when you're moving fast.

During the initial build, a broken button is visible the moment you look at the screen. Testing it at that stage slows you down without adding much. But once your core conversion paths exist, things like signup, onboarding, and checkout, they are worth protecting. A global CSS change or a Tailwind config update can silently hide a button on mobile and you won't catch it manually every time.

The practical version: skip UI tests while you're building, add them for your most critical flows once they're stable.

Tell Claude Code: "Skip UI component tests for now. We'll add coverage for critical conversion paths once they're stable."

Simple CRUD operations

Creating, reading, updating, and deleting records doesn't need tests if it's just calling an ORM method. The ORM is already tested. Your thin wrapper around it doesn't need coverage.

The exception is CRUD that enforces business rules. A create operation that checks plan limits before inserting is worth testing. A create operation that just calls db.insert() is not.

Auth configuration, but not auth logic

Libraries like Auth.js are tested by their maintainers. Whether your sign-in redirect fires correctly doesn't need coverage. You'll know immediately if it breaks.

What is worth testing is the authorisation logic you write yourself: middleware that checks roles, functions that decide what a user can see, session handling that scopes data correctly. Those are yours to own, and that's where leaks happen.

Tell Claude Code: "Skip testing that the auth redirect fires. Do test any custom middleware, role checks, or session scoping logic we've written."

A prompt that actually works

When starting a new feature, give Claude Code this framing before asking it to write tests:

"We only write tests for high-stakes areas where a silent failure would cause real damage. This means: core business logic with edge cases, external integrations that handle payments or critical data, and authorisation boundaries that prevent data leaks. Skip UI tests during initial build, basic CRUD, and standard auth library configuration. For each test you write, add a comment at the top of the test explaining in one sentence why a silent bug here would be a serious problem. If you can't write that sentence, the test shouldn't exist."

The comment becomes permanent documentation for why the test was written. Six months from now, when you're wondering whether you can delete it, the answer is right there.

Why the filter matters more than the framework

Coverage targets and exhaustive test suites are a different conversation to the one most solo developers building with AI assistance need to have right now. The more immediate problem is Claude Code generating 80 tests when 15 would have been enough, most of them testing things that would have been obviously broken on first look.

Start with the filter. Test the things that fail silently and cause real damage. Build coverage from there as your product matures.

Write tests for those. Skip everything else. Tell Claude Code exactly that and it'll spend its time on the things that actually matter.

Your Uptime Monitor Says Green. Your Users Disagree.

Ramon — Thu, 12 Mar 2026 16:35:45 +0000

Your uptime monitor pings your homepage every 60 seconds and gets a 200 back. Green. Healthy. No alerts.

Meanwhile, your checkout is broken because a payment webhook stopped processing three hours ago. Your welcome emails are queued and going nowhere. Your nightly sync hasn't run since Tuesday.

This is the gap nobody talks about: uptime monitoring tells you your server is alive. It does not tell you your app is working.

What an uptime check actually does

A traditional uptime check sends an HTTP request to a URL and waits for a response. If it gets one, the monitor turns green. That's it.

It doesn't know whether your database is accepting writes. It doesn't know whether your background workers are running. It doesn't know whether the queue has 40,000 unprocessed jobs backed up behind a dead consumer.

A server can respond with 200 OK while being completely broken in every way that matters to your users.

The failure modes your ping check misses

Background workers dying quietly

Most apps have workers running alongside the web server — email dispatch, order processing, report generation. These processes don't have a URL.
Nothing pings them. When they crash or get stuck, there's no signal.

Your web server keeps responding 200. The workers sit dead. Orders pile up.

Queue backlogs

A queue that's growing is worse than a queue that's empty. Your uptime monitor has no idea your jobs are sitting unprocessed because a consumer crashed. Users submit forms. The forms go into the queue. Nothing comes out the other side.

Third-party integration failures

Your app might be up but calling an external API that started returning 500s two hours ago. Stripe, Twilio, SendGrid, whatever. Your server is healthy.
Your users' experience is not.

Database write failures

A read-only replica can serve your homepage and return 200 all day while your primary database rejects writes. Users think the form submitted. It didn't.

The fix: make your health check do real work

The standard move is to point your uptime monitor at /health and call it done. That endpoint usually just returns 200. It tells you the process is running, nothing more.

Make it test something real instead.

# This tells you nothing
@app.get("/health")
def health():
    return {"status": "ok"}

# This actually catches problems
@app.get("/health")
def health():
    db_ok = check_database_write()
    queue_depth = get_queue_depth()
    worker_last_seen = get_worker_heartbeat()

    if not db_ok or queue_depth > 10_000:
        raise HTTPException(status_code=503)

    return {
        "db": db_ok,
        "queue_depth": queue_depth,
        "worker_last_seen": worker_last_seen,
    }

Now when your uptime monitor hits /health, it's actually probing your database writes, your queue state, and whether your workers are alive. A 503 means something real is broken, not just that the process died.

The rule of thumb: if a failure in X would affect users, X should be in your health check.

Monitoring things that don't have a URL

Workers and scheduled tasks are harder because there's nothing to ping. The approach is to flip it — instead of you checking on them, they check in with you.

At the end of each successful cycle, the worker hits a heartbeat URL. If the heartbeat stops arriving on schedule, you get alerted.

def run_worker():
    while True:
        process_queue_batch()
        requests.get(
            "https://pulsemon.dev/api/ping/order-queue-worker",
            timeout=10
        )
        time.sleep(30)

The key detail for workers: set the expected interval to at least twice the cycle time. A slow batch shouldn't trigger a false alert.

A note on alert fatigue

One reason teams skip monitoring background tasks is they've been burned by flaky alerts before. A monitor that fires constantly gets muted. Then it fires for real and nobody notices.

Heartbeat monitoring sidesteps this because the only alert condition is absence. There's no false positive from a brief network blip or a slow response. Either the job ran and checked in, or it didn't. That binary clarity means alerts stay trustworthy and actually get acted on.

What to audit in your own stack

Any /health endpoint that just returns 200 without touching the database
Background workers with no heartbeat
Queues with no depth monitoring
Third-party integrations you assume are working because your server is up

If it can break without your uptime monitor noticing, it needs a second line of defence.

PulseMon is heartbeat monitoring for your cron jobs, background workers, and scheduled tasks. Add a single ping to the end of any job and get alerted via email, Slack, Discord, or webhook when it stops running on schedule.

Free plan has 30 monitors with 2-minute checks if you want to poke around: PulseMon.dev

Why Your Cron Jobs Fail Silently (And How to Fix It)

Ramon — Fri, 06 Mar 2026 22:37:31 +0000

Your database backup runs every night at 2 AM. Your invoice generator fires every Monday morning. Your cache warmer runs every five minutes. They all work great until they don't.

The problem with cron jobs is that they fail the same way they run: silently. Nobody is watching stdout at 2 AM. There's no browser to show an error page. When a cron job stops working, the only signal is the absence of something happening.

You find out on a Friday afternoon that backups haven't run since Tuesday. Or a customer emails you because their weekly report never arrived. Or your disk fills up because the cleanup job died three weeks ago.

Why cron jobs fail

The cron daemon itself is reliable. It has been running scheduled tasks on Unix systems since 1979. The daemon is not the problem. Everything around it is.

Server reboots. After a reboot, cron usually starts back up. But if your job depends on a mounted volume, a running database, or a network connection that takes 30 seconds to initialize, the first run after reboot fails. Silently.

Disk full. Your job tries to write a temp file or a log entry. It can't. It crashes. Cron doesn't care.

Dependency failures. The API you're calling is down. The database connection times out. The S3 bucket policy changed. Your job throws an exception on line 12 and exits with code 1. Nobody notices.

Timezone issues. You deployed to a server in UTC but wrote your cron expression assuming US Eastern. The job runs at the wrong time, or during a DST transition, it runs twice. Or zero times.

The job itself crashes before logging. This is the worst one. If your error handling depends on the job running long enough to reach the catch block, an early segfault or OOM kill means zero evidence that anything went wrong.

Why traditional monitoring misses this

Most monitoring tools watch for things that are happening: high CPU, slow responses, error rate spikes. They're good at detecting active failures.

Cron job failures are passive. They're the absence of something happening. Your APM won't alert you that a script didn't run. Your error tracker can't capture an exception from a process that never started.

The dead man's switch pattern

The fix is to flip the model. Instead of watching for failure, watch for the absence of success.

This is called a dead man's switch, or heartbeat monitoring. The idea is simple:

Create a monitor with an expected interval (say, "every 24 hours")
Add a ping to the end of your job
If the ping doesn't arrive on time, you get alerted

The key insight: you're not monitoring whether the job failed. You're monitoring whether it succeeded. If you don't hear from it, something went wrong. You don't need to know what.

Setting it up

Add a single HTTP request to the end of your script. If the script completes successfully, the ping fires. If it crashes, hangs, or never starts, the ping never arrives and you get an alert.

#!/bin/bash
# backup-database.sh

set -e  # Exit on any error

pg_dump "$DATABASE_URL" | gzip > /tmp/backup.sql.gz
aws s3 cp /tmp/backup.sql.gz s3://my-backups/$(date +%Y-%m-%d).sql.gz
rm /tmp/backup.sql.gz

# Report success
curl -fsS --retry 3 https://pulsemon.dev/api/ping/nightly-backup

The set -e flag means the script exits on any error. The curl at the end only runs if everything above it succeeded. If pg_dump fails, if S3 upload fails, if the disk is full, the ping never fires.

For Python:

import requests

def main():
    # ... your job logic here ...
    run_etl_pipeline()
    requests.get("https://pulsemon.dev/api/ping/nightly-etl", timeout=10)

if __name__ == "__main__":
    main()

For Node.js:

async function main() {
  // ... your job logic here ...
  await processQueue();
  await fetch('https://pulsemon.dev/api/ping/queue-processor');
}

main().catch((err) => {
  console.error(err);
  process.exit(1);
});

What you should monitor

Any scheduled process that runs unattended:

Database backups are the most common silent failure.
Email queues stop processing and nobody complains for days because they assume it's normal.
Data syncs between services. Your analytics dashboard shows stale numbers but looks fine at a glance.
Certificate renewals from Let's Encrypt. The cert expires and your site shows a scary browser warning.
Cleanup jobs that free disk space. When they stop, other services start crashing.

If any of these run on your infrastructure, they should have a heartbeat monitor. It takes less time to set up than it does to recover from the failure.

I built PulseMon to solve this for my own projects. Free tier with 30 monitors if you want to try it: PulseMon.dev