David W. Adams

Posted on Apr 8

Why Your API Needs a Heartbeat (And Yours Probably Doesn't Have One)

#api #monitoring #devops #webdev

Your API returns 200 OK. Great. But is it actually working?

I learned this the hard way. Last month, our payment webhook was "up" — returning 200s all day — but silently failing to process transactions. Three hours. Forty-seven failed payments. One very angry email from our biggest customer.

Our monitoring said everything was fine.

The 200 OK Lie

Most uptime monitoring checks one thing: does the server respond?

That's it. A simple HTTP request to your /health endpoint, a check for status code 200, and you're "healthy."

But here's the problem. Your server can return 200 OK while:

Your database connection pool is exhausted
Your message queue is backed up 10,000 messages
Your authentication service is timing out
Your critical background jobs haven't run in 6 hours

Real API monitoring checks things that actually matter:

Does the response match the expected schema?
Is latency under your SLA threshold?
Are auth tokens actually validating?
Did the database write actually happen?

What a Real Health Check Looks Like

Here is what I should have built from day one:

GET /health
Expected: {
  "status": "ok",
  "db": "connected",
  "queue": "processing",
  "last_job_run": "2024-04-05T12:03:00Z"
}
Max latency: 200ms

If any of those fields are missing, malformed, or slow, you want to know now. Not when customers start complaining on Twitter. Not when your payment processor threatens to shut you down.

The Three-Layer Monitoring Stack

After our incident, we rebuilt our monitoring from scratch. Here is the stack that actually works:

Layer 1: Synthetic checks. These hit your endpoints every minute from multiple locations. They validate response structure. They measure against your baseline. They tell you if your API is behaving, not just responding.

Layer 2: Heartbeat monitoring. Your app sends a ping to a monitoring service every few minutes. If the pings stop, something is critically wrong. This catches failures that synthetic checks miss — database locks, background job crashes, internal service outages.

Layer 3: Real user monitoring. Track actual API calls in production. Alert on error rate spikes. Catch edge cases that no synthetic test could predict. This is your safety net.

Why Most Teams Skip This

Because it is annoying to build.

You need:

A health endpoint that actually checks dependencies
A cron job or scheduled function to run synthetic tests
An alerting pipeline that does not spam you with false positives
A dashboard you will actually look at
Documentation so the next engineer understands why it exists

By the time you are done, you have spent 3 days building something that is not a feature. It is not moving metrics. It is just "reliability work" that your CEO does not understand and your users will never thank you for.

Until it saves you from a 3-hour outage.

The Build vs Buy Decision

You can build this yourself. Here is the math:

Lambda function: $0 per month (plus 4 hours of your time)
Alerting pipeline: $0 per month (plus ongoing maintenance)
Dashboard: $0 per month (plus updates every time someone asks a new question)
On-call rotation when alerts break: priceless (literally)

Or you use OwlPulse.

$9 per month. Monitoring in 90 seconds. Sleep better tonight.

I spent 3 days building our first monitoring stack. It had bugs. It had gaps. It woke me up at 3 AM with false alarms.

OwlPulse took 90 seconds to set up. It has not missed an issue. It has not cried wolf.

The math is not even close.

One Thing to Do Today

Go look at your /health endpoint. Seriously. Right now.

Does it just return {"status": "ok"}?

Or does it actually check the things your API needs to function?

If it is the first one, you are flying blind. Fix it. Or use something that fixes it for you.

Set up API heartbeat monitoring in 90 seconds: owlpulse.org

DEV Community