DEV Community: velprove

Monitor a DigitalOcean App Platform or Droplet Stack

velprove — Mon, 01 Jun 2026 14:00:04 +0000

The short version: To monitor digitalocean app platform properly you have to know what its three native layers do and do not do, and they are better than most people give them credit for: the App Platform health check, threshold alerts across App Platform, Droplets, and managed databases, and DigitalOcean Uptime, a real first-party external synthetic monitor. The catch is that all three answer the same question, "is the box or process up and reachable," and none of them assert the response body, validate JSON, run a multi-step flow, or sign into your login. So a green App Platform deploy whose internal health check passed and whose URL returns a 200 that DigitalOcean Uptime reads as healthy can still serve a broken build, a blank render, or a page whose managed-database read silently failed. This post maps the exact gap on each of the three DigitalOcean surfaces, App Platform components, raw Droplets, and managed databases, and shows how a free, no-code content-assertion and browser login monitor closes it.

The conceptual anchor: a green deploy that serves a broken page

Picture the failure this post is about. You push to your connected branch, App Platform rebuilds the component, the build passes, the internal health check confirms the port answers 2xx, and the deploy goes green. DigitalOcean Uptime, pointed at the public URL, reads a 200 and reports the service healthy. Every native signal you have is green. And the page your users load is blank, or it is yesterday's build, or it renders a shell while the read from your managed Postgres returns nothing. Nothing native catches it, because nothing native is looking at the body.

This is not a hypothetical specific to one bad day. DigitalOcean's own status history shows App Platform has had multi-region deploy and build-failure incidents, and the broader class, a deploy that succeeds from the platform's point of view but serves the wrong response, is the steady-state risk between named incidents. The rest of this post is the three DigitalOcean surfaces and the precise native gap on each.

What the three native DigitalOcean layers actually do

To monitor digitalocean app platform honestly you have to start by giving the native tooling full credit, because the differentiator is a capability gap, not an absence. There are three layers.

App Platform health checks. These are configurable HTTP or TCP readiness and liveness probes that run internally, from inside DigitalOcean's network, against an http_path you set. The readiness probe gates the deploy and stops routing traffic to an instance that is not ready. The liveness probe automatically restarts a component whose check fails and emails the account. They are real and you should configure them. What they confirm is narrow: the port answered 2xx from inside the platform. They never leave the network and never read the body for correctness.

Threshold metric alerts across all three surfaces. App Platform alert policies cover failed and successful deploys (failed-deploy alerting is on by default), CPU, RAM, restart count, request rate, and P95 request duration, delivered to email or Slack. Droplets get the free DigitalOcean Monitoring agent plus Alert Policies: threshold alerts on CPU, memory, disk, and bandwidth. Managed databases get their own alert policies on connection count, CPU, memory, and disk. Every one of these is a threshold on a metric. A metric crossing a line is a useful signal and a different signal from "the response is wrong."

DigitalOcean Uptime. This is DigitalOcean's own first-party external synthetic monitor, and it deserves to be named, because the lazy version of this post would pretend it does not exist. Per DigitalOcean's Uptime feature docs , it checks a URL or IP over HTTPS, HTTP, or ICMP from up to 4 global regions, alerts on downtime and latency to email and Slack, monitors SSL certificate expiry on HTTPS checks, and keeps up to 90 days of latency history. It is a competent reachability and latency monitor.

Side by side, the capability split is the whole story. The App Platform health check and DigitalOcean Uptime each assert reachability and metrics; neither reads the response body, validates JSON, or signs into a login. That body-and-login correctness layer is what an external no-code monitor adds.

Capability	App Platform health check	DigitalOcean Uptime	Velprove
Endpoint reachable / 2xx status	Yes (internal probe)	Yes (external, up to 4 regions)	Yes (external, 5 regions)
Latency and SSL expiry alerts	No	Yes	Yes
Auto-restart on failed liveness	Yes	No	No
Response body / string assertion	No	No	Yes
JSON validation and multi-step flows	No	No	Yes (multi-step up to 3 on free)
No-code browser login monitoring	No	No	Yes

The honest gap: reachability green, correctness unverified

Here is the thesis, stated precisely so nobody can read it as an overclaim. All three native layers answer "is the box or process up and reachable?" None of them assert the response body, validate JSON, drive a browser, sign into a login, or run a multi-step flow. DigitalOcean Uptime confirms a 200, acceptable latency, and a valid certificate. Verified against DigitalOcean's own feature docs, it does not do content matching, string matching, JSON validation, or login flows. It treats any status outside the 200-299 range as an outage and stops there.

The difference between DigitalOcean's health check and external monitoring is the difference between reachability and correctness: the App Platform health check confirms the component's port answers 2xx from inside the platform, while an external content monitor confirms the public response actually renders the right body. Reachability failures change the status code, so the native tools catch them. Correctness failures leave the status code at 200 and change only the body, so the native tools miss them entirely. A successful App Platform deploy can serve a 200 with a blank render, the wrong build, or a page whose database read silently returned empty, and your health check, your metric alerts, and DigitalOcean Uptime all stay green. The layer that reads the body is the external no-code assertion monitor, and it is the only thing in this stack that catches a logically broken 200.

One DigitalOcean-specific surface worth a single sentence: if you run worker components or DO Jobs, their PRE-, POST-, and FAILED-DEPLOY hooks are platform-side lifecycle events, not response correctness, and the way to make a worker observable from outside is the same freshness-endpoint approach the rest of this stack uses. The freshness-endpoint and build-SHA assertion mechanic itself is covered in our API health-check patterns guide , so this post references it rather than re-teaching it.

Monitor DigitalOcean App Platform components: assert the body the deploy left behind

To monitor a DigitalOcean App Platform app for correctness, point an external HTTP monitor at the component's public URL and assert on the response body, not just the status code. On an App Platform component the native gap is the deploy that succeeds but serves the wrong response, and the fix is a content assertion on the public URL. In the Velprove wizard, set the check type to an HTTP monitor, point it at your component's URL, and on the Verify step add two Success Conditions: a status code of 200, and a Response Body Contains assertion on a string your correct page always renders. Pick a string that is load-bearing, a known piece of copy, a marker your template emits, or, better, a value that only appears when a real read succeeded, so a blank or shell render fails the check even though the status code is 200.

For the deploy-skew case specifically, where the build succeeded but promoted the wrong code, the pattern is a freshness or build-SHA assertion against a light /version route your app exposes. That route and the multi-step build-SHA comparison mechanic are owned by the API health-check patterns guide , so configure it there and point a Velprove monitor at the result. The point on App Platform is only this: the platform tells you the deploy finished, your assertion tells you the deploy finished correctly.

Droplets and managed databases: same blind spot, different native tool

You can monitor a DigitalOcean Droplet, an App Platform component, and a Managed Database with the same external content monitor: each has a different native tool (Droplet metric Alert Policies, App Platform alert policies, Managed Database alert policies) but the same blind spot, because none of them assert the response body. On a Droplet, a raw VM, the native monitoring is the agent-based metric story: the free Monitoring agent plus Alert Policies on CPU, memory, disk, and bandwidth. Those are excellent for a runaway process or a full disk. They say nothing about whether the nginx in front of your app is returning your application or a default welcome page, or a 200 error page from a misconfigured reverse proxy. You point the same Velprove content-assertion monitor at the Droplet's public URL, and the metric-green-but-content-wrong case becomes visible.

On a managed database the native alert policies watch connection count, CPU, memory, and disk on the database itself. They cannot tell you the read your application performs through that database came back with the right data, or came back at all. A connection-pool exhaustion, a bad migration, or a permissions change can leave every database metric green while your app's query silently returns empty. The external probe that catches that does not touch the database directly. It asserts a string on a page or endpoint whose render depends on a real read: the metric stays green, the assertion goes red, and you learn the read broke before your customers file the ticket.

Note one thing this post is not claiming. Unlike a platform that sleeps idle services, DigitalOcean App Platform components stay warm, so the gap here is not cold-start or idle-sleep latency, the way it is on some sibling platforms (the idle-sleep contrast is covered in the Railway platform-layer guide ). The DigitalOcean gap is purely the ceiling of native synthetic monitoring: reachability and metrics are covered, response correctness is not.

The browser login monitor on the real signed-in path

Content assertions prove the public surface renders correctly. They do not prove a real user can sign in and see their data, and on a DigitalOcean-hosted SaaS that is the failure that costs you customers. This is where the no-code login monitoring sits. Velprove's browser login monitor opens a real browser, signs in as a dedicated low-privilege test user, follows the post-login redirect, and asserts a string on the landing page that only renders if a real read from your managed database actually succeeded.

The setup is no-code: in the wizard you give the login URL, the test user's credentials, and, under Customize detection, switch Success verification from the default URL-change to "Page contains text" set to a post-login data string, a customer name, an invoice ID, a known plan label. A component can return 200 with an intact page shell while the database read behind the dashboard fails. A text-present assertion on post-login content catches that; a status-code probe never will. To be precise about the claim: Velprove is not the only tool that offers free browser checks, but the combination here, free and no-code login monitoring that signs into your own login with no Playwright code to write, is the differentiator.

Use a dedicated test account with the smallest permissions that still renders a real data-backed page, never production admin credentials. The browser login monitor is free on every plan, including the free plan, at a 15-minute interval, which is enough to catch a multi-hour database-backed outage and a login regression within one window.

How this compares to the sibling platform guides

The platform-sibling guides share this shape, native reachability is solid, native body-correctness is the gap, and they differ on the exact native wedge. Heroku charges for native alerting, so external monitoring there is partly a cost play; DigitalOcean is different, because DigitalOcean actually sells you a competent external monitor in DigitalOcean Uptime, separately, and it still cannot assert your body, so the wedge here is capability, not price (the cost framing is in the Heroku platform-layer guide ). For the broader pattern shared across managed-host platforms, see the Render platform-layer guide ; the DigitalOcean version is distinguished by the three-surface split (App Platform, Droplet, managed database) and by the fact that DigitalOcean's first-party synthetic monitor sets the native ceiling higher than most while still stopping at reachability.

Getting started

The Velprove free plan covers 10 monitors total at a 5-minute HTTP interval, one browser login monitor at a 15-minute interval, multi-step API monitors up to 3 steps, 5 global regions to choose from (one per monitor), email alerts, SSL expiry monitoring, and 1 status page. Commercial use is allowed on every plan, including free. No credit card required.

That is enough to land the DigitalOcean correctness layer for a single production app: a content-assertion HTTP monitor on your App Platform component or Droplet URL, a freshness or build-SHA assertion on a /version route, and one browser login monitor on the signed-in path. Keep your App Platform health checks, your metric alert policies, and DigitalOcean Uptime turned on; this sits on top of them and reads the body they never read. Start with the free plan. The first monitor takes about three minutes to configure.

Frequently Asked Questions

Isn't DigitalOcean Uptime or the App Platform health check enough?

They are real and worth turning on, but they answer a narrower question than most people assume. The App Platform health check is an internal HTTP or TCP readiness and liveness probe: it gates the deploy and auto-restarts a component whose port stops answering 2xx, and it emails the account. DigitalOcean Uptime is a genuine external synthetic monitor that checks a URL or IP over HTTPS, HTTP, or ICMP from up to 4 regions and alerts on downtime, latency, and SSL certificate expiry. What none of them do is assert your response body, validate JSON, run a multi-step flow, or sign into your login. They confirm the box is up and reachable and the certificate is valid. They do not confirm the page renders correct content. That correctness layer is what an external no-code assertion monitor adds on top.

Can I monitor DigitalOcean App Platform on the Velprove free plan?

Yes. The Velprove free plan covers 10 monitors total at a 5-minute HTTP interval, one browser login monitor at a 15-minute interval, multi-step API monitors up to 3 steps, email alerts, SSL expiry monitoring, and 1 status page. Commercial use is allowed and no credit card is required. That is enough to put a content-assertion HTTP monitor on your App Platform component URL, a freshness or build-SHA assertion on a /version route, and one browser login monitor on the signed-in path of a DigitalOcean-hosted SaaS.

Is monitoring a DigitalOcean Droplet different from monitoring an App Platform component?

The native tools differ, the blind spot is the same. A Droplet is a raw VM, so its native monitoring is the free DigitalOcean Monitoring agent plus Alert Policies: threshold alerts on CPU, memory, disk, and bandwidth to email or Slack. An App Platform component is managed, so its native monitoring is App Platform alert policies (deploy outcome, CPU, RAM, restart count, request rate, P95 latency) plus the internal health check. Both are metrics and reachability signals. Neither asserts that the response your Droplet's nginx or your App Platform service returns is the correct content. You point a Velprove content-assertion or browser login monitor at the public URL in front of either surface, and the setup is the same regardless of which DigitalOcean product is behind it.

What about a DigitalOcean managed database? Native alerts already cover it.

Managed-database alert policies watch connection count, CPU, memory, and disk on the database itself, which is genuinely useful. They do not tell you that the read your app performs through that database returned the right data, or returned anything at all. A managed database can be green on every metric while a connection-pool exhaustion, a bad migration, or a permissions change makes your app's query silently return empty. The external probe that catches that is a content assertion on a page or endpoint whose render depends on a real database read: assert that a known string only that read produces is present. The database metric stays green; your assertion goes red.

Does DigitalOcean Uptime check the response body or sign into my login?

No. DigitalOcean Uptime confirms the endpoint is reachable, measures latency from up to 4 regions, and on HTTPS checks watches the SSL certificate expiry, with 90 days of latency history. It treats any HTTP status outside the 200-299 range as an outage. It does not do content or string matching, JSON validation, multi-step request chains, or browser-driven login flows. So a deploy that succeeds, passes the internal health check, and returns a 200 that DigitalOcean Uptime reads as healthy can still be serving a blank render, a stale build, or a page whose database read silently failed. The body-level and login-level correctness check is the no-code assertion and browser login monitor's job, not DigitalOcean Uptime's.

Your GraphQL API Returns 200 While It's Down. Here's How to Catch It.

velprove — Sun, 31 May 2026 14:00:03 +0000

** The 30-second version: A GraphQL endpoint can return HTTP 200 while it is functionally down. When a field or a resolver fails on a well-formed request, GraphQL reports it inside the response body as a top-level errors array, often with null sitting in data where the real value should be. The HTTP status stays 200, so a plain status check stays green. The fix is a body-content assertion: confirm 200 AND that $.errors[*] has no matches AND that $.data.<criticalField> is not null. That is exactly what Velprove's free, no-code multi-step API monitor does: it asserts on the response body, not just the status code. **

If you run a GraphQL API, your uptime monitor is probably lying to you. Not because the tool is bad, but because GraphQL breaks the assumption every status-only check is built on: that a 200 means the request worked. For REST that assumption mostly holds. For GraphQL it does not. A query can fail in a way that takes your whole feature down and still hand back a tidy 200 OK. To monitor GraphQL API uptime for real, you have to read the body.

This is the same blind spot we wrote about more generally in why a 200 OK can hide an outage . GraphQL is the sharpest example of it, because the 200-with-an-error is not an edge case here. It is the documented, default, by-design behavior.

Why GraphQL returns 200 when a field fails

Start with the scope, because the common overstatement ( "GraphQL always returns 200") is wrong and will lead you to build the wrong monitor. The 200-with-errors behavior applies to one specific situation: a well-formed request that the server accepts and executes, responding with the application/json media type, where a field or a resolver fails during execution.

In that case the transport did its job. Your query parsed, validated, and ran. One of the resolvers threw, or returned null for a non-nullable field, or an upstream the resolver called timed out. The server has a valid HTTP response to send you, so it sends 200 and reports the failure in the body. The response carries a top-level errors array describing what broke, and a data object that holds null where the failed field should have been.

This is described in the graphql-over-http specification , and the clearest practitioner write-up is Nigel Sampson's "GraphQL and 200 Not OK" (2020), which frames the problem exactly the way a monitoring engineer runs into it. Sasha Solomon's "200 OK! Error Handling in GraphQL" (2019) covers the same ground from the schema-design side. The short version: in GraphQL, the HTTP status describes the transport, and the errors array describes your query. A monitor that only reads the status is reading the wrong layer.

The three failure modes a status check misses

There are three shapes this takes in production, and a status-only check is blind to all three. Each one returns 200. The samples below are responses your own GraphQL API would send back. Names and fields are illustrative.

1. Field error with a populated errors array

A resolver throws. The field it was responsible for comes back null, and the failure shows up in errors. The status is still 200.

Your dashboard's "who am I" query just failed for every signed-in user. The status check sees 200 and a non-empty body and reports green.

2. A critical field returns null while siblings resolve

The query mostly works. One important field goes null because its resolver failed, while the cheap fields around it resolve fine. Sometimes there is an errors entry, sometimes the resolver swallowed the error and just returned null. Either way the body looks populated.

The product page renders with a name and no price. Nothing is "down" by any status-code measure, but you cannot sell the thing. An assertion on $.data.product.price being non-null is the only check that catches this.

3. Partial data, the page half-loads

This is partial success: data and errors in the same response. The fields that worked are in data, the ones that failed are null, and errors explains the gaps. The graphql-over-http spec treats this as normal, expected behavior, not an error condition for the transport.

Half the page loads. The order details are there, the recommendations rail is empty. The response is 200 with a healthy-looking data object, and the only signal that something broke is the errors array nobody is reading.

When it's actually a 4xx or 5xx (and when it isn't)

GraphQL does use real HTTP error codes, just not for the failures above. Knowing where the line falls keeps you from building a monitor on a false assumption. There are three broad cases where the status does carry the signal, and one important divergence between the spec and what servers actually do.

Parse and validation errors. If your query is malformed, or asks for a field that does not exist, that is a request error caught before execution. Here the spec and real servers part ways. The graphql-over-http spec recommends returning 200 even for request errors when the response uses the application/json media type. In practice Apollo Server returns 400 for parse and validation errors. So do not assume 400 is universal, and do not assume 200 is either. It depends on the server. For a monitor this is fine, because a malformed canary query is your bug to fix before you ship the monitor, not a production signal.

Invalid variables. One trap worth a single sentence: older Apollo Server 4 returned 200 when a variable failed coercion, which meant a bad-input failure hid behind a success status. Current Apollo fixes this with status400ForVariableCoercionErrors, which returns 400 and is the default in Apollo Server 5.

Transport and server crashes. If the process is down, the load balancer has no healthy backend, or an upstream gateway times out, you get a real 5xx (or a connection failure). This is the one case a status-only check reliably catches, and it is the minority of GraphQL outages.

The newer media type. The spec defines a second media type, application/graphql-response+json, which may use non-200 statuses for errors, and the draft even sketches a non-standard 294 "Partial Success" code. Treat that as emerging, not deployed. The spec is still at Draft stage, and most servers in the wild still answer with application/json and 200. Build for what your server actually sends today.

Net of all of this: the real GraphQL blind spot is the field error that resolves to 200 with a populated errors array. No status code will surface it. You have to read the body.

How to monitor a GraphQL API for the 200-that-lies (Velprove, no code)

Velprove's free, no-code multi-step API monitor asserts on the response body, not just the status code. That is the whole game for GraphQL. The browser login monitor is the differentiator we lead with for sign-in flows, but the right tool here is the API monitor with a JSON-path assertion on the errors array. Here is the shape, in four steps, no config files.

Step 1. POST your GraphQL endpoint with a small canary query. Create an API monitor that sends a POST to your single GraphQL URL (something like /graphql) with a small, read-only query in the request body. Keep it cheap and stable. Ask for the one or two fields you most need to be alive. Run it with a dedicated low-privilege monitoring account, never real admin credentials.

Step 2. Assert the status code is 200. This is the baseline that catches the transport and crash failures from the section above. It is necessary and, on its own, nowhere near sufficient.

** Step 3. Assert the JSON path $.errors[*] has no matches. ** This is the assertion that turns a status check into a real GraphQL health check. The [*] matches the entries inside the array, so it passes when errors is absent or an empty [], and fails the moment any error entry appears, even though the status is still 200. It catches failure modes 1 and 3 above.

** Step 4. Assert $.data.<criticalField> is not null. ** Point this at the field your product genuinely depends on, for example $.data.currentUser.id or $.data.product.price. Use a not-null assertion, not a bare existence check. A field can be present and still null, which is exactly failure mode 2, where a critical field quietly goes null and the resolver swallowed the error so errors stays empty. Belt and suspenders: assert no error entries and a non-null value for the field you care about.

That four-assertion pattern is the entire GraphQL-specific part. The mechanism underneath it, how a monitor sends a request body, reads the JSON response, and runs JSON-path assertions, is the same engine you would use for any API. If you want to extend this into a token-then-query flow, or chain several queries, that is just chaining and JSON-path assertions in a multi-step API monitor , and that guide teaches the mechanism end to end. This post only adds the GraphQL assertion shape on top of it. All of this runs on the free plan, from 5 regions, with commercial use allowed.

A GraphQL data probe is a different layer than a /healthz endpoint

It is tempting to think you already cover this because you have a /healthz endpoint. You do not. They are different layers and you want both.

A /healthz probe is an endpoint you deliberately build to report health. It returns 200 and a small body that says "I am up," usually after checking a database connection and a couple of dependencies. It is a self-report. The patterns for designing one are covered in our note on why a /healthz probe is a different layer .

A GraphQL 200-with-errors is the opposite situation. It is not a special health endpoint. It is your normal data endpoint, the one your app actually queries, telling you it is fine while a field underneath it is broken. A green /healthz can sit right next to a GraphQL query that returns null for the field that pays your bills. The health endpoint reports the service's opinion of itself. The canary query reports what a real client actually gets back. Monitor both.

Frequently asked questions

Why does my GraphQL API return 200 when there's an error?

When the request itself is well-formed and the server responds with the application/json media type, GraphQL signals field-level and resolver-level failures inside the response body, not in the HTTP status. The transport succeeded, so the status stays 200. The failure is reported as an entry in a top-level errors array, usually alongside a data object that holds null where the failed field should have been. The graphql-over-http spec describes this behavior, and most servers, including Apollo Server in its default configuration, follow it.

What's in the GraphQL errors array?

The errors array is a top-level field in a GraphQL response. Each entry is an object with a human-readable message, and usually a locations array pointing at the spot in the query that failed, a path array naming the response field that errored, and an extensions object that servers like Apollo use to carry a machine-readable code such as INTERNAL_SERVER_ERROR or UNAUTHENTICATED. When errors is present and non-empty, at least one part of your query did not resolve correctly, even though the HTTP status is 200.

Can a GraphQL response have both data and errors?

Yes. This is called partial success, and it is normal GraphQL behavior. If one field's resolver throws while its siblings resolve fine, the server returns a data object containing the fields that worked plus null for the field that failed, and an errors array describing what went wrong. A status check sees 200 and a non-empty body and reports the API as healthy. The page is half-broken. This is the single most important reason to assert on the body, not the status.

Does GraphQL ever return a 4xx or 5xx?

Yes, for failures that happen before execution or in the transport. A malformed query or a request body that fails to parse is a request error, and while the graphql-over-http spec recommends 200 under application/json, Apollo Server actually returns 400 for parse and validation errors. Invalid variable values return 400 in current Apollo when status400ForVariableCoercionErrors is on, which is the default in Apollo Server 5. A crashed server or an upstream that times out returns a 5xx. The gap a status check cannot see is the field error that resolves to 200 with a populated errors array.

How do I alert on a GraphQL errors array if the status is 200?

Use a monitor that asserts on the response body, not just the HTTP status. POST a small canary query to your GraphQL endpoint, then add three assertions: status code equals 200, the JSON path $.errors[*] has no matches, and the JSON path $.data.<criticalField> is not null. If the errors array fills in or your critical field goes null, the monitor fails even though the status is still 200. Velprove's free, no-code multi-step API monitor asserts on the response body, not just the status code.

What query should I use to monitor a GraphQL endpoint?

Use a small read-only canary query that touches the field you care about most, run with a dedicated low-privilege monitoring account rather than real admin credentials. A good canary asks for one critical field and maybe one stable identifier, for example a viewer or health-style query that returns a known id. Keep it cheap so it does not load your resolvers, keep it stable so it does not break on unrelated schema changes, and assert that the one critical field comes back null-free with an empty errors array.

Set up a free GraphQL monitor with Velprove . POST a canary query, assert 200 AND no errors entries AND a non-null critical field, all no-code, from 5 regions, commercial use allowed. The next time a resolver fails behind a 200 OK, you hear about it before your users do.

Why Jetpack and ManageWP Report False Downtime: The Two Failure Modes and the Fix

velprove — Fri, 29 May 2026 14:00:03 +0000

The pattern. Jetpack and ManageWP false-report downtime for two reasons: a roughly ten-second timeout that trips on a slow shared-hosting load (or a firewall and security plugin that blocks the probe), and a homepage HTTP 200 check that stays green while a broken checkout or locked-out wp-admin sits behind it. The durable fix is depth, not a different external vendor: allowlist the probe to stop the false alarms, then monitor what a real user does. Velprove's browser login monitor signs into your own wp-login.php and asserts a logged-in-only string, so it fails the moment wp-admin breaks instead of reporting a green homepage.

To put numbers on it: Jetpack checks from WordPress.com servers in the United States every five minutes via an HTTP HEAD request, and ManageWP checks from its own external network on a similar interval. Both are external checks, and both fail in the same two opposite directions. They cry wolf, sending a phantom "your site is down" email while real visitors load the page fine. And they go blind, staying green while a real outage runs on a part of the site the homepage check never looks at. You can build all of this on the free plan, no credit card required.

The two ways a built-in WordPress uptime monitor sends false reports

Jetpack's Downtime Monitor and ManageWP's uptime monitor are the two most common built-in options a WordPress owner reaches for. Both are external checks. Jetpack, per its own support documentation , runs from WordPress.com servers: "one of our servers will start checking your site every five minutes," and it "pings your site's homepage every five minutes, via a HTTP HEAD request." ManageWP checks externally on a similar interval. Neither one is a process living on your server. They both reach in from the public internet, which is the right place to monitor from.

That shared design is exactly why both fail the same two ways. The first failure is a false alarm, a cry wolf. The monitor sends you a "site is down" email while your site is up and serving real visitors. The second failure is the opposite, a silent miss. The monitor stays green and says nothing while a real outage is happening on a part of the site the monitor never looks at.

These two failures pull in opposite directions, and that is what makes a built-in homepage monitor so frustrating to trust. It pages you when nothing is wrong, and it stays quiet when something is. Over a few weeks of phantom emails, the natural human response is to stop reading the alerts, which is the worst possible outcome: now you have a monitor that both lies to you and gets ignored. The rest of this post takes the two failure modes one at a time, with the real complaints and vendor quotes that document them, and then lands on the fix that addresses both.

Cry wolf, part one: the ten-second timeout on slow hosting

The most common false-down alert is a timeout. Jetpack decides your site is down if it does not answer within about ten seconds on a check. On shared hosting, a single request can blow past that window when a neighbor on the box is spiking, when an uncached page has to be rebuilt, or when a backup job is running, even though real visitors with warm caches never notice.

This is not a guess. A WordPress.org support thread from September 2024 captures the experience precisely. User @philipgilson posted on the WordPress.org forums (last modified 2024-09-22):

"I am repeatedly getting emails from JetPack that my website appears to be down... Then another saying it's back online. This happens a lot. however when I try to load the website it's not slow to load and loads fine. Error reference: 234058489/intermittent." *

A site that is genuinely down does not come back online thirty seconds later, over and over, while loading fine the whole time in a browser. That up-down-up flapping with an /intermittent error reference is the signature of a timeout, not an outage. And Jetpack's own staff confirm the mechanism. In the same complaint family, an Automattic staffer explained it directly:

"If your site is slow to load, it could trigger a notice that it is down... the site may actually be loading, but it's just slow, and Jetpack thinks that this slowtime is instead a sign that the site is offline." *

The same thread documents the exact threshold and the fact that you cannot change it: "the Jetpack requests timed out, meaning that your site does not respond to these requests after 10 seconds," and "there's no way to adjust the time that the Monitor checks your site, it's either on or off." So the timeout is fixed, the threshold is short, and on slow hosting it trips on a load that a human would happily wait out. This is a real, persistent, vendor-acknowledged behavior, not a recent regression, and it is the number-one source of phantom "down" emails.

The important diagnostic move here is to separate a false alarm from a site that genuinely keeps falling over. If your site really is going down on a recurring basis, the timeout alert is correct and the problem is upstream of the monitor. We walk through the real causes of a recurring outage, from memory limits to plugin conflicts to host throttling, in why your WordPress site keeps going down . If, instead, the site loads fine every time you check and the alerts flap up and down, you are looking at a cry-wolf timeout, and the fix is not to keep restarting things. It is to monitor differently.

Cry wolf, part two: your firewall or security plugin blocks the probe IPs

The second false-down cause is a block. The probe never reaches WordPress because something on your side stops it at the door: a host firewall rule, a security plugin like Wordfence, a geoIP block, or a rate limiter that decides a request hitting your homepage every five minutes from the same place looks like a bot. The page loads fine for you because your request is not the one being blocked. The probe is.

For Jetpack specifically, there is an extra dependency that makes this worse: the connection layer runs through xmlrpc.php. A WordPress.org thread from early 2022 shows the volume this can reach. User @johnnyivan reported on the WordPress.org forums (thread last modified 2022-03):

"For a couple of weeks, JetPack's sending me hundreds of emails saying my site's down, then back up again. Error reference: 142945637/intermittent... I contacted my hosting provider and they can see nothing wrong, and neither can I." *

An Automattic staffer in that thread named the cause:

"When we try to test your site's xmlrpc.php file (which Jetpack uses to communicate with your site), it is timing out... some hosts block connection requests to that file." *

Here is the tension. Many WordPress owners deliberately block or disable xmlrpc.php, because it is a well-known target for brute-force login attempts and pingback amplification attacks. That is a defensible, even recommended, security posture. But the moment you do it, Jetpack Monitor loses its connection path and starts emailing you phantom downtime. You did the right thing for security and got punished with a noisy monitor. Blocking harder does not fix it, and unblocking xmlrpc.php weakens your security to satisfy a monitor, which is backwards.

ManageWP hits the same wall from a different angle. It checks from an external network, and that external probe gets firewalled like any other outside request. A common symptom is a timeout error of the form "Response timeout, did not receive response for 30sec" thrown against a site that is loading perfectly in a browser. A security forum discussion of this exact error describes the cause plainly: a response timeout like that "usually indicates that some security configuration is blocking requests." (We are not naming ManageWP's backend probe vendor here, because the only sourcing for that is a single third-party forum, not a ManageWP document. The point stands without it: ManageWP checks externally, and the external check gets blocked the same way Jetpack's does.)

The honest fix for both is allowlisting, not blocking, and it is worth stating clearly because the forum advice often gets this backwards. You allowlist the specific monitor so your security layer stops eating its requests, and you keep the rest of your hardening in place. But allowlisting alone only stops the false alarms. It does nothing about the second, quieter failure, which is the subject of the next section.

Going blind: a homepage 200 is not proof your site works

Now flip the failure mode. Jetpack pings your homepage with an HTTP HEAD request and reads the response. ManageWP assesses your site's status from its response code, with an optional single-keyword match on the homepage. In both cases the monitor is asking one question: did the homepage answer with a 200? And an HTTP 200 is not proof that your site works. It is proof that one URL returned a success code.

Think about what actually breaks on a WordPress site and where it lives. A WooCommerce checkout that throws a fatal error on the payment step. A wp-admin that locks every editor out after a plugin update. A membership area that silently logs users out because a session table filled up. A page that renders a white screen below the fold while the header and footer come through fine. Every one of those can sit behind a homepage that still returns 200. The monitor sees green. Your customers see a dead checkout. This is the general shape of a silent outage, and we cover the broader pattern, including why most uptime tools miss it, in why uptime monitors miss real outages . The Jetpack and ManageWP version is just the most common WordPress instance of it.

This is not a complaint we are quoting from a forum, because owners rarely file a support ticket that says "my monitor stayed green during an outage" (they never knew the monitor should have caught it). It is an architectural limit you can reason about directly. A monitor that only knows the homepage status code can only ever tell you the homepage answered. It is structurally incapable of telling you that the logged-in dashboard renders, that the checkout completes, or that the members area still authenticates, because it never asks. The optional ManageWP keyword match helps a little, but only on the homepage, and only against a single static string, so a white-screen checkout three clicks deep is still invisible.

The takeaway is uncomfortable but clean. The cry-wolf problem makes the monitor annoying. The go-blind problem makes it dangerous, because it gives you a false sense of safety. A green pill on a homepage HEAD check is the monitoring equivalent of checking that the front door opens and concluding the whole house is fine.

Why "switch to another external monitor" does not fix it

The usual advice when Jetpack gets noisy is to switch to a dedicated external uptime monitor. Plenty of articles frame this as "external monitoring is the key," and Automattic staff have themselves recommended moving to a separate service in support threads. Switching off the built-in monitor and onto a dedicated one is reasonable. But if the dedicated monitor you switch to is just another homepage status check, you have not fixed anything. You have moved the same two failure modes to a different logo.

Walk it through. Both Jetpack and ManageWP already check your site from the outside, so swapping one outside vendor for another that does the same shallow thing changes nothing. A second homepage-200 monitor inherits both problems: it still times out on a slow shared-hosting load, it still gets firewalled if your security layer blocks its probe, and it still goes blind to anything behind a 200. You have changed vendors, not failure modes.

What actually moves the needle is depth plus robustness. Depth means checking what a user actually does, the login flow, the post-login content, a money path, rather than only whether one URL returns a status code. Robustness means a probe that is harder to false-trip and a configuration that does not page you on a single slow load. If you want to move off a plugin entirely and run your monitoring from outside WordPress, the mechanics of doing that cleanly, without touching wp-content, are in monitoring WordPress uptime without a plugin . The point of this section is narrower: do not let the "just go external" advice convince you that the vendor was the problem. The shallow check was.

The durable fix, in plain steps: monitor what a user does

The fix is to stop asking "did the homepage return 200" and start asking "can a real user do the thing they came here to do." That is four moves, and the first one is the differentiator. None of this involves any configuration file or API payload. You build each of these in a wizard, step by step, the same way you would set up a forwarding rule in your email client.

Move one: a browser login monitor on your own wp-login.php. This is the move that catches the most, and it is the one a homepage HEAD request can never replicate. Velprove's browser login monitor opens your own wp-login.php in a real browser, signs in with a dedicated Subscriber-role test account, waits for the dashboard to load, and asserts that a logged-in-only string is actually on the page. If wp-admin locks everyone out after a plugin update, this monitor fails on the next check, because the real browser cannot reach the signed-in state. A status-code probe would have stayed green the whole time. Step one in the wizard: a real browser opens yourdomain.com/wp-login.php. Step two: it signs in with the test account's username and password. Step three: it asserts that a logged-in-only string (the "Howdy" greeting, the admin bar, a dashboard widget label) is present. If you want the step-by-step setup of a wp-admin login monitor specifically, with screenshots, that lives in how to monitor your wp-admin login .

One hard rule on this move, because it is the most common mistake. Point the browser login monitor at your own wp-login.php on your own domain, where you control a dedicated low-privilege test user. Do not point it at WordPress.com or Jetpack's own login, which sit behind device verification and email codes a monitor cannot complete (this is covered in the FAQ below). And never wire your real admin credentials into a monitor. Create a throwaway Subscriber account whose only job is to prove the login flow works.

Move two: a content assertion on the homepage, not just a status code. Keep an HTTP monitor on the homepage, but make it assert on a specific string that only renders when the page actually built correctly. The title of your latest post, a footer copyright line with the current year, a product name. A white screen of death often still returns a 200 with an almost-empty body, so a status-only check passes while the page is blank. A body assertion on a real string fails the moment the content stops rendering. The wizard move: add the homepage URL, then add a body-contains assertion on a string you know is always on the page.

Move three: a multi-step monitor for a money path. If your site sells something or has a multi-page flow that matters, a multi-step monitor walks a short sequence of requests in order and asserts at each stop. Velprove's free plan covers a multi-step monitor of up to three steps, which is enough to fetch a page, follow it to the next, and assert that the expected content arrives. Each step asserts against a static expected value, a status code or a known string, not a moving target. Velprove runs each step once in sequence; there is no polling or wait-for-condition primitive, so if you need a freshness check, your own endpoint should compute that server-side and the monitor asserts a 200 on it. The wizard move: add step one, add its assertion, add step two, and so on, up to three.

Move four: spread your HTTP and multi-step monitors across regions. A single blocked region is one of the cry-wolf causes from earlier. All five Velprove regions (North America, Europe, United Kingdom, Asia, Oceania) are available on every plan, including free. HTTP and multi-step monitors can be distributed across regions, so running the same homepage assertion from a few different locations tells you whether a failed check is your site falling over or just one regional path getting firewalled or hitting a CDN issue. The browser login monitor runs from one region at a time, so pick the region closest to most of your users for that one. If a probe from a single location trips while the others stay green, you are looking at a one-path problem, not a site-wide outage.

Those four moves fit inside Velprove's free plan: 10 monitors total, one browser login monitor, a multi-step monitor up to three steps, all five regions, email alerts, no credit card required, and commercial use allowed. The browser login monitor on your own wp-login is the piece that separates this from yet another homepage ping, and it is the piece neither Jetpack nor ManageWP can do.

What to do with Jetpack or ManageWP once you have real monitoring

Once a deeper monitor is watching the login flow and real content, the built-in monitor has one of two honest jobs left. You can keep it on as a free homepage backstop, a second pair of eyes on the single thing it can actually see, and treat its alerts as low-priority hints rather than pages. Or you can turn it off to stop the noise, especially if the cry-wolf timeouts and the xmlrpc.php blocks have already trained you to ignore its emails. Both are reasonable. A monitor you ignore is worse than no monitor, so if Jetpack's false alarms have burned your trust, turning it off and leaning on the deeper monitor is the cleaner choice.

The bigger structural question, whether a WordPress owner should run a built-in plugin monitor at all, which segment each tool fits, and how to choose among the options, is not what this post is for. This post is about why the false reports happen and how to make them stop. The full landscape decision lives in our complete guide to WordPress uptime monitoring , which walks the whole field including where Jetpack Monitor and ManageWP fit and when each one is enough. If you arrived here trying to decide what to use rather than why the alerts are lying to you, start there.

Frequently Asked Questions

Why does Jetpack Monitor keep emailing that my site is down when it loads fine for me?

Two causes account for most of these phantom alerts. First, Jetpack flags your site down if it does not answer within about ten seconds, and a busy shared host can blow past that window on a single check while real visitors with warm caches load the page fine. The alert body usually says your site is responding intermittently or extremely slowly and carries an error reference ending in /intermittent. Second, something on your side blocks the probe before it ever reaches WordPress: a firewall rule, a security plugin, a geoIP block on the United States where the Jetpack servers live, or a blocked xmlrpc.php file that Jetpack relies on to talk to your site. The page loads for you because your request is not blocked. The probe is. Allowlisting the probe and adding a deeper check that asserts on real content is the durable fix, not turning the timeout knob, because Jetpack does not expose one.

Why does ManageWP show my site as down when it is up?

ManageWP checks your site from an external network, the same way any outside monitor does, and that external probe gets firewalled like any other. A common report is a "Response timeout" that did not receive a response within thirty seconds on a site that is loading fine in a browser, which usually traces back to a security configuration on the host or a security plugin blocking the request before WordPress answers. The first move is to confirm the site really is up from several places, then allowlist the probe so the security layer stops eating it. The deeper move is to monitor what a signed-in user actually does, because a status-code probe will keep reporting a green homepage even when the part of the site your customers use is broken.

Can a five-minute homepage monitor miss a real outage?

Yes, in two ways. A short outage that opens and closes inside the five-minute gap between checks can be invisible simply because no check landed during it. More importantly, both Jetpack and ManageWP look at whether the homepage returns an HTTP 200, and an HTTP 200 is not proof the site works. A broken checkout, a locked-out wp-admin, a logged-out members area, or a white screen below the fold can all sit behind a 200 response. The homepage answers, the status code is green, and the part of the site that makes you money is dead. A monitor that asserts on real content or drives a real login catches that class of failure. A status-code probe cannot.

Should I block the monitor's IPs or the xmlrpc.php requests to stop the noise?

Blocking is what causes the false alerts in the first place, so blocking harder is the wrong direction. Jetpack Monitor depends on a reachable xmlrpc.php to talk to your site, and many WordPress owners disable or block xmlrpc.php for security because it is a common brute-force and amplification target. That is a reasonable security posture, but it breaks Jetpack's connection and produces phantom downtime alerts. The fix is to allowlist the specific monitor's requests rather than turn off your security, or to move the uptime signal to a monitor that does not depend on xmlrpc.php at all. Pair the allowlist with a monitor that checks depth, a real content assertion or a real login, so you are protected and not paged on noise.

Will a browser login monitor work against WordPress.com or Jetpack's own login?

No. Point a browser login monitor at your own wp-login.php on your own domain, where you control a dedicated low-privilege test account. It will not work against WordPress.com, the Jetpack dashboard, or any consumer login you do not control, because those sit behind device verification, email codes, and captchas that a monitor cannot complete. That limitation is fine for the WordPress owner, because the surface you actually want to verify is your own login at yourdomain.com/wp-login.php with a Subscriber-role test user, which is exactly where a browser login monitor belongs.

Monitor a Supabase App: Auth, RLS, Edge Functions, Realtime

velprove — Thu, 28 May 2026 14:00:07 +0000

Diagnosis. Supabase's status page can be green while your customer's post-login dashboard renders an empty list. RLS-protected reads that fail to a misconfigured policy return an empty array with HTTP 200, which a status-code probe cannot see. A Supabase Auth session can pass its first check, then fail the next one with "Invalid Refresh Token" because refresh tokens are single-use. An Edge Function that boots in 3 ms on a warm isolate can spike into a multi-second tail on the cold path. Realtime can keep its websocket open and stop delivering events. None of those show up on a 200 OK against your app URL. Velprove's free browser login monitor signs into the actual app your users open, the one that reads through Supabase RLS, and a multi-step API monitor re-authenticates against /auth/v1/token every check and asserts a known row comes back, not just a 200.

Your Supabase project returns 200 OK. Your RLS reads return [].

PostgREST is the HTTP layer that fronts every Supabase table. When you query it with a publishable key and a user JWT, it consults the row-level security policies on the target table and returns only the rows the policies allow. If the policies are missing or misconfigured, PostgREST does not return a 403 or a 500. It returns an empty JSON array and a 200 status code. A monitor that asserts status_code = 200 sees a happy response. Your customer sees a blank screen.

From Supabase's Row Level Security docs , the behavior is documented verbatim:

"Once you have enabled RLS, no data will be accessible via the API when using a publishable key, until you create policies." *

And, on the default scope:

"RLS must always be enabled on any tables stored in an exposed schema. By default, this is the public schema." *

Read those sentences together. RLS is supposed to be on. With RLS on and no policy, the API returns nothing. The failure mode is not an HTTP error code. The failure mode is an empty array dressed as a success. This is the load-bearing reason a status-code probe is not enough for a Supabase-backed app: the platform's primary authorization layer fails by returning success. The same family of silent-success failures shows up across cloud platforms in our silent outages HTTP misses writeup; the Supabase version is just the cleanest case.

A migration that drops a policy, a CI step that disables RLS on a table by accident, a refactor that renames the column the policy references: all three produce the same outside-visible shape. 200 OK, empty array, customer dashboard blank. The rest of this post is what to assert on top of the 200 so the empty array trips an alert.

Feb 12 2026: 3h 42m of us-east-2 dark, every surface at once

On Thursday February 12 2026 at 21:12 UTC, Supabase deployed a new internal monitoring service that took out an entire AWS region. The post-mortem, signed by CEO Paul Copplestone, names the change as the root cause:

"We deployed a new internal monitoring service on February 12th that inadvertently enabled AWS's VPC Block Public Access feature at the regional level in us-east-2. This blocked all internet gateway traffic across every VPC in the region." *

The blast radius was the whole region. Verbatim, again:

"All Supabase customers with projects hosted in the us-east-2 region were affected." *

The incident started at 21:12 UTC and resolved at 00:54 UTC the next morning. 3 hours and 42 minutes. And the affected-surface list is the punchline for this post:

"Postgres databases, Auth, Data APIs, Edge Functions, Storage, Realtime, and any other Supabase service in that region" *

That list is every Supabase primitive a customer's application reads through. A status-code probe on your own app URL during those 3h 42m would have caught the most visible surface, your web app erroring on its first database read, but it would not have told you which Supabase primitive was failing, whether Auth was issuing tokens, whether Edge Functions were accepting invocations, or whether Realtime channels were open. Each of those needs its own assertion. The pattern is the same pattern we apply to what /healthz should return on any backed-by-something API: assert on the read, not on the handler.

One incident does not justify a monitoring strategy by itself. The Feb 12 2026 outage is the recent reminder that a BaaS-backed app has a vendor surface area that lives outside your control, and the vendor's own status page is not a substitute for probing your own customer-visible path. The next four H2s are the assertions, one per Supabase primitive.

Auth + RLS read in one multi-step (the core wedge)

A multi-step API monitor lets you chain two HTTP calls in sequence and extract a value from the first response into the second request. For a Supabase-backed app, the chain that matters is: get a real user's access token, then use it to read a row the user is supposed to be able to read. If either step fails, your customers cannot use the app. If both steps pass, the auth-and-authorization path is verified from outside, end to end, on every probe interval. The general mechanic is in the multi-step mechanic reference; this section is the Supabase-specific shape.

Step 1 hits the auth endpoint with a dedicated test user's credentials. Step 2 reads from a table protected by RLS, using the access token captured in step 1, and asserts a known row marker appears in the response body. Re-authenticate fresh on every check; do not try to cache a refresh token, because Supabase refresh tokens are single-use (see FAQ #5 for the full reason). The 1-hour default access-token lifetime gives you all the headroom a 5-minute monitor interval needs.

The shape is small enough to describe in a paragraph. Step 1: POST /auth/v1/token?grant_type=password with the publishable key in the apikey header, and the test user's email and password in the body. Assert status_code = 200, then assert that $.access_token exists in the JSON response. Capture $.access_token for the next step. Step 2: GET /rest/v1/canary_rows with the captured token as Authorization: Bearer and the same apikey header. Three assertions: status_code = 200, body_contains the user-A-canary marker, and body_not_contains the user-B-canary marker. Screenshot 1 shows the wizard with the chain built end-to-end.

A few details earn their place. The apikey header carries the publishable key on both steps; PostgREST requires it on every request and the Auth endpoint requires it on the token call. The Authorization: Bearer header on step 2 carries the user JWT extracted from step 1. The body_not_contains assertion is the RLS-enforcement half of the wedge: if a policy got dropped and the read now returns rows owned by user B, the assertion fails. The two assertions together verify both that RLS lets the right rows through AND that RLS blocks the wrong rows.

Velprove's multi-step monitor runs each step once in sequence with the same 6 assertion types HTTP monitors use: status_code, body_contains, body_not_contains, json_path, response_time_ms, and header_contains. No conditional branching, no per-step retry, no wait-for-condition. That is enough to express the Supabase auth-then-read pattern cleanly, and the simplicity is what keeps the monitor deterministic across thousands of runs.

The test user posture matters. Provision a dedicated monitoring user with read-only access to a small canary table whose rows carry a stable marker string. Do not point this monitor at a real customer's account, do not give the user write permission, and rotate the password from your monitor's secret vault on the same cadence as your other service credentials. Two test users (one for body_contains, one to source the body_not_contains marker) are the discipline that turns this monitor from a liveness check into an RLS-enforcement check.

Edge Functions: cold start as a response-time tail, not a binary

Supabase Edge Functions run on V8 isolates inside Deno , with code packaged in ESZip format for fast boot. The architecture puts cold starts in the single-digit millisecond range under normal conditions; warm invocations return in roughly the same time as the function's own work. The Supabase team shipped a 2025 fix that moved workers performing initial script evaluation onto a dedicated blocking pool, which measurably reduced boot-time spikes in the long tail.

That is the right shape to monitor as a response_time_ms assertion rather than a binary up/down signal. Cold starts happen, they recover quickly, and a single slow cold start is not an outage. A sustained shift in the p95 tail is. Set the threshold at 1.5x to 2x your warm p95 measured over a real traffic window, not at a round number pulled from the docs.

A standalone HTTP monitor against the function URL is the right primitive. The function exposes a public HTTPS endpoint, and a Velprove HTTP monitor can probe it from any of the 5 global regions. Configure three success conditions: status_code = 200, response_time_ms under your warm-p95 threshold (1500 ms is a reasonable starting point for most Edge Function workloads), and body_contains a known string the function emits on the happy path. Screenshot 2 shows the Success Conditions step with all three assertions stacked.

The body_contains assertion is the part most Edge Function monitors skip and shouldn't. A 200 OK from a function that silently swapped its handler (a deploy that shipped the wrong build, an env var that flipped a feature flag) is still a 200. Asserting on a static string the function's real code path emits turns the same probe into a deploy-correctness check.

One trap to avoid: do not invoke functions that mutate state on the monitor path. The monitor runs on the configured interval from every region the monitor is configured in, forever. A function that writes a row on each call will accumulate millions of rows over a year. Use a read-only Edge Function path for the canary, or pass a request flag your function honors as a dry-run.

Browser-login on YOUR signed-in surface (not Supabase Studio)

A browser login monitor opens a real browser, navigates to a login page, fills the form with a test user's credentials, waits for the post-login route, and asserts that the page rendered the data it was supposed to render. For a Supabase-backed app, the page that renders post-login is the one that reads through RLS. If RLS is broken or the Data API is down, the page returns 200, renders the chrome, and shows an empty state. The browser login monitor catches that, because it asserts on the data, not on the response code. The general pattern is in our browser login monitor on your signed-in surface guide; the Supabase-specific detail is below.

Hard rule: never point this monitor at supabase.com, the Supabase dashboard, or Supabase Studio. The target is the customer's own application: the URL your real users open to sign in. Supabase's own UIs sit behind device verification, captchas, and account-level protections the monitor cannot complete. The monitor's job is to verify the path your customer takes through your app, which happens to be backed by Supabase Auth and the Supabase Data APIs.

The Supabase-distinguishing detail in the assertion: set the monitor's post-login success check to a string that only renders after the dashboard's first RLS-protected read completes. A customer name pulled from the profiles table. An invoice ID from the invoices table. The label of a plan the user is actually on. If the RLS policy on that table drops, the page loads but the string never renders, and the monitor fails. For a stronger signal, set the success check to selector_visible on a DOM element that only renders after the post-login RLS read completes (a row from the user's profiles table, an invoice ID from the invoices table). That catches the case where the page renders cached chrome but the user's data layer underneath has gone empty.

The free plan covers 1 browser login monitor at a 15-minute interval. That is enough for the production signed-in path of a single application. Paid plans add more browser monitors and tighter intervals. The browser monitor is the one assertion in this post that catches a class of failures the API-only monitors cannot: an authenticated page that renders a 200 but is functionally broken because the data layer underneath it returned nothing.

Realtime: probe a customer-side /realtime-health endpoint

Supabase Realtime runs on Elixir with the Phoenix Framework and delivers three primitives over WebSockets: Broadcast, Presence, and Postgres Changes. Postgres Changes adheres to RLS policies on the tables you subscribe to. The whole thing is fast and well-engineered. The whole thing also has no probe surface Velprove can directly assert on, because we have no websocket primitive in any of our monitor types (http, api, multi_step, browser).

The right response is to push the freshness window into the customer's own infrastructure. Stand up a tiny server-side process that subscribes to the channel you care about, records the timestamp of the last event it received, and exposes an HTTP endpoint that returns 200 or 503 based on how stale that timestamp is. The endpoint owns the freshness logic, and Velprove asserts status_code = 200 on it from outside. This is the same /healthz pattern for compound dependencies : compute the verdict server-side, expose a binary endpoint, probe the binary.

A minimal implementation in Node with the supabase-js client:

A few notes on the shape. The endpoint computes the verdict on each request from a server-local timestamp; nothing about the request itself drives the calculation. The Velprove monitor that probes it is the simplest HTTP monitor in this whole post: a GET against the /realtime-health URL on a 60-second interval with a single status_code = 200 assertion. No body assertion, no response-time threshold; the customer endpoint already encoded all of those concerns in its 200-versus-503 return.

That is it. The endpoint owns "what counts as stale," Velprove owns "tell me when it isn't 200," and the customer's real Realtime delivery path is being exercised continuously by the server-side subscriber. If the channel falls silent, the timestamp stops advancing, the endpoint flips to 503, and the next probe pages you.

Two disciplines matter. First, set the staleness tolerance to the real-world cadence of events on the channel: a 5-minute window on a channel that fires every few seconds catches a real silence quickly; a 5-minute window on a channel that fires once an hour will false-positive constantly. Second, the subscriber process needs its own uptime story; if it crashes the timestamp also stops advancing, which is correct alerting behavior but means you should keep the subscriber simple and run it as part of your normal application deployment, not as a one-off script.

When this post is the wrong one

Supabase is one BaaS, and this post is scoped to that surface. If you got here looking for something else, three sibling posts probably fit better.

If your question is platform-shaped, not BaaS-shaped. A Supabase-backed app still has a host that serves its frontend and a runtime that serves its backend. For platform-side monitoring of those hosts, the per-platform guides are Vercel, Render, Railway, Cloudflare Workers + Pages , and Heroku. Each covers the platform's own failure modes (cold starts, release-phase failures, Eco dyno sleep, regional outages) which are orthogonal to Supabase's.

If your question is about choosing between a browser monitor and an HTTP monitor. The general rule of thumb is in the browser vs HTTP decision tree . The short version for Supabase: use an HTTP or multi-step monitor for the API surface, and use a browser login monitor for the customer-facing signed-in path that reads through RLS. Both, not either.

If you are not sure which Velprove plan covers this. The four-monitor Supabase set in this post (multi-step auth+RLS, HTTP Edge Function, HTTP Realtime freshness, browser login) fits inside the free plan: 10 monitors total, 1 browser login monitor, multi-step up to 3 steps. If you need more browser monitors, multi-step chains longer than 3 steps, Slack/Discord/Teams/Webhooks delivery (Starter), or PagerDuty (Pro), see which Velprove plan fits your shape.

Frequently Asked Questions

How do I assert that RLS is actually enforced and not silently disabled?

Provision two low-privilege test users in your Supabase project, A and B, with disjoint row ownership: a row only A can read carries a marker string user-A-canary, and a row only B can read carries user-B-canary. Run the Auth + RLS multi-step monitor as user A. On the RLS read step, assert two conditions in this order: body_contains the user-A-canary marker, and body_not_contains the user-B-canary marker. If RLS is enforced, both pass. If RLS was disabled on the table (or the policy got dropped during a migration), the read returns both rows, the second assertion fails, and Velprove pages you. Velprove cannot tell you that the policy is misconfigured, only that the expected row scope changed. That symptom-not-cause framing is enough to put a human on the database within minutes.

Can Velprove monitor a Supabase Edge Function cold start?

Yes. Create a Velprove HTTP monitor against your function URL and add a response_time_ms assertion at roughly 1.5x to 2x your warm p95. Edge Functions run on V8 isolates with ESZip cold starts in the single-digit-millisecond range under normal conditions, but boot-time spikes still happen on first invocation after idle. Setting the threshold above warm p95 catches the long tail without paging on every warm request.

What if my Supabase Realtime channel stops delivering events?

Velprove has no websocket primitive, so the realtime channel itself is not directly probeable. Move the freshness window into your own infrastructure: expose a /realtime-health endpoint that subscribes to the channel server-side, records the timestamp of the last delivered event, and returns 200 when the gap is below your tolerance or 503 when it exceeds it. A Velprove HTTP monitor asserts status_code = 200 on that endpoint on your normal interval. The endpoint owns the freshness logic, and Velprove owns the alerting and the global probe origins.

Should the multi-step monitor use the service-role key or the anon/publishable key?

The publishable (anon) key plus a real test-user JWT obtained at the first step. The service-role key bypasses RLS by design. A monitor authenticated with the service-role key will return rows whether or not the policy enforces correctly, so the entire RLS-enforcement wedge collapses to noise. Use the publishable key in the apikey header and the test user's access_token in the Authorization: Bearer header. That is the same posture your real customer's browser uses, which is the posture you want to be monitoring.

Why does my Supabase Auth multi-step fail every other check with "Invalid Refresh Token"?

Supabase refresh tokens are single-use. From the Sessions docs : a refresh token can only be used once to exchange for a new access-and-refresh-token pair. If your monitor caches a refresh token between checks and tries to refresh on the second run, the first refresh consumed the token and the second call gets "Invalid Refresh Token." The fix is to not refresh at all. Call POST /auth/v1/token?grant_type=password fresh on every check, get a brand new access token, and discard everything when the check completes. A 5-minute monitor interval against a 1-hour access-token lifetime means you never approach expiry anyway, and the refresh-token consumption problem stops existing.

Can a Browser Monitor Sign In With OAuth, SSO, or a Passkey?

velprove — Wed, 27 May 2026 14:00:02 +0000

Short answer: No. A form-fill browser login monitor cannot complete an OAuth redirect, a SAML or OIDC SSO bounce, a passkey ceremony, an MFA challenge, or a magic link. Stop trying to make it. Monitor the identity provider's token endpoint with a multi-step API monitor, assert one protected route with the access token, and expose a small canary route in your own app that an HTTP monitor can poll. That combination catches Entra and Okta outages in under five minutes. If your login actually is email and password on your own domain, the browser login monitor is the right tool and we already wrote that post: see how to monitor a SaaS login that IS email and password . This post is the sibling for everything else. Built for SaaS application monitoring teams whose login is anything but a single form. Start for free.

Microsoft 365 lost SSO and MFA on October 8, 2025. Status-only monitors stayed green.

On October 8, 2025, Microsoft posted incident MO1168102 against Microsoft 365. The impact list named Microsoft Teams, Exchange Online, the Microsoft 365 admin center, Microsoft Entra SSO authentication, and MFA. Microsoft's own root-cause language was * "a portion of directory operations infrastructure which became imbalanced during a period of high traffic volume and caused authorization failures" * ( BleepingComputer coverage, MO1168102 ). Translation: the part of Entra that issues authorization decisions stopped issuing them, while everything else looked fine from the outside.

Status-only monitors pointed at portal.office.com kept returning 200 OK while the IdP path was failing. The portal HTML rendered. The marketing chrome loaded. The CSS came back. Real users hit sign-in and got error pages, because Entra refused to issue a token they could use. Every team whose monitoring stopped at "the host answered" learned the same lesson at the same time. This is the same defect class we walk through in five outage classes standard monitoring misses .

The honest defense against an IdP-flavored outage is not a better form-fill primitive. It is monitoring the identity provider as a third-party dependency you do not own (we cover the broader pattern in treat your IdP as a third-party dependency ), plus a canary route in your app that says out loud whether authentication just round-tripped. Both are cheap. Both are unambiguous. Neither is a form-fill browser monitor.

What a browser login monitor actually drives

Velprove's browser login monitor is a form-fill primitive. Every browser login monitor has exactly one loginUrl, exactly two credentials stored in check_secrets ( BROWSER_USERNAME and BROWSER_PASSWORD), three optional CSS selectors for the username field, the password field, and the submit button (auto-detected when blank), and one successIndicator chosen from url_pattern, text_present, or selector_visible. That is the entire shape.

Every check loads the URL, locates the two fields, types the credentials, clicks submit, waits for navigation, and checks the success indicator against the rendered DOM. That is exactly what a real customer signing in with email and password looks like, which is why it works for that case. It is also why it fails for everything else: there is no second URL, no token capture, no third-party redirect handling, no second factor input, no authenticator integration, no inbox reader.

Velprove ships four monitor types in total: Browser login, HTTP, API, and Multi-step API. When the login flow is not a single form, the work moves out of the browser primitive and into the multi-step API plus HTTP primitives. The rest of this post is the recipe for that move.

Five login patterns a form-fill browser monitor cannot drive

Five common login patterns push the work outside what a form-fill primitive can do. The honest answer in each case is the same: do not bend the browser monitor into the wrong shape. Use the right primitive for the actual auth flow.

Pattern 1: OAuth redirect (authorization code flow)

OAuth's authorization code flow ( OAuth 2.0 spec ) bounces the user to a third-party authorization server, asks for consent, redirects back with a code, and exchanges the code for an access token at the token endpoint. The browser login monitor has one loginUrl; it cannot follow the bounce to the provider, click the consent button, and ride the redirect back with the code in the query string.

Right primitive: a multi-step API monitor against the token endpoint using client_credentials for service-to-service cases, or refresh_token for a long-lived refresh token you minted out of band for the synthetic test identity. The browser chrome is not what is interesting; the token round trip is.

Pattern 2: SSO via SAML or OIDC IdP

SAML 2.0 ( SAML 2.0 ) bounces the user from the service provider to the identity provider, posts a signed assertion back, and ends with a session in the SP. OpenID Connect ( OpenID Connect ) is the modern equivalent over OAuth. Neither survives a single form fill. SAML ECP (Enhanced Client or Proxy) exists, but it requires the IdP to enable an ECP profile, which most enterprise deployments deliberately disable. Do not assume ECP. Do not pitch ECP to a customer who has not turned it on.

Right primitive: OIDC client_credentials against the IdP's token endpoint, plus a canary route in your app that proves the SP-side session check still works.

Pattern 3: Passkey or WebAuthn

WebAuthn ( W3C WebAuthn Level 2 ) requires a real authenticator: a hardware security key, a phone biometric, or a platform credential signed by the OS. Chrome DevTools and Playwright support a virtual authenticator for testing an app you control. Velprove's browser login monitor primitive does not wire up a virtual authenticator. Do not assume it does.

Right primitive: monitor the IdP's OIDC discovery endpoint and token endpoint with HTTP and multi-step API monitors. Reserve the browser login monitor for the email-and-password flow on your own login page.

Pattern 4: MFA challenge (TOTP, push, SMS)

Time-based one-time passwords need a shared secret and a clock. A push notification needs a phone and a human. An SMS needs a carrier path Velprove will not touch. None of these fit into the single-form-fill shape. Provision a synthetic test identity with MFA disabled and the lowest scope you can grant. Monitor the IdP and the protected route, not the human ceremony.

Pattern 5: Magic link

Magic-link sign-in posts an email to a server, the server emails a token, the user clicks the token, the server creates a session. Velprove does not read inboxes. The honest workaround is to assert the issuance call returns 200 with a multi-step API monitor, then assert a protected-route response using a token your app generates server-side for the synthetic test identity. If the link generator falls over, the issuance call fails. If the session backend falls over, the protected route fails. Two signals, no inbox required.

The workaround pattern: monitor the auth API, then assert the canary route

Seven steps inside this section, plus an eighth in the conclusion. About twenty minutes of work the first time, five minutes for each subsequent app. The HowTo schema embedded in this page mirrors the steps below verbatim.

Pick the right primitive: browser, multi-step API, or HTTP

Velprove has four monitor types: Browser login, HTTP, API, and Multi-step API. Pick the browser login monitor only when your sign-in page is a real email-and-password form on your own domain (see the sibling post for that recipe). For OAuth, SAML, passkey, MFA, or magic-link auth, the browser primitive is the wrong tool. Reach for the multi-step API monitor first, plus an HTTP monitor for the canary route. We unpack the choice in detail in the seven-question decision tree on browser vs HTTP .

Provision a dedicated test identity (no MFA, lowest scope)

Create a synthetic user in your identity provider. Smallest scope you can grant. MFA disabled (because the monitor cannot complete it). No real customer data behind it. No admin permissions. If the credentials leak, the blast radius is one inert account that can read nothing interesting. The safest approach is always a purpose-built test identity, never a real admin or a real customer.

Store credentials in Velprove's encrypted fields

Velprove encrypts the multi-step monitor's request body and headers at rest. Paste the client_id, client_secret, username, password, or refresh_token directly into the body or header fields of the multi-step monitor; the worker decrypts the request server-side before issuing it to your IdP. Browser login monitors use a dedicated check_secrets store keyed by name (BROWSER_USERNAME and BROWSER_PASSWORD) for the same encryption guarantee with a different shape. Plaintext secrets never sit in the database.

Build the multi-step API monitor against the token-grant endpoint

Step 1 of the multi-step monitor POSTs to your IdP's token endpoint with the grant your test identity is configured for. For service-to-service, the right grant is client_credentials. For an offline-issued user token, use refresh_token. ROPC is the last resort and we unpack why in the next H2. Assert status_code is 200 and json_path on $.access_token is present. This is exactly the pattern we unpack in multi-step API monitoring with a token-grant first step . Velprove's multi-step API monitor exposes exactly six assertion types: status_code, body_contains, body_not_contains, json_path, response_time_ms, and header_contains. Each step runs once. There is no polling, no retry-until, no wait-for-condition, no "within N seconds" freshness primitive. Snapshot per interval, asserted against the response body the IdP just sent.

Extract the access token and call one protected route

Step 2 of the multi-step monitor uses the access token captured from step 1 as a Bearer header on the Authorization request header, and calls one protected API route that exercises real authentication. Assert status_code is 200 and body_contains a value only an authenticated caller can see (your synthetic identity's email, a known role name, an account ID). This is the assertion that proves the IdP issued a token your API actually accepted, not just that the IdP returned a JSON blob. This pattern works equally well for API uptime monitoring for OAuth-protected endpoints where the API itself is the surface you care about.

Ship a /healthz/authed canary route in your own app

The multi-step monitor proves the IdP is up and the token round trip works from outside. The canary route proves the same thing from inside your own app, which is the perspective that matches your real customers' experience. Add an unauthenticated endpoint, conventionally /healthz/authed, that does the real work server-side: take a server-held service-account credential, exchange it for a token against the IdP, call one internal protected handler, and return 200 if the round trip succeeded or 503 if any step failed. The route is unauthenticated from the outside (no secrets sent over the wire) but its 200 is a real signal. We unpack the broader pattern in expose an unauthenticated /healthz that proxies the auth state .

Point a Velprove HTTP monitor at the canary route

Add a Velprove HTTP monitor that GETs /healthz/authed and asserts status_code is 200 plus a body_contains string the route returns when the auth round trip succeeded (the literal string "auth: ok" is plenty; anything stable and non-blank works). This is the second pair of eyes that catches IdP outages your multi-step monitor cannot reach, because the outage is between your app server and the IdP, not between the Velprove worker and the IdP.

If your app uses Clerk: the +clerk_test pattern

If your app uses Clerk, this is the cleanest pattern. Clerk supports a deterministic test-mode flow that fits Velprove's browser login primitive exactly. Per Clerk's testing documentation ( test emails and phones ): "Any email with the +clerk_test subaddress is a test email address. No emails will be sent, and they can be verified with the code 424242."

Provision one synthetic test user with an email like monitor+clerk_test@yourapp.com. Point a Velprove browser login monitor at your Clerk-hosted sign-in URL. The username field gets the +clerk_test email; the password field gets the fixed verification code 424242. Choose a text_present success indicator that only renders for signed-in users (a known piece of dashboard chrome, the user's name in the navbar, a logout button). The monitor runs every fifteen minutes from the region of your choice on the free plan.

One caveat directly from Clerk's documentation: "Every development instance has test mode enabled by default. If you need to use test mode on a production instance, you can enable it in the Clerk Dashboard. However, this is highly discouraged." Translation: prefer pointing this monitor at a staging environment, not at production. Production stays clean; the staging monitor proves the auth path works end to end. We are documenting the pattern, not shipping a case study; if you want a worked staging-environment example, the Clerk docs page above is the canonical source.

OAuth-protected APIs: client credentials grant is your friend, ROPC is the last resort

Two grants matter for monitoring an OAuth-protected API. Pick the first one you can.

OAuth 2.0 client_credentials grant. Designed for service-to-service auth. No user in the loop. You exchange a client_id and client_secret for an access token. Store both in Velprove check secrets. The monitor never touches a real user account. This is the right grant for almost every Velprove multi-step API monitor that authenticates against an IdP.

Resource Owner Password Credentials (ROPC) grant. The user's username and password go straight to the token endpoint. Older, simpler, and actively discouraged by every modern IdP. Microsoft's own documentation on the Entra ROPC flow ( Microsoft Entra ROPC documentation ) is unusually direct:

"Microsoft recommends you do not use the ROPC flow; it's incompatible with multifactor authentication (MFA). In most scenarios, more secure alternatives are available and recommended. This flow requires a very high degree of trust in the application, and carries risks that aren't present in other flows. You should only use this flow when more secure flows aren't viable." "As MFA becomes more prevalent, some Microsoft web APIs will only accept access tokens if they have passed MFA requirements. Applications and test rigs relying on ROPC will be locked out."

That second paragraph is the one that matters for monitoring. As more Microsoft APIs gate on MFA-enforced tokens, an Entra ROPC monitor will quietly stop working when the API behind it tightens its conditional access policy. Plan for ROPC to be a temporary workaround, not a long-term monitoring strategy on Entra.

Other IdPs land in different places. Amazon Cognito supports username-and-password auth via ADMIN_USER_PASSWORD_AUTH on the admin API (note: the modern flow is ADMIN_USER_PASSWORD_AUTH, not the older ADMIN_NO_SRP_AUTH). Google has no public ROPG-equivalent on its OAuth endpoints; service accounts plus the client_credentials grant against a Google Workspace domain is the supported path. Auth0 supports ROPC behind a tenant setting that is off by default. Okta supports the resource-owner password flow on tenants that explicitly enable it. In every case, the recommended order is the same: try client_credentials first, fall back to refresh_token with a long-lived refresh token minted out of band, and only reach for ROPC when both of those are unavailable.

When to use HTTP or multi-step API instead of a browser login monitor

Three short questions decide it. The honest decision rubric, in order:

Is your sign-in page an email-and-password form on a URL you own? If yes, use a browser login monitor. That is exactly its shape and it will catch the post-200-OK assertion failures status-only checks miss. We covered this case in the sibling post on monitor a SaaS login that IS email and password . Is your sign-in a redirect to a third-party IdP, a passkey ceremony, an MFA challenge, or a magic link? If yes, the browser login monitor is the wrong primitive. Use a multi-step API monitor against the IdP's token endpoint plus an HTTP monitor against an unauthenticated canary route in your own app. Is your API the surface you actually care about? If yes, skip the browser layer entirely. A multi-step API monitor that does client_credentials against the IdP and calls one protected route covers the case in two steps.

We unpack the full version of this rubric, with the seven branches that matter, in the seven-question decision tree on browser vs HTTP . The short version above is enough for ninety percent of the choices teams actually face.

Monitor the IdP as a third-party dependency

Once you accept that Entra, Okta, Auth0, Clerk, or Google is on the critical path for your sign-in, the right framing is to treat your IdP as a third-party dependency and monitor it the way you monitor any other vendor in your request path. Two endpoints carry most of the signal.

The OpenID Connect discovery endpoint, conventionally /.well-known/openid-configuration, is a public JSON document every OIDC IdP exposes. Velprove can hit it with an HTTP monitor, assert status_code 200 and body_contains on the literal "token_endpoint" field name, and you have a no-credentials smoke test that the IdP is reachable and serving configuration. The token endpoint itself, exercised by the multi-step monitor described above, is the second signal. When Entra MO1168102 reproduced in October 2025, the discovery endpoint kept responding while the token endpoint returned authorization failures. Two monitors, one signal each, no overlap.

Frequently asked questions

Can Velprove's browser login monitor complete an OAuth redirect to Google or Microsoft?

No. Velprove's browser login monitor is a form-fill primitive. It loads one login URL, types a username and a password into two fields, clicks a submit button, and asserts one of three success indicators (a URL pattern, a piece of text, or a CSS selector). It cannot follow an OAuth redirect to a third-party identity provider, handle the consent screen, or complete the authorization code exchange. Use the multi-step API monitor against the IdP's token endpoint instead, plus an unauthenticated canary route in your own app.

Can a browser login monitor sign in via SAML SSO or Okta?

No. SAML SSO bounces the user through the identity provider, posts a signed assertion back to the service provider, and ends with a session in the SP. A form-fill browser monitor with one loginUrl and two credential fields cannot drive that bounce chain. The clean workaround is OIDC client_credentials against the IdP's token endpoint plus a canary route in your app that exercises the SP-side session check.

What about passkey or WebAuthn login?

No. WebAuthn requires a real authenticator (a security key, a phone biometric, or a platform credential). A form-fill browser monitor has no authenticator attached and no way to attach a virtual one in this primitive. Monitor the IdP's discovery and token endpoints with HTTP and multi-step API monitors instead. Reserve browser login monitors for the email-and-password flow on your own login page.

Does Velprove handle TOTP, push, or SMS MFA?

No. The browser login monitor has one login URL, two credentials, and three optional selectors. It cannot read a TOTP code from an authenticator app, approve a push notification on a phone, or receive an SMS. Use a dedicated synthetic test identity with MFA disabled and the lowest scope you can grant. Monitor the IdP and the protected route, not the human MFA ceremony.

How do I monitor a Clerk-protected app?

If your app uses Clerk, the cleanest pattern is the +clerk_test subaddress in development mode. Clerk's documentation states that any email with the +clerk_test subaddress is a test email address, no emails are sent, and they can be verified with the code 424242. Create a synthetic test user with a +clerk_test email, point a Velprove browser login monitor at your Clerk-hosted sign-in page, fill the email and the fixed 424242 verification code, and assert a post-login element. Clerk also notes that test mode is highly discouraged on production instances, so prefer running this pattern against a staging environment.

What about magic-link login?

No. Magic-link auth requires reading an email inbox to extract the token, which Velprove does not do. The workaround is the same as for SSO: monitor the magic-link issuance endpoint with a multi-step API monitor that asserts the issuance call returns 200, then assert the protected-route response with a token your app generates server-side for the synthetic test identity. If the magic-link generator falls over, the issuance call fails. If the session backend falls over, the protected route fails.

Is the canary-route workaround a real monitor or a hack?

It is a real monitor and it is also a pattern your own application code controls. A /healthz/authed route exposed by your app that round-trips against your IdP using a server-side service account is a defensible signal: if the route returns 200 the auth backend is reachable and a token round trip just worked; if the route returns 503 something on the auth path broke. Velprove polls the route with an HTTP monitor and asserts the body. The route does the real auth work in your own infrastructure where you can keep secrets and run logic.

How do I monitor an OAuth-protected API without exposing user credentials?

Use the OAuth 2.0 client_credentials grant if your API supports it. The grant is designed for service-to-service auth: you exchange a client_id and client_secret for an access token with no user in the loop. Store both in Velprove check_secrets, build a multi-step API monitor where step 1 fetches the token and step 2 calls a protected route with the Bearer header, and you have a credential-free monitor that does not touch a real user account. If client_credentials is not available, the Resource Owner Password Credentials grant works but Microsoft and most modern IdPs actively discourage it.

Where this connects

Login is the single most expensive blind spot in commercial monitoring, and it splits cleanly in two. If your sign-in is a real email-and-password form on a URL you own, the browser login monitor is the right primitive and we wrote the recipe in monitor a SaaS login that IS email and password . If your sign-in is anything else (OAuth, SAML SSO, OIDC, passkey, MFA, magic link), the browser primitive is the wrong tool and the workaround above is the right one: multi-step API monitor against the IdP token endpoint, HTTP monitor against an unauthenticated canary route in your own app, both signed by a synthetic test identity that lives behind nothing.

Wire the alert to the channel that wakes the right person

Velprove ships email alerts on every plan including free. Slack, Discord, Teams, and outbound webhooks unlock on Starter. PagerDuty unlocks on Pro. Pick the channel the on-call actually reads. An alert that lands in a muted Slack channel is worse than no alert, because it teaches the team to trust the silence.

Free plan, your choice of five regions, browser login monitor every fifteen minutes, multi-step API and HTTP monitors at five-minute minimums, commercial use explicitly allowed, no credit card. Start for free. Monitor the login your customers actually use. If your stack is SaaS-shaped, the SaaS application monitoring page is the right next read; if the surface you care about is an OAuth-protected API, API uptime monitoring for OAuth-protected endpoints covers the same pattern from the API side.

WHMCS Does Not Retry Failed Provisioning. Here Is How to Catch the Silent Order Chain.

velprove — Tue, 26 May 2026 14:00:05 +0000

The mechanic: WHMCS does not automatically retry failed module actions. When the upstream cPanel, Plesk, or SolusVM module returns an error during account creation, WHMCS quietly drops the failure into its Module Queue and waits for you to notice. The customer paid the invoice. The order shows Active. The portal login works. The hosting account does not exist. A browser login monitor on clientarea.php (the WHMCS portal monitor we covered previously ) will not catch this by design. What does catch it: a single API monitor that hits GetModuleQueue and asserts the queue is empty, plus an optional 3-step API chain that simulates the full order, accept-order, and provisioning verification path. Both recipes fit the Velprove free plan.

WHMCS does not retry failed provisioning. Your customer finds out before you do.

If you have already set up the browser login monitor on your WHMCS client portal , this post covers what that monitor cannot see by design. The browser login monitor confirms clientarea.php renders and accepts credentials. It tells you nothing about whether a customer who just placed an order has a working hosting account waiting for them.

That gap is where WHMCS silent failures live. The order chain (AddOrder, then AcceptOrder, then the underlying provisioning module call against your cPanel, Plesk, or SolusVM box) runs after the login succeeds. When the upstream module fails, WHMCS does not retry. It drops the failure into the Module Queue and waits for you to look at the admin dashboard. The customer finds out before you do, usually by email, usually after they have already tried to use the service that does not exist.

The WHMCS docs are explicit about this. From the official Module Queue troubleshooting page: “WHMCS will not automatically retry a failed action. You must click Retry to attempt the failed action again.” The retry is a manual button click inside the admin UI. If no human opens that admin UI, the failed order sits in the queue. That is the silent-outage shape this post is about. The broader taxonomy of failures that look green on the dashboard lives in the silent-outage taxonomy .

The Module Queue is the silent-failure inspection point.

The Module Queue is a WHMCS internal log of every automated module action that failed. From the docs: “The Module Queue list displays your WHMCS installation's failed automated actions. This includes any action that WHMCS performs using a module, either as part of the system cron tasks or in direct response to a user or admin action.” You can access it at Utilities > Module Queue in the WHMCS admin area. The list shows the client name, the associated service or domain, the action that failed, the error details, and the time of the attempt. Two buttons sit next to each entry: Retry and Mark Resolved.

WHMCS surfaces the queue count on the admin dashboard as a blue Pending Module Actions badge. From the WHMCS feature spotlight: “These represent times that a WHMCS installation attempted to perform an action with an external system (via a module) but did not receive a successful response back.” The badge is helpful if you are already inside the admin area. It is not a monitor. It does not page you at 2 AM. It does not Slack your operations channel. It requires a human to open the dashboard and look at the badge.

The corresponding API endpoint is GetModuleQueue . It returns a JSON response with a result field, a count field, and a queue array containing one object per failed action. Each queue entry carries the serviceId, moduleName, moduleAction, lastAttempt timestamp, and the verbatim lastAttemptError message from the upstream provisioning module. That last field is the signal you actually want when you triage a red alert.

The one-monitor recipe: a single GetModuleQueue API monitor (free plan ready).

The load-bearing recipe is one API monitor against GetModuleQueue. It catches every queued failure regardless of which provisioning module produced it. It runs on the Velprove free plan. It takes about three minutes to configure.

Set up a Velprove monitor of type API (not multi-step, so your multi-step quota stays free for the optional 3-step chain in the next section). HTTP method POST. URL https://your.whmcs.example.com/includes/api.php. Set the request header Content-Type to application/x-www-form-urlencoded. The form-encoded body:

Add a single json_path assertion: path $.count, operator equals, expected value 0. That is the entire recipe. When the queue is empty, the monitor stays green. When any module action fails (anywhere in your WHMCS install, across any provisioning module), the count goes above zero and the monitor flips red.

On the Velprove free plan, set the interval to 5 minutes. On Starter ($19/month), set it to 1 minute. On Pro ($49/month), the floor drops to 30 seconds. The Velprove free plan also covers all six assertion types, so you have everything you need to express this monitor without upgrading.

When the monitor flips red, the alert tells you the queue is no longer empty. To triage, hit GetModuleQueue manually (curl, Postman, your browser) and read the queue array. Each entry tells you which client, which service, which module, which action, and the exact error message the module returned. From there you have everything you need to walk the remediation flow that WHMCS documents at the Resolving a Failed Hosting Account Creation page. The 1-step recipe is the right starting point for any WHMCS-using operator. Most teams ship this one and never need the 3-step chain.

The three-step chain: trigger a test order and assert the queue stays clean.

The more advanced recipe creates a real test order on every monitor run and then checks the queue. It catches a different class of failure: cases where the upstream module appears healthy to GetModuleQueue (because nothing real has been ordered recently) but actually fails when an order arrives. It is most useful when your order volume is low enough that the 1-step recipe could go days without exercising the provisioning path.

The chain uses three API calls. The load-bearing detail is in the WHMCS AddOrder API documentation, which is explicit: “For more flow control, this method ignores the ‘Automatically setup the product as soon as an order is placed.’ option. When you call this method, you must make a subsequent explicit call to AcceptOrder.” So AddOrder alone does not provision. You need AcceptOrder afterward to trigger the provisioning module.

The chain (Velprove monitor type: multi-step). Every step uses the header Content-Type: application/x-www-form-urlencoded and a form-encoded body:

Step 1: AddOrder. POST to /includes/api.php (no real gateway fires because paymentmethod=mailin). Body:

Assert json_path path $.result, operator equals, expected success. In the step's extract config, capture $.orderid from the response into a variable named order_id.

Step 2: AcceptOrder. POST to /includes/api.php. Body:

Assert json_path path $.result, operator equals, expected success. This call triggers the provisioning module against your cPanel, Plesk, or SolusVM box.

Step 3: GetModuleQueue. POST to /includes/api.php. Body:

Assert json_path path $.count, operator equals, expected 0. If your test order's provisioning failed, the failure entry is sitting at the top of the queue and the assertion trips.

The {{order_id}} template syntax in Step 2 is Velprove's flat variable interpolation: any name you used in a previous step's extract config can be referenced inside double braces in subsequent steps. Names are word-character only (no dots, no dashes). For the underlying variable-extraction semantics, see the multi-step API monitoring guide .

The chain fits Velprove's free plan exactly: 3 steps is the free cap. Starter raises it to 5, Pro to 10.

One auto-setup footnote. The WHMCS Order Statuses documentation notes: “If you have configured provisioning to occur while orders are in the Pending status, they will occur without you having accepted the order.” If your install runs in that mode, step 2 (AcceptOrder) is optional and a 2-step chain (AddOrder, then GetModuleQueue) is sufficient. Most resellers run in the safer default of provisioning on Accept, so the 3-step chain is the right shape for most readers.

A cleanup note: every monitor run leaves a test order in tblorders. Add a nightly cron that calls the WHMCS DeleteOrder API to garbage-collect every velprove-canary tagged order from the previous day. Without cleanup, your orders table fills up with thousands of synthetic test records inside a year.

Which recipe do you actually need?

The 1-step recipe (a single GetModuleQueue monitor) catches every queued failure that already happened in your install, across every module, regardless of who triggered it. It is cheap, read-only, and the recommended baseline for everyone including Free-plan readers. Most operators ship this and stop here.

The 3-step chain catches a different class of failure: the case where AddOrder itself errors out (validation rejection, invalid payment method, malformed custom field), the case where AcceptOrder succeeds but the underlying module never gets called, and the case where the module call fails on a code path that the 1-step recipe would not have caught between organic orders. The two recipes are additive, not alternatives. Most teams add the chain only after the 1-step monitor has caught its first incident and they want active provisioning verification on top of the passive queue check.

For the broader framework on which third-party-like systems (WHMCS-as-a-dependency is one) justify a synthetic monitor at all, see the 3-of-12 rule for which dependencies to monitor synthetically . WHMCS scores high on every axis (blast radius across all customers, revenue attribution on the order path, no useful vendor status page because the vendor is you), so it lands in the must-monitor bucket for any WHMCS-running operator.

The least-privilege service account.

Do not point these monitors at your existing admin API credentials. Create a dedicated WHMCS admin role for the monitor service account. The role needs four API permissions and nothing else: GetModuleQueue, AddOrder, AcceptOrder, and GetClientsProducts (the last one if you want to extend the chain later to verify the service row landed correctly). Disable every other API capability on the role, including any read access to client billing data or server credentials.

Create a dedicated API credential pair under that role and use it only for Velprove monitors. Rotate the credential quarterly with the same calendar reminder you use to rotate your other service credentials. If WHMCS supports IP allowlisting on the role (depends on your WHMCS version and any third-party security modules you have installed), restrict the credential to Velprove's monitor egress range so a leaked secret cannot be replayed from elsewhere.

For the test client used by the 3-step chain: create a normal client account, tag it with a unique identifier like velprove-canary, and use it as the sentinel clientid for every AddOrder call. The tag makes the DeleteOrder cleanup cron trivial to write and keeps your production client tables clean. The hosting-stack adjacency story (what to monitor on the underlying cPanel/WHM box itself) lives in how to monitor the cPanel/WHM box itself .

Alerting and incident response when the monitor flips red.

Velprove's alert channels today: email on every plan, including Free. Slack, Discord, Microsoft Teams, and webhook on Starter ($19/month) and above. PagerDuty on Pro ($49/month). Route WHMCS monitor alerts to whichever channel your on-call human actually watches at 2 AM. For most resellers, that is PagerDuty for the on-call rotation plus a Slack mirror for shared visibility.

Pick a home region from Velprove's 5 (North America, Europe, UK, Asia, Oceania) closest to your WHMCS install for the tightest baseline latency. All 5 regions are available on every plan, including Free. Each monitor runs from one region you pick, not fanned out across all five.

When GetModuleQueue reports count > 0, your runbook is the WHMCS-documented remediation flow:

Open the WHMCS admin area, navigate to Utilities > Module Queue. Read the error code on each pending action. The error field carries the verbatim message from the upstream module. Click the client name or service name on each entry to jump into the affected client's profile and the Products/Services tab. Fix the underlying cause (username collision, server quota exceeded, API token expired, IP allowlist mismatch). Click Retry inside the Module Queue UI to re-attempt the failed action. The page displays the new attempt result immediately.

The InMotion Hosting troubleshooting guide catalogues the common error shapes you will see in the queue: “Module Create Failed - Service ID: 4 - Error: Access denied” (cPanel rejected the credential), “Server Command Error - Curl Error - Couldn't connect to host (7)” (network partition or port 2087 blocked), “406 Not Acceptable” (ModSecurity rule fired), and “Allowed memory size of xxxxx bytes exhausted” (PHP memory_limit too low on the WHMCS host). Each has a known fix. The Velprove monitor surfaces the failure; the remediation stays in your hands. Velprove monitors are read-only observers and do not drive WHMCS workflow actions.

Why WHMCS is a high-value surface to monitor in the first place.

The Lagom Client Theme cascade through 2024 (HostUS in February, Hosturly mid-year, DigiRDP later in the year) taught the hosting industry that WHMCS panels sit at the center of billing, identity, and provisioning, and that vulnerabilities in any one popular theme or add-on can cascade across the entire customer base. RSStudio shipped Lagom 2.2.7 in September 2024 with the security fix. WHMCS itself shipped a security update on June 3, 2025 covering v8.13, v8.12, and v8.11 LTS . Disclosure language was vague. The cadence is the point: the flows running through your WHMCS install warrant external monitoring you control, not just admin-dashboard widgets you have to remember to check.

Patterns to avoid (honest about Velprove's primitive set).

Five patterns WHMCS-community guides commonly recommend for provisioning monitoring do not fit Velprove's primitive set. Naming them is faster than pretending they are options.

No polling primitive. Velprove's multi-step API monitor runs each step exactly once in sequence and records the result. “Wait 60 seconds after AddOrder, then keep hitting GetOrders until status flips to Active” is not expressible. The replacement is the monitor interval. If you need 30-second detection, set the interval to 30 seconds on the Pro plan.

No time-relative freshness assertion. The six assertions are status_code, body_contains, body_not_contains, json_path, response_time_ms, and header_contains. “Assert this order was created within the last 5 minutes” is not a primitive. The replacement is your endpoint computing freshness server-side and returning 200 or 503, or a json_path assertion against a static expected value like $.count = 0.

No multi-page browser navigation through the WHMCS order wizard. The browser login monitor drives one form submit (the login page). It does not click through the order placement wizard, add items to the cart, fill the billing form, and complete checkout. Order creation in this post lives in the API path (AddOrder), not the browser path. The browser monitor pattern stays where it shines: client-area login coverage on clientarea.php.

No retry-from-monitor. Velprove monitors are read-only. The Retry button inside the Module Queue UI is the operator action that re-attempts the failed module call. The monitor surfaces the failure; the operator runs the retry. Trying to make the monitor call AcceptOrder or the WHMCS RetryQueueItem API to self-heal a failed provision is an antipattern: it papers over the underlying cause (quota exceeded, credentials wrong, server full) and starts accumulating real provisioning errors at scale.

No mobile push channel. Velprove's alert channels today are email, Slack, Discord, webhook, Microsoft Teams, and PagerDuty. There is no mobile push alert on any plan. Plan your alert routing around what exists.

Frequently Asked Questions

What is the difference between monitoring my WHMCS portal login and monitoring the order and provisioning flow?

A browser login monitor on your WHMCS client portal confirms that clientarea.php renders and accepts customer credentials. It tells you the auth layer is up and the database row exists. It tells you nothing about whether the order, invoice, and provisioning chain that runs after login succeeds actually completes. The order-flow monitor in this post covers the AddOrder, AcceptOrder, and provisioning module call chain that produces a working hosting account. The two monitors catch different failure modes and are additive, not alternatives. Most operators run both: one browser login monitor for portal availability, one GetModuleQueue API monitor for silent provisioning failures.

How do I monitor WHMCS provisioning failures without triggering real charges?

Create a dedicated test client in WHMCS with a unique tag like velprove-canary so your cleanup scripts can identify it. Set up a sentinel product SKU configured with paymentmethod=mailin (bank transfer), which sits in pending without firing a real payment gateway. Run the 3-step chain (AddOrder, AcceptOrder, GetModuleQueue) against this test client and sentinel product. Schedule a nightly cron with the DeleteOrder API to garbage-collect velprove-canary tagged orders so they do not accumulate in tblorders. No customer-facing charges are ever generated, and your WHMCS database stays clean.

What does the WHMCS Module Queue actually catch that a portal login monitor misses?

Module-by-module provisioning failures that happen after the customer paid and logged in. cPanel CreateAccount failing on quota exceeded. Plesk CreateAccount failing on a subscription template mismatch. SolusVM Create failing on stockout. Domain registrar module timeouts. ResellerClub authentication failures after API key rotation. WHMCS appends every one of these to the Module Queue with the verbatim error from the upstream module (Access denied, Couldn't connect to host (7), 406 Not Acceptable from ModSecurity, Allowed memory size exhausted). All of them are silent from the portal-login perspective: the tblclients row exists, clientarea.php works, the customer logs in to find an account that does not exist.

How often should I run the GetModuleQueue monitor?

5-minute intervals on Velprove Free, 1-minute on Starter, 30-second on Pro. GetModuleQueue is a cheap, read-only API call against your WHMCS install, so the frequency tradeoff is detection lag versus WHMCS server load, and WHMCS server load is not a real constraint at this endpoint. If your order volume is high enough that a 5-minute detection lag means several failed provisions before you find out, move to Starter at $19 per month for 1-minute intervals. If your order volume is low (under 50 new orders per day), 5-minute Free-plan intervals are fine.

Can I do this on the Velprove free plan with the 3-step multi-step monitor limit?

Yes. The 3-step order chain (AddOrder, AcceptOrder, GetModuleQueue) fits the Free plan's 3-step multi-step monitor cap exactly. The simpler 1-step recipe (a single GetModuleQueue API monitor) also fits Free and is the recommended starting point for most operators. For the underlying multi-step primitive, see the multi-step API monitoring guide . Free includes 10 monitors total at 5-minute intervals, all six assertion types, and email alerts.

How do I monitor a WHMCS provisioning module if my install does not expose the API externally?

If your WHMCS API is internal-only, the order-chain recipe in this post will not work as written. The cleanest fallback is to open the API to Velprove's published monitor egress IPs (see velprove.com/ips for the live list and JSON feed), with a dedicated low-privilege API service account that only has GetModuleQueue, AddOrder, AcceptOrder, GetClientsProducts permissions. If even that is off the table, fall back to a Velprove browser login monitor on the WHMCS client portal (free plan, one browser login monitor included at a 15-minute interval). The browser monitor catches login-layer failures but cannot catch the silent provisioning failures this post is about.

Monitor a Heroku App: Eco Sleep, Release Phase, Scheduler

velprove — Tue, 26 May 2026 14:00:03 +0000

The short version: On June 10 2025 Heroku went down for up to 24 hours and Heroku's own status page went down with it, because both ran on the same affected infrastructure. External monitoring is not an extra. For 7 hours and 42 minutes it was the only signal anyone had. Beyond named incidents, Heroku has three platform primitives a 200 OK on your dyno URL cannot see: Eco web dynos sleep after 30 minutes of inbound idle, Release Phase failures email the deployer but leave your public URL serving yesterday's code, and Heroku Scheduler is documented as "expected but not guaranteed." Classic Cedar Eco ($5) and Basic ($7) dynos get zero native threshold alerting. You upgrade to a Standard-1x dyno at $25 per dyno per month just to unlock email alerts on response time. The Velprove free plan covers the same gap with 10 monitors total, 1 browser login monitor, and multi-step API monitors, no credit card, commercial use allowed.

Heroku's own status page went down with the platform on June 10 2025

On Tuesday June 10 2025, Heroku went down for up to 24 hours, and Heroku's own status page went down with it. The incident started at 06:00 UTC when an automated operating system update ran against production infrastructure that was meant to have automated upgrades disabled. The update restarted host networking, the routes did not reapply, and outbound connectivity for every dyno on every affected host severed at once. Heroku identified root cause at 13:42 UTC, seven hours and forty-two minutes after the first dynos failed. Customer impact persisted for up to 24 hours on the long tail.

From Heroku's own postmortem , published 2025-06-15:

"Our internal tools and the Heroku Status Page were running on this same affected infrastructure. This meant that as your applications failed, our ability to respond and communicate with you was also severely impaired." *

That sentence is the entire reason this post exists. For the first eight hours of the incident, the vendor status page that every Heroku customer reflexively refreshes during an outage could not tell them anything, because the status page was inside the outage. External monitoring stopped being theoretical insurance and became the only signal anyone had. Heroku has since said no system changes will occur outside its controlled deployment process going forward. That is the right corrective action. It does not change the structural lesson: a status page sitting on the same platform as the product it reports on is a single point of failure, and external monitoring is the second point.

The rest of this post is what an external monitor should watch on a Heroku app between outages of that scale, which is most of the time. Three platform primitives, one Standard-only alerting wedge, and four concrete Velprove monitors.

Why a 200 OK on your Heroku app URL is not enough

A single GET on your Heroku app URL watches one thing: the web dyno answering on port $PORT. That is the smallest part of most real Heroku deployments. Behind the web dyno sit Release Phase steps that run migrations and asset compiles before a new release promotes, Scheduler jobs that fire on cron and run as one-off dynos, worker dynos that drain queues with no inbound traffic at all, and an auto-restart cycle that bounces every dyno at least once every 24 hours. None of those have a public URL, and a status-code probe pointed at / cannot see any of them.

Free-tier Heroku ended on 2022-11-28. Eco dynos at $5 a month for a shared 1,000-hour pool replaced the old free tier, and most indie-hacker Heroku apps now run on Eco or Basic. The hosting-economics half of that decision is its own conversation, covered in the indie-hacker free-stack guide . This post assumes you have already made that call and is about the platform surface a URL monitor cannot see.

The reason the distinction matters: page-level failures change the page, so a URL monitor catches them. Platform-level failures degrade the product without changing the page. Your marketing site can keep serving a clean 200 for hours after the Scheduler job that bills your customers skipped a run, after Release Phase failed and left you on yesterday's code, or after the database the public URL queries silently lost its connection pool. The rest of this post is the platform layer.

Eco dyno sleep is inbound-idle, the opposite of Railway

Heroku Eco web dynos sleep when no web traffic arrives, not when the dyno stops sending outbound traffic. Heroku's Eco Dyno Hours docs state the rule verbatim:

"If an app has an Eco web dyno and that dyno receives no web traffic in a 30-minute period, it sleeps. Eco web dynos do not consume Eco dyno hours while sleeping." *

And on wake:

"the dyno becomes active again after a short delay." *

Heroku does not publish a cold-start latency number. Community observation puts it at a handful of seconds for Node, a few more for Rails or Django, longer for the JVM. The honest framing is a short delay, qualitatively, with the actual number determined by your stack.

The mechanic runs in the opposite direction from Railway. On Railway, a service sleeps when it has not sent outbound traffic for 10 minutes, and an external probe arrives as inbound traffic that does not reset the sleep clock (covered in the Railway platform-layer guide ). On Heroku Eco the clock is inbound. A Velprove HTTP probe on a 5-minute interval arrives as inbound web traffic six times every 30 minutes, so the Eco web dyno never sees a full 30 minutes of silence, so it never sleeps. The probe is keeping the dyno warm whether or not you want it to.

That comes with a cost. The Eco dyno-hour pool is 1,000 hours shared across every Eco dyno on your account. A single Eco web dyno held awake 24/7 burns about 720 hours per month (24 hours times roughly 30 days). One always-awake Eco web dyno fits comfortably in the pool with headroom for a second small Eco service. Two always-awake Eco web dynos overflow into billed dyno time at the Basic per-second rate. The Velprove pattern on Eco is honest about that tradeoff: a 5-minute probe interval keeps your one production Eco web dyno warm and observable, and if you have a second Eco service you slow the second probe to a 10-minute interval or accept the overflow.

For a Heroku-hosted SaaS where cold-start latency matters, an always-warm Eco web dyno is the correct configuration. For a side project that genuinely does not care about a few seconds of cold-start delay on first request, a slower probe interval saves pool hours at the cost of detection lag. The rule is the principle, not a number: match the probe interval to how fast the failure matters.

Release Phase failures leave your public URL on yesterday's code

Release Phase is the lifecycle stage that runs after a build and before a release is promoted to the dyno formation. It is where database migrations and asset compile steps typically live. When the release command fails, the new release does not promote. The public URL keeps serving the previous release.

From Heroku's Release Phase docs :

"If the release command exits with a non-zero exit status, or if it's shut down by the dyno manager, the release fails. In this case, the release is not deployed to the app's dyno formation." *

Heroku does send an email when this happens:

"An email notification is generated in the event of a release phase failure." *

The honest wedge is sharper than "Release Phase fails silently." The email arrives. The problem is what the email covers and what it does not.

The email goes to the deployer, the developer who pushed the release. On a one-developer indie project that is the same person who would have configured an external monitor. On a small team, the deployer is whichever developer last pushed, not the on-call engineer. On a larger team with a deploy bot or a CI/CD pipeline, the email may be going to a shared inbox no human reads. The notification is real, but it is point-to-point email to a known address, not a routable alert into PagerDuty or Slack.

The bigger problem is what the URL looks like. From the outside, a Heroku app whose Release Phase just failed looks identical to a Heroku app where the release succeeded and did not break anything: the public URL returns 200, the HTML looks correct, and the responses are consistent. The previous release is still running. If the migration that just failed was the one that adds a column three new code paths depend on, the next time those code paths run they will 500, but right now the URL is fine. Nothing has actually deployed, and the URL still looks normal.

The pattern that closes the gap is a build-version probe. Expose a /version endpoint that returns the current git SHA, wired from an environment variable that Heroku sets at build time. Have a Velprove HTTP monitor assert body_contains the SHA your CI just produced. When the release succeeds, the new SHA serves and the assertion passes. When Release Phase fails and the previous release stays live, the old SHA serves and the assertion fails on the next probe. The mechanic is exactly the same shape used on Render and Railway; the Heroku-specific framing is the lifecycle stage. The full assertion pattern is in the build-SHA assertion pattern guide; this section is the Release Phase framing on top of it.

Recovery on Heroku is two commands. heroku releases:retry reruns the release without a new build, useful when the failure was an external dependency such as a Postgres instance that was briefly unavailable. heroku rollback promotes a prior release if the failed release uncovered something that needs a code fix. Either way, the monitor told you to run them.

Heroku Scheduler is "expected but not guaranteed"

Heroku Scheduler is a free add-on that runs jobs on a cron-like schedule by spawning a one-off dyno that executes the configured command. The killer detail is in Heroku's own Scheduler docs :

"Scheduler job execution is expected but not guaranteed. Scheduler is known to occasionally (but rarely) miss the execution of scheduled jobs." *

And, in the same article:

"In very rare instances, a job may be skipped. In very rare instances, a job may run twice." *

Read those sentences twice. Heroku is documenting that Scheduler can skip a run and can double-run a run, with no native alert for either case. If you bill customers from a Scheduler job, settle balances from a Scheduler job, or send a daily report from a Scheduler job, the platform has formally disclaimed the guarantee. The contract is best-effort.

Both failure modes are silent from outside. A job that runs and exits non-zero produces logs in your platform-aggregated log stream, which you have to be looking at. A job that Scheduler skips produces nothing, because from Scheduler's side nothing happened. The Render counterpart (covered in the Render platform-layer guide ) at least emails on a failed run; Heroku Scheduler does not even emit that signal for the skip case.

The pattern that works is a heartbeat URL the job hits at the end of its successful run, paired with a freshness endpoint a probe asserts against. The job writes a timestamp to Postgres or a key value store on durable success, after the work is done, not on entry. A small companion route reads that timestamp, computes its age, and returns 503 when the age exceeds the job cadence plus a grace window, 200 otherwise:

A Velprove HTTP monitor asserts status_code = 200 on that endpoint. The endpoint flips to 503 the moment the job goes stale, so a 200 is the whole check. Both Scheduler failure modes, skipped run and run-that-exited-non-zero-without-writing-the-stamp, collapse into one signal: the timestamp did not advance. The detection lag is bounded by the probe interval, not by Scheduler.

One discipline matters: the job must write the timestamp on real progress, not on entry. A job that fails partway through and exits non-zero before the final write looks correctly stale from the freshness endpoint. A job that writes the timestamp before doing the work would look fresh while never actually completing.

Eco and Basic dynos get no native threshold alerting

This is the load-bearing economic wedge of the post. Heroku has a first-party alerting feature called Threshold Alerting that emails you or pages PagerDuty when response time or failed-response rate crosses a configured threshold. It runs on top of App Metrics, which is the dashboard view of your dyno's performance over time.

Two quotes from Heroku's Application Metrics docs define the tier boundary:

"Application metrics aren't available for apps using eco dynos." *

"The Threshold Alerting feature is available to apps running on Professional dynos (standard-1x, standard-2x and performance) and all Fir dynos." *

The economic shape is this: classic Cedar Eco dynos at $5 a month and classic Cedar Basic dynos at $7 a month do not have App Metrics, so they cannot have Threshold Alerting, so they have zero native uptime alerting from Heroku at all. The cheapest classic dyno that includes Threshold Alerting is Standard-1x at $25 per dyno per month. That is a 5x jump in dyno cost for an Eco shop and a 3.5x jump for a Basic shop, paid not for more compute but for the right to receive an email when response time crosses a threshold.

The Fir-generation entry-tier dyno (Heroku's next-generation Kubernetes-based runtime) does include alerting on its low-cost tier per the Threshold Alerting tier quote. The Eco-no-alerting claim scopes specifically to classic Cedar Eco dynos. If you are running on Fir already, your alerting story is different and worth checking against the current Fir docs.

The third option is external. A Velprove free plan covers 10 HTTP monitors at a 5-minute interval, 1 browser login monitor at a 15-minute interval, multi-step API monitors up to 3 steps, and 1 status page, with email alerts on every plan including free. That gives a Cedar Eco shop response-time and failed-response alerting without changing dyno tier, plus the Release Phase and Scheduler coverage Threshold Alerting cannot give you even on a Standard dyno. The math is straightforward: $0 for external alerting versus $240 a year per dyno to unlock native alerting. The right answer is both, for most teams, but the cost of starting with external is zero and the marginal benefit is high.

Setting up the 4 Velprove monitors for a Heroku app

Put the patterns above together and the Heroku-side coverage lands in four concrete monitors. All four fit inside the Velprove free plan: 10 monitors total at a 5-minute interval, 1 browser login monitor at a 15-minute interval, multi-step API monitors up to 3 steps, email alerts, SSL expiry monitoring, and 1 status page with a Velprove badge. Each monitor probes from one of 5 global regions you pick at setup time. Every plan picks from the same 5 regions; to cover multiple regions, you create multiple monitors.

Create a browser login monitor on the signed-in path. This is the Velprove differentiator and the monitor that catches the most subtle Heroku failures, so it goes first in the canonical order. Create a new browser login monitor against your Heroku app's login URL with a dedicated low-privilege test user. The monitor drives a real browser, signs in as the test user, follows the post-login redirect, and asserts on the landing page. Under Customize detection, switch Success verification from the default URL-change to Page contains text and set it to a string that only renders when a real database read succeeded: a customer name, an invoice ID, a known plan label. A Release Phase that fails the migration adding a column the login flow depends on will land the user on an error page with a 200 status, which a URL probe would miss and the browser login monitor catches. The monitor is free on every plan, including the free plan, at a 15-minute interval. Add an HTTP monitor on the public URL with a build-SHA assertion. Create an HTTP monitor against your public Heroku app URL on a 5-minute interval. On the Verify step, add two Success Conditions in order: status_code = 200 and body_contains set to the build SHA exposed at /version. The body_contains rule turns the same probe into a Release Phase detector, because a stuck release keeps serving the previous SHA at a 200 URL. Pick the region closest to your real users. On Eco, this probe also keeps the dyno warm by arriving as inbound web traffic every 5 minutes, which resets the 30-minute sleep clock. Add an API monitor on the Scheduler heartbeat endpoint. Create an HTTP monitor (API-shaped) against the freshness endpoint your Scheduler job updates on real success. Assert status_code = 200. The endpoint returns 503 when the timestamp goes stale, so a 200 is the whole check. Match the probe interval to the job cadence: a daily Scheduler job is comfortable on a 5-minute probe with a 25-hour grace window in the endpoint logic; an hourly job wants a tighter grace window. The detection lag is bounded by your probe interval, not by Scheduler. Add a multi-step API monitor for deploy verification. Create a multi-step API monitor. Step 1 hits /version and captures $.build_sha into a variable using a json_path assertion. Step 2 hits a second route that compares its own runtime SHA against the captured value and returns non-2xx on mismatch. Multi-step monitors run each step once in sequential order, with the same 6 assertion types HTTP monitors use: status_code, body_contains, body_not_contains, json_path, response_time_ms, and header_contains. No polling, no retry-until, no wait-for-condition. The free plan covers multi-step up to 3 steps; Starter covers up to 5 and Pro up to 10. This monitor is the upgrade path from the body_contains assertion in monitor (2): the SHA comparison lives server-side in your app, so the setup survives every future deploy unchanged.

That is four monitors out of your ten total slots: one browser login monitor, two HTTP monitors, and one multi-step monitor. The remaining six slots are room for a database health endpoint, a third-party API dependency, a second region on a critical path, or a second environment such as staging.

Email alerts are included on every plan, including free. Slack, Discord, Microsoft Teams, and webhook alerts unlock on Starter. PagerDuty integration is on Pro for teams that route alerts into an on-call rotation. The free plan's status page carries a Velprove badge; the badge comes off on paid plans.

What Velprove cannot catch on Heroku

A monitor that pretends to catch everything is lying. The honest boundary on a Heroku app has four parts.

Most multi-factor authentication flavors on the browser login monitor. If your Heroku app's login flow requires an SMS code, an email code, a magic link, a push approval, or a passkey, the browser login monitor cannot complete it. Velprove cannot read your phone, your inbox, or your authenticator app. The monitor works on login flows where the dedicated test user can sign in with a username and password. For consumer SaaS where every user is forced through SMS or email-code MFA, the browser login monitor pattern is not the right tool; an HTTP monitor on a post-login API endpoint with a service token is. This is not a Heroku-specific limit, but it bites Heroku-hosted apps the same way it bites apps anywhere else.

Heroku platform internals that are invisible to an external probe. Velprove sees what the public URL returns. Velprove does not see dyno-level CPU or memory pressure before the request reaches the dyno, the state of the router queue, or the internal health of Heroku Postgres beyond what your application code exposes. App Metrics on Standard-1x and up sees those; an external probe sees the consequences. The two views complement each other, and on a small Eco app the external view is the only view available.

Fir-generation entry-tier dyno alerting. The Eco-no-alerting framing in this post scopes to classic Cedar Eco dynos. The Fir generation, Heroku's next-generation Kubernetes-based runtime, includes Threshold Alerting on its equivalent low-cost tier. If you are on Fir, your native alerting story is meaningfully different and worth checking against current Heroku docs before assuming this post's wedge applies to you. The external pattern still helps for the Scheduler skip and Release Phase build-SHA cases on Fir, because those are not threshold-shaped signals.

Dyno restart skew. Per Heroku's Dyno Restarts docs , the dyno manager restarts every dyno at least once per day on a jittered 24-hour-plus-216-random-minutes cycle. During a deploy, some dynos can be on the new release and some on the previous release for a short window. This is documented, intentional, and customer-tolerated behavior, not a failure mode you should wire alerting around. One sentence acknowledgment, not a wedge.

Getting started

The Velprove free plan covers 10 monitors total at a 5-minute interval, 1 browser login monitor at a 15-minute interval, multi-step API monitors up to 3 steps, 5 global regions to choose from (one per monitor), email alerts, SSL expiry monitoring, and 1 status page with a Velprove badge. Commercial use is allowed on every plan, including free. No credit card required.

That is enough to land the four-monitor Heroku set described above for a single production app: a browser login monitor on the signed-in path, an HTTP monitor on the public URL with a build-SHA assertion, an HTTP monitor on a Scheduler heartbeat endpoint, and a multi-step API monitor for deploy verification. Start with the free plan. The first monitor takes about three minutes to configure.

Frequently Asked Questions

How do I monitor a Heroku Eco dyno that sleeps after 30 minutes?

Point a Velprove HTTP monitor at your public Heroku app URL on a 5-minute interval and assert status_code = 200 plus body_contains on a static marker your real app emits. The probe arrives as inbound web traffic, which resets the 30-minute Eco sleep clock and wakes the dyno on the first hit after a sleep. The tradeoff is dyno-hour pool burn: a single Eco dyno held awake 24/7 consumes about 720 hours of the 1,000-hour Eco pool every month, which is fine for one app and tight if you run two. If you have a second Eco app on the same account, slow the probe interval or accept overflow billing.

How do I detect a Heroku Release Phase failure when the public URL still returns 200?

Expose a /version endpoint that returns the current git SHA from a build-time environment variable, then have a Velprove HTTP monitor assert body_contains the SHA your CI just produced. When Release Phase fails, Heroku emails the deployer but does not promote the new release to the dyno formation, so your public URL keeps serving the previous build's SHA. The body_contains assertion fails on the next probe and Velprove pages you. The full multi-step capture-and-assert variant lives in our API health check patterns reference .

Does Heroku Scheduler alert me when a job does not fire?

No. Heroku's own Scheduler docs say job execution is expected but not guaranteed and that jobs may occasionally be skipped or run twice. Heroku sends no email and no webhook when a scheduled job fails to fire. The pattern that closes the gap is a heartbeat URL the job hits on real success, with a companion freshness endpoint that returns 503 when the timestamp goes stale.

Why do I need external monitoring if I am paying for Heroku Standard dynos?

Heroku's Threshold Alerting on Standard-1x and above watches response time and failed responses on the web dyno. It does not watch Scheduler runs, it does not catch a Release Phase failure that leaves you on the previous build SHA, and on June 10 2025 it could not tell you anything because Heroku's own status page went down with the platform on the same affected infrastructure. Threshold Alerting is a useful inside-the-platform signal. An external probe from outside Heroku is what gives you signal when Heroku itself is the failure. The two complement each other; one does not replace the other.

Can I monitor a Heroku app on the Velprove free plan?

Yes. The Velprove free plan covers 10 monitors total at a 5-minute interval, one browser login monitor at a 15-minute interval, multi-step API monitors up to 3 steps, email alerts, SSL expiry monitoring, and 1 status page. Commercial use is allowed and no credit card is required. That is enough to land an HTTP probe on your web URL, a build-SHA assertion on /version, a Scheduler heartbeat on a freshness endpoint, and a browser login monitor on the signed-in path of a Heroku-hosted SaaS.

Does a Velprove probe keep my Heroku Eco dyno from sleeping?

Yes, and that is the opposite of how Railway works. Heroku's Eco sleep clock is inbound-idle: if an Eco web dyno receives no web traffic in a 30-minute period, it sleeps. A Velprove HTTP probe arrives as inbound web traffic, so a 5-minute probe interval resets the sleep clock every 5 minutes and the dyno stays warm. The cost side of that decision is the 1,000-hour Eco pool shared across all Eco dynos on the account. One always-awake Eco web dyno burns about 720 hours per month, leaving room for one more small Eco service; two always-awake Eco dynos overflow into billed time. The Railway inverse, where service sleep is outbound-idle and an inbound probe does not reset the clock, is covered in the Railway outbound-idle sleep writeup .

The 3 of 12 Rule: Choosing Which Third-Party Dependencies to Monitor Synthetically

velprove — Mon, 25 May 2026 14:00:03 +0000

Triage: Most SaaS apps run on 10 to 20 third-party dependencies. You cannot afford a custom monitor recipe for each one, and you should not try. The right move is dependency-graph triage. Rank every vendor on three axes: blast-radius (what breaks downstream when they fail), revenue-attribution (what dollars stop), and vendor-status-lag-history (how late their status badge tells the truth). The top 3 get a 5-minute synthetic of the exact call your app makes. The next 5 stay on the vendor's status page as a secondary signal. The bottom 4 you accept as quiet risk. This post walks the triage across the transactional email path, the vendor status page itself, and the webhook receiver you cannot host, using the Datadog External Provider Status launch from October 2025 as evidence that the largest SRE teams already treat the dependency-monitoring problem as tier-1.

You have 12 third-party dependencies. You should monitor 3 of them synthetically.

Count the third-party dependencies your app calls in production. Auth provider. Email vendor. Payments. Object storage. CDN. DNS. Search. Queue. Webhook senders. AI provider if you have one. SMS provider if you have another. CRM API. Most SaaS apps land between 10 and 20 once you write the list down honestly. A custom synthetic per vendor on a 5-minute interval is ten to twenty monitors, and on most paid uptime tools that is real money per month before you have caught a single incident.

The triage rule we run is this. Score every vendor on three axes from 0 to 3. Blast-radius is what breaks for your customers the moment the vendor goes dark. A payments vendor sits at 3. A vendor used for an offline weekly report sits at 0. Revenue-attribution is what dollars stop. A vendor in the checkout path is 3. A vendor in the marketing-email path is 1, because delayed marketing email rarely breaks the buy. Vendor-status-lag-history is how late their public status page has run on their last three incidents. A vendor that posted within 5 minutes of impact three times in a row scores 0. A vendor that has shown a 30-plus minute lag at least once scores 3.

Add the three axes. Vendors that score 7 or higher belong in the top bucket. Cap the bucket at 3, the budget you actually have. Vendors that score 4 to 6 sit in the next 5: their status page is a secondary signal, you accept the lag, you do not run your own probe. Vendors that score 3 or under sit in the bottom 4: you accept that you find out late. The dependency-graph triage is the same triage every SRE team eventually arrives at. The work is making the scoring explicit and revisiting the list every quarter when vendors change.

This is the generalized parent of two posts we already shipped. For the vendor-specific worked examples, see the LLM-scoped version of this triage and the Stripe-scoped version of this triage . Both posts walk one vendor through the triage at full depth. This post walks the framework across a portfolio.

What a synthetic of your own call actually buys you (the Datadog 2025 anchor)

On October 21, 2025, Datadog launched External Provider Status alongside a free public companion at Updog.ai. The launch pitch is dependency-graph monitoring as a product category. In Datadog's own words: “Datadog External Provider Status provides real-time visibility into the health of more than 40 third-party providers, including 13 AWS services across global regions and widely used SaaS APIs such as GitHub, Stripe, and OpenAI.”

The detection-time claim is the load-bearing number. From the same launch: “During a DynamoDB degradation on July 3, 2025, Datadog surfaced the issue 32 minutes before AWS acknowledged it on their status page.” Thirty-two minutes is the gap between when Datadog's customer telemetry started flagging the vendor and when the AWS status page caught up. The same 32-minute number appears in the companion Updog.ai launch post: “Instead of depending on provider updates, Updog.ai is powered by aggregated, anonymized observability data and AI models.”

Datadog's model is telemetry-derived: it needs APM agents installed and live customer traffic flowing to spot the vendor degradation in aggregate. Velprove's wedge is the opposite shape. A 5-minute synthetic of the exact dependency call your app makes, from one region you pick out of 5 regions available to choose from, with no agents and no instrumentation. It works on day one. If Datadog needed to ship a new product to close the 32-minute gap, you can ship the lightweight version yourself in 15 minutes per top-3 vendor.

This is the structural reason a plain HTTP probe on the vendor's documented endpoint is not the same primitive. The vendor's endpoint often returns 200 while the call your app actually makes (the one with your auth, your headers, your payload shape) fails. The broader version of that argument lives in why HTTP probes alone miss vendor-degradation outages .

What status-page-lag history tells you about a vendor

The third triage axis (vendor-status-lag-history) is the one teams skip because it requires looking at the vendor's last three incidents and measuring the delta between impact start and the first Investigating update. The work is worth it, because the lag varies wildly across vendors. Stripe routinely posts within 5 to 10 minutes. GitHub posted the May 15 2026 Actions degradation 30 minutes after impact. AWS posted the July 3 2025 DynamoDB degradation 32 minutes after Datadog's telemetry caught it. The lag is structural to the vendor, not random.

Third-party aggregator IsDown reports a wider gap across its provider pool. In their own product blog, they write: “In January 2026, IsDown detected outages up to 2.2 hours before vendors acknowledged them, and caught 101 incidents that vendors never reported at all.” Two caveats on that number. It is self-reported product telemetry from a competing monitoring tool, not third-party-audited. And it is an aggregate across the IsDown provider pool in a single month, not a per-vendor claim. Read it as an upper bound on how bad vendor status pages can run, not as the median.

The rule we use is rougher and easier to apply. Pull the vendor's status page incident history. Look at the last three incidents. If all three were posted within 5 minutes of impact, the status page is acceptable as a secondary signal. If any one of the three was posted 30 minutes or more after impact, the status page is not the monitor: a synthetic is. The broader structural argument that vendor status pages lag for the same reason customer-facing dashboards always lag the truth lives in vendor status-page lag is a structural problem .

Triage worked example #1: the transactional email path

Vendor candidates: SendGrid, Postmark, Mailgun. Score the axes. Blast-radius is HIGH: password resets, invoice receipts, double-opt-in confirmations, and magic-link auth all die together when the send API fails. Revenue attribution is MEDIUM: rarely the direct buy path, but onboarding-email failures kill activation and churn-recovery email failures cost real dollars at the long tail. Vendor-status-lag-history is LOW for the three named vendors: their status pages have been generally honest. Combined score: high enough to land in the top 3 for most SaaS shops.

The Velprove recipe is a single API monitor. HTTP POST to the provider's send endpoint with a sentinel recipient address you own (something like monitor@yourdomain.com routed to /dev/null on your end). Assert status_code eq 202 and header_contains x-message-id on the response. Both assertions run on every Velprove plan, including Free. Each monitor run is one snapshot of the send path at that interval, not a poll: the monitor fires once, reads the response, and records the result. The synthetic catches API-side failures (rate limits, auth-key rotation breakage, provider 5xx). It does not catch deliverability failures (the message accepted by the vendor but never landing in the inbox). Velprove does not read inboxes and the browser login monitor cannot click an email link. For deliverability, layer a dedicated inbox-monitoring tool on top.

A second pattern fits the same triage shape when the vendor exposes a customer dashboard you actually sign in to (Stripe Dashboard, AWS Console, SendGrid web app). Velprove's browser login monitor drives a real browser through the vendor's sign-in page with a dedicated low-privilege test account, then asserts on a post-login element only authenticated users see. If the vendor's auth backend is degraded but their public API returns 200, the API synthetic stays green and the browser login monitor flips red. Free plan includes one, running every 15 minutes from any of the 5 regions available to choose from. The two synthetics layer cleanly on the same vendor.

The shape generalizes to any vendor whose API you call to produce a side-effect (send, charge, upload, dispatch). The foundational primitive is the multi-step API monitor primitive , which the Free plan supports at up to 3 steps (5 on Starter, 10 on Pro). For email, 1 step covers it. The Stripe-checkout pattern needs 3 steps to land the full flow.

Triage worked example #2: the vendor status page itself

The triage question for the vendor status page is narrower than it sounds. The question is not “should I scrape my vendor's status page?” The question is “when can I trust this vendor's status page enough to skip running my own probe against the vendor?” The answer is the third axis. If the vendor's last three incidents posted within 5 minutes of impact, treating the status page as the secondary signal is reasonable. If any one of the three posted 30 minutes late, the status page is not the monitor.

AWS sits in the second bucket. The July 3 2025 DynamoDB degradation is the Datadog anchor (32-minute gap) and the October 20 2025 AWS US-EAST-1 cascade is the worst-case example. Slack sits in the second bucket. Most SaaS APIs in the long tail (auth providers, search vendors, CRM webhooks) sit in the second bucket too: their status pages exist, but they update slowly because they are driven by manual SRE confirmation, not by customer telemetry. The right action is to scrape the status page as a secondary signal if you want richer context, but never to rely on it as the primary detection instrument for any vendor that scored 3 on the lag axis. The cluster-fold variant of this pattern (other failure modes that look green on the dashboard) lives in the silent-outage taxonomy .

Triage worked example #3: the webhook receiver you cannot host

The triage question for webhook-driven vendors is constrained by what Velprove can and cannot do. Velprove does not host an inbound webhook receiver. We cannot accept the vendor's POST, parse it, and assert on the payload. That is a category of product (webhook capture and replay) we do not ship. The pattern that works inside Velprove's primitive set is two monitors that compose: trigger the workflow on the vendor side with one monitor, then check the downstream effect on your own application endpoint with a second monitor on the next interval.

Stripe checkout is the canonical example. Monitor A is an API monitor that creates a test checkout session against Stripe's test mode. Monitor B is an API monitor that hits your own /api/orders/test-canary endpoint (or whatever you name it), with a json_path assertion that $.status equals the literal string paid. Your endpoint records the most recent test-canary state server-side, returns 200 with the static JSON if it has been updated by the Stripe webhook within your acceptable window, and returns 503 if it has not. Velprove does not need to know about the webhook itself. It only asserts on what your endpoint says about the webhook's effect. The deeper version of this pattern, including how to design the /api/orders/test-canary endpoint, lives in trigger-and-check-effect for webhooks . The shape applies to any vendor whose only outage signal is a webhook you cannot receive: SMS delivery callbacks, payment confirmations, CRM record-update events, build-finished notifications.

The 5-region pattern and partial regional degradation

Velprove offers 5 regions available to choose from on every plan, including Free. Each monitor runs from one region you pick, not from all five at once. The triage implication is straightforward. For a vendor with global failure modes (Cloudflare data plane, AWS US-EAST-1 cascades), a single monitor in any one region catches the incident. For a vendor with regional failure modes (a CDN with PoP-specific issues, an auth provider whose European cluster degrades independently of US-East), you create one monitor per region you want to cover.

The Cloudflare November 18 2025 outage is the clean global example. Cloudflare's own post-mortem at blog.cloudflare.com/18-november-2025-outage records 11:20 UTC to 17:06 UTC, roughly 5 hours and 46 minutes of global data plane impact. Core CDN and security services returned HTTP 5xx status codes across every Cloudflare region. A Velprove synthetic from any of the 5 regions would have flipped red inside one monitor interval. No region selection wisdom was required.

The October 20 2025 AWS DynamoDB cascade is the contrasting story. ThousandEyes' post-incident analysis at thousandeyes.com/blog/aws-outage-analysis-october-20-2025 documents the shape: a DynamoDB DNS race condition surfaced at 6:49 AM UTC October 20, AWS engineers identified the cause by 7:26 AM UTC, DNS was fully restored between 9:25 and 9:40 UTC, and EC2 instance launches continued failing until 8:50 PM UTC, with Redshift cluster backlogs not cleared until 11:05 AM UTC October 21. The customer-visible window ran over 15 hours. Many of the downstream phases hit US-EAST-1 specifically. A monitor from a non-US region would have stayed green for the EC2-launch phase while a US-region monitor turned red. That asymmetry is the case for putting your top-3 vendor monitors in two or three regions when you can spend the monitor budget.

When NOT to monitor a dependency synthetically (the third bucket)

The honest counterweight to the triage rule is the bottom-4 bucket. Some vendors do not justify a synthetic, because the cost of running the monitor (the time to set it up, the alert noise, the slot it takes in your monitor budget) exceeds the cost of finding out late.

Three concrete shapes land in the bottom bucket reliably. A vendor used for an offline batch report that runs nightly: a 10-minute outage at 4 AM costs nothing real, and your nightly job retries on its own. A vendor used for a low-traffic internal admin feature: you will find out the next time you click the button, which is rare enough that the monitor is overhead. A vendor with a fast, honest status page and an email-subscription pipeline where 5-minute-late detection is acceptable to your operation. Calling these out explicitly is part of the triage: the rule is “3 of 12,” not “all 12.” The point of triage is to spend your monitor budget where it earns its keep, and to consciously accept the risk on the rest.

Patterns to avoid (honest about Velprove's primitive set)

Five patterns commonly recommended for third-party API monitoring do not fit Velprove's primitive set. Naming them is faster than pretending they are options.

No polling primitive. Velprove's multi-step API monitor runs each step exactly once in sequence, then records the result. There is no “keep hitting this endpoint until X” option, no retry-until-success loop, no condition-wait. The replacement is the monitor interval itself. If you need 30-second granularity, set the interval to 30 seconds on the Pro plan.

No time-relative assertion type. The six assertions Velprove supports are status_code, body_contains, body_not_contains, json_path, response_time_ms, and header_contains. There is no “assert this timestamp field is within the last 60 seconds” primitive. The replacement is your endpoint computing freshness server-side and returning 200 or 503, or a json_path assertion against a static expected value.

No percentile latency thresholds. response_time_ms is a per-request budget, not a p95 or p99 aggregate. The replacement is to set a per-request threshold that allows for some single-request noise, and to configure your alert rule to fire on N consecutive failures. The same goal (catch sustained slowdown, not single slow requests) is met by the consecutive-failure rule.

No inbound webhook receiver. As described in worked example 3, Velprove does not host an endpoint that catches third-party webhooks. The replacement is trigger-and-check-effect: two monitors that compose, where the second asserts on your own application state after the vendor's webhook has had time to fire.

No distributed tracing, no RUM. Velprove is the outside-in synthetic layer. APM tracing (Datadog, Honeycomb) and Real User Monitoring (Splunk, LogicMonitor) are complementary categories, not replacements. The right view of the dependency call from inside your application is the trace; the right view from outside is the synthetic.

One final note on alert channels. Today, Velprove's alert channels are email (every plan), Slack, Discord, webhook, and Microsoft Teams (Starter and above), and PagerDuty (Pro). There is no mobile push channel on any plan today. Plan your alert routing around what exists, not what should exist. The opposite-prescription view of dependency monitoring (why your own /healthz should NOT deep-probe these dependencies inside a liveness probe) lives in the inverse view, why your own /healthz should NOT deep-probe these dependencies . The complement holds: synthetic-from-outside, plain-liveness-from-inside.

ThousandEyes' analysis of the October 20 2025 AWS incident captures the recovery-shape implication well: “Recovery timelines are sums of dependent phases, not parallel operations.” The triage rule above tells you which vendors to monitor. The recovery shape tells you why your incident playbook should keep the monitor running through the all-clear: the vendor's status page going green is the first phase, not the last.

Frequently Asked Questions

How do I decide which of my third-party dependencies to monitor synthetically?

Score every vendor on three axes from 0 to 3. Blast-radius is what breaks for your customers when the vendor goes dark (payments 3, weekly report 0). Revenue-attribution is what dollars stop (checkout 3, marketing email 1). Vendor-status-lag-history is how late the vendor posted its last three incidents (within 5 minutes 0, 30+ minutes 3). Sum the axes. Vendors scoring 7 or higher belong in the top 3. Vendors scoring 4 to 6 belong in the next 5, with the vendor status page as secondary signal. Vendors scoring 3 or under sit in the bottom 4: you consciously accept that you find out late. Revisit the list every quarter when vendors and traffic patterns change.

What is the realistic monthly cost of running a synthetic monitor per third-party vendor?

On Velprove's Free plan, three synthetic monitors on your top-3 vendors costs $0 per month, assuming your total monitor count stays under the 10-monitor Free cap. Free includes 5-minute intervals, multi-step API monitors up to 3 steps, 1 browser login monitor (every 15 minutes), all six assertion types, and email alerts, with commercial use allowed. Starter at $19 per month unlocks 1-minute intervals plus Slack, Discord, webhook, and Teams channels. PagerDuty ships on Pro at $49 per month. By comparison, Datadog Synthetic prices per-test-per-region: three vendor synthetics from three regions runs into low-three-figures monthly at Datadog's current list price.

How do I know when a vendor's status page is reliable enough that I do not need my own monitor?

Pull the vendor's status page incident history and look at the last three incidents. Measure the gap between impact start (usually disclosed in the Resolved update) and the first Investigating update. If all three incidents posted within 5 minutes of impact, the status page is acceptable as a secondary signal: you can lean on it instead of running your own probe. If any one of the three incidents posted 30 minutes or more after impact, the status page is not the monitor. Stripe sits in the first bucket. AWS, Slack, and most long-tail SaaS sit in the second. Status page subscriptions are still useful for downstream context, even for vendors in the second bucket. They just are not the primary detection instrument.

How do I monitor a vendor whose API is bursty and noisy on the happy path?

Bursty vendors generate single-request slow responses that are not real incidents. The per-request response_time_ms assertion is per-request, not an aggregate, so a single slow response will trip a raw threshold. The fix is two configuration choices. First, set response_time_ms to a threshold that allows for some single-request noise (often 2x or 3x the observed p50 from your own client telemetry). Second, configure the monitor's alert rule to fire on N consecutive failures instead of a single failure. Three consecutive 5-minute checks failing is 10 to 15 minutes of sustained degradation, which is the signal you actually want. The consecutive-failure rule is available on every plan.

How do I monitor a vendor whose only outage signal is a webhook I cannot receive in Velprove?

Velprove does not host an inbound webhook receiver. We cannot accept the vendor's POST and parse the payload. The pattern that works is trigger-and-check-effect with two composed monitors. Monitor A is an API monitor that triggers the workflow on the vendor side (POST a test checkout, dispatch a test SMS, kick off a test build). Monitor B is an API monitor that hits your own application endpoint (/api/canary/whatever) on the next monitor interval, with a json_path assertion against a static expected value. Your endpoint records the most recent webhook-driven state server-side and returns 200 with the static JSON when the webhook arrived, or 503 when it did not. The deeper version with the Stripe checkout shape lives in monitor Stripe webhooks .

Can I do this on the free plan?

Yes. Velprove's Free plan includes 10 monitors at a 5-minute interval, multi-step API monitors up to 3 steps, 1 browser login monitor (every 15 minutes), HTTP and API monitors with all six assertion types, and email alerts. Three synthetic API monitors on your top-3 vendors fit inside Free as long as your overall monitor count stays under 10. No credit card. Commercial use allowed.

The GitHub Actions May 2026 Degradation: A Detection-Time Teardown

velprove — Sat, 23 May 2026 14:00:03 +0000

Liquid syntax error: Variable '{{% raw %}' was not properly terminated with regexp: /\}\}/

Monitor a Railway App: Sleep, Private Net, Cron Services

velprove — Wed, 20 May 2026 14:00:02 +0000

TL;DR: To monitor a Railway app properly, you have to probe it from outside Railway. Railway's own healthcheck only runs at deploy time and explicitly is not for continuous monitoring. Its sleep timer is outbound-driven so an external probe does not keep a service awake, services on *.railway.internal are unreachable from the public internet, and a cron service has no native did-not-fire alert. Four Velprove patterns close those gaps: a public HTTP monitor on the web service, a /deps probe for private services, a heartbeat probe for crons, and a browser login monitor on the real signed-in path. Free-tier Railway spin-down economics are a separate hosting decision covered in the indie-hacker free-stack guide. Every monitor probes from one of 5 global regions you pick, on the Velprove free plan. No credit card required.

Why Railway's native healthcheck is not your uptime monitor

Railway has a built-in healthcheck and it is not what most people assume. Railway's healthchecks reference reads: "The healthcheck endpoint is currently not used for continuous monitoring as it is only called at the start of the deployment, to ensure it is healthy prior to routing traffic to it." That single sentence is the entire reason this post exists. Railway itself is telling you the native healthcheck is a deploy-time gate, not a runtime alert.

The mechanic is precise. When a new deployment is triggered, Railway repeatedly queries the configured healthcheck endpoint until it receives an HTTP 200, then activates the deployment and starts routing traffic. The default timeout is 300 seconds. The probe originates from healthcheck.railway.app, which matters only if you have host-restricted access on that route. Once the new instance is live, Railway stops calling the endpoint. It does not come back later to confirm the instance is still answering.

Two consequences follow. First, an instance that passes its deploy-time check and then stops responding an hour later produces zero signal from the native healthcheck, because the native healthcheck is not watching anymore. Second, an endpoint that returns 200 with an empty body or a stale cached error page will satisfy the gate just as easily as a real, working endpoint will. Deploy-time gating and runtime monitoring are two different problems, and Railway covers the first one. The second one is an external probe's job. This post is about the platform surface a 200 OK on your public URL cannot see.

Service sleep is outbound-driven, not inbound-idle

This is the Railway fact drafters get wrong most often, so it goes early. Railway's app-sleeping docs state the rule verbatim: "For Railway to put a service to sleep, a service must not send outbound traffic for at least 10 minutes." And, also verbatim: "Inbound traffic is excluded from considering when to sleep a service." The clock that triggers sleep is outbound. The clock has nothing to do with how many requests arrive at your service.

What counts as outbound is broader than most people expect. Telemetry pushed to a logging or APM service, database connection pool keepalives, NTP queries, requests to another service in the same project over the private network, and external API calls all count as outbound traffic that keeps the service awake. What does not count is anything arriving at your service, including a request from a customer's browser and a probe from an external monitor.

That last point is the one to internalize: a Velprove HTTP probe is inbound traffic from Railway's perspective, so it does not keep your Railway service awake. A probe will wake a slept service on the first hit, because Railway wakes a slept service on any inbound request from the internet or from another service in the same project over the private network. But it will not prevent the next sleep. Ten minutes after your service's last outbound packet, it sleeps again, regardless of how often the probe arrives. If you need the service awake, do that with outbound activity originating inside the service: a periodic outbound heartbeat to a logging or telemetry endpoint is the honest way. Treat "monitoring keeps my service warm" as a trap on Railway, because the mechanic runs in the opposite direction from inbound-idle platforms.

Private services on `*.railway.internal` are invisible from outside

Railway exposes a private network between services in the same project. Railway's private-networking docs define the hostname pattern: "<service-name>.railway.internal. For example, a service named api would be reachable at api.railway.internal." The transport is an encrypted Wireguard mesh, which is why the docs consider HTTP over the mesh acceptable: the tunnel itself is encrypted. Isolation is per-project and per-environment, so services in a different project or a different environment cannot resolve your railway.internal hostnames at all.

The load-bearing consequence for monitoring is structural. *.railway.internal is unreachable from the public internet by design. An external probe, including a Velprove monitor running from any of the 5 global regions, cannot resolve those names and cannot reach those ports. There is no flag to flip and no header to add. The private network is private. That is the whole feature.

The pattern that works is a public companion route. Pick one of the public web services in the project and add a route, conventionally /deps, that exercises the actual dependency call you care about and returns 200 only if the private service responded correctly. A Velprove HTTP monitor against /deps then gives you an external signal for a structurally internal service. Frame this honestly to yourself: /deps is a userland convention, not a Railway primitive. The probe is watching the public companion, and the companion is watching the private dependency. If the companion lies or stops being deployed, the probe lies too. Keep the route's implementation small and obvious, and put the same Wireguard-side call your real code uses behind it.

Cron-as-service has no didn't-fire alert

Railway crons are not a separate service type. They are a setting on a normal service. Railway's cron-jobs reference describes the model: you define a 5-field crontab string, in UTC, on a service's settings. On schedule, Railway invokes the service's start command. The service is expected to do its work and terminate. The minimum frequency is every 5 minutes. On Render, a cron is its own separate billable service; on Railway it is a setting on a normal service, and that difference shapes everything downstream (see the Render platform-layer guide for the contrast).

The concurrency rule is the one that bites silently. Railway's docs state, verbatim: "If a previous execution is still running when the next scheduled execution is due, Railway will skip the new cron job." That is the silent-skip failure mode. A cron that hangs once because a downstream API is slow can quietly suppress every subsequent run until you notice the data is stale. From outside Railway, a hung cron and a skipped cron look identical: nothing happened. No surfaced email, no surfaced webhook for "the next run did not start."

The pattern that works is a heartbeat. The cron writes a timestamp on real success into Postgres or a key value store, after the work is durably done, not on entry. A small companion web service reads that timestamp, computes its age, and returns 503 when the cron has been quiet longer than its expected cadence plus a grace window, 200 otherwise. A Velprove HTTP monitor against /last-run/<job> asserts status_code = 200. The endpoint flips to 503 the moment the cron goes stale, and the monitor catches that within one probe interval.

Velprove does not receive passive heartbeats from your cron. The freshness logic lives on your /last-run/<job> endpoint, on your service, and Velprove asserts the status code from outside. A static body_contains assertion that looks for today's date does not work for this; the monitor stores whatever string you type once at setup, then keeps asserting that stale value forever. Let the endpoint compute freshness server-side, and let the status code carry the signal.

Verify your deploy actually came up: the `/version` SHA pattern

Railway auto-deploys on every push to the connected branch by default. A green deploy in the Railway dashboard means the native healthcheck returned 200 once at activation. It does not mean the build that came up is the build you intended, and it does not mean the build is still serving correct responses now.

The cheap fix is a /version endpoint that returns the current git SHA, wired from an environment variable Railway sets at build time. Two ways to assert it with Velprove, and the right one depends on where you want the SHA comparison to live.

The recommended form is a multi-step API monitor: Step 1 hits /version and captures $.build_sha into a variable, Step 2 calls a second route that compares its own runtime SHA against the captured value and returns non-2xx on mismatch. The comparison lives server-side in your app, the monitor just orchestrates, and the setup survives every future deploy unchanged. Available on every plan including free up to 3 steps.

The shorter setup is a plain HTTP monitor with body_contains set to the SHA your build just produced. It works for the current deploy and stales on the next, because the deployed app starts returning the new SHA while the monitor keeps asserting the old one. Use this form only when your CI/CD pipeline updates the assertion on every deploy via Velprove's PUT /api/checks/<id> API. When a deploy reports green but serves a stale or wrong build, either assertion fails and the monitor pages you. Velprove does not provide a native deploy-skew detector; your /version assertion is the detector.

The full multi-step capture-and-assert flow, including the X-Expected-SHA header variant for capturing a value from one step and asserting it in the next, is already walked through in the multi-step build_sha pattern in the API health-check guide. If multi-step is new, the same flow framed for API teams is in the multi-step API monitoring walkthrough. This section is the Railway-specific framing on top of that pattern, not a re-derivation of it.

The four Railway monitors in Velprove (free plan)

Put the patterns above together and the Railway-side coverage lands in four concrete monitors. All four fit inside the Velprove free plan: 10 monitors total, a 5-minute HTTP interval, one browser login monitor at a 15-minute interval, multi-step API monitors up to 3 steps, email alerts, and 1 status page. Each monitor probes from one of 5 global regions you pick at setup time. If your Railway service runs Next.js, pair this set with how to monitor a Next.js app in production for the render-layer half.

(a) Public web-service HTTP probe. A plain HTTP monitor against your public Railway URL, or its public custom domain, asserting status_code = 200 and a body_contains rule on a static string that only your real app emits (a footer tagline, a known marker in the HTML). The body_contains rule keeps a cached gateway error page that happens to return 200 from passing. Set the interval to 5 minutes on free or 1 minute on a paid plan, and pick whichever of the 5 global regions is closest to your real customers.

(b) Private-service /deps probe. You cannot point a Velprove monitor at db.railway.internal or worker.railway.internal, because those names resolve only inside your project's Wireguard mesh. Expose a /deps route on a public service in the same project that calls the private dependency and returns 200 only on real success. Point a Velprove HTTP monitor at /deps, assert status_code = 200, and the private service becomes externally observable without giving it a public surface.

(c) Cron heartbeat probe. On the companion route that reports cron freshness, set up an HTTP monitor against /last-run/<job> asserting status_code = 200. The endpoint returns 503 when the cron has gone stale, so a 200 is the whole check. Match the probe interval to the cron cadence: a 5-minute cron is comfortable on a 5-minute probe; a daily cron is comfortable on a slower probe with a generous grace window. The detection lag is bounded by your probe interval, not Railway's cron minimum.

(d) Browser login monitor on the signed-in path. The three monitors above prove the platform's pieces are alive. They do not prove a real user can sign in and see their data. The browser login monitor opens a real browser, signs in as a dedicated low-privilege test user, follows the post-login redirect, and asserts the landing page looks right. By default it verifies success by confirming the URL changed; that catches a login that fails outright but not a login that lands on an empty shell because the database read behind it silently failed. Under Customize detection, switch Success verification from the default URL-change to "Page contains text" or "Element is visible", and set it to a string or selector that only renders when a real database read returned data: a customer name, an invoice ID, a known plan label. This is the clearest case of when a browser monitor beats an HTTP probe. Use a dedicated test account, never real admin credentials. The free plan includes one browser login monitor at a 15-minute interval, which is enough to catch a multi-hour database-backed outage and a login regression inside one window.

No credit card required. The set lands on free and stays on free unless you want sub-5-minute intervals or more than one browser login monitor.

The honest probe-cost tradeoff on Railway

Probes cost request volume on your service, not Railway pricing dollars in this post's frame. The math is easy. A 1-minute probe from a single region hits your endpoint about once per minute, which is 1,440 per day, which is roughly 43,200 requests per month at that single endpoint. A 5-minute cron-heartbeat probe from a single region is about 288 per day, roughly 8,640 per month. Both numbers are small relative to any real traffic, but they are not zero, and they are the load you are adding by deciding to probe continuously.

The sane default for Railway on the Velprove free plan is HTTP probes at 300-second (5-minute) intervals, which is what the free plan includes. That is enough to catch a multi-minute outage and small enough to stay invisible on any real Railway service's billing. If you need the 1-minute interval, you need it for the customer-facing paths where one minute of detection lag is one minute of silent revenue loss, not for the cron heartbeat that fires hourly anyway.

The same probe-cost discipline applies across the Platform sibling guides: Render, Vercel, and Cloudflare Workers and Pages carry the same four-pattern shape, with platform-specific plumbing under each pattern.

Frequently Asked Questions

Does Velprove keep my Railway service from sleeping?

No. Railway's sleep timer is outbound-driven. Per Railway's app-sleeping docs, a service goes to sleep when it has not sent outbound traffic for at least 10 minutes, and inbound traffic is explicitly excluded from that decision. A Velprove HTTP probe arrives at your service as inbound traffic, so it does not reset the sleep clock. It will wake a slept service on the first request, then sleep again 10 minutes after your service stops sending outbound traffic. If you need the service awake, do that with outbound activity inside the service, not with an external monitor.

How do I monitor a Railway service that only runs on `railway.internal`?

You cannot reach *.railway.internal from outside. The private network is a Wireguard mesh scoped to a single project and environment, structurally unreachable from the public internet. The working pattern is a public companion route, for example /deps on a public web service in the same project, that exercises the internal call and returns 200 only if the private service responded. A Velprove HTTP monitor against /deps, probing from one of 5 global regions you pick, then tells you when the private service stops answering.

How do I detect a Railway cron that did not fire?

Heartbeat pattern. The cron writes a timestamp on success into Postgres or a key value store. A small companion web service reads the timestamp, computes its age, and returns 503 when the cron has gone stale, 200 otherwise. A Velprove HTTP monitor asserts status_code = 200 on that endpoint. Railway's cron docs state that if a previous execution is still running when the next scheduled run is due, Railway skips the new run, so a hung cron looks identical to a missing cron from outside. Velprove does not receive passive heartbeats, so the freshness lives on your endpoint and Velprove asserts the status code.

Does Railway alert me when a deploy serves the wrong build SHA?

No. Railway's native healthcheck only gates the deploy at activation time, not its content afterwards. Expose /version returning the git SHA from a build-time environment variable, then assert it with Velprove. The recommended form is a multi-step API monitor: Step 1 captures $.build_sha from /version into a variable, Step 2 hits a second route that compares its own runtime SHA against the captured value and returns non-2xx on mismatch. The comparison lives in your app, the monitor just orchestrates, and the setup survives every deploy unchanged. The lighter alternative is a plain HTTP monitor with body_contains set to the current SHA, but body_contains goes stale on your next deploy unless your CI/CD updates it via Velprove's PUT /api/checks/<id> API. When a deploy reports green but serves a stale or wrong build, either assertion fails and the monitor pages you.

Is Railway's native healthcheck enough for uptime monitoring?

No, and Railway says so. The healthchecks reference page states, verbatim: "The healthcheck endpoint is currently not used for continuous monitoring as it is only called at the start of the deployment, to ensure it is healthy prior to routing traffic to it." It is a deploy-time gate that lets a new instance start receiving traffic once it returns 200, not a runtime alert that fires when the instance later stops responding. Continuous uptime needs an external probe.

What is the cheapest way to monitor a Railway app?

The Velprove free plan. It covers 10 monitors total, a 5-minute HTTP interval, one browser login monitor at a 15-minute interval, multi-step API monitors up to 3 steps, email alerts, and 1 status page, with each monitor probing from one of 5 global regions you pick. That is enough to land a public HTTP monitor on your web service, a /deps monitor on a private dependency, a heartbeat monitor on a cron, and one browser login monitor on the signed-in path. Start with the free plan. No credit card required.

Monitor AI App Uptime When OpenAI or Anthropic Degrades

velprove — Tue, 19 May 2026 14:00:04 +0000

Bottom line: If your core feature is an AI call, a degraded provider is a degraded product, and a naive ping of api.openai.com will mostly return 200 while your users watch the feature fail. The check that actually catches this is a multi-step API monitor that signs in and calls your own in-app AI endpoint, asserting the response shape, the HTTP status, and a response_time_ms budget, paired with a browser login monitor that signs in as a real user and confirms the AI feature actually rendered. Velprove proves your AI endpoint responded fast and in the expected shape. It does not and cannot judge whether the model's answer was good or correct.

The 15 hours OpenAI's API ran at 75%

On June 9 and 10, 2024, OpenAI had an incident that lasted roughly 15.5 hours. It was not a clean outage. According to OpenAI's June 2024 postmortem , "ChatGPT users experienced elevated error rates reaching ~35% errors at peak, while API users experienced error rates peaking at ~25%," and for the API, "Availability dropped to 75% during the incident." The root cause was mundane: "a daily scheduled system update inadvertently restarted the network management service (systemd-networkd) on affected nodes, causing a conflict with a networking agent."

Read the 75% number again, because it is the whole point of this post. The API was not down. It was up, and serving correct responses, roughly three times out of four, for fifteen hours. A monitor that asks "is api.openai.com reachable, does it return 200" would have passed most of the time, because most of the time it genuinely did. Meanwhile an app that calls that API on every user action, without retry-with-jitter, was surfacing roughly one in four AI-feature requests as a failure to real users. The provider endpoint was nominally up. The product was not.

That gap did not show up at api.openai.com. It showed up inside your own AI feature, as elevated errors, latency blowout, 429s and 529s, or a stream that returned HTTP 200 and then died mid-completion. The general case of what an HTTP 200 misses is its own subject, covered in why uptime monitors miss real outages . This post is the AI-provider-specific case, and it has failure modes that the general catalogue does not.

How an LLM provider actually degrades to your app

When a provider degrades, it does not politely return a single clean error code. Here is the detectable surface, by primary source.

Latency blowout. The most common partial degradation is not an error at all. Time-to-first-token climbs, the call still completes, the status is still 200, and your users sit watching a spinner. This is invisible to a status-code check and visible to a latency assertion.

HTTP 429, which means two different things. This distinction is underused and it matters. A rate_limit_exceeded 429 means request frequency exceeded your account or tier limit. It is transient; retry with backoff. An insufficient_quota 429 means billing or credits are exhausted. It is not transient. Retrying never helps, and your AI feature is silently dead until you top up. No provider status page will ever show this, because it is account-scoped, not a provider outage.

5xx and timeouts during incidents. During the documented OpenAI incidents, traffic returned 500s, 503s, and timeouts. Anthropic's errors documentation lists "500 - api_error" and "504 - timeout_error" explicitly.

Anthropic 529, overloaded_error. Anthropic's docs define "529 - overloaded_error : The API is temporarily overloaded" and warn that "529 errors can occur when APIs experience high traffic across all users. In rare cases, if your organization has a sharp increase in usage, you might see 429 errors because of acceleration limits on the API." The error body shape is {{"type":"error","error":{"type":"overloaded_error","message":"Overloaded"}}} , which is exactly the kind of shape you can assert against.

The AI-specific one: a stream that errors after a 200. Anthropic documents this directly: "When receiving a streaming response over SSE, it's possible that an error can occur after returning a 200 response, in which case error handling wouldn't follow these standard mechanisms." A status-code-only check passes, because the status really was 200, while the user gets a truncated or errored completion. This failure mode does not exist for a static REST endpoint, and it is the reason the rest of this post is not just the silent-outage argument with an LLM example. Providers can also serve a slower or lower-tier path under load; treat that qualitatively, as a latency signal, not as a documented behavior with a name.

What Velprove can and cannot see here

This goes before the setup on purpose, because if you misunderstand the boundary, you will build the wrong check and trust it for the wrong thing.

Velprove can prove your AI endpoint responded, with the expected JSON shape, the expected HTTP status, and within a latency budget. It cannot judge whether the model's answer was good, correct, relevant, or not hallucinated. It asserts that the AI feature responded correctly-shaped and fast, not that the answer was right. Output-quality, evals, and hallucination testing are a different tool category and explicitly out of scope. There is no semantic or answer-quality assertion in the product, and there is no way to construct one. The only assertion types that exist are status_code, body_contains, body_not_contains, json_path, response_time_ms, and header_contains. None of those reads meaning.

Here is the turn, and it is an honest one. Shape, latency, status, and the stream-error-after-200 case catch the overwhelming majority of provider-degradation incidents anyway, precisely because those failures change the shape, status, or latency of the response, not just its quality. A timeout is a latency failure. A 529 is a status failure. A quota-dead account is a body failure. A stalled stream is a completion-marker failure. The class of failure that a Velprove check genuinely cannot see, a confidently-worded but wrong answer returned fast and in the right shape, is real, but it is an evals problem, and conflating the two is how monitoring tools lose credibility. Velprove will not claim that ground.

The monitor: a multi-step API check on your own AI endpoint

The useful check is a Velprove multi-step API monitor pointed at your own AI endpoint, not a ping. It has two steps. Step one authenticates as a dedicated low-privilege synthetic test account. Step two calls your in-app AI endpoint with a fixed synthetic test prompt. Use a generic endpoint shape to make this concrete: POST /api/ai/generate returning {{ "answer": "..." }}.

On step two, set these success conditions. A status_code assertion that catches 429, 500, 503, 504, and 529 (do not assert 200-only if your endpoint streams, see the next section). A response_time_ms threshold sized to your real time-to-first-token budget, because latency blowout is the partial degradation a status check misses entirely. A json_path assertion with the exists operator on the answer field, which catches a 200 wrapped around a body that is missing the answer field entirely, a malformed error-shape response. And two body_not_contains rules, one for overloaded_error and one for insufficient_quota, so the two failures that no provider status page will ever show fail your check loudly.

I am deliberately not re-explaining how multi-step monitors chain requests, extract a token, and carry it forward. That is its own walkthrough; see the multi-step API monitoring guide for the mechanics and come back. The plan math here: the Free plan caps multi-step API monitors at 3 steps, so a 2-step check fits Free comfortably. Starter at 19 dollars lifts the cap to 5 steps and the interval to 1 minute; Pro at 49 dollars goes to 10 steps and a 30-second interval. Each monitor runs from one of 5 global regions; if you want regional coverage, run a separate monitor per region, because providers can degrade asymmetrically by region.

This is the same pattern our guide to monitoring Stripe API health applies to your payment provider: your app depends on a third-party API that degrades in ways the vendor status page does not show, so you monitor your own integration point synthetically rather than trusting the dependency to tell you. The dependency is different. The pattern is the same.

The stream that returns 200 and then dies

If your AI endpoint streams, a status_code assertion alone is not enough, and Anthropic's own docs are why. An error can occur after a 200 has already been returned. The HTTP status was 200. It was honestly 200. The stream then errored, stalled, or truncated, and your user got half an answer or a broken one. A monitor that only checks the status code records a pass and tells you everything is fine.

The fix is to assert on a stable end-of-stream marker, not on the status. If your endpoint emits a final structured event or sets a completion field once the full answer is assembled, assert it with a json_path or body_contains rule: for example, a done: true field, or a sentinel token your server only writes after the stream closes cleanly. A 200-then-broken stream will not contain that marker, so the check fails on exactly the failure a status-code check waves through. This is the single beat that separates monitoring an AI feature from monitoring any other endpoint, and it is the reason the boundary section above is honest rather than defensive: this failure changes the response shape, so Velprove can see it.

The browser login monitor: the AI feature as a signed-in user

The API check proves the endpoint answers correctly-shaped and fast. It does not prove a signed-in user can actually use the AI feature through your real interface, and that is where Velprove's strongest differentiator lives. A browser login monitor opens a real browser, signs in as the dedicated low-privilege test account, and verifies the post-login page rendered correctly. To make it watch the AI feature, point its login URL at an account whose post-login landing surfaces AI-dependent content, then open Customize detection and set Success verification to Page contains text, matching a string that only renders when a real AI result actually loaded. By default this monitor only checks that the URL changed after login, which would pass even if the AI content never rendered, so the default is not enough here. One honest limit: the browser login monitor logs in and checks a single success condition on the resulting page. It does not script clicking into a feature and submitting a prompt. Driving the AI endpoint itself is the multi-step API monitor's job, above.

This catches failures the API check structurally cannot. A front-end that swallows a 500 and shows a generic toast. A spinner that never resolves because the stream stalled client-side. An auth-gated AI route that the API check authenticated into directly but a real browser session cannot reach because a session or CSRF step broke. A client-side error boundary that renders an empty panel while the network tab shows a clean 200. The API monitor sees a healthy endpoint; the user sees a dead feature; the browser login monitor sees what the user sees.

Free includes 1 browser login monitor at a 15-minute interval, which is enough to catch a multi-hour provider degradation and a UI regression within one window. Starter includes 3 at a 10-minute interval, and Pro 10 at a 5-minute interval. Point it at the dedicated test account with the smallest permissions that still renders a real AI result, and use a fixed synthetic prompt, never real user data and never a prod-mutating or expensive call, because it runs on every interval.

Why your provider's status page is not the monitor

Start with the argument that is fully sourced and not a matter of timing at all. An OpenAI insufficient_quota 429 and a per-account acceleration-limit 429 will never appear on status.openai.com or status.anthropic.com, because they are not provider outages. They are account-scoped. The provider is fine. Your account is out of credits or over an acceleration limit, and a synthetic monitor of your own AI endpoint is the only thing that catches them, because there is no public incident to subscribe to.

Then the timing argument, kept qualitative, because no defensible minute-count exists. OpenAI's December 11, 2024 postmortem describes a control-plane cascade: from 3:16 PM PST to 7:38 PM PST, about 4 hours 22 minutes, after "a new telemetry service deployment that unintentionally overwhelmed the Kubernetes control plane." DNS caching held stale-but-working records for a while, which delayed when services visibly started failing, and OpenAI states plainly that "Remediation was very slow because of the locked out effect." You do not need an invented number to take the point: impact and provider-side acknowledgement and recovery are not the same clock. A monitor of your own endpoint runs on the impact clock. The status page runs on the acknowledgement clock.

This is not a substitute for error tracking or evals

One last honest boundary, because credibility is the only thing this post is selling. Velprove tells you fast that your AI endpoint stopped responding correctly-shaped, fast, and with the right status. It does not replace application error tracking, which owns the stack traces and the per-request diagnostics when you go to fix the failure. It does not replace model-output evals, which own whether the answers are actually any good. Three different layers, three different jobs. A synthetic uptime monitor is the layer that tells you the AI feature is failing for users right now, which is the layer most teams launching AI features do not have wired and the one a degraded provider exposes first. Keep the eval suite. Keep the error tracker. Add the monitor that watches the endpoint the way a user hits it.

Frequently Asked Questions

Can Velprove tell me if the AI gave a wrong or bad answer?

No. Velprove asserts that your AI endpoint responded, in the expected JSON shape, with the expected HTTP status, inside a latency budget. It does not judge whether the answer was correct, relevant, or hallucinated. That is an evals and output-quality tool category, and it is explicitly out of scope. What Velprove does instead is catch the failure modes that change the shape, status, or latency of the response: timeouts, 429, 500, 503, Anthropic 529, a response missing the answer field, and a stream that returned 200 and then died. Those cover the overwhelming majority of provider-degradation incidents.

How do I monitor my AI feature without sending real user data to the model?

Use a dedicated low-privilege synthetic test account and a fixed synthetic test prompt that you control. Never send real user data through the monitor, and never point it at a prod-mutating or expensive AI call. The prompt should be short, deterministic in shape, and cheap, because it runs on every probe interval. The point of the check is to prove the endpoint responds correctly-shaped and fast, not to exercise real customer content.

Will an uptime monitor catch an OpenAI or Anthropic outage before their status page does?

It catches the class of failures a provider status page structurally cannot show, because some of them are account-scoped, not provider outages. An OpenAI insufficient_quota 429 or a per-account acceleration-limit 429 will never appear on status.openai.com or status.anthropic.com because they are not platform incidents. A synthetic monitor of your own AI endpoint sees the impact where it actually lands, at your request, without waiting for the provider to detect, confirm, and post. OpenAI's own December 11 2024 postmortem describes remediation that was very slow because of the locked out effect, which is a sourced way of saying impact and acknowledgement are not the same clock.

What does a 429 from OpenAI actually mean for my app?

Two different things. A rate_limit_exceeded 429 means request frequency exceeded your account or tier limit. It is transient, and retrying with backoff is the right response. An insufficient_quota 429 means billing or credits are exhausted. It is not transient. Retrying never helps, and no provider status page will ever show it because it is account-scoped. Assert body_not_contains on insufficient_quota so a quota-dead AI feature fails the check loudly instead of degrading silently until a customer notices.

My AI endpoint returns 200 but the answer is cut off. Why doesn't my monitor catch it?

Streaming responses can return HTTP 200 and then error mid-stream. Anthropic documents this directly: when receiving a streaming response over SSE, an error can occur after a 200 response has already been returned. A status-code-only assertion passes because the status really was 200. Assert a stable end-of-stream or completion marker with a json_path or body_contains rule so a 200-then-broken-stream still fails the check.

Can I do this on the free plan?

Yes. A 2-step API monitor (authenticate, then call your AI endpoint) fits the Free plan, which caps multi-step API monitors at 3 steps. Free also includes 1 browser login monitor at a 15-minute interval and email alerts, with a 5-minute HTTP interval and commercial use allowed. Starter at 19 dollars lifts multi-step to 5 steps, drops the interval to 1 minute, and adds Slack, Discord, Teams, and webhook alerts. Pro at 49 dollars goes to 10 steps and a 30-second interval. Start with the free plan. No credit card required.

Which region does the AI endpoint check run from?

From any one of 5 global regions. Each monitor runs from a single region you pick, not all of them at once. If you want regional coverage of your AI endpoint, create separate monitors per region. This matters for AI features because a provider can degrade asymmetrically by region, and a single-region monitor only sees its own region's path.

Detecting a Hacked WordPress Site: Skimmers and Silent Defacement

velprove — Tue, 19 May 2026 14:00:03 +0000

The honest take: when an attacker injects a card skimmer, defaces a page, or hijacks your wp-admin, your host dashboard keeps returning a green HTTP 200, because the site is still being served, it is just serving the attacker's version. Velprove is not a security scanner and has no server access. Its browser login monitor signs in to your wp-admin in a real browser, and content assertions check the page a visitor actually loads, so it can flag that your expected checkout or admin markup is gone, or that a known bad marker appeared, as a fast external tripwire. It is complementary to server-side tools like Wordfence or Sucuri, not a replacement for them.

A card skimmer ran on a live store for weeks, and the uptime dashboard never blinked

In September 2024, Sucuri documented a WooCommerce credit card skimmer with a detail worth sitting with. As Sucuri put it in their September 12, 2024 writeup , "All the attackers did was simply edit the checkout page source, either from wp-admin (using a compromised administrator user) or directly through the database." The payload was not even a conspicuous <script> tag. It hid inside a <style> tag and executed through an onload handler once the page finished loading, with the skimmer body heavily obfuscated through custom character substitution and shuffling.

Now hold that next to what every availability check saw. The store was up. The checkout page returned a 200. The cart worked, the product pages loaded, the host status panel was green. A skimmer that quietly copies every card number a customer types can run for weeks this way, because nothing about availability changes: the site stays up and keeps returning 200 while the attacker collects card data on every order. If you run a store, the natural next step is dedicated WooCommerce checkout monitoring , and this post is about the adversarial half of that: not a checkout that broke, a checkout an attacker quietly rewrote.

Why an HTTP 200 dashboard is structurally blind to this

A status-code check asks one question: did the server answer? A compromised site answers fine. It returns a 200 with a fully rendered page, because the page is doing exactly what the attacker wants it to do. This is not the familiar "200 but the page is blank" problem where a build failed and the body is empty. This is a 200 serving a page an attacker now controls: the markup is present, it renders, it just contains a skimmer or a defacement or a login that now belongs to someone else. Availability monitoring is the wrong instrument for it, the same structural reason uptime monitors miss outages like this . The fix is not a better status-code check. It is checking the content of the page a visitor actually loads.

The three ways a compromise shows up on a page a visitor can see

This post is about your site being compromised by an attacker, an injected skimmer, a silent defacement, a hijacked admin, not about a legitimate plugin update breaking your own site, which is a different problem . That sibling is about your site breaking itself when a good-faith update throws a fatal. This one is about the page doing precisely what an attacker intends while your host dashboard shows a green 200. Three patterns dominate the externally visible side of it.

Injected scripts and skimmers. The largest example on record is Balada Injector. Sucuri, in their campaign synopsis , estimated that "since 2017, we estimate that over one million WordPress websites have been infected by this campaign," and noted it "consistently ranks in the top 3 of the infections that we detect and clean." It typically enters through a vulnerable plugin: BleepingComputer reported that Sucuri detected it on over 17,000 WordPress sites in September 2023, more than 9,000 of them through one plugin XSS flaw. The injected code redirects visitors and adds backdoors. To a visitor it is a script that should not be there.

Silent defacement. The content or appearance of a page is changed without the site going down. A pricing page now reads differently, a banner appears that you did not put there, a section is replaced. The page still returns 200 and renders cleanly. Nothing about it is "down." It is just no longer your page.

Hijacked wp-admin. In Sucuri's 2023 hacked-website report , among the sites Sucuri remediated, "malicious WordPress admin users were found in 55.2% of infected databases," and SEO spam appeared on 42.22% of infected sites. That is a remediation-sample figure, not a rate across all WordPress sites, but the direction is clear: when an attacker gets in, control of the admin is a common outcome, and the externally visible consequence is a wp-admin login that no longer behaves the way it should.

What Velprove is not, and read this before you set anything up

This is the part that keeps this post honest, so it goes before the setup, not after it. Velprove is not a security scanner and it has no server access. It does not read your files, it does not scan for malware, it does not do file-integrity monitoring, and it is not a web application firewall or a vulnerability scanner. It sees exactly one thing: the page a visitor's browser receives from the outside.

That means Velprove detects the symptom, not the cause. It can tell you the expected checkout markup vanished, a known injected marker is present, or the wp-admin login is broken. It cannot tell you which plugin was vulnerable, which file was modified, or that a rogue admin row was written to the database. Server-side tools like Wordfence and Sucuri do that work, scanning files and doing integrity monitoring on the server itself. Velprove is the fast external tripwire that fires from outside the host you are trying to verify. The two are complementary, and the rest of this post is written on that understanding.

The setup: a positive DOM tripwire, a known-bad check, and a wp-admin login monitor

Three monitors cover the externally visible surface of a compromise. The first is the lead, and it is the one almost no uptime tool gives you on a free plan.

A browser login monitor on wp-admin. This is the strongest leg, and it is the differentiator. Velprove opens a real browser, navigates to your wp-login.php, signs in as a dedicated low-privilege test account, and asserts that the expected post-login admin markup rendered. When an attacker changes credentials, locks accounts out, or replaces the login flow, this monitor fails, which is the externally visible face of the hijacked-admin pattern and the 55.2% figure above. The mechanics are walked through in detail in the wp-admin browser login monitor guide . Use a dedicated low-privilege test account, never your real administrator credentials. One configuration detail here is load-bearing: in the monitor's Customize detection options, set Success verification to Page contains text and point it at a stable string only the real admin dashboard renders. Leave it on the default URL-change check, and a hijacked login that still redirects can pass while a visitor is seeing the attacker's page. The post-login markup assertion is what makes this monitor detect the symptom, not just that some redirect happened. A positive body_contains tripwire. Add an HTTP monitor with a body_contains assertion on a stable piece of markup that must be present on a healthy page: a checkout form field name, an admin shell element, a distinctive footer string. If a defacement or a skimmer rewrites the page, that expected markup is the first thing to disappear, and the assertion fails. This is the robust play, because it does not require knowing the attacker's payload in advance. A targeted body_not_contains check. Add a body_not_contains assertion on a specific known bad string only when you or your security tool have already identified one: a specific injected script source, a known exfiltration host, a known defacement banner string. This is a narrower secondary layer, not the primary defense.

One product fact to set expectations correctly: each Velprove monitor probes from a single region. You can choose which of the 5 global regions runs a given monitor, or run separate monitors per region if you want multi-region coverage. There is no "every check from all regions at once."

The honest limitation: a fixed substring cannot catch a rotating skimmer

Velprove's assertions are fixed substring matches, not regular expressions or pattern matching. body_contains checks whether an exact string is present, and body_not_contains checks whether an exact string is absent. That has a direct consequence you should know before you rely on it. Modern skimmers, as the Sucuri Woo skimmer writeup showed, obfuscate their payload with custom character substitution and rotate their exfiltration domains. A body_not_contains assertion catches only a known fixed string. The moment the attacker re-obfuscates or rotates, that exact string changes and the known-bad check goes silent.

This is why the positive tripwire leads and the known-bad check is secondary. Asserting that your expected checkout or admin markup is present does not depend on predicting the attacker's payload. Most page replacements and many injection techniques disturb the legitimate markup, so a positive-presence assertion is the durable signal. The honest framing is: Velprove reliably tells you the expected page is no longer intact, and it can catch a specific known marker you already know about, but it is not a promise to catch every obfuscated or rotating payload. Anyone selling you that promise from the outside of your server is overclaiming.

Where this fits next to Wordfence and Sucuri

Wordfence and Sucuri are dedicated WordPress security platforms. They do server-side malware scanning and file-integrity monitoring: they read the files on your server and tell you when one changed in a way it should not have. That is real, important work, and Velprove does not do it and does not claim to.

Velprove sits in a different and narrower place. It watches the rendered page from outside the server, with no plugin and no server access, and fires fast when the page a visitor loads stops looking like your page. A server-side scanner answers "did a file on my server change?" Velprove answers "did the page my customer sees change?" Those are not the same question, and a real compromise often trips one before the other. The right posture is both: a server-side scanner for file and integrity coverage, an external content tripwire so you hear about the visible symptom quickly even from a network position the attacker does not control.

Set this up in the next ten minutes, free

You do not need a paid plan to put this in place. Velprove's free plan includes 10 monitors, one browser login monitor at a 15-minute interval, a 5-minute HTTP interval, email alerts, and a choice of 5 global regions, with no credit card required. That is enough for the wp-admin browser login monitor plus the body_contains positive tripwire and a targeted body_not_contains check on the same page.

The browser login monitor is the leg to set up first, because a hijacked admin is both common and the hardest of the three to notice on your own. Point it at your wp-login.php, sign in with a dedicated low-privilege test account, and let it run. Then layer the positive content tripwire on your checkout or a high-value page. None of it touches your server, and all of it runs from outside the host you are trying to trust. Pair it with a server-side scanner and you have covered both the file and the page. Start with the free plan. No credit card required.

Frequently Asked Questions

How do I detect if my WordPress site has been hacked?

From the outside, you watch the symptom on the page a real visitor loads: the expected checkout or admin markup is missing, a known injected marker appeared, or the wp-admin login no longer works. Velprove does this with a body_contains assertion on the markup that should be present, a body_not_contains assertion on a known bad string, and a browser login monitor that signs in to wp-admin. That is symptom detection. It does not replace a server-side scanner that reads your files for malware and file-integrity changes. Run both.

Does Velprove need a plugin or server access to detect a compromise?

No. Velprove is not a security scanner and has no access to your server, files, or database. It reads only the externally rendered page, the same one a visitor's browser receives. It cannot scan files for malware or do file-integrity monitoring, and it asks for nothing to be installed on the site. That boundary is the point: it is a fast external tripwire that runs independently of the host you are trying to verify, and it is complementary to, not a replacement for, a server-side security tool.

Can an uptime monitor catch a credit card skimmer on my WooCommerce checkout?

It can catch the symptom, with one honest caveat. A body_contains assertion that the expected checkout form markup is intact is the robust play, because a skimmer that tampers with the checkout often disturbs that markup. A body_not_contains assertion on a specific known bad string catches that exact string. The caveat: assertions are fixed substring matches, not patterns, and modern skimmers obfuscate and rotate their payload, so the positive presence check is the durable one and the known-bad check is a targeted secondary, not a promise to catch every variant.

Why does my uptime monitor still show green when my site is hacked?

Because the site is still being served. A status-code check asks whether the server answered, and a compromised site answers fine: it returns a 200 with a fully rendered page. The page is just doing exactly what the attacker wants. The skimmer collects cards, the defaced content is live, the hijacked admin is theirs, and every layer that only watches availability reports healthy. You need a check that inspects the content of the page a visitor actually loads, not just the status line.

Is Velprove a replacement for Wordfence or Sucuri?

No, and it is not trying to be. Wordfence and Sucuri are server-side WordPress security platforms that scan your files for malware and do file-integrity monitoring on the server. Velprove has no server access and does none of that. It watches the rendered page from the outside as a fast external tripwire. The two answer different questions: a server-side scanner tells you a file changed, Velprove tells you the page a visitor sees changed. Use both. They cover different layers of the same problem.

What should I assert to detect a defaced or skimmed page?

Lead with a positive tripwire. In Velprove, use a body_contains assertion on a stable piece of markup that must be present on a healthy page: a checkout form field name, an admin shell element, a footer string the attacker is unlikely to preserve when they replace the page. Then add a body_not_contains assertion on a specific known bad string only if you or your security tool have already identified one. The positive check is more robust because it does not depend on knowing the attacker's exact payload in advance, which with obfuscated and rotating skimmers you usually do not.

Can Velprove tell me if someone created a fake admin account?

Not directly, and being honest about that matters. Velprove has no server access and cannot read your WordPress user table, so it cannot tell you a rogue admin row exists. What it can catch is the symptom: a browser login monitor that signs in to wp-admin with a dedicated low-privilege test account will fail when an attacker has changed credentials, locked accounts out, or replaced the login flow. That is the externally visible consequence of a hijacked admin, not the database state itself. For the underlying account audit you still need a server-side tool.