DevHelm

Posted on Jun 19 • Originally published at devhelm.io

Synthetic Monitoring Best Practices: What to Monitor and How Often

#guides #reliability

Most synthetic monitoring setups fail in one of a few predictable ways. They monitor everything and alert on nothing useful. They assert on status code 200 and miss the empty response body. They run flaky browser checks that page someone at 2 AM for a problem that fixed itself by 2:01. Or they go stale — the checkout flow changed three months ago and the check has been failing-then-being-ignored ever since.

These are not exotic failures. They are the default outcome of setting up synthetic monitoring without a discipline. Here is the discipline.

1. Monitor the journeys that cost money, not everything

Every browser check costs compute and, more importantly, costs maintenance. A check on a path that does not matter is worse than no check — it generates noise that trains your team to ignore alerts.

Rank your journeys by cost of silent failure and monitor the top of the list:

Authentication — login, signup. The gate to everything else.
The revenue path — checkout, upgrade, add payment method.
The core product action — the one thing your product exists to do.
Critical third-party handoffs — OAuth redirects, payment iframes, SSO.

Leave static pages, read-only endpoints, and admin screens to cheaper uptime and API checks. A good rule: if a path breaking would not generate a support ticket or lose revenue, it does not need a browser check.

2. Assert on what the user sees, not just the status code

The entire point of synthetic monitoring is catching the failure that a 200 OK hides. So your assertions have to go past the status code.

// Weak: passes even when the page renders an error
await page.goto("https://shop.example.com/checkout");
expect(page.url()).toContain("/checkout");

// Strong: asserts the user can actually complete the action
await page.getByRole("button", { name: "Pay now" }).click();
await expect(page.getByText("Order confirmed")).toBeVisible({
  timeout: 10000,
});
await expect(page.getByTestId("order-number")).not.toBeEmpty();

For API checks, the same principle applies: assert on the response body and JSON paths, not just the code. Check that data.user.role equals "admin", that the array is non-empty, that the token is present. A status code tells you the server answered; an assertion tells you it answered correctly.

3. Set the interval to your tolerance for silent failure

Your check interval is your worst-case detection latency. A 5-minute interval means a broken deploy can bleed for five minutes before anything notices. For revenue-critical journeys, 30 seconds is the standard.

But faster is not automatically better, because interval drives cost. A browser check every 30 seconds from three regions is roughly 259,200 runs per month — for one check. On metered pricing that is real money, and a misconfigured 10-second check can produce a surprise four-figure bill. Match the interval to the journey: 30 seconds for the money path, 1–5 minutes for secondary flows, and reserve sub-30-second intervals for the handful of checks where every second of downtime is quantifiably expensive.

4. Run checks from multiple regions

Failures are often regional. A CDN edge certificate expires in one region; DNS propagates unevenly; a deploy rolls out zone by zone; an SSL chain is misconfigured on one edge. A single-origin check is blind to all of these.

Run each critical check from at least two or three regions that match where your users are. Multi-region also disambiguates incidents: if a check fails from one region but passes from the others, you have a regional problem, not a global outage — a distinction that changes who you wake up and how hard you panic.

5. Engineer out flakiness before it trains your team to ignore alerts

A flaky check is worse than no check, because it teaches your team that the alert is noise. The three biggest sources of flakiness and their fixes:

Hard waits. Never waitForTimeout(3000). Wait for a condition — an element visible, a network response received, a URL reached. Conditional waits adapt to real timing; fixed sleeps race against it.
Single-sample failures. A genuine 30-second blip should not page anyone. Use confirm-on-failure: when a check fails, immediately re-run it (ideally from another region) before declaring an incident. This collapses the vast majority of transient false positives without adding latency to real outages.
Shared mutable state. Two checks that log in as the same user and mutate the same cart will trip over each other. Give each check its own isolated test account and idempotent steps.

6. Keep checks as code, in version control

Synthetic checks are infrastructure, and infrastructure that lives only in a vendor's web UI rots. Define your checks as code — a Playwright spec, a YAML config — committed to your repository alongside the application they test.

The payoff is concrete: when a developer changes the checkout button's label, the check that depends on it is right there in the same pull request, so it gets updated in the same change instead of silently breaking in production. Config-as-code also gives you code review, history, and the ability to recreate your entire monitoring setup from scratch. This is the same monitoring-as-code discipline that keeps the rest of your reliability tooling honest.

7. Use test data safely

Synthetic checks run against production, repeatedly, forever. That has consequences:

Use dedicated synthetic accounts, never a real customer's. Tag them so they are excluded from analytics and billing.
Make steps idempotent or self-cleaning. A checkout check that creates a real order every 30 seconds will pollute your data and possibly charge a real card. Use a test payment token and a path that does not commit real state, or clean up after each run.
Never hard-code real secrets in a check. Use the platform's secret storage; a check definition in Git must not leak credentials.

8. Route alerts by severity and correlate with dependencies

Not every failed check deserves the same response. A failed checkout check is a wake-someone-up event; a failed check on a secondary report page is a business-hours ticket. Map check severity to routing so the right alerts reach the right channels — and tie it to your incident severity levels so the response is consistent.

Then correlate. A checkout check that fails because Stripe is degraded is a vendor incident, not your bug. Grouping dependent checks and subscribing to the relevant vendor status feeds means a third-party outage shows up next to your failing checks, so you spend the first five minutes fixing instead of diagnosing whose fault it is. That correlation is the difference between a low MTTR and a long one.

9. Treat checks as living code — they rot

The single most common failure of a mature synthetic setup is staleness. The product changes; the check does not; the check starts failing; someone mutes it "temporarily"; six weeks later the journey is genuinely broken and the muted check never said a word.

Prevent it with the same hygiene you apply to tests: review checks when the flow they cover changes, fail loudly rather than allowing silent mutes, and periodically audit which checks have been red-and-ignored. A check you do not trust is a check you do not have.

Start with the foundation

Best practices compound from the bottom up: get your endpoint and uptime coverage right first — multi-region, real assertions, severity-routed alerts — then layer browser journeys on top. For tool selection see the best synthetic monitoring tools in 2026, and for turning an existing test suite into monitors see Playwright monitoring.

Set up multi-region uptime and API checks with config-as-code, severity-based alert routing, and a status page that updates from the same data at app.devhelm.io — your first monitor is live in about 60 seconds, no credit card.

Originally published on DevHelm.

DEV Community