DEV Community: DevHelm

Monitoring as Code: Why Your Monitors Should Live in Git

DevHelm — Wed, 08 Jul 2026 15:29:29 +0000

Your infrastructure is defined in Terraform. Your application deploys through CI/CD. Your database schema migrates through version-controlled files. But your monitors — the thing that tells you everything else is working — live in a vendor's web UI, maintained by whoever clicked the buttons last, with no history, no review process, and no way to recreate them if the vendor loses the configuration.

Monitoring as code is the practice of defining monitors, alert channels, notification policies, escalation policies, and status pages in version-controlled configuration files that deploy through the same pipeline as your application. When a developer changes the checkout flow, the monitor that watches the checkout flow is updated in the same pull request — not three days later when someone remembers to update the dashboard.

Why monitors rot in a web UI

Every team that maintains monitors in a vendor dashboard eventually hits the same failure modes:

Configuration drift. Two engineers edit the same monitor at the same time. One changes the threshold, the other changes the URL. The vendor keeps the last write. Nobody knows what the "correct" state is.

Orphaned monitors. A service is decommissioned. Its monitors keep running, alerting on expected failures, training the team to ignore the alert channel. Six months later a real failure in that channel goes unnoticed because everyone learned to tune it out.

Undocumented changes. Someone increased the alert threshold from 500ms to 5000ms during an incident "temporarily." There is no record of who changed it, when, or why. The threshold never goes back.

Unreviewed configuration. A new engineer creates a monitor that alerts the entire team on every 404 response. No peer review caught it because there is no review process for monitor changes in a web UI.

Disaster recovery failure. The vendor has an outage. Or you switch vendors. Your monitoring configuration — dozens of monitors, alert channels, notification policies — cannot be recreated because it was never captured in a reproducible format.

What monitoring as code looks like in practice

Monitoring as code means your repository contains declarative files that define your monitoring setup. The format varies by tooling — YAML, HCL, JSON, TypeScript — but the principle is the same: the repository is the source of truth, and the running state is derived from it.

# monitors/checkout-api.yaml
name: Checkout API Health
type: http
url: https://api.example.com/v1/checkout/health
method: GET
frequency: 30s
regions:
  - us-east
  - eu-west
  - ap-southeast
assertions:
  - type: status_code
    value: 200
  - type: response_time
    threshold: 1000ms
  - type: json_body
    path: $.status
    value: "healthy"
alerts:
  - channel: pagerduty-checkout-team
    severity: critical
  - channel: slack-engineering
    severity: warning
    condition: degraded

# monitors.tf — Terraform approach
resource "devhelm_monitor" "checkout_api" {
  name      = "Checkout API Health"
  type      = "http"
  url       = "https://api.example.com/v1/checkout/health"
  frequency = 30

  regions = ["us-east", "eu-west", "ap-southeast"]

  assertions {
    status_code    = 200
    response_time  = 1000
    json_path      = "$.status"
    expected_value = "healthy"
  }

  alert_channels = [
    devhelm_alert_channel.pagerduty_checkout.id,
    devhelm_alert_channel.slack_engineering.id,
  ]
}

Both formats express the same intent. The HCL version integrates with Terraform's plan/apply workflow. The YAML version integrates with a CLI (devhelm apply -f monitors/). Either way, the monitor definition lives in Git.

The payoff: review, history, and reproducibility

Once monitors are code, you get everything version control provides:

Pull request review for monitor changes. When someone changes an alert threshold, the diff shows up in a PR. A teammate can ask "why did you change the checkout timeout from 1s to 5s?" before it ships. Bad configurations get caught before they reach production — the same way bad code does.

Git blame for debugging. "Who changed this monitor, and when?" is a git log query, not a support ticket to the vendor.

Atomic deploys with the application. When a developer renames an endpoint from /v1/users to /v2/users, the monitor definition updates in the same commit. The monitor never points at a dead endpoint.

Environment parity. The same monitor definitions deploy to staging and production with environment-specific variables. Staging monitors catch issues before production monitors do.

Disaster recovery. If you need to recreate your entire monitoring setup — new vendor, new account, after an outage — you run one command: devhelm apply -f monitors/ or terraform apply. Everything recreates from the source of truth.

Audit trail. For SOC 2 and ISO 27001, you need evidence that monitoring changes are reviewed and authorized. Git history provides this automatically.

The workflow

A mature monitoring-as-code workflow looks like:

1. Developer changes application code
2. Developer updates monitor definition in the same branch
3. PR review covers both code and monitoring changes
4. CI validates monitor syntax (lint, dry-run)
5. Merge to main triggers deploy pipeline
6. Application deploys first
7. Monitors deploy/update second (same pipeline)
8. Monitors validate the deploy succeeded

Step 7 is the key: monitors deploy through CI/CD, not through a web UI. The running state always matches what's in the repository.

What should be code — and what should not

Not everything needs to be version-controlled configuration:

Should be code:

Monitor definitions (URL, frequency, assertions, regions).
Alert channel configuration (which Slack channel, which PagerDuty service).
Notification policy rules (which severity routes where).
Status page component mappings (which monitors feed which status page components).
Escalation policies.

Can stay in a UI:

One-time ad-hoc debugging checks (temporary monitors during incident investigation).
Dashboard visualizations (the view of data, not the data source).
Historical alert data and incident timelines (these are records, not configuration).

The rule: if losing it would require manual recreation, it should be code. If it's ephemeral or a view of data, a UI is fine.

Common objections — and responses

"Our team isn't technical enough for config files." If your team manages monitors, they're technical enough to edit YAML. A config file with clear field names is less complex than navigating a 15-field form in a web UI. And unlike the form, the config file has documentation, examples, and review.

"We'd need to deploy every time we change a monitor." Yes — that's the point. An unreviewed, un-deployed change to a production monitor is exactly as risky as an unreviewed change to production code. The deploy pipeline is the safety net.

"What about quick changes during an incident?" Some tools support both: config-as-code for steady-state, with a UI for temporary overrides that sync back to the repo. If your tool doesn't, keep a documented escape hatch ("during incidents, mute via UI; file a follow-up to update the config post-incident").

Getting started

Export your current monitors. Most tools have an export or API endpoint. Capture your current state into files.
Pick a format. YAML for simplicity, HCL for Terraform users, TypeScript/JSON for programmatic generation.
Version control them. Commit to your application repository (same repo = atomic changes) or a dedicated infrastructure repo (if you manage many services).
Add a deploy step. Wire your CI/CD to apply monitor definitions on merge to main. Start with a dry-run step that validates syntax.
Enforce the workflow. Once monitors deploy from code, disable (or restrict) direct UI edits so drift cannot accumulate.

The first time a PR catches a bad monitor change before it reaches production, the investment pays for itself. The second time — when you recreate your entire monitoring setup in a new environment in 30 seconds — it pays for itself again.

Set up monitors as code with a CLI (devhelm apply), a Terraform provider, or the dashboard — and a status page that updates from the same monitor data — at app.devhelm.io. Your first monitor is live in 60 seconds, no credit card.

Originally published on DevHelm.

API Testing vs API Monitoring: Different Problems, Different Tools

DevHelm — Wed, 08 Jul 2026 15:28:53 +0000

API testing and API monitoring share a surface similarity — both send requests to endpoints and check responses — which is why teams confuse them and assume one covers the other. They do not. They solve different problems, at different times, against different failure modes.

API testing answers: "does this code work correctly before we deploy it?" API monitoring answers: "is this endpoint working correctly right now, in production, for real users?" The first catches bugs. The second catches incidents. A test suite that passes in CI cannot tell you that your production database connection pool is exhausted, that a third-party API your service depends on is returning 503, or that a deploy rolled out a misconfigured environment variable.

When each runs

	API Testing	API Monitoring
When	Before deploy (CI/CD pipeline)	After deploy (continuous, on a schedule)
Triggered by	Code change (push, PR, merge)	Time (every 30s, 1min, 5min)
Environment	Staging, test, local	Production
Failure means	Bug in the code; block the deploy	Incident in production; alert on-call
Duration	Minutes (test suite runtime)	Forever (runs until you stop it)

This timing difference is fundamental. A test runs once per code change and validates correctness. A monitor runs continuously and validates availability. You need both because production fails in ways that tests cannot predict.

What each catches — and what each misses

API testing catches

Logic errors. An endpoint returns the wrong status code for an edge case. A query parameter is parsed incorrectly. A validation rule rejects valid input.
Regression. A refactor broke an endpoint that was working last week. A dependency upgrade changed behavior.
Contract violations. The response schema changed — a field was renamed, a type changed from string to number, a required field became nullable.
Performance regression. A new database query added 2 seconds of latency that wasn't there before (if your test suite includes performance assertions).

API testing misses

Infrastructure failures. The database is up in CI but the production replica is lagging 30 seconds behind. The Redis cluster lost a node. The connection pool is exhausted under production load.
Third-party dependency failures. Your payment provider's API is returning 503. The OAuth provider changed their JWKS endpoint. A CDN edge is serving stale certificates.
Configuration drift. A deploy shipped with the wrong environment variable. The production secret rotated but the app is using the old one. A feature flag was toggled off accidentally.
Gradual degradation. Latency creeping up over hours as a memory leak consumes the heap. Disk filling up until writes start failing. Connection pool exhaustion under sustained load.
Regional failures. The endpoint works from your CI runner in us-east-1 but returns timeouts from eu-west-1 because of a routing misconfiguration.

API monitoring catches

Everything testing misses — because it runs in production, continuously, from multiple locations. A monitor does not know what the "correct" behavior is for every edge case (that's testing's job). It knows what "working" looks like: responds within a latency budget, returns expected status codes, response body contains expected fields.

API monitoring misses

Logic correctness. A monitor can assert that /users/123 returns a 200 with a name field. It cannot assert that the value of the name field is correct for that specific user ID — that's a test.
Edge cases. A monitor runs one synthetic request on a schedule. It does not cover the 47 edge cases your test suite validates (malformed input, concurrent writes, boundary conditions).
Pre-deploy validation. By definition, monitoring runs post-deploy. If you ship a broken endpoint, monitoring tells you after users are affected.

Same tool, different workflows

The confusion deepens because the same tools appear in both workflows. Postman is used for both ad-hoc API testing and scheduled monitors. Playwright is used for both E2E tests in CI and production browser monitors. The difference is not the tool — it's the workflow.

┌─────────────────────────────────────────────────────────┐
│                    Development                           │
│                                                         │
│  Write code → Run tests → Push → CI runs test suite     │
│                                    │                    │
│                                    ▼                    │
│                              Tests pass?                │
│                              Yes → Deploy               │
│                              No  → Fix                  │
│                                                         │
└─────────────────────────────────────────────────────────┘
                              │
                              ▼ deploy
┌─────────────────────────────────────────────────────────┐
│                    Production                            │
│                                                         │
│  Monitor runs every 30s from 3 regions                  │
│      │                                                  │
│      ▼                                                  │
│  Response OK?                                           │
│  Yes → Continue                                         │
│  No  → Alert on-call → Incident → [MTTR clock starts]  │
│                                                         │
└─────────────────────────────────────────────────────────┘

Testing gates the deploy. Monitoring watches the deploy's aftermath. Skipping either leaves a gap.

The gap between them — and what fills it

Teams with mature API infrastructure have three layers:

Unit + integration tests — validate logic correctness in isolation and against test databases. Run in CI, block merge.
Contract tests — validate that the API response schema matches what consumers expect. Catch breaking changes before deploy.
Production monitoring — validate availability, latency, and response correctness continuously. Catch incidents after deploy.

The gap between layers 2 and 3 is where most P0 incidents originate. The code is correct (tests prove it). The contract is intact (schema hasn't changed). But the system fails because of infrastructure, configuration, or dependency issues that no pre-deploy validation covers.

Some teams add a fourth layer: synthetic API tests against production — essentially API tests that run post-deploy against real infrastructure, before the traffic shifts fully. Canary deploys and smoke tests fill this gap, but they are time-bounded. Monitoring is continuous.

What good API monitoring actually asserts

A common mistake is monitoring only the HTTP status code. A 200 OK from an endpoint that returns an empty JSON body, a 200 that returns an error message in the body, or a 200 that takes 12 seconds to arrive — these are all failures that status-code monitoring misses.

Good API monitoring asserts on:

Status code — the baseline, but not sufficient alone.
Response time — within your latency SLO (e.g., p95 < 500ms).
Response body content — the expected fields exist and contain valid data. data.users is a non-empty array. meta.total is a number > 0.
Response headers — cache headers are present, CORS headers are correct.
Certificate validity — the TLS certificate won't expire within 14 days.

# Example: a monitor that checks status, latency, and body content
curl -s -w "\n%{http_code} %{time_total}" \
  -H "Authorization: Bearer $TOKEN" \
  https://api.example.com/v1/health | \
  jq -e '.status == "healthy" and .database == "connected"'

Common mistakes

"We have tests, we don't need monitoring." Tests validate code correctness. They cannot validate production infrastructure health, third-party availability, or configuration correctness. These are different failure classes.

"We have monitoring, we don't need tests." Monitoring tells you something broke. It doesn't tell you what the correct behavior is, doesn't validate edge cases, and doesn't prevent broken code from deploying.

"Our staging environment is identical to production." It is not. Staging has different load, different data, different third-party credentials (often sandbox/test keys), and different infrastructure (usually smaller). Passing in staging does not guarantee passing in production.

"We monitor the health endpoint, that covers it." A health endpoint that returns 200 proves the process is running. It does not prove that the /checkout endpoint can reach the payment provider, that the /search endpoint can query Elasticsearch, or that the /upload endpoint can write to S3.

Start with both

If you have tests but no monitoring: your next incident will be a production failure that tests could never have caught. Add multi-region API monitoring with response body assertions on your 5 most critical endpoints.

If you have monitoring but no tests: your next bug will ship to production before monitoring detects it, because it will affect an edge case the monitor doesn't cover. Add integration tests for your API's critical paths.

Set up API monitoring with multi-region checks, response body assertions, latency thresholds, and a status page that updates from the same monitoring data at app.devhelm.io — your first monitor is live in 60 seconds, no credit card. For the full tooling landscape, see the best API monitoring tools in 2026.

Originally published on DevHelm.

How to Set Up Browser Monitoring with Playwright

DevHelm — Wed, 08 Jul 2026 15:28:16 +0000

You want to know — continuously, automatically — that a user can complete a critical task in your production application. Not that the server returns 200. Not that the homepage loads HTML. That a real browser can navigate to your checkout page, fill in payment details, click "Pay now," and see "Order confirmed."

This tutorial walks through setting up browser monitoring with Playwright from scratch: writing a check that asserts on user-visible outcomes, capturing forensic evidence on failure, scheduling it against production, and routing alerts to the right channel.

Prerequisites

Node.js 18+ with an existing project (or create one: mkdir browser-monitors && cd browser-monitors && npm init -y).
Playwright installed: npm install -D @playwright/test && npx playwright install chromium.
A production (or staging) URL to monitor.
A dedicated synthetic test account — credentials stored in environment variables, never in source.

Step 1 — Write a check that asserts on outcomes

A monitoring check is not a test in the QA sense — it does not cover edge cases or validate business logic. It answers one question: can a user complete this journey right now?

// monitors/checkout.spec.ts
import { test, expect } from "@playwright/test";

test("checkout completes successfully", async ({ page }) => {
  await page.goto(process.env.BASE_URL! + "/products");

  // Add item to cart
  await page.getByRole("button", { name: "Add to cart" }).first().click();
  await page.getByRole("link", { name: "Cart" }).click();

  // Begin checkout
  await page.getByRole("button", { name: "Checkout" }).click();

  // Fill payment (test card — non-charging token)
  await page.getByLabel("Email").fill(process.env.SYNTHETIC_EMAIL!);
  await page.getByLabel("Card number").fill("4242424242424242");
  await page.getByLabel("Expiry").fill("12/28");
  await page.getByLabel("CVC").fill("123");

  // Submit and assert on the user-visible outcome
  await page.getByRole("button", { name: "Pay now" }).click();
  await expect(page.getByText("Order confirmed")).toBeVisible({
    timeout: 15000,
  });
});

The assertion is the key: page.getByText("Order confirmed") proves the entire chain worked — frontend rendering, API call, payment provider round-trip, database write, and confirmation page render. A status code check would miss failures in any of those layers.

Step 2 — Use stable selectors that survive deploys

Browser checks break when the UI changes. Minimize breakage by using selectors that survive visual redesigns:

// Fragile — breaks when CSS classes change
await page.locator(".btn-primary.checkout-submit").click();

// Stable — survives any visual redesign
await page.getByRole("button", { name: "Pay now" }).click();

// Also stable — explicit test IDs for elements without semantic roles
await page.getByTestId("order-total").textContent();

Priority order for selectors: ARIA roles > getByLabel / getByText > data-testid > CSS selectors. The first three are tied to user-visible semantics that rarely change; CSS classes change on every design sprint.

Step 3 — Wait for conditions, never for time

The number-one cause of flaky monitoring checks is fixed sleeps:

// Flaky: assumes 3 seconds is enough, or wastes 3 seconds when it's fast
await page.waitForTimeout(3000);
const text = await page.getByTestId("balance").textContent();
expect(text).toBeTruthy();

// Stable: retries automatically until the condition holds or times out
await expect(page.getByTestId("balance")).toBeVisible({ timeout: 10000 });
await expect(page.getByTestId("balance")).not.toHaveText("");

Playwright's web-first assertions (expect(locator).toBeVisible(), expect(locator).toHaveText()) retry until the condition is met. They pass in 200ms when the app is fast and only fail when something is genuinely broken.

Step 4 — Capture evidence on failure

When a production check fails at 3 AM, the on-call engineer needs to diagnose why without re-running it. Configure Playwright to preserve forensic evidence on failure:

// playwright.config.ts
import { defineConfig } from "@playwright/test";

export default defineConfig({
  testDir: "./monitors",
  use: {
    baseURL: process.env.BASE_URL,
    screenshot: "only-on-failure",
    trace: "retain-on-failure",
    video: "retain-on-failure",
  },
  timeout: 30000,
  retries: 1,
});

On failure, you get: a screenshot of the page state at the moment of failure, a trace (full DOM snapshot + network log + console, viewable at trace.playwright.dev), and a video of the entire check execution. The retries: 1 setting re-runs a failed check once before declaring failure — this eliminates most transient network blips without masking real incidents.

Step 5 — Schedule checks against production

A monitoring check runs on a clock, not on commits. The simplest scheduler is a GitHub Actions cron workflow:

# .github/workflows/browser-monitor.yml
name: browser-monitoring
on:
  schedule:
    - cron: "*/5 * * * *"
  workflow_dispatch:

jobs:
  checkout-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20 }
      - run: npm ci && npx playwright install --with-deps chromium
      - run: npx playwright test monitors/checkout.spec.ts
        env:
          BASE_URL: https://app.yourproduct.com
          SYNTHETIC_EMAIL: ${{ secrets.SYNTHETIC_EMAIL }}
          SYNTHETIC_PASSWORD: ${{ secrets.SYNTHETIC_PASSWORD }}
      - uses: actions/upload-artifact@v4
        if: failure()
        with:
          name: failure-evidence-${{ github.run_id }}
          path: test-results/
          retention-days: 14

This gives you browser monitoring today for free. The limits are honest: cron floors at ~1-minute granularity, GitHub-hosted runners give you one region (US), and you need to wire alerting separately.

Step 6 — Add multi-region coverage

Failures are often regional — a CDN edge cert expires in one zone, DNS propagates unevenly, a deploy rolls out region by region. Running checks from a single location is blind to all of these.

For a self-hosted approach, deploy the same Playwright check as a container to multiple regions and report results to a central collector. For a managed approach, platforms like Checkly run your Playwright scripts from 20+ regions with built-in scheduling and alerting.

At minimum, run critical checks from two geographically distinct locations. If a check fails from one location but passes from others, you have a regional incident — a distinction that changes the severity and response.

Step 7 — Route failures to the right channel

A failed browser check is only useful if it reaches a human who can act on it. Wire failure notifications to match your incident severity levels:

Revenue-critical journey failure (checkout, signup) → page on-call via PagerDuty/OpsGenie.
Important but not revenue-blocking (settings page, profile update) → Slack channel, business hours.
Secondary paths (about page, blog) → logged, reviewed weekly.

For the GitHub Actions approach, add a failure notification step:

      - name: Alert on failure
        if: failure()
        run: |
          curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H "Content-Type: application/json" \
            -d '{"text": "Browser check FAILED: checkout journey. Run: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"}'

Step 8 — Monitor the layer underneath

A browser journey sits on top of API endpoints that can fail independently. When your checkout check goes red, you want to immediately know whether the problem is in the frontend, the API, or a third-party dependency.

Monitoring those API endpoints directly — with response body assertions, latency thresholds, and multi-region coverage — turns "the whole flow is red" into "the /api/payment-intent endpoint is returning 500 from us-east." That diagnostic specificity is the difference between a fast MTTR and a slow one.

Set up API and uptime monitors for the endpoints your browser checks depend on — with multi-region checks, config-as-code, and a status page that updates from the same monitoring data — at app.devhelm.io. Your first monitor is live in 60 seconds, no credit card.

Headless Browser Monitoring: What It Is and When You Need It

DevHelm — Wed, 08 Jul 2026 15:27:40 +0000

An HTTP check tells you a server responded. A headless browser check tells you a user could actually complete a task. The difference matters every time a deploy ships a JavaScript error that breaks a button, a third-party script blocks rendering, a payment iframe fails to load, or a frontend routing change returns 200 OK with an empty page body.

Headless browser monitoring runs a real browser — Chromium, headless, with no visible window — against your production application on a recurring schedule. It clicks buttons, fills forms, waits for elements, and asserts on outcomes exactly the way a user would, except it does it every 30 seconds from multiple geographic regions without getting tired or forgetting to check.

How it works

A headless browser monitor executes a script — typically written in Playwright or Puppeteer — inside a Chromium instance that runs without a display. The script performs a user journey: navigate to a URL, interact with the page, assert that the expected outcome appears.

import { chromium } from "playwright";

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();

await page.goto("https://app.example.com/login");
await page.getByLabel("Email").fill("synthetic@example.com");
await page.getByLabel("Password").fill(process.env.SYNTHETIC_PASSWORD!);
await page.getByRole("button", { name: "Sign in" }).click();

await page.waitForURL("**/dashboard");
const heading = await page.getByRole("heading", { name: "Welcome" });

if (!await heading.isVisible()) {
  throw new Error("Login journey failed: dashboard heading not visible");
}

await browser.close();

When this script runs on a schedule — say every 60 seconds from three regions — it becomes a monitor. A failure triggers an alert. The evidence (screenshot, trace, network waterfall) ships with the alert so on-call knows what broke without re-running the check manually.

What headless browser monitoring catches that API checks miss

An API check validates that an endpoint returns a response matching expectations. A headless browser check validates that the assembled experience works end-to-end. The gap between those two is where most user-facing incidents live:

Client-side JavaScript errors. A deploy ships a bundling error that throws Uncaught TypeError on the checkout page. The API returns 200 with valid JSON. The button does nothing. An HTTP check passes. A browser check fails because page.getByText("Order confirmed") never appears.

Third-party script failures. Your payment provider's iframe stops loading because their CDN has a regional outage. Your server is fine. Your API is fine. The user sees a blank payment box. Only a browser check that actually attempts the payment flow detects this.

CSS/layout regressions. A deploy changes a z-index and the "Submit" button is now behind an overlay. The button exists in the DOM. The API endpoint works. But page.getByRole("button", { name: "Submit" }).click() fails because the button is not interactable — it's obscured.

SPA routing failures. A React/Next.js app has a routing bug that renders a blank page on certain navigation paths. The server returns 200 with the HTML shell. The JavaScript that should render the page crashes silently. Only a browser check that navigates the path and asserts on rendered content catches this.

Authentication flow breakage. OAuth redirects, SAML flows, cookie-based sessions — these involve multiple round trips across domains. An API check on any single endpoint passes. The combined flow fails because a redirect URI changed or a cookie domain attribute is wrong.

When you need headless browser monitoring — and when you do not

Headless browser monitoring is expensive relative to HTTP checks. A browser check consumes 10–100x more compute, takes 5–30 seconds to complete (versus ~200ms for an HTTP check), and requires maintaining scripts that break when the UI changes. You should use it selectively.

You need it when:

Revenue-critical user journeys exist (checkout, signup, upgrade, payment method update).
Your application is an SPA or heavily JavaScript-dependent — the server response alone does not represent the user experience.
You depend on third-party iframes or scripts (payment, auth, analytics) that can break independently of your infrastructure.
Your deploy pipeline does not include end-to-end tests that cover production-specific configurations (real OAuth, real payment providers, production CDN).

You do not need it when:

Your service is a pure API (no browser-rendered UI). API monitoring with response body assertions covers this.
The same Playwright tests already run in CI against a staging environment that mirrors production faithfully. (Though even then, a scheduled production run catches configuration drift and third-party failures — see Playwright monitoring.)
Your pages are mostly static content served from a CDN. A simple HTTP check with content assertion is cheaper and sufficient.

The cost question

Every headless browser monitor has three costs: compute (running Chromium), maintenance (updating scripts when the UI changes), and signal quality (flaky checks that erode trust).

Compute cost depends on your tooling choice. Managed platforms like Checkly bill per-run (~$4–6.50 per 1,000 browser runs). Self-hosted approaches (GitHub Actions cron, dedicated containers) trade billing for infrastructure management. Either way, a single browser check at 30-second intervals from 3 regions is ~260,000 runs/month — budget accordingly.

Maintenance cost scales with UI volatility. If your team ships frontend changes daily, browser checks will break often unless they use stable selectors (data-testid, ARIA roles) rather than CSS classes or XPath. The best practices guide covers selector discipline in detail.

Signal quality is the hidden cost. A browser check that flakes once a week trains your team to ignore it. Invest in confirm-on-failure (re-run from a second region before alerting), conditional waits (never waitForTimeout), and isolated test accounts. A reliable check that alerts once a quarter is worth more than a flaky check that alerts twice a day.

Headless browser monitoring vs synthetic monitoring vs RUM

These three terms overlap and confuse:

Headless browser monitoring is a specific implementation: run Chromium headlessly against production on a schedule. It is a subset of synthetic monitoring.
Synthetic monitoring is the broader category: any proactive, scripted check against production — including HTTP pings, multi-step API checks, and browser checks.
Real User Monitoring (RUM) is passive: instrument real browser sessions and record what actual users experience.

Headless browser monitoring catches failures before users encounter them. RUM catches failures as users encounter them. The first is proactive; the second is reactive. Most teams need both — synthetic checks for critical journeys (instant detection), RUM for coverage breadth and performance baselines.

Getting started

The fastest path from zero to a working headless browser monitor:

Write a Playwright test that asserts on your most important user journey (login, checkout, or the core product action).
Run it locally against production: npx playwright test --headed to verify it passes.
Schedule it: a GitHub Actions cron workflow is the simplest starting point.
Add failure evidence: screenshots, traces, and video on failure so alerts carry context.
Wire alerting: route failures to your on-call channel, matched to your severity levels.

Once you have one check running reliably, add checks for your top 3–5 revenue-critical paths. Stop there until you've proven the maintenance cost is manageable at your team's deployment frequency.

For the API endpoints and uptime layer underneath your browser checks — multi-region HTTP monitoring with response body assertions, config-as-code, and a status page that updates from the same data — start at app.devhelm.io. Your first monitor is live in 60 seconds, no credit card.

Originally published on DevHelm.

Checkly Alternatives in 2026: Synthetic Monitoring Tools Compared

DevHelm — Wed, 08 Jul 2026 15:27:03 +0000

Checkly built its reputation on a single conviction: synthetic monitoring should be code, not click-through wizards. You write Playwright tests, Checkly runs them on a schedule from global locations, and you get alerts when a user journey breaks. For teams already fluent in TypeScript and Playwright, it fits like a natural extension of CI.

But Checkly is not the only option, and its trade-offs become clear at scale. Per-run billing means a busy monitoring setup can produce surprise invoices — a single browser check running every 30 seconds from three regions generates 259,200 billable runs per month. There is no recorder for non-developers, no native on-call, and SSO requires the Enterprise plan. If any of those gaps matter to your team, an alternative deserves evaluation.

We compared five alternatives across the dimensions where teams actually switch: pricing predictability, authoring mode, on-call integration, browser engine fidelity, and developer surface (CLI, Terraform, MCP).

What makes Checkly good — and where teams hit limits

Checkly's strengths are real. It runs your actual Playwright suites (multi-file, fixtures, stored state) with the highest fidelity in the category. The developer surface is deep: a CLI (checkly test and checkly deploy), a Terraform provider, Pulumi support, and Prometheus export. If your definition of monitoring as code is "my monitors live in the same repo as my app and deploy in the same pipeline," Checkly delivers that better than anyone else.

The limits show up in three places:

Pricing at scale. Browser checks cost ~$4–6.50 per 1,000 runs, and the platform splits into three separately metered products (Synthetics, Alerting, Private Locations). A team with 20 browser checks at 1-minute intervals from 3 regions pays for ~2.6 million runs/month. That's $10,000–17,000/year just for the browser runs.

Non-developer authoring. There is no recorder. Every check is TypeScript. If your QA team or product managers need to create checks, they cannot — they need a developer to write the code. This limits adoption to engineering teams only.

On-call and incident management. Checkly alerts via webhook, Slack, PagerDuty, or OpsGenie — but has no native escalation, rotation, or incident timeline. You need a separate on-call tool.

Datadog Synthetic Monitoring

Datadog is the enterprise pick, and its differentiator is correlation. A failed browser check links directly to the APM trace, the infrastructure metrics, and the RUM session that explains it. No other tool in this list can show you "the checkout button failed because the payment-intent endpoint spiked to 4s latency because the Postgres replica lagged 12s behind primary" in a single pane.

Where it wins over Checkly: nine test types (including mobile device), self-healing locators in the recorder, SOC 2 + ISO 27001 + HIPAA + FedRAMP compliance, SAML/SCIM with custom RBAC, and native integration with 700+ other Datadog products.

Where it loses: Browser checks cost ~$12–18 per 1,000 runs — roughly 3x Checkly's rate. CI test runs draw from the same quota. The frequency floor in the UI is 5 minutes (1-minute requires a support ticket). The code-first story is weaker — you can write tests in JavaScript, but the workflow assumes the recorder as the starting point. Session replay sits behind separately-billed RUM.

Best for: enterprises already on Datadog that want synthetic checks correlated with full-stack observability and can absorb metered browser pricing. For the broader platform comparison, see Datadog vs Dynatrace.

Grafana Cloud Synthetic Monitoring (k6)

Grafana Cloud has the most generous free tier in synthetic monitoring — 100,000 API plus 10,000 browser executions per month, no credit card — backed by the credibility of open-source k6. If your team values owning the stack and paying nothing to start, Grafana is the obvious candidate.

Where it wins over Checkly: free tier that actually covers meaningful usage, OSS foundation (k6 is MIT-licensed), Playwright-to-k6 script conversion, a first-party authoring MCP, and Terraform support. If you already run Grafana for dashboards, adding synthetic monitoring is a config change, not a vendor decision.

Where it loses: past the free tier, browser pricing gets steep and confusing (~$50 per 10,000 executions, billed per-probe-per-minute). There is no in-product recorder — k6 Studio is a separate desktop app. The browser interval floor is 60 seconds. And the whole-stack complexity is real: you are adopting Grafana's ecosystem, not a focused monitoring tool.

Best for: engineering teams that value OSS, need a strong free tier, and author everything in code. Not for teams that need a recorder or sub-minute browser intervals.

Better Stack

Better Stack bundles uptime monitoring, Playwright browser checks, incident management, on-call rotations, logs, and status pages in one product. Its native on-call and escalation are the best in this list — the one thing Checkly fundamentally lacks.

Where it wins over Checkly: native on-call with escalation policies and rotations, a bundled status page, Playwright/Chromium engine (not Selenium), trace-viewer artifacts on failure, and a Terraform provider. For small teams that want monitoring + on-call + status pages in one bill, Better Stack eliminates three separate vendor relationships.

Where it loses: locations are coarse (four regional groups — US, EU, Asia, Australia). Private synthetic locations are weak and lightly documented. There is no AI authoring and no visual regression. Pricing is per-minute (~$1 per 100 Playwright-minutes) on top of a required $29/responder seat, which gets unpredictable at scale.

Best for: small-to-mid teams that want monitoring, on-call, and status pages bundled. Teams that need deep location control or high-volume browser checks at predictable pricing will hit limits.

Sematext

Sematext is the one predictable pricing model in synthetic monitoring: a flat per-monitor fee (~$2 for HTTP, ~$7 for a browser monitor per month) with no per-run meter. The engine runs Playwright on Chromium, and private locations deploy as Docker containers.

Where it wins over Checkly: completely predictable billing. A team with 20 browser monitors pays $140/month regardless of check frequency or location count. No surprise invoices. No metering math. Docker-based private locations that work without an enterprise contract.

Where it loses: the developer surface is minimal — no Terraform provider, no CLI for synthetics, no MCP. There is no recorder, no video capture, no HAR archive. Multi-step journeys only report the last page's metrics. The browser interval floor is 5 minutes. It is a thin feature set with one clear advantage: predictability.

Best for: teams that want predictable per-monitor pricing on a handful of browser checks and do not need a developer surface or deep forensics.

When you need the layer underneath

All synthetic monitoring tools — Checkly included — run checks from the outside. They tell you that a journey failed. They do not tell you which API endpoint caused the failure, or whether the root cause is your infrastructure or a degraded third-party dependency.

Layering API monitoring underneath your browser checks turns "the checkout flow is red" into "the /payment-intent endpoint is returning 500 because Stripe's API is degraded." That correlation is the difference between a 5-minute diagnosis and a 45-minute scramble.

How to choose

The decision tree is shorter than the feature matrix suggests:

You write Playwright and want maximum fidelity: Checkly remains the best if you can manage per-run billing.
You need APM correlation: Datadog, if you can absorb the price.
You want free and open source: Grafana Cloud / k6.
You need on-call bundled: Better Stack.
You need predictable billing above all: Sematext.
You need API and uptime monitoring with config-as-code underneath your synthetic layer: DevHelm — 50 monitors on the free tier, flat pricing, multi-region, with a status page that updates from the same check data. Your first monitor takes 60 seconds, no credit card.

Whatever you choose, read the synthetic monitoring best practices guide before you configure your first check — the difference between a useful setup and a flaky one that trains your team to ignore alerts is in the details, not the vendor.

Originally published on DevHelm.

Opsgenie Is Shutting Down: What You Need to Know and When to Migrate

DevHelm — Wed, 08 Jul 2026 15:26:27 +0000

Atlassian has confirmed that Opsgenie will reach end-of-life in April 2027. New signups are already blocked, and existing customers are being guided toward Jira Service Management (JSM) Cloud as the replacement. If your team relies on Opsgenie for on-call scheduling, alert routing, or escalation policies, you have roughly ten months to plan and execute a migration.

This article covers what's happening, why, and what your options look like.

What happened

Atlassian announced that Opsgenie — acquired in 2018 as a standalone incident alerting and on-call product — will be fully deprecated. The core capabilities (alert routing, on-call schedules, escalation policies, and incident timelines) have been rebuilt inside Jira Service Management Cloud, and Atlassian no longer sees a reason to maintain a separate product.

Existing Opsgenie customers received migration notices starting in early 2026. The Opsgenie web console now displays a banner pointing to migration documentation, and the API will continue functioning until the shutdown date — but no new features will ship.

Timeline

Date	Event
Q1 2026	New Opsgenie signups blocked
Q2 2026	Migration tooling available in JSM
April 2027	Opsgenie reaches end-of-life — APIs, web console, and mobile app stop functioning

Atlassian is providing roughly 12–14 months of overlap between the announcement and the hard cutoff. That sounds generous, but migrations that involve on-call schedules, hundreds of integrations, and team-specific escalation logic rarely go smoothly in a single sprint.

Why Atlassian is doing this

The strategic logic is straightforward: Atlassian wants one incident management surface, not two. Maintaining a standalone product that overlaps with JSM's built-in incident features creates engineering duplication, confuses the sales motion, and splits the user base.

JSM Cloud now includes:

Alert routing and deduplication (ported from Opsgenie's engine)
On-call scheduling with rotation rules
Escalation policies with multi-channel notification
Incident timelines and postmortem workflows
A mobile app for on-call acknowledgment

For teams already paying for JSM, this consolidation removes a separate line item. For teams that used Opsgenie standalone (without Jira), it forces a platform decision.

What changes for existing users

If you do nothing, your on-call system stops working in April 2027. Specifically:

Integrations break. Every monitoring tool, CI pipeline, or custom webhook that currently POSTs to Opsgenie's API will need a new destination. If you route alerts from Datadog, Prometheus, Grafana, CloudWatch, or any other source through Opsgenie, those integrations require rewiring.

On-call schedules need recreation. While Atlassian provides migration tooling, the schedule model in JSM differs from Opsgenie's. Complex rotations with overrides, restrictions, and multi-team handoffs may not map 1:1.

Mobile app changes. The Opsgenie mobile app will stop receiving alerts after shutdown. JSM uses the Jira Cloud mobile app (or the Jira Ops companion app) for on-call notifications.

Data export. Atlassian provides export tooling for historical alert and incident data. If you need audit trails or postmortem archives, export before the cutoff.

Migration options

You have three paths, and the right one depends on your existing Atlassian footprint.

Option A: Move to JSM Cloud

This is the path of least resistance if your team already uses Jira for ticketing. The migration tooling handles:

On-call schedules and rotations
Escalation policies
Integration configurations (partial — some require manual reconnection)
Team structures

Considerations: JSM pricing is per-agent, which can be significantly more expensive than Opsgenie's per-user model for large teams. JSM also requires Jira Cloud — if you're on Jira Server or Data Center, this migration includes a platform migration too.

Option B: Evaluate alternatives

If you were already unhappy with Opsgenie — pricing, mobile app reliability, or integration depth — this is a natural breakpoint to evaluate the market. The on-call and alerting alternatives space has matured since 2018, with several strong options in PagerDuty, Rootly, incident.io, Grafana OnCall, and others.

When this makes sense: your team doesn't use Jira, you want to avoid JSM's per-agent pricing, or you need capabilities Opsgenie never had (AI-powered triage, native Slack workflows, or deeper runbook automation).

Option C: Hybrid migration

Some teams separate concerns: move on-call scheduling to a dedicated tool while keeping Jira for ticket tracking and postmortems. This avoids JSM lock-in for the real-time alerting path while preserving the Jira integration for retrospective workflows.

Timeline recommendations

Regardless of which path you choose:

Now (Q2 2026): Audit your current Opsgenie configuration. Document every integration, escalation policy, and schedule. Identify which are actively used vs. legacy.
Q3 2026: Run a proof-of-concept migration in a staging environment. Test alert routing, escalation timing, and mobile notifications.
Q4 2026: Execute production migration with a parallel-run period. Keep Opsgenie active as a fallback while validating the new system handles real incidents correctly.
Q1 2027: Decommission Opsgenie integrations and export historical data.

What to evaluate in any migration

Whether you're moving to JSM or a third-party tool, these are the dimensions that matter for on-call rotation setup:

On-call scheduling flexibility. Can you model your actual rotation? Multi-team handoffs, timezone-aware shifts, holiday overrides, temporary swaps. Simple round-robin is table stakes — the complexity lives in the exceptions.

Escalation policies. How many escalation layers? Can you escalate to a different team after N minutes? Do escalations respect business hours vs. 24/7? Can you route by incident severity level?

Integration breadth. Count your current Opsgenie integrations. Verify that the destination tool supports each one natively or via webhook. Pay special attention to bidirectional integrations (where the on-call tool writes back to the source).

Pricing model. Opsgenie charged per-user with a generous free tier. JSM charges per-agent. PagerDuty charges per-user with add-on costs. Some newer tools offer flat-rate pricing. Model your actual team size and growth.

Mobile experience. On-call is a mobile-first workflow. Test the actual notification reliability, acknowledgment flow, and override UX on both iOS and Android. Unreliable push notifications during an outage are worse than no tool at all.

Start planning now

Ten months feels like plenty of time until you factor in procurement cycles, security reviews, integration testing, and the reality that on-call migrations can only be validated during actual incidents. Teams that start evaluating in Q2 2026 will have time for a proper parallel-run period. Teams that wait until Q1 2027 will be scrambling.

Document your current state, pick a direction, and get a proof-of-concept running before summer ends.

For the monitoring and alerting layer that feeds your on-call tool — multi-region checks, notification policies, and a status page — start at app.devhelm.io.

Originally published on DevHelm.

Opsgenie Alternatives in 2026: Where to Migrate Before the Shutdown

DevHelm — Wed, 08 Jul 2026 15:25:50 +0000

Atlassian announced it is sunsetting Opsgenie as a standalone product, folding its on-call features into Jira Service Management. If your team runs on Opsgenie today, you have a deadline — and a decision.

The forced migration is an opportunity to re-evaluate. Opsgenie was a solid mid-market on-call tool, but the market has changed since most teams adopted it. Newer entrants like incident.io and Rootly have reimagined how on-call connects to incident response. Grafana OnCall made open-source on-call viable. PagerDuty matured its platform further. And Atlassian's own JSM absorbed Opsgenie's features into a broader ITSM surface.

This guide compares six alternatives on the dimensions that actually matter for an on-call rotation: scheduling flexibility, escalation depth, integration breadth, pricing model, mobile experience, and how well each tool connects to the rest of your incident workflow.

What to look for in an Opsgenie replacement

Before comparing tools, clarify what your team actually needs. Opsgenie covered a broad surface — on-call scheduling, escalation policies, alert routing, and integrations with monitoring tools. Not every replacement covers all of it equally well.

On-call scheduling. The basics are table stakes: rotations, overrides, time-of-day restrictions. The differentiators are schedule previews, gap detection, and how painful it is to set up a follow-the-sun rotation across three time zones.

Escalation policies. How many levels can you define? Can you branch based on alert severity or service? Can you escalate to a Slack channel instead of a person? Some tools treat escalation as a linear chain; others allow conditional routing trees.

Integration count and quality. Opsgenie had 200+ integrations. If your monitoring stack sends alerts through a specific integration, verify the replacement supports it natively — not just via generic webhook. Native integrations carry metadata (severity, service, deduplication keys) that generic webhooks lose.

Pricing model. The industry split is per-seat vs. usage-based. Per-seat is predictable but punishes large on-call rosters where most people are only paged occasionally. Usage-based (per-incident or per-notification) is cheaper for large teams but can spike during outages — exactly when you need the tool most.

Alert routing intelligence. Can the tool suppress duplicates? Correlate related alerts? Auto-resolve when the source clears? Route based on alert content, not just the integration it arrived on? Opsgenie's alert policies were underrated — make sure your replacement covers the same ground.

Mobile app quality. On-call is a mobile-first job. The acknowledge-and-escalate flow needs to work reliably on a lock screen notification at 3 AM. Test the mobile app before committing.

PagerDuty

PagerDuty is the incumbent and the most mature platform in the category. It has been doing on-call since 2009, and the depth shows — 700+ integrations, multi-level escalation with conditional branching, event intelligence (ML-based alert grouping and suppression), and a mobile app that has had fifteen years of iteration.

The escalation engine is the deepest in this list. You can define policies that route by urgency, time of day, and service. The event orchestration layer lets you transform, suppress, or re-route alerts before they page anyone. If you have complex routing needs — "page the database team for Postgres alerts, but only if severity is critical and it's outside business hours" — PagerDuty can express that without custom code.

The downside is cost. Plans start at $21/user/month for the base tier and reach $49/user/month for the full platform (AIOps, analytics, status pages). For a 20-person on-call roster, that is $5,000–12,000/year. The platform also carries the weight of its age — the UI has layers of legacy concepts, and configuring event orchestration requires working through a learning curve that newer tools avoid.

Best for: mid-to-large engineering orgs with complex routing needs, deep integration requirements, and budget for the premium tier. Overkill for a 5-person startup.

Grafana OnCall

Grafana OnCall is the open-source option. You can self-host it (Apache 2.0 license) or use the managed version on Grafana Cloud, which includes a free tier for up to 100 users.

The primary value proposition is native integration with the Grafana ecosystem. If you already use Grafana for dashboards and Grafana Alerting for your alert rules, OnCall plugs in without a new vendor relationship. Alerts flow directly from Grafana Alerting into on-call schedules and escalation chains — no webhook glue required.

The scheduling and escalation features cover the essentials: rotations, overrides, multi-step escalation, and notification through Slack, Telegram, phone, and SMS. The web UI is clean and functional. Where it lags behind PagerDuty is in the edges — the mobile app is newer and less polished, the alert routing logic is simpler (no ML-based grouping), and the integration catalog outside the Grafana ecosystem is smaller.

Best for: teams already invested in Grafana that want on-call without adding another vendor or another bill. Also strong for teams that value open source and want the option to self-host.

incident.io

incident.io started as an incident management tool and expanded into on-call. The result is a product where on-call and incident response are tightly coupled — when a page fires, the same tool handles the response, the communication, and the retrospective.

The differentiator is Slack-native workflows. Declaring an incident creates a dedicated channel, assigns roles, posts status updates, and tracks action items — all without leaving Slack. The on-call layer feeds directly into this: when an alert fires and nobody acknowledges, escalation can auto-declare an incident with the full response machinery attached.

On-call features include rotations, escalation policies, and a catalog-driven routing model where you define services and link them to teams. Pricing starts at $20/user/month for the on-call product, with incident management as a separate (or bundled) line item.

Best for: teams that run their incident response in Slack and want on-call tightly integrated with the declare-respond-retrospect lifecycle. Less compelling if your team uses Microsoft Teams or prefers a standalone on-call tool.

Rootly

Rootly occupies similar territory to incident.io — incident management with on-call — but differentiates on AI-powered retrospectives and broader chat platform support (both Slack and Microsoft Teams).

The incident timeline is the standout feature. Rootly automatically constructs a chronological record of actions taken during an incident — who was paged, what was acknowledged, which runbooks were triggered, what messages were posted. The retrospective template then pulls from this timeline, reducing the manual work of writing a postmortem.

On-call scheduling and escalation are solid but not as deep as PagerDuty's. The integration catalog is growing but smaller than the incumbents. Pricing is competitive with incident.io.

Best for: teams that value automated retrospectives and want incident management + on-call in one tool. Particularly relevant if your organization uses Microsoft Teams, where incident.io's Slack-native approach is a non-starter.

Better Stack

Better Stack takes a bundled approach: uptime monitoring, on-call, incident management, and status pages in one product. If you want to consolidate vendors — replace Opsgenie and your uptime monitoring tool at the same time — Better Stack is the most integrated option.

The on-call features are competent: rotations, escalation policies, multi-channel notifications (phone, SMS, Slack, Teams, email). The scheduling UI is straightforward. What makes it interesting for Opsgenie refugees is the monitoring layer underneath — you get HTTP, keyword, and heartbeat checks that feed directly into on-call without configuring a separate integration.

The trade-off is depth. Better Stack's on-call is solid for straightforward routing (alert fires, page the on-call engineer, escalate if unacknowledged), but it lacks the conditional routing and event orchestration that PagerDuty offers. For teams with simple on-call needs and a desire to reduce vendor count, that trade-off is acceptable.

Best for: small-to-mid teams that want monitoring + on-call + status pages in one subscription. Not for teams with complex multi-service routing needs.

Jira Service Management (JSM)

JSM is Atlassian's own migration path. The on-call features in JSM are, in large part, Opsgenie's features rebuilt into the JSM platform. If you are already paying for JSM Cloud, you get on-call included at no additional cost on Premium and Enterprise plans.

The integration with Jira is the obvious advantage. Alerts can create Jira issues. Incidents link to change requests. The service catalog connects to your CMDB. If your organization's workflow revolves around Jira, the operational data staying in the same platform has genuine value.

The downsides: JSM is an ITSM tool first, and on-call is one feature among many. The configuration surface is large, the UI carries Jira's complexity, and the mobile experience for on-call is embedded within the broader JSM app rather than being a focused paging tool. For more on the Opsgenie shutdown timeline and migration planning, see our detailed breakdown.

Best for: organizations already on Atlassian Cloud (Jira, Confluence, JSM) that want the simplest migration path and value tight Jira integration over a standalone on-call UX.

Migration decision framework

Dimension	PagerDuty	Grafana OnCall	incident.io	Rootly	Better Stack	JSM
Starting price	$21/user/mo	Free (Cloud)	$20/user/mo	Custom	$29/mo (team)	Included w/ JSM Premium
Integrations	700+	50+ (Grafana-native)	100+	80+	100+	200+ (Jira ecosystem)
Slack-native	Partial	Yes	Yes	Yes	Partial	No
Teams support	Yes	Yes	No	Yes	Yes	Yes
Open source	No	Yes (Apache 2.0)	No	No	No	No
Bundled monitoring	No	Yes (Grafana Cloud)	No	No	Yes	No
Mobile app maturity	High	Medium	Medium	Medium	Medium	Medium
Event orchestration	Deep	Basic	Catalog-based	Basic	Basic	Moderate

There is no single best tool here. The decision depends on your existing stack:

Already on Grafana? Grafana OnCall is the lowest-friction path.
Run incidents in Slack? incident.io or Rootly, depending on whether you need Teams support.
Want the deepest routing engine? PagerDuty.
Want to consolidate monitoring + on-call? Better Stack.
Already paying for JSM Cloud Premium? JSM is free and familiar.
Budget-constrained? Grafana OnCall (free) or JSM (included).

Whichever tool you choose, the on-call layer is only as good as the alerts feeding it. Noisy, low-context alerts create alert fatigue regardless of how well the escalation policy is configured. The monitoring system upstream — what generates the alerts, how it classifies severity, and how quickly it detects problems — determines whether your on-call engineers get paged for real incidents or wake up for false positives.

The monitoring layer underneath

Your on-call tool routes alerts. Something else has to generate them.

For the multi-region monitoring and alerting layer that feeds your on-call tool — HTTP, DNS, TCP, and heartbeat checks with configurable notification policies and a public status page — take a look at DevHelm. Your first 50 monitors are free, with checks running from multiple regions and alerts routed to whichever on-call platform you picked from this list.

Originally published on DevHelm.

On-Call Rotation Best Practices for Engineering Teams

DevHelm — Wed, 08 Jul 2026 15:25:14 +0000

On-call is the tax engineering teams pay for running production systems. Every organization that ships software to users eventually reaches the point where someone needs to be reachable when things break at 2 AM. Done well, on-call is sustainable — a shared responsibility that distributes evenly across the team and improves with every incident. Done poorly, it burns people out, concentrates knowledge in a few overloaded individuals, and quietly degrades the product as exhausted engineers stop investing in reliability.

The difference between those outcomes is almost never the tooling. It is the structure: how rotations are designed, how escalations work, how burden is distributed, and how the team treats on-call as a first-class engineering practice rather than an afterthought.

Rotation structures

There is no single correct rotation model. The right choice depends on team size, geographic distribution, and the criticality of the services being covered.

Weekly rotation is the most common default. One engineer carries the pager for a full week, then passes it to the next person. Simple to understand, simple to schedule. The downside is that a bad week — multiple incidents, high alert volume — falls entirely on one person with no relief until the handoff.

Biweekly rotation extends the shift to reduce the overhead of context-switching between on-call and non-on-call weeks. It works when alert volume is low (fewer than five pages per week on average). Beyond that threshold, two weeks starts to feel punishing.

Follow-the-sun distributes coverage across time zones so no one takes overnight pages. A team with engineers in US Pacific, European, and Asian time zones can cover 24 hours without anyone waking up at 3 AM. The tradeoff is coordination cost: handoff quality becomes critical, and you need at least three engineers per timezone to avoid single points of failure.

Primary/secondary split assigns two engineers per shift. The primary takes the initial page; if they don't acknowledge within the escalation window (typically 5–10 minutes), the secondary gets paged. This provides redundancy without doubling the roster size.

Service-ownership rotation maps on-call to specific services rather than a team-wide roster. The payments team covers payments; the platform team covers infrastructure. This works at scale (50+ engineers) where generalist on-call produces too much context-switching, but requires well-defined service boundaries and ownership clarity.

Schedule design

Shift length matters. Seven-day shifts are standard, but consecutive days beyond seven correlate strongly with burnout and error rates. If your rotation runs longer than a week, build in explicit rest days or shorten the active hours.

Overlap periods between shifts prevent the "not my problem anymore" gap. A 30-minute handoff window — where both outgoing and incoming engineer are reachable — catches incidents that fire right at the boundary.

Handoff rituals are the difference between "good luck" and useful context transfer. At minimum, the outgoing engineer should communicate: open incidents, anything flapping or degraded, recent deploys that haven't fully baked, and upcoming maintenance windows.

Weekend coverage deserves explicit design. Some teams split weekends separately from weekdays; others absorb them into the weekly rotation. The key decision is whether weekend pages carry additional compensation or comp time. Leaving this ambiguous creates resentment.

Time-zone-aware scheduling goes beyond follow-the-sun. Even within a single-timezone team, shift start times matter. Starting on Monday morning rather than Sunday night means the outgoing engineer doesn't stay up late on their last day.

Escalation policies

A page that goes unacknowledged is worse than no page at all — it means the system thinks someone is handling the incident while no one is. Escalation policies exist to guarantee every alert eventually reaches a human who acts on it.

Basic chain: Primary (acknowledge within 5 min) → Secondary (acknowledge within 5 min) → Engineering Manager → VP Engineering. Each tier adds urgency without skipping the people closest to the code.

Time-based escalation increases the responder pool as time passes without acknowledgment. This is the minimum viable escalation policy. If you have nothing else, implement this.

Severity-based routing sends different severity levels to different responders. A P3 informational alert goes to Slack only. A P2 goes to the primary on-call. A P1 pages both primary and secondary immediately. A P0 pages the entire escalation chain simultaneously.

De-escalation matters too. When an incident resolves without needing backup, the secondary should be notified (not paged) so they know the situation is handled. Over-escalation erodes trust in the system and trains people to ignore pages.

Reducing on-call burden

The single highest-impact action for improving on-call is reducing alert noise. Teams that page 20+ times per week cannot retain engineers on the rotation. The target is fewer than five pages per on-call shift — ideally fewer than two that require actual intervention.

Alert quality over quantity. Every alert should be actionable. If the on-call engineer looks at a page and routinely says "I can ignore this," that alert should be tuned, suppressed, or deleted. Document what the responder should actually do in structured runbooks — not just what triggered the alert.

For the full policy model behind that cleanup, see Monitoring Alerts: The Definitive Guide to Alerting Without Alert Fatigue.

Auto-remediation for known responses. If the runbook for an alert is "restart the pod" or "clear the queue," that's a candidate for automation. Every automated response removes a page from the rotation permanently.

Blameless postmortems feed back into the system. When an incident occurs, the postmortem should ask: could this alert have been prevented? Could the responder have been given better information? Could the resolution have been automated? Each answer improves future on-call shifts.

Consolidation windows batch non-urgent alerts rather than paging for each one individually. A monitoring system that fires three separate alerts for related symptoms — high latency, increased errors, connection pool exhaustion — should consolidate into a single incident rather than paging three times.

Compensation and fairness

On-call carries real cost to the engineer: interrupted sleep, restricted evenings, constrained weekend plans. Teams that treat this as "just part of the job" with no explicit compensation build resentment and face retention problems.

Paid on-call is the clearest model. Common structures include a flat per-shift stipend ($200–500/week in US markets), a per-page bonus, or a combination. Some organizations pay a higher rate for pages outside business hours.

Comp time offers time off in exchange for on-call shifts — typically 0.5–1 day per week of on-call. This works well when the team values time flexibility more than additional pay.

Rotation equity tracking ensures burden distributes fairly over time. Track pages-per-person, incidents-per-person, and weekend-shifts-per-person quarterly. If one engineer consistently gets paged more due to scheduling luck, adjust the rotation.

Opt-out for life events. Moving, new baby, medical issues, family emergencies — there should be an explicit, no-questions-asked path to temporarily exit the rotation. The team absorbs the extra coverage. Building this in structurally prevents people from burning through goodwill to get relief.

Common anti-patterns

The hero. One engineer who "doesn't mind" being on-call all the time. They accumulate all the context, make the rotation smaller for everyone else, and become an invisible single point of failure. When they leave — and they eventually leave — the team discovers how much tribal knowledge walked out the door.

Alerts that cry wolf. Pages that fire but require no action train the on-call engineer to ignore alerts. This is the most dangerous anti-pattern: it directly causes real incidents to get slow responses because the engineer assumes "probably nothing again." See Opsgenie alternatives for tools that help with noise reduction at the routing layer.

No handoff notes. The incoming engineer starts their shift blind, with no context about what has been flapping, what was recently deployed, or what the previous shift was watching. Every shift start should include at minimum a 5-minute written or async handoff.

Scope creep punishment. "You're on-call anyway, so can you also handle this customer request / deploy this hotfix / review this PR?" On-call is for incident response. Loading unrelated work onto on-call engineers makes the rotation feel punitive and discourages participation.

No feedback loop. Pages fire, engineers respond, nothing changes. Without postmortems that feed back into alert tuning, runbook updates, and automation investments, on-call burden stays static or grows. The rotation should get measurably better each quarter.

Start with the signals

Rotation design determines who gets woken up and how quickly they respond. But the quality of on-call depends entirely on the quality of the alerts feeding it. A rotation staffed by great engineers still fails if the monitoring underneath generates noise instead of signal.

On-call works when the alerts feeding it are precise. Multi-region monitoring with configurable severity and notification policies reduces noise before it reaches your rotation — start at app.devhelm.io.

Originally published on DevHelm.

SSL Certificate Monitoring: Prevent Outages Before Your Certs Expire

DevHelm — Wed, 08 Jul 2026 15:24:37 +0000

A Let's Encrypt certificate renews every 90 days. When auto-renewal fails silently — a DNS record changed, an ACME challenge path broke during a migration, a permissions change on the webroot — your users see a browser security warning and your site is effectively down. Browsers refuse to load the page, API clients reject the connection, and mobile apps show a blank error screen.

The frustrating part: every one of these outages is preventable. Certificate expiry is not unpredictable. The expiration date is baked into the certificate itself, readable by any TLS client. The only reason teams get bitten is that nobody was checking.

Why SSL certificates expire (and why auto-renewal fails)

TLS certificates have a maximum validity period — 90 days for Let's Encrypt, up to 398 days for commercial CAs. This is by design: shorter lifetimes reduce the window during which a compromised private key can be exploited.

Most teams rely on automated renewal through ACME clients like certbot, acme.sh, or cloud-native solutions (AWS ACM, Cloudflare Origin CA). When these work, you never think about certificate expiry. When they fail, the failure is silent.

Common failure modes:

DNS changes that break ACME challenges. You migrate DNS from one provider to another. The ACME DNS-01 challenge was configured for the old provider's API. Certbot tries to renew, cannot create the required TXT record, and silently gives up. The certificate continues serving for 60 more days until it expires.

Infrastructure migrations that break file paths. You move from nginx to a reverse proxy. The /.well-known/acme-challenge/ path no longer routes to the certbot webroot. HTTP-01 challenges fail. The cert renewal log shows errors, but nobody is watching that log.

Permissions and credential rotation. Cloud-provider IAM credentials used by the ACME client get rotated. The renewal script runs as a different service account than when it was first configured. It fails with a 403, and the retry logic gives up after three attempts.

Vendor certificates you do not control. A SaaS dependency serves its API behind a certificate you cannot renew. A CDN edge node has its own certificate lifecycle. A legacy internal service uses a certificate issued by an internal CA with its own renewal cadence. None of these are in your automation — they expire on their own schedule.

What SSL certificate monitoring actually checks

A useful SSL monitoring system goes beyond "is the certificate expired right now." By the time the answer is "yes," the outage is already happening. Meaningful checks include:

Days until expiry. The primary signal. Alerting at 30 days gives your team time to investigate a broken renewal pipeline before it becomes an outage. Alerting at 14 days is the escalation threshold — something is wrong, and it needs attention today.

Certificate chain validity. An expired intermediate certificate breaks the chain even when the leaf certificate is current. Older clients (Android < 7.1, some embedded devices) that do not perform AIA fetching will reject a connection with a missing intermediate.

Hostname matching. A certificate issued for *.example.com does not cover example.com (the bare apex) unless explicitly included as a SAN. After a CDN migration or load balancer swap, the served certificate may not match the requested hostname.

Issuer changes. If the certificate issuer changes unexpectedly — say, your Let's Encrypt cert is suddenly signed by an unknown CA — that is a signal worth investigating. It may indicate a CDN misconfiguration, a MITM proxy in the path, or a compromised renewal pipeline.

When you need dedicated SSL monitoring

If you have a single domain with one certificate behind a managed provider (ACM, Cloudflare), and your infrastructure never changes, you might survive without monitoring. Everyone else needs it:

Multiple domains and subdomains. Each certificate has its own renewal lifecycle. Ten domains means ten independent renewal processes, each with its own failure modes. One forgotten subdomain is all it takes.

Wildcard certificates. Wildcards cover *.example.com but not nested subdomains (api.staging.example.com). Teams assume the wildcard covers everything, discover the gap at 2 AM when the staging API breaks.

Internal services with self-signed certificates. Internal CA certificates have their own expiry. The internal PKI that issues them may not have automated renewal. The operations team that set it up two years ago may no longer be on the team.

Vendor certificates you depend on. Your payment gateway, your authentication provider, your CDN — all serve certificates you cannot renew. If their certificate expires or their chain breaks, your integration fails. Monitoring their certificate from outside tells you about the problem before your on-call pager does.

Approaches to SSL monitoring

Manual calendar reminders. You look up the expiration date, set a calendar event 30 days before, and hope whoever gets the reminder knows what to do. This breaks the moment someone changes teams, the domain list grows, or the certificate gets replaced early (resetting the expiry date without updating the calendar).

Cron scripts with openssl. A bash script runs openssl s_client -connect example.com:443 and parses the notAfter date. This works until the script's host is down, the output format changes between OpenSSL versions, or the alert channel it writes to gets archived. It also only checks from one location — useless for catching regional CDN cert issues.

Assertion-based monitoring. An HTTP monitor runs from multiple regions on a fixed schedule and includes an SSL expiry assertion alongside its other checks — status code, response time, body content. The SSL check is part of the monitor, not a separate system. When the certificate drops below the threshold, it triggers the same alert pipeline as a 500 error or a timeout. This is the approach that scales.

Setting up SSL monitoring with assertions

DevHelm's HTTP monitors support an ssl_expiry assertion that checks the certificate's remaining validity on every request. You configure the minimum days remaining, and the assertion fails when the certificate crosses that threshold.

A YAML configuration that monitors a production API with both a warning and a failure threshold:

name: Production API SSL
type: http
url: https://api.example.com/health
frequency: 300s
regions:
  - us-east
  - eu-west
  - ap-southeast
assertions:
  - type: status_code
    value: 200
  - type: ssl_expiry
    minDaysRemaining: 30
    severity: warn
  - type: ssl_expiry
    minDaysRemaining: 14
    severity: fail

The same monitor via CLI:

devhelm monitor create \
  --type http \
  --url https://api.example.com/health \
  --frequency 300 \
  --regions us-east,eu-west,ap-southeast \
  --assertion "status_code=200" \
  --assertion "ssl_expiry>=30"

The minDaysRemaining: 30 threshold fires a warning when the certificate has 30 days left — enough time to investigate why auto-renewal is failing, fix the issue, and verify the fix before the certificate actually expires. The minDaysRemaining: 14 threshold fires a failure alert — this is the escalation point.

Multi-region coverage matters here. A CDN edge node may have an expired or misconfigured certificate in one region while other regions serve a valid cert. If you only monitor from one location, you will not see the problem until users in that region report it. Running the assertion from multiple regions catches the discrepancy — the same monitor passes from us-east and fails from eu-west, telling you exactly where the problem is.

For teams managing their monitors as code, the ssl_expiry assertion fits into the same version-controlled config as your other monitoring definitions — reviewed in PRs, deployed through CI, reproducible across environments. See monitoring as code for the full workflow.

Beyond expiry: what else to monitor

Certificate expiry is the most common failure, but it is not the only one:

Certificate chain changes. If the chain your server presents changes — different intermediate, different root, different leaf issuer — that is worth alerting on. It can indicate a CDN misconfiguration (wrong origin pull), a man-in-the-middle proxy that was not there yesterday, or an unintended renewal that picked up a different issuer. For more on what SSL errors mean and how to diagnose them, the chain is usually where it starts.

Protocol downgrades. A server that suddenly negotiates TLS 1.0 instead of TLS 1.3 may have a misconfigured load balancer or a fallback rule that should not be active. Compliance frameworks (PCI DSS 4.0) require TLS 1.2 minimum — a protocol downgrade is a compliance violation before it is a security one.

Mixed content and HSTS gaps. A site that serves HTTPS but loads resources over HTTP gets degraded in browsers. If your monitoring confirms the TLS connection is valid but users still see warnings, the problem may be mixed content rather than the certificate itself.

Building SSL monitoring into your stack

SSL certificate monitoring is not a separate discipline. It is one assertion on an HTTP monitor you are probably already running. The same monitor that checks your API returns a 200 in under 500ms can also check that the certificate will not expire in the next 30 days, using the same alert pipeline and the same on-call routing.

For a broader view of how SSL monitoring fits alongside uptime, latency, and content checks, see the best website monitoring tools comparison.

The practical setup for most teams: add an ssl_expiry assertion with minDaysRemaining: 30 to every external-facing HTTP monitor. For critical services — payment endpoints, authentication providers, API gateways — add a second assertion at 14 days with failure severity. For vendor dependencies you cannot renew yourself, the monitoring is your only early-warning system.

Add an ssl_expiry assertion to any HTTP monitor in 60 seconds — from the dashboard, the CLI, Terraform, or YAML config. Start at app.devhelm.io, free for your first 50 monitors.

Originally published on DevHelm.

DNS Monitoring: What to Track, Why It Breaks, and How to Set It Up

DevHelm — Wed, 08 Jul 2026 15:24:01 +0000

Every request your application serves starts with a DNS lookup. If that lookup fails — or returns the wrong IP — your perfectly healthy server is unreachable. The database is fine. The load balancer is fine. But users see a blank page because a record expired, a zone file has a typo, or a resolver is returning stale data.

Most outages traced to "DNS issues" were detectable hours before users noticed. The record was already wrong. Nobody was checking.

This guide covers what DNS monitoring tracks, why DNS breaks in production, the types of checks that catch real problems, and how to set it up with automated assertions.

What DNS monitoring tracks

DNS monitoring verifies that your domain's DNS infrastructure is healthy across five dimensions.

Resolution success. Can the domain be resolved at all? An NXDOMAIN response or a SERVFAIL means the domain is effectively offline for anyone whose cache has expired. This is the baseline — if resolution fails, nothing else matters.

Response time. A healthy DNS lookup completes in under 50 ms. When resolution crosses 200 ms, it adds perceptible latency to every new connection. Slow DNS is slow everything: every HTTPS handshake, every API call, every page load starts with a lookup. For a deeper dive into diagnosing resolution latency, see How to Fix Slow DNS Lookup.

Record accuracy. The resolved values need to be correct, not just present. An A record pointing to a decommissioned IP, a CNAME targeting a deleted CDN distribution, or a missing MX record silently breaks traffic routing, email delivery, or TLS certificate validation. Accuracy checks verify that A/AAAA records match expected IPs, CNAME records point to the right targets, and TXT records contain the correct SPF, DKIM, and domain verification strings.

TTL health. Time-to-live values control how long resolvers cache a record. A TTL that's too low (under 60 seconds) forces constant re-resolution, creating unnecessary load on authoritative servers and adding latency. A TTL that's too high (over 86,400 seconds) means changes take a day or more to propagate — dangerous during a migration or an incident. Monitoring TTL drift catches both extremes before they cause problems.

Propagation consistency. A record change that's visible in us-east might still be stale in eu-west for hours, depending on TTL and resolver cache behavior. Multi-region DNS checks detect propagation failures that single-location monitoring misses entirely.

Why DNS breaks

DNS failures rarely look like DNS failures. They look like "the site is down" or "email stopped arriving" or "the CDN is serving the old version." Here are the actual root causes.

Registrar expiry. The domain registration lapses because the renewal credit card expired. The registrar points the nameservers to a parking page. Every record disappears. This happens to large companies more often than anyone admits.

Zone file typos. A missing trailing dot on a CNAME target, a transposed octet in an A record, or a malformed SPF string in a TXT record. The change passes the registrar's syntax check but breaks resolution for specific record types.

TTL misconfiguration. Setting TTL to 60 seconds before a migration (to speed propagation) and forgetting to raise it afterward creates a thundering herd — every resolver re-queries your authoritative server every minute instead of every hour. Conversely, a 24-hour TTL on a record you're about to change means the old value persists in caches long after you've updated it.

DNS hijacking. An attacker modifies DNS responses — through cache poisoning, BGP hijacking, or compromised registrar credentials — to redirect traffic to a server they control. Without record-value assertions, you won't know until users report seeing a different site. Cloudflare's DNSSEC documentation covers how DNSSEC validation protects against some of these vectors, but DNSSEC misconfiguration is itself a common source of outages.

Provider outages. Your authoritative DNS provider has an incident. If you run a single-provider setup with no secondary, resolution fails for every domain hosted there.

Types of DNS checks

DNS monitoring maps to four categories, each catching a different failure class.

Resolution checks

The most basic assertion: does the domain resolve? A dns_resolves check queries the domain and passes if it gets a valid answer — no NXDOMAIN, no TIMEOUT, no SERVFAIL. This catches expired domains, deleted zones, and authoritative server outages.

Performance checks

DNS response time directly affects connection setup latency. A dns_response_time assertion fails when resolution exceeds a threshold (e.g., 500 ms), catching overloaded resolvers, network path degradation, or authoritative server issues before they compound into visible user-facing slowness. A dns_response_time_warn variant produces a warning instead of a failure for softer thresholds.

Accuracy checks

Record-value assertions verify that DNS returns what you expect:

IP matching (dns_expected_ips) — A/AAAA records resolve to addresses in your configured allow-list. Catches migrations where old IPs linger and hijacking where IPs change without authorization.
CNAME verification (dns_expected_cname) — CNAME records point to the expected target. Critical for CDN configurations where a wrong CNAME means serving from the wrong origin.
TXT validation (dns_txt_contains) — TXT records contain the correct SPF, DKIM, or domain verification strings. A broken SPF record means your email gets flagged as spam.
Exact match (dns_record_equals) and substring match (dns_record_contains) — verify any record type against an expected value.
Record count (dns_min_answers, dns_max_answers) — verify records haven't been silently deleted or duplicated. A domain that should have two A records for failover dropping to one is a signal, even though DNS still "works."

Health checks

TTL assertions monitor cache hygiene. dns_ttl_low warns when any record's TTL drops below a floor (catching the "forgot to raise TTL after migration" pattern), while dns_ttl_high warns when TTL exceeds a ceiling (catching stale-cache risk before a planned change).

Multi-region DNS monitoring

DNS propagation is not instant. When you update a record, the old value persists in every resolver's cache until its TTL expires. A record change at 14:00 UTC with a 3,600-second TTL won't be fully propagated until 15:00 UTC — and that's the optimistic case. Some resolvers ignore TTL or cap it at their own maximum.

Running DNS checks from a single location tells you whether that resolver sees the correct value. It says nothing about what users in other regions see. Multi-region monitoring catches the scenarios that actually cause user-facing incidents: a propagation failure that affects Frankfurt but not Virginia, a geo-DNS rule returning wrong IPs for Asian resolvers, or a CDN CNAME that's correct in one region and stale in another.

Setting up DNS monitoring

DevHelm's DNS monitor type supports stacking multiple assertions on a single check, so you can verify resolution, performance, accuracy, and TTL health in one monitor running from multiple regions.

YAML configuration

A monitoring-as-code configuration covering the most common assertions:

monitors:
  - name: "Production DNS - example.com"
    type: dns
    target: example.com
    frequency_seconds: 300
    regions:
      - us-east
      - eu-west
      - ap-southeast
    assertions:
      - type: dns_resolves
      - type: dns_response_time
        max_ms: 500
      - type: dns_expected_ips
        values:
          - "203.0.113.10"
          - "203.0.113.11"
      - type: dns_ttl_low
        min_ttl: 120
      - type: dns_ttl_high
        max_ttl: 43200

CLI

Create the same monitor from the command line:

devhelm monitor create \
  --type dns \
  --target example.com \
  --frequency 300 \
  --region us-east --region eu-west --region ap-southeast \
  --assertion dns_resolves \
  --assertion "dns_response_time<500" \
  --assertion "dns_expected_ips=203.0.113.10,203.0.113.11" \
  --assertion "dns_ttl_low>120" \
  --assertion "dns_ttl_high<43200"

Common configurations

Domain migration monitoring. Before cutting DNS to a new provider, add dns_expected_ips with the new IP addresses and run checks from all regions. Once every region returns the new IPs consistently, the migration is complete.

Email deliverability. Monitor your SPF and DKIM records to catch silent breakage:

devhelm monitor create \
  --type dns \
  --target example.com \
  --assertion dns_resolves \
  --assertion "dns_txt_contains=v=spf1" \
  --assertion "dns_min_answers:mx>=1"

CDN CNAME verification. Verify that your CDN CNAME hasn't drifted:

devhelm monitor create \
  --type dns \
  --target cdn.example.com \
  --assertion "dns_expected_cname=d1234.cloudfront.net"

When DNS monitoring fires

When an alert triggers, the assertion type tells you where to look:

dns_resolves fails — check authoritative nameserver health, domain registration status, and zone file presence. Run dig +trace example.com to find where the resolution chain breaks.
dns_response_time exceeds threshold — compare response times across resolvers. If slow from all regions, the authoritative server is overloaded or rate-limiting. If slow from one region, it's a network path issue.
dns_expected_ips mismatch — verify the zone file. If the IP is one you don't recognize, investigate immediately — this is a hijacking signal.
dns_ttl_low fires — someone set a low TTL during a migration and forgot to raise it. Update the TTL in your DNS provider.
dns_txt_contains fails — check whether a recent zone change removed or modified your SPF/DKIM records. Email deliverability may already be degraded.

A well-configured DNS monitor with stacked assertions turns these from "something feels off" into a specific, actionable alert within minutes — not hours after users start complaining. For more on building an API monitoring layer alongside DNS checks, see our guide on the tools developers actually use.

DevHelm's DNS monitors check resolution, response time, record values, and TTL health from multiple regions — catching propagation failures and hijacking before users notice. Start at app.devhelm.io, free for your first 50 monitors.

Originally published on DevHelm.

Monitoring Alerts: The Definitive Guide to Alerting Without Alert Fatigue

DevHelm — Wed, 08 Jul 2026 15:23:24 +0000

How do you set up monitoring alerts that wake the responsible on-call engineer, but do not train the team to ignore the pager? Your API can fail from two regions while Slack gets five duplicate warnings. A certificate can expire in 30 days and still page someone at 3 AM. A checkout outage can go to a general engineering channel because nobody set the owner.

Alerting fails when engineers have not defined which cases should wake the on-call engineer. A short warning gets treated like a production outage. Three signs of the same dependency failure become three incidents. Database alerts reach people who cannot fix the database. The fix starts with a clear decision: keep alerts that need action now, downgrade alerts that can wait, and remove alerts nobody uses.

Monitoring alerts connect a failed check to a human response. That connection is expensive. It can stop work, wake someone from sleep, or break a weekend. If the alert is real, that cost is worth it. If the alert is unclear, repeated, or impossible to act on, it creates alert fatigue.

This guide explains a simple rule: every page should be urgent, useful, and require a human. We will cover what to alert on, how severity should control routing, how a notification policy reduces noise, how alert quality affects the on-call engineer, and how to set up DevHelm so the right channel gets the right alert.

Why monitoring alerts matter

Monitoring answers "is the system healthy?" Alerting answers "who needs to act now?"

Those are different questions. A dashboard can show many metrics because people open it when they need details. An alert is pushed to a person. It needs a higher bar.

A healthy monitoring system usually has three output paths:

Signal type	Destination	Example
Urgent and actionable	Page on-call	Checkout API returns 500s from multiple regions
Important but not urgent	Ticket or Slack	TLS certificate expires in 21 days
Useful context only	Dashboard or log	CPU crossed 70% for two minutes

The mistake is sending all three to the same place. A Slack channel with deploy notes, warning alerts, dependency pings, and real outages becomes a junk drawer. A phone alert for non-urgent warnings becomes noise. The alerting layer should sort the message before a human sees it.

Start with good monitoring coverage: uptime checks, DNS checks, SSL checks, API checks, and the user journeys that prove your product works. This guide starts after those checks exist. The goal is to turn many check results into a small number of trusted monitoring alerts.

The right approach is easier to see side by side:

Bad approach	Right approach
Every failed check pages the on-call engineer	Only urgent, user-impacting failures page
Warnings and outages use the same channel	Critical, warning, and info alerts use different routes
Alerts go to a general engineering channel	Alerts go to the responsible team or on-call rotation
One outage creates five separate alerts	Related alerts are grouped into one incident
Old noisy alerts stay forever	Unused alerts are deleted or downgraded
Planned maintenance pages the team	Maintenance windows mute only expected alerts

Alert fatigue: what to page, downgrade, or remove

Alert fatigue happens when engineers receive so many low-value alerts that they stop trusting the alert channel. This is usually a process problem, not a personal problem, and the common causes are predictable.

The alert has no clear action. "CPU is high" tells the responder almost nothing. Is customer traffic failing? Is a queue backing up? Is there a deploy in progress? If the alert does not point to a next step, it should not page.
The alert is not urgent. A certificate that expires in 30 days matters, but it does not need a 3 AM phone call. It belongs in Slack, email, or a ticket queue. A certificate that expires tomorrow may need business-hours escalation. An expired production certificate needs a page.
The alert repeats another alert. One database outage can trigger API errors, checkout failures, queue warnings, failed jobs, and status page changes. If those arrive as separate alerts, the on-call engineer spends the first ten minutes sorting them instead of fixing the issue.
The alert has no owner. A general engineering channel is not ownership. Every alert should map to a service, a team, or an on-call rotation. "Someone should look" usually means nobody will.
The alert has too little context. The responder needs the failed service, severity, region, status, runbook link, dashboard link, and escalation path. An unclear alert makes the engineer rebuild context while half awake.

The durable fix is a notification policy that sorts alerts before they reach people.

Monitoring alerts: signal vs noise

Good alerts start from user impact. Bad alerts start from a metric that was easy to graph.

Use this test: if the alert fires, can the responder take a clear action right now to protect users or revenue? If yes, it may deserve a page. If no, downgrade it.

Strong monitoring alerts usually look like this:

Case	Why it matters	Best route
Customer-facing endpoint is down from more than one region	Users are blocked, not just one probe	Page the on-call engineer
Error rate crosses the service's SLO burn threshold	You are spending the error budget too fast	Page or Slack with notification, based on severity
P95 or P99 latency stays above the user-visible limit	The product is slow enough for users to feel it	Slack with notification, page if revenue path is affected
Core background job stops making progress	Work is stuck even if the API still returns 200	Page if user data or billing is affected
Database load is high because one user runs repeated expensive queries	One customer can slow the whole product for everyone	Slack with notification, page if production health is at risk
Dependency needed for checkout, login, billing, or deployment is unreachable	Your product may fail because another service failed	Page for critical paths, Slack for non-critical paths
Certificate is expired or close enough to expiry that renewal probably failed	Users or API clients may be blocked soon	Slack or email early, page when expired
DNS records no longer match expected values	Traffic may go to the wrong place	Page for production domains

For database load, make the alert specific. "DB load high" is weak. "User 123 is running a 10-year count query 80 times in 5 minutes and production latency is rising" is useful. That tells the responsible engineer what to check and what to stop.

Weak alert candidates should usually be downgraded:

Case	Better route	Why
Single-server CPU or memory usage without customer impact	Dashboard or log warning	It may be useful context, but it is not a page
One failed check from one region that recovers on the next run	Log warning or silent Slack	One probe can fail because of network noise
A warning that has fired every week for months without action	Delete, tune, or make a ticket	Repeated ignored alerts create alert fatigue
Informational deploy events	Log or silent Slack	Useful history, not an alert
Errors from non-production routed to production on-call	Team Slack only	Staging should not wake production on-call
Runbook says "wait and see"	Dashboard or ticket	If no action is expected, it should not page

Some important alerts start inside your code, not only in external monitoring. Use your logger and error tools to mark business-critical cases.

Code-level signal	Log level	Sentry/Grafana rule	Best route
Parser fails for a top-tier customer feed	Critical/error	Alert when failures > 0 for that customer in 5 minutes	Page responsible engineer
Billing webhook signature fails repeatedly	Critical/error	Alert when failures spike above baseline	Page on-call engineer
Login token validation fails for many users	Critical/error	Alert on error rate by endpoint and status code	Page on-call engineer
Data export job misses its SLA	Warning	Alert when job age exceeds expected window	Slack with notification
Optional enrichment API times out	Warning	Track rate and latency on dashboard	Slack or ticket
Debug-only parsing mismatch in staging	Warning/info	Dashboard only	Log warning

For example, if your app has parsers that extract data for top-tier clients, do not rely only on a generic "worker failed" alert. Add a clear code-level signal such as:

client tier: top-tier, paid, trial
parser name: invoices, shipments, uptime imports, HTML status parser
input source: webhook, S3 file, API sync
failure reason: schema mismatch, HTML changed, selector missing, auth error, empty response, timeout
affected customer or workspace id

Then build rules on top of that signal:

Critical: top-tier parser fails and no successful run happens within the next 5 minutes.
Warning: paid-customer parser has more than 3 failures in 30 minutes but later recovers.
Log only: trial-customer parser has one malformed optional field and the import still succeeds.

Parser alerts need a clear failure reason. "Parser failed" is too broad. If your parser reads HTML from a third-party site and that site changes its markup, the alert should say what changed:

expected selector was not found
table column count changed
required JSON-LD field is missing
date format changed
login page appeared instead of data page
response is empty or blocked

That detail makes the fix much faster. The responsible engineer can open the parser code, update the selector or schema handling, add a test case for the new HTML, and deploy the fix. Without the reason, they first have to reproduce the failure, compare old and new HTML, and guess what broke.

Tools like Sentry can turn repeated exceptions into issues and alert rules. Grafana can turn logs and metrics into dashboards and alert rules. The important part is the filter: do not alert on every exception. Alert on the exceptions that match a real business case, a responsible owner, and a clear action.

Logging is the raw material for many code-level alerts. If your team is still deciding how to structure logs in Node.js, see Winston vs Pino: Choosing a Node.js Logger in 2026. For the broader split between logs, metrics, and alerts, see Monitoring and Logging: How They Work Together.

Defining alert rules in DevHelm, Sentry, and Grafana

Most teams use more than one alert source. That is fine, but each tool should own the rule type it is best at.

Tool	Best alert rules	Bad alert rules
DevHelm	External checks: uptime, API, DNS, SSL, heartbeat, dependency health	Internal stack traces that only exist inside app code
Sentry	Exceptions, failed jobs, parser errors, customer-impacting code failures	Simple uptime checks or "is the site reachable?"
Grafana	Metrics and logs: database load, queue depth, CPU, memory, error-rate trends	Single business events that need customer context

Use DevHelm when the question is "does the service work from outside?"

DevHelm rule	Example route
API returns 5xx from two regions	Page on-call engineer
SSL certificate is under 30 days	Slack or ticket
DNS record changed from expected value	Page platform on-call
Heartbeat job missed two runs	Slack with notification, page if billing/data is affected

Use Sentry when the question is "did the code fail in a way that matters?"

Sentry rule	Example route
New critical exception in checkout flow	Page on-call engineer
Parser fails for top-tier customer	Page responsible engineer
Same auth exception affects 20 users in 5 minutes	Slack with notification or page
Error appears only in staging	Team Slack or issue, no production page

Use Grafana when the question is "is a metric moving into a dangerous range?"

Grafana rule	Example route
Database CPU/load stays high for 10 minutes	Slack with notification, page if latency rises
Queue depth grows for 15 minutes	Slack with notification
Error rate crosses SLO burn threshold	Page or Slack based on severity
Memory rises slowly but users are fine	Dashboard or ticket

The rule of thumb: DevHelm watches the outside user path, Sentry watches code failures, and Grafana watches system trends. Route them through the same notification policy model so the on-call engineer sees one clear incident instead of three disconnected alerts.

The stronger pattern is to alert on user-visible symptoms and keep causes as context. "Checkout is failing from us-east and eu-west" is the page. "Database connection pool is full" is useful detail. The on-call engineer needs both, but the user-visible problem should decide whether someone gets woken up.

Google's SRE guidance makes this point clearly: pages should be urgent, useful, new, and tied to user-visible problems. Prometheus Alertmanager follows the same idea with grouping, inhibition, and silences. The modern version is simple: classify first, notify second.

How severity levels route monitoring alerts

Severity connects detection to notification. Without severity, every alert looks equally urgent. With severity, critical outages can page people, while warnings stay out of the pager path.

Use a small scale. Most teams need three alert severities:

Severity	Meaning	Notification behavior
Critical	Customer-facing outage, data loss risk, or broken revenue path	Page on-call immediately
Warning	Degradation, higher risk, or upcoming failure	Slack, email, or business-hours escalation
Info	Useful operational context	Dashboard, log, or digest

This should match your incident severity model. A critical alert usually maps to a Sev1 or Sev2 incident. A warning may become a Sev3 if it continues or gets worse. Info should rarely become an incident unless a human promotes it.

Severity should be set at the source. Do not ask the notification channel to guess. The monitor, alert rule, or policy should know whether a failed check is critical or warning based on the service, environment, assertion, and confirmation window.

For a deeper incident triage model, see Incident Severity Levels: Sev1-Sev4 with Triage Matrix. The simple alerting rule is this: if two severities notify the same people, through the same channel, with the same urgency, one level is not needed.

Notification channels for monitoring alerts

Alert fatigue often gets worse because teams treat channels as personal choices instead of alerting tools. Slack, email, SMS, phone, webhooks, PagerDuty, and OpsGenie each have a job.

Slack is good for team awareness. It works for warnings, incident channels, resolved alerts, and routing context. It is weak as the only path for critical pages because people mute channels, miss threads, and close laptops.

Email is good for low-urgency follow-up. It works for reports, digests, and non-urgent warnings. It is poor for incident response because inboxes are crowded and delivery is not always fast.

SMS and phone are good for critical pages. They are loud by design. Save them for problems that need a human now.

Webhooks are good for automation. Send alerts to a workflow tool, ticket system, incident tool, or custom responder. A webhook should not be the only record of a critical incident unless the receiver is reliable.

PagerDuty and OpsGenie-style tools are good for on-call schedules. They know who is on-call, how to escalate, and whether the first responder acknowledged. They work best when the monitoring tool sends clean alerts with clear severity.

Here are ten common routing examples:

Case	Best route	Why
Checkout API returns 500s from two or more regions	Page the on-call engineer	Users cannot pay, so action is needed now
Login is down for production users	Page the on-call engineer	Users cannot enter the product
Database is unavailable for the main app	Page the responsible backend or platform on-call	The owner can act fastest
Payment provider webhook is failing	Page on-call and post to incident Slack	Revenue path is at risk
API latency is high but requests still succeed	Slack with notification	The team should look soon, but it may not need a page
SSL certificate expires in 21-30 days	Email or ticket	Important, but not urgent
SSL certificate expires tomorrow	Slack with notification, then escalate during business hours	Renewal likely failed and needs fast follow-up
Staging monitor fails	Slack without notifying the whole team	Useful for developers, not production on-call
CPU crosses 70% for two minutes	Log warning or dashboard	Context only unless users are affected
A deploy starts or finishes	Log or silent Slack message	Useful history, not an alert

Notification policy setup for monitoring alerts

A notification policy is a set of rules for alerts. It decides where an alert goes, who sees it, how it groups with related alerts, and whether it should be muted.

Think of it as a routing tree:

What service or monitor generated the alert?
What environment did it come from?
What severity is it?
Has the same incident already opened?
Is a maintenance window active?
Is quiet-hours behavior different for this severity?
Which channel or escalation chain should receive it?

That tree gives you control. A critical checkout outage can page the primary on-call engineer, post to the incident channel, and update the status page. A warning on a staging monitor can go to Slack during business hours. A planned maintenance event can mute expected failures without turning the monitor off.

Good policies usually route by these attributes:

Severity: critical, warning, info.
Environment: production, staging, development.
Service or tag: checkout, API, auth, billing, database.
Region: global failure vs. single-region degradation.
Monitor type: HTTP, DNS, SSL, TCP, heartbeat.
Ownership: team or rotation responsible for the service.
Time window: business hours, quiet hours, planned maintenance.

The default policy matters too. Alerts that do not match a specific rule should not disappear. Send them to a visible catch-all channel with enough context to fix the policy. A missing route is a config bug.

For teams using monitoring as code, notification policies should live beside monitor definitions. The monitor says what to check. The policy says who gets notified when that check fails.

Alert routing and escalation for on-call engineers

Routing answers one question: who should get this alert first? Escalation answers a different one: what happens if they do not respond?

Mixing these two ideas creates fragile alerting. A route should be based on ownership and context. An escalation chain should be based on time and responsibility.

Example escalation:

Time since alert	Action
0 minutes	Notify primary on-call
5 minutes unacknowledged	Notify secondary on-call
15 minutes unacknowledged	Notify engineering lead
30 minutes unresolved for critical severity	Open broader incident response

Test both layers. A good routing policy is useless if the backup never gets paged. A good escalation chain is useless if every database alert goes to the wrong team first.

Alert deduplication, grouping, and suppression

The most useful alerting work usually happens before the notification is sent.

Deduplication stops the same alert from notifying again and again. If the checkout monitor fails every 30 seconds for ten minutes, the on-call engineer should not receive twenty identical pages. They should receive the first alert, updates when the state changes, and a resolved message when the monitor recovers.

Grouping combines related alerts into one incident or notification. During a dependency outage, many monitors may fail for the same reason. Grouping by service, dependency, severity, and time window keeps the responder focused.

Suppression stops expected or lower-value alerts from notifying people. Maintenance windows are the common case. If you know the database will restart during a planned migration, mute that alert for the window but keep the monitor running.

Inhibition mutes smaller symptoms when a bigger alert already explains them. If "cluster unavailable" is firing, individual pod warnings do not need to page. They can stay visible as context without becoming separate pages.

These controls separate alert management from simple message delivery. A PagerDuty alternative, an incident management platform, or a monitoring tool can all send messages. The important question is whether the tool reduces duplicate human work before the alert reaches a human.

On-call engineers need trusted alerts

An on-call rotation depends on alert quality. You can design a fair schedule, pay engineers properly, and write good handoff notes. The team can still burn out if monitoring alerts are low value.

The target is trusted alerts. A quiet system may not monitor enough. A noisy system can still miss the failure that matters.

Credible alerts have five traits:

The responder understands what broke.
The alert maps to customer or business impact.
The route points to the team that can act.
The alert includes context and a runbook.
The alert history shows that pages usually matter.

For schedule design, handoffs, pay, and rotation models, read On-Call Rotation Best Practices for Engineering Teams. This guide focuses on the input side: making sure the on-call engineer receives signal instead of noise.

Quiet hours and maintenance windows for alert fatigue

Quiet hours do not mean critical alerts go silent. They mean non-critical alerts wait until a better time.

A good quiet-hours policy usually looks like this:

Alert severity	During business hours	During quiet hours
Critical	Page immediately	Page immediately
Warning	Slack or ticket	Queue for morning unless worsening
Info	Dashboard or digest	Dashboard or digest

Maintenance windows mute expected alerts during planned work. The important word is expected. If you are restarting a database, mute the database availability alert for that window. Keep the rest of production alerting active.

Use narrow matchers:

Specific service or monitor.
Specific environment.
Specific time window.
Specific expected failure mode.

Broad silences are dangerous because they hide unrelated incidents. A maintenance window for checkout should not mute auth, DNS, or the public status page. Keep the policy narrow so an unrelated critical failure still reaches on-call.

Measuring monitoring alert quality

You can measure alert quality. If you do not measure it, alerting gets worse over time.

Track these metrics monthly:

Metric	What it tells you
Alerts per on-call shift	Human load
Pages per incident	How well duplicate alerts are grouped
Actionable alert rate	Signal-to-noise ratio
MTTA	Whether alerts are trusted and routed to the right person
False positive count	Whether thresholds are too sensitive
Alerts with runbooks	Whether responders have clear next steps
Alerts by service owner	Ownership gaps
Suppressed alerts during maintenance	Whether planned work is being handled cleanly

MTTA is especially useful. It measures the time between alert delivery and human acknowledgment. A rising MTTA can mean the alert is hard to see, routed to the wrong person, or ignored because the channel is noisy. For the full incident metric model, see MTTA, MTTR, MTBF, MTTF - The Four Incident Metrics, Compared.

Run a monthly alert review with three questions:

Which alerts fired most often?
Which alerts did not lead to action?
Which incidents produced more than one page?

Delete, downgrade, group, or rewrite anything that fails the review. Alerting is not a one-time setup. Review it every month.

Monitoring alert setup example

Here is a simple policy for a small SaaS team with an API, web app, checkout flow, and public status page.

Start with monitors:

Monitor	Failure behavior
Homepage uptime	Warning for one-region failure, critical for multi-region failure
API health endpoint	Critical on multi-region 5xx or timeout
Checkout endpoint	Critical on failed assertion or elevated latency
DNS records	Warning on record drift, critical on resolution failure
SSL certificate	Warning under 30 days, critical when expired
Background job heartbeat	Warning after one missed interval, critical after sustained failure

Then define alert channels:

Channel	Purpose
Engineering Slack	Warnings, resolved alerts, context
Primary pager	Critical production pages
Secondary pager	Unacknowledged critical escalation
Email or ticket queue	Non-urgent follow-up
Webhook	Automation, incident creation, or audit record

Then define notification policies:

Rule	Destination
production + critical	Primary pager immediately, engineering Slack for visibility
production + critical + unacknowledged after 5 minutes	Secondary pager
production + warning	Engineering Slack during business hours, queue after hours
staging + critical	Team Slack only
info	Dashboard or digest
maintenance window match	Mute matched warning/critical alerts for that service only

Finally, attach runbook and ownership metadata:

Metadata	Why it matters
Service owner	Prevents "who owns this?" delay
Runbook URL	Gives the first responder a next step
Dashboard URL	Speeds diagnosis
Status page component	Keeps customer updates tied to the same signal
Recent deploy link	Helps spot change-related incidents

Runbooks are the most important link in this table. A good alert should not just say what broke; it should point to the first action. For a deeper format, see Runbooks: Anatomy, Examples, and the AI-Executable Format.

This setup is small on purpose. Start with a few alerts you trust. Add more only after the signal is good. Teams get into trouble when they create dozens of alerts before they have routing, severity, and review habits.

Setting up monitoring alerts with DevHelm

DevHelm splits alerting into two reusable parts: alert channels and notification policies.

Alert channels are destinations. They can be Slack, email, SMS, webhooks, PagerDuty, OpsGenie, or another place an alert can go. Create the destination once, test it, and reuse it across monitors. For the exact setup flow and supported channel types, see the alert channels reference.

Notification policies are routing rules. They decide which channel fires for each monitor, severity, tag, service, or escalation step. This is where you define quiet hours, severity routing, deduplication, escalation chains, and maintenance windows. For match rules, priority, and evaluation order, see the notification policies reference.

A clean DevHelm setup follows this order:

Create the first monitor and first alert channel. If you are starting from zero, follow First HTTP monitor, then First alert.
Create channels for the real destinations: engineering Slack, critical pager, secondary escalation, and ticket/email follow-up. For end-to-end setup, see the alerting guide.
Tag monitors by service and environment: production, staging, checkout, API, auth, billing. If you route by ownership, use alert routing by tag.
Set severity based on impact: critical for customer-facing outages, warning for degradation or upcoming failure, info for context.
Build notification policies from specific to broad: checkout critical first, production critical next, warnings after that, catch-all last.
Add escalation only where the first responder might miss the alert. For a three-step model, see tiered escalation.
Add quiet-hours behavior for warnings and info.
Add maintenance windows for planned work. For scoped planned downtime, see maintenance windows.
Test the full path from failed monitor to delivered notification before trusting it in production. For the test flow, see testing your alerts.

The key is reuse. Do not configure custom alert behavior on every monitor unless the monitor is truly special. A policy should define the team's alerting model once. New monitors should follow that model through tags, severity, and ownership.

That is how you keep alerting clean. When the payments team changes its escalation chain, update the payments policy. When Slack channel names change, update the channel. When checkout becomes more critical, update the severity rule. Do not edit twenty monitors by hand.

PagerDuty alternative vs better monitoring alerts

Many teams searching for a PagerDuty alternative are really searching for less alert fatigue. These problems are related, but each layer has a different job.

PagerDuty, OpsGenie, Grafana OnCall, incident.io, Rootly, Better Stack, and other incident tools handle schedules, escalation, acknowledgment, and response workflows. They are useful when you need on-call schedules and escalation. They do not automatically fix bad monitoring alerts.

If the monitoring layer sends noisy alerts with little context, the on-call tool will just route noisy alerts faster. Signal quality has to improve before the pager.

The better evaluation question is:

Can the monitoring layer classify severity before paging?
Can related failures deduplicate into one incident?
Can warnings avoid the pager path?
Can maintenance windows suppress expected failures without hiding unrelated outages?
Can the alert include owner, runbook, region, and service context?
Can the incident tool escalate only after a clean alert has reached the right first responder?

For tool comparisons, see Best Incident Management Tools in 2026 and Opsgenie Alternatives in 2026. This guide has a narrower point: whatever pager you use, fix the signal before you optimize the route.

Monitoring alert audit checklist

Use this checklist for every paging alert:

Question	What a good answer looks like
Is the condition user-visible or almost user-visible?	It affects users, revenue, data, or a critical dependency
Does it require action now?	Waiting until tomorrow would make the incident worse
Is the responder able to act?	The route goes to the team that owns the service
Is the owner clear?	The alert names a service, team, or on-call rotation
Is the severity correct?	Critical pages; warning goes to Slack or ticket; info stays quiet
Is the runbook linked?	The first responder has a clear next step
Will duplicate alerts group together?	One root cause creates one incident, not five pages
Is there a business-hours path for warnings?	Non-urgent warnings do not wake people at night
Is maintenance muting narrow?	Only the planned service and time window are muted
Did this alert lead to action in the last 90 days?	If not, delete it, tune it, or downgrade it

If an alert fails the checklist, do not leave it as a page. Rewrite it, downgrade it, group it, or delete it.

The best alerting systems are boring in the right way. Most signals go to dashboards, tickets, or quiet Slack channels. A small number reach the on-call engineer. When they do, the engineer trusts the page.

FAQ

What is the difference between monitoring and alerting?

Monitoring collects health signals from systems, services, and user journeys. Alerting decides which signals need human attention and sends them to the right place.

What is the best way to reduce alert fatigue quickly?

Start with the ten noisiest alerts from the last 30 days. Delete alerts that never led to action, downgrade non-urgent warnings, and group duplicate symptoms from the same incident.

Should warning alerts ever wake up an on-call engineer?

Usually no. A warning should page only when it predicts a near customer-facing failure and waiting until business hours would make the incident worse.

How many notification policies should a small team have?

Most small teams can start with four notification policies: production critical, production warning, non-production, and catch-all. Add service-specific policies only when ownership or severity handling is different.

Is PagerDuty a monitoring tool?

PagerDuty is mainly an on-call and incident response tool. It routes, escalates, and tracks alerts from monitoring systems, but it usually depends on another tool to detect the failure first.

What should every alert include?

Every alert should include the affected service, severity, environment, current state, first failed time, owner, runbook link, dashboard link, and the notification policy that routed it.

DevHelm routes monitoring alerts through reusable alert channels and notification policies, with severity-based routing, escalation chains, quiet hours, deduplication, and maintenance windows. Start with your first production monitor at app.devhelm.io, then set up the alerting policy before the next incident tests it for you.

Originally published on DevHelm.

Alert Fatigue: Why Your Team Ignores Pages and How to Fix It

DevHelm — Wed, 08 Jul 2026 15:22:47 +0000

Your on-call engineer gets 47 alerts before lunch. Twelve are the same flapping health check. Eight are downstream effects of a single database hiccup. Six are informational warnings that never require action. By 2 PM, a new alert arrives — checkout is returning 500s from three regions — and it sits in the channel for nine minutes because the person on call has learned to stop reading.

That is alert fatigue. It is one of the most common ways a well-monitored system ends up with a longer incident response time than a system with no monitoring at all.

What alert fatigue looks like in practice

Alert fatigue is not laziness. It is a rational response to a noisy system. When most alerts are false positives, low-priority warnings, or duplicates of the same underlying problem, engineers train themselves to ignore the pager — because ignoring it is usually the right call.

PagerDuty's State of Digital Operations research found that the median on-call team receives over 300 alerts per week. Of those, roughly 30–40% are noise: alerts that fire, get acknowledged, and close without any action taken. The remaining alerts often cluster around a handful of real incidents, buried under duplicates and false alarms.

The damage compounds over time. Engineers rotate off on-call shifts feeling burned out. Response times creep upward. Critical pages blend in with the noise. The team builds informal workarounds — muting channels, filtering notifications, checking Slack "when they get around to it" — that undermine the entire alerting pipeline. For more on how this affects on-call teams specifically, see on-call rotation best practices.

Why your alerting is too noisy

Alert fatigue rarely has one cause. It accumulates from a series of reasonable decisions that compound into an unreasonable system.

Too many monitors, not enough intent

Teams add monitors reactively. An outage happens, someone creates an alert so it "never happens again," and the alert stays forever — even after the underlying architecture changes. The monitor count grows while the number of meaningful signals stays flat.

The question is not "does this metric matter?" It usually does. The question is "does this metric need to wake someone up?" Most metrics belong on a dashboard, not in a pager.

No severity differentiation

When every alert pages the on-call engineer with the same urgency, the engineer cannot tell what is actually urgent. A certificate expiring in 28 days should not arrive in the same channel, with the same sound, as a checkout outage affecting live transactions.

Severity levels exist to solve this. Critical means customers are affected right now. Warning means something needs attention during business hours. Info means the team should know, but nobody needs to act. Without that separation, everything feels equally important — which means nothing does. See incident severity levels for a practical framework.

Duplicate alerts from correlated failures

A single database going unhealthy can trigger alerts from every service that depends on it. The API returns 500s — alert. The background job queue stalls — alert. The health check for the admin panel fails — alert. The status page component goes red — alert. One root cause, four pages.

The on-call engineer spends the first ten minutes triaging instead of fixing the database. Worse, the volume itself signals "something big is happening" without clarifying what.

Alerts without actions

"CPU at 78%." So what? Should someone scale up? Is it a runaway query? Is it normal traffic? If the alert does not point to a specific action or a runbook, it creates noise without value. Every alert that fires without a next step trains the team to treat alerts as background chatter.

How to fix it

The fix is not adding more rules. It is removing the ones that do not earn their place and restructuring the rest so each alert is worth the interruption.

Audit the alert-to-action ratio

Pull the last 30 days of alert history. For each alert rule, count how many times it fired and how many of those resulted in a human taking action. If an alert fires 50 times a month and gets acted on twice, it needs to be downgraded, tuned, or removed.

A healthy target: at least 70% of pages should lead to a meaningful response. If your ratio is below 50%, the pager has become a notification feed, not an incident signal.

Set severity thresholds that mean something

Map each alert to a severity level and route accordingly:

Severity	Meaning	Channel
Critical	Customers affected now, revenue at risk	Phone call or push notification to on-call
Warning	Needs attention within hours, not minutes	Slack or email to the owning team
Info	Useful context, no action required	Dashboard or daily digest

Then enforce the mapping in your notification policies. Critical pages the on-call rotation. Warning posts to a team channel. Info goes nowhere near a phone. This separation alone can cut page volume by 40–60% without losing coverage.

Group correlated failures into one alert

When five services fail because one dependency went down, the on-call engineer needs one alert that says "the database is unhealthy and these services are affected" — not five independent pages.

Resource groups help here. Group monitors by shared dependency — your payment provider, your primary database, your CDN — so that correlated failures produce a single notification instead of a cascade. The engineer sees one alert with context, not a wall of symptoms.

Tune flapping and transient detection

A health check that fails once because of a network blip and recovers 30 seconds later is not an incident. But if your alerting fires on the first failure, the on-call engineer gets paged for something that resolved before they could open a terminal.

Require consecutive failures before alerting — two or three failed checks in a row from the same region, or failures from multiple regions simultaneously. This filters transient noise without delaying real outage detection.

Require every alert to have a runbook link

This is the simplest and most effective rule to adopt. If the alert does not link to a runbook or documented response procedure, it should not page anyone. The runbook does not need to be perfect — a three-line document that says "check the database dashboard, look for long-running queries, escalate to the database team if replication lag exceeds 30 seconds" is enough.

This forces the team to think through the response before wiring the alert, which naturally filters out alerts nobody knows how to act on.

Review and prune quarterly

Alert rules are not permanent. Services change, architectures evolve, and the alert that mattered six months ago may now fire on a deprecated endpoint. Schedule a quarterly review: sort alerts by frequency, check the action ratio, and delete or downgrade anything that has become noise.

The alert your team cannot afford to miss

Alert fatigue is not a tooling problem at its root. It is a prioritization problem. The goal is not zero alerts — it is a system where every page is urgent, actionable, and trusted. When the on-call engineer's phone buzzes, they should think "something real is happening" instead of "probably another flapping check."

Start with the audit. Pull your alert history. Find the rules that fire without action. Set severity levels that route alerts to the right channel. Group correlated failures so one root cause produces one page. Attach a runbook to every alert rule. And prune what no longer matters.

The on-call shift should be boring most of the time. When it is not boring, every alert should matter. If you're ready to fix the noise, DevHelm's notification policies let you map severity levels to channels, group correlated monitors into resource groups, and require multi-region confirmation before any alert pages your team. For a deeper dive into the configuration, see the full monitoring alerts guide. To reduce mean time to resolution once a real alert fires, pair it with standardized runbooks and clear escalation paths.

Originally published on DevHelm.

DEV Community: DevHelm

Monitoring as Code: Why Your Monitors Should Live in Git

Why monitors rot in a web UI

What monitoring as code looks like in practice

The payoff: review, history, and reproducibility

The workflow

What should be code — and what should not

Common objections — and responses

Getting started

API Testing vs API Monitoring: Different Problems, Different Tools

When each runs

What each catches — and what each misses

API testing catches

API testing misses

API monitoring catches

API monitoring misses

Same tool, different workflows

The gap between them — and what fills it

What good API monitoring actually asserts

Common mistakes

Start with both

How to Set Up Browser Monitoring with Playwright

Prerequisites

Step 1 — Write a check that asserts on outcomes

Step 2 — Use stable selectors that survive deploys

Step 3 — Wait for conditions, never for time

Step 4 — Capture evidence on failure

Step 5 — Schedule checks against production

Step 6 — Add multi-region coverage

Step 7 — Route failures to the right channel

Step 8 — Monitor the layer underneath

What to read next

Headless Browser Monitoring: What It Is and When You Need It

How it works

What headless browser monitoring catches that API checks miss

When you need headless browser monitoring — and when you do not

The cost question

Headless browser monitoring vs synthetic monitoring vs RUM

Getting started

Checkly Alternatives in 2026: Synthetic Monitoring Tools Compared

What makes Checkly good — and where teams hit limits

Datadog Synthetic Monitoring

Grafana Cloud Synthetic Monitoring (k6)

Better Stack

Sematext

When you need the layer underneath

How to choose

Opsgenie Is Shutting Down: What You Need to Know and When to Migrate

What happened

Timeline

Why Atlassian is doing this

What changes for existing users

Migration options

Option A: Move to JSM Cloud

Option B: Evaluate alternatives

Option C: Hybrid migration

Timeline recommendations

What to evaluate in any migration

Start planning now

Opsgenie Alternatives in 2026: Where to Migrate Before the Shutdown

What to look for in an Opsgenie replacement

PagerDuty

Grafana OnCall

incident.io

Rootly

Better Stack

Jira Service Management (JSM)

Migration decision framework

The monitoring layer underneath

On-Call Rotation Best Practices for Engineering Teams

Rotation structures

Schedule design

Escalation policies

Reducing on-call burden

Compensation and fairness

Common anti-patterns

Start with the signals

SSL Certificate Monitoring: Prevent Outages Before Your Certs Expire

Why SSL certificates expire (and why auto-renewal fails)

What SSL certificate monitoring actually checks