<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: DevHelm</title>
    <description>The latest articles on DEV Community by DevHelm (@devhelm).</description>
    <link>https://dev.to/devhelm</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3936382%2Fe8a13abc-de71-41f3-a5eb-70eb7efde5e6.png</url>
      <title>DEV Community: DevHelm</title>
      <link>https://dev.to/devhelm</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/devhelm"/>
    <language>en</language>
    <item>
      <title>Runbooks: Anatomy, Examples, and the AI-Executable Format</title>
      <dc:creator>DevHelm</dc:creator>
      <pubDate>Tue, 02 Jun 2026 10:14:30 +0000</pubDate>
      <link>https://dev.to/devhelm/runbooks-anatomy-examples-and-the-ai-executable-format-440a</link>
      <guid>https://dev.to/devhelm/runbooks-anatomy-examples-and-the-ai-executable-format-440a</guid>
      <description>&lt;p&gt;The wiki page nobody opens. The Confluence doc that's six months stale. The Notion entry that gets read once during the postmortem and then forgotten. Most "runbooks" fail because they were written for nobody in particular — neither a fresh on-caller at 3 AM, nor a tenured engineer who already knows the system, nor an AI agent that might be the first responder. They serve no one, and they rot quietly.&lt;/p&gt;

&lt;p&gt;A useful runbook is a specific, narrow thing: a tightly scoped, executable procedure that turns one known failure into one known recovery. This post pins down what a runbook actually is (and what it isn't), shows the seven sections a good one contains, walks through a worked example you can copy, and ends with the structure that makes a runbook executable by an AI agent — because increasingly that's who reads it first.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a runbook (and what it isn't)
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;runbook&lt;/strong&gt; is a document that tells you how to handle one specific operational situation, end-to-end. The trigger that brings you to it, the symptoms you should see, the commands that confirm what's wrong, the steps that fix it, and the checks that prove it's fixed. One runbook covers one failure mode.&lt;/p&gt;

&lt;p&gt;It's not the same as some adjacent documents people lump under the term:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Document&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;Audience&lt;/th&gt;
&lt;th&gt;When you reach for it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Runbook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One specific failure mode (e.g. "API p95 latency above SLO")&lt;/td&gt;
&lt;td&gt;On-caller, AI agent, or a teammate paged into an active incident&lt;/td&gt;
&lt;td&gt;When that exact alert fires&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SOP (standard operating procedure)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Routine, non-incident operations (e.g. "Rotate database credentials quarterly")&lt;/td&gt;
&lt;td&gt;Operator on a schedule&lt;/td&gt;
&lt;td&gt;On a calendar trigger&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Playbook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A class of incidents with branching (e.g. "Customer reports degraded API performance")&lt;/td&gt;
&lt;td&gt;Incident commander making routing decisions&lt;/td&gt;
&lt;td&gt;At the start of an unknown incident&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dashboard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A live view of system state&lt;/td&gt;
&lt;td&gt;Anyone investigating&lt;/td&gt;
&lt;td&gt;Continuously, during and outside incidents&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The most common mistake is conflating runbooks with playbooks. A playbook is a tree of questions ("Is the database the bottleneck? If yes, go to runbook X. If no, check Y."). A runbook is a leaf of that tree — the actual recovery procedure once you've narrowed down which failure you're looking at. (The &lt;a href="https://response.pagerduty.com/" rel="noopener noreferrer"&gt;PagerDuty incident response guide&lt;/a&gt; is a good example of a playbook that links to many runbook-like procedures.) If your "runbook" is more than ~500 lines or covers more than one failure mode, it's a playbook and the runbooks it would link to don't exist yet.&lt;/p&gt;

&lt;p&gt;The second common mistake is writing one runbook per &lt;em&gt;service&lt;/em&gt;. A service has dozens of failure modes; lumping them all into one document means nobody can find the relevant section under pressure. One runbook, one failure mode, one alert. A &lt;a href="https://devhelm.io/blog/how-to-fix-slow-dns-lookup" rel="noopener noreferrer"&gt;slow DNS lookup&lt;/a&gt; and an &lt;a href="https://devhelm.io/blog/what-ssl-error-means-and-how-to-fix-it" rel="noopener noreferrer"&gt;SSL certificate error&lt;/a&gt; are two different failure modes — they get two different runbooks, even though they may live on the same load balancer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The anatomy of a useful runbook
&lt;/h2&gt;

&lt;p&gt;Most runbook templates you'll find on the internet ask for a dozen sections: purpose, scope, owners, dependencies, change history, related links, escalation matrix, last-reviewed date. Almost none of that is useful while an alert is paging. The reader has 30 seconds of working memory and is looking for what to do.&lt;/p&gt;

&lt;p&gt;A good runbook contains exactly seven sections:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Trigger&lt;/strong&gt; — the precise alert or signal that brought the reader here. Not "this is for API issues"; &lt;em&gt;"this runbook is opened when the &lt;code&gt;api-latency-p95-high&lt;/code&gt; alert fires."&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Symptoms&lt;/strong&gt; — what the reader can confirm &lt;em&gt;right now&lt;/em&gt;. Specific commands, expected output. &lt;em&gt;"The p95 latency panel shows &amp;gt;1s for 5+ minutes; error-rate panel is flat (rules out a 5xx storm)."&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diagnosis&lt;/strong&gt; — commands to confirm the failure and rule out lookalikes. Each command in a fenced code block; expected output annotated.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mitigation steps&lt;/strong&gt; — ordered, idempotent, each with a runnable command. If a step depends on the previous one succeeding, say so.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification&lt;/strong&gt; — how the reader knows it worked. Concrete checks: "the &lt;code&gt;http_request_duration_seconds&lt;/code&gt; p95 drops below 500ms for 10 consecutive scrape intervals."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RTO and what data you lose&lt;/strong&gt; — expected duration of the recovery and any acceptable data loss. The reader needs to know whether this is a 30-second fix or a 30-minute restore so they can communicate up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Escalation path&lt;/strong&gt; — when and to whom you escalate if the steps don't work. Real names or rotation references, not "the DBA team."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. Everything else (owners, related links, last-reviewed date) belongs in the file's front-matter or repository metadata, not in the body the on-caller reads while their phone is buzzing. For more on why RTO matters as a success criterion, see &lt;a href="https://devhelm.io/blog/mttr-full-form" rel="noopener noreferrer"&gt;MTTR Full Form&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Worked example: API p95 latency runbook
&lt;/h2&gt;

&lt;p&gt;Below is a condensed runbook for a common SaaS failure mode — API latency crossing an SLO threshold while error rates stay flat (often a saturation or dependency slowdown, not a hard outage). The names are illustrative; swap in your service labels and metric names.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Scenario: API p95 latency above SLO&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trigger:&lt;/strong&gt; the &lt;code&gt;api-latency-p95-high&lt;/code&gt; alert (Prometheus rule: p95 &amp;gt; 1s for 5m, error rate &amp;lt; 1%).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Symptoms:&lt;/strong&gt; Grafana "API latency" panel red; "API errors" panel green. Recent deploy in the last 30 minutes (check CI) OR no deploy (points to dependency or traffic spike).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diagnosis:&lt;/strong&gt; (1) &lt;code&gt;kubectl get pods -n api -l app=api&lt;/code&gt; — any Not Ready? (2) &lt;code&gt;curl -s http://api.internal/health | jq '.status'&lt;/code&gt; — expect &lt;code&gt;"UP"&lt;/code&gt;. (3) Compare p95 by route in Grafana — one route or all routes?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigation:&lt;/strong&gt; if post-deploy → roll back to previous revision (&lt;code&gt;kubectl rollout undo deployment/api -n api&lt;/code&gt;). If all routes slow and health is UP → check upstream dependency status pages; throttle non-critical traffic if you have a feature flag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verification:&lt;/strong&gt; p95 &amp;lt; 500ms for 10 consecutive scrape intervals; error rate unchanged; no new pages in 15 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RTO:&lt;/strong&gt; 5–15 minutes for rollback path; 30–60 minutes for dependency-wait path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Escalation:&lt;/strong&gt; if rollback fails twice or p95 still &amp;gt;1s after 30 minutes → page platform lead with dashboard link and deploy SHA.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Notice the shape: each section has a single job. There's no preamble about "the importance of SLOs." The reader who arrived from the alert wants four things in this order — &lt;em&gt;is this the right runbook, what should I see, what should I run, did it work&lt;/em&gt; — and the document delivers all four within the first screen.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI-readable runbooks: structure that an agent can execute
&lt;/h2&gt;

&lt;p&gt;Increasingly the first responder to an incident is not a human. An on-call agent (Cursor, Claude Code, or a dedicated SRE bot) can receive the same alert payload as a human and start triage before anyone is paged — if the alert carries a &lt;code&gt;runbook_url&lt;/code&gt; and the runbook body is structured for machines, not just humans.&lt;/p&gt;

&lt;p&gt;For that to work, the runbook has to be structured so an agent can extract steps and act on them. The seven sections above are necessary but not sufficient. Five additional properties make a runbook AI-executable:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The trigger is a machine-parseable query, not a description.&lt;/strong&gt; "Looks slow" can't be matched against telemetry; &lt;code&gt;histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="api"}[5m])) &amp;gt; 1.0&lt;/code&gt; can.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Commands live in fenced code blocks with language tags&lt;/strong&gt; (&lt;code&gt;bash&lt;/code&gt;, &lt;code&gt;sql&lt;/code&gt;, &lt;code&gt;yaml&lt;/code&gt;). The agent (and any markdown parser) needs structural cues to know what's executable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expected output is colocated with the command.&lt;/strong&gt; A step that says "run &lt;code&gt;kubectl get pods&lt;/code&gt;" without telling the agent what success looks like is non-executable — there's no way to verify the step worked before moving on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure modes branch explicitly.&lt;/strong&gt; "If health is UP but p95 is still high, go to Check 3 (dependency status); if pods are Not Ready, go to Check 2 (roll back)" is executable. "If needed, escalate" is not — the agent can't decide what "needed" means.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No prose-only sections in the recovery body.&lt;/strong&gt; Every step has a runnable artifact or a verifiable check. Background narrative belongs in a separate "Why this happens" section that the agent can skip if it's already remediating.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A human-only version of a step:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Check whether latency is still elevated. You can look at the metrics in Grafana, or curl the health endpoint. If it's still slow, you'll want to investigate why."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The same step, AI-executable:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check 1 — is the API still degraded?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://api.internal/health | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.status'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected: &lt;code&gt;UP&lt;/code&gt;. Then confirm latency in Prometheus:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="api"}[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected: below 0.5 (500ms). If above 1.0 for two consecutive evaluations, proceed to Check 2 (recent deploy).&lt;/p&gt;

&lt;p&gt;Same information, but the agent can run it, parse the output, and decide whether to advance. That's the bar.&lt;/p&gt;

&lt;h2&gt;
  
  
  Runbook hygiene: where to store them, how to find them at 3 AM
&lt;/h2&gt;

&lt;p&gt;A runbook that exists but can't be found in an incident is worse than no runbook — it costs minutes while the on-caller searches for it. Three rules cover most of the discoverability problem:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Store runbooks in Git, next to the code.&lt;/strong&gt; Confluence and Notion fail in two ways: they go down during outages of services they themselves depend on (the same DNS provider, the same auth provider), and they have no review workflow that catches stale content. A runbook in &lt;code&gt;runbooks/api-latency-p95-high.md&lt;/code&gt; is reviewed every time the surrounding service changes — pull requests force the authors to update the runbook or explain why not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Link every alert to its runbook.&lt;/strong&gt; Use the annotation field your alerting system provides. For Prometheus / Grafana, that's &lt;code&gt;runbook_url&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ApiLatencyP95High&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="api"}[5m])) &amp;gt; 1&lt;/span&gt;
    &lt;span class="s"&gt;and rate(http_requests_total{job="api",status=~"5.."}[5m]) / rate(http_requests_total{job="api"}[5m]) &amp;lt; 0.01&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;page&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;p95&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;above&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1s&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;5m&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;below&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1%."&lt;/span&gt;
    &lt;span class="na"&gt;runbook_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://docs.your-company.com/runbooks/api-latency-p95-high"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The alert payload that reaches the pager (and the AI agent, if you run one) carries the URL. The on-caller's first click is straight into the right procedure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One runbook per failure mode, named for the failure.&lt;/strong&gt; &lt;code&gt;api-latency-p95-high.md&lt;/code&gt;, not &lt;code&gt;api.md&lt;/code&gt;. When the page fires, the alert name and the file name match — no search needed.&lt;/p&gt;

&lt;p&gt;For decay management: review each runbook quarterly, archive any with zero hits in 90 days, and treat a stale runbook found mid-incident as a sev3 of its own — the on-caller files a ticket to fix it; otherwise nobody does.&lt;/p&gt;

&lt;h2&gt;
  
  
  How DevHelm fits runbooks into your incident flow
&lt;/h2&gt;

&lt;p&gt;DevHelm is built for the moment an alert fires and someone (human or agent) needs context fast. What's shipped today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Alert channels&lt;/strong&gt; (PagerDuty, Slack, webhook, email) pass through the payload your upstream system sends. If your Prometheus or Grafana alert includes a &lt;code&gt;runbook_url&lt;/code&gt; annotation, that URL can ride along in the notification DevHelm dispatches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vendor status context&lt;/strong&gt; on &lt;a href="https://devhelm.io/status/github" rel="noopener noreferrer"&gt;dependency status pages&lt;/a&gt; — when latency looks like an upstream problem, the runbook's "check dependency status" step has a concrete destination instead of a generic Google search.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource groups&lt;/strong&gt; (see &lt;a href="https://devhelm.io/blog/mttr-full-form" rel="noopener noreferrer"&gt;MTTR Full Form&lt;/a&gt;) collapse multiple monitors that share one failure mode into one incident — so the runbook link in the notification matches one root cause, not three duplicate pages.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What's not yet shipped: a first-class &lt;code&gt;runbook_url&lt;/code&gt; field on DevHelm monitors — the kind that would let you set it once on the monitor and have it flow into every notification and MCP tool response automatically. Until then, put the URL in the monitor description and your alert template. The &lt;a href="https://dev.to/reliability"&gt;reliability page&lt;/a&gt; covers how we operate our own stack; you don't need our internal runbook repo to apply the patterns in this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to start
&lt;/h2&gt;

&lt;p&gt;Pick your noisiest recurring alert — the one that woke someone up twice last quarter — and write one runbook for it. Seven sections, under 500 lines, stored in Git, linked from the alert annotation. That's the whole commitment.&lt;/p&gt;

&lt;p&gt;If you've been troubleshooting &lt;a href="https://devhelm.io/blog/how-to-fix-slow-dns-lookup" rel="noopener noreferrer"&gt;slow DNS lookups&lt;/a&gt; or &lt;a href="https://devhelm.io/blog/what-ssl-error-means-and-how-to-fix-it" rel="noopener noreferrer"&gt;SSL certificate errors&lt;/a&gt;, you've already done most of the work: those investigations follow exactly the trigger → diagnosis → fix → verify shape described above. Turning them into a runbook is a matter of formatting what you already know so the next person (or agent) doesn't have to rediscover it. And once the runbook exists, measuring whether it actually shortens recovery is what &lt;a href="https://devhelm.io/blog/mttr-full-form" rel="noopener noreferrer"&gt;MTTR&lt;/a&gt; is for.&lt;/p&gt;

&lt;p&gt;Spin up a free account at &lt;a href="https://app.devhelm.io" rel="noopener noreferrer"&gt;app.devhelm.io&lt;/a&gt; and connect your first dependency status feed in 60 seconds — useful when your runbook's diagnosis step says "check if the vendor is degraded." For AI-native setup, &lt;code&gt;npx devhelm skills install --target cursor&lt;/code&gt; installs the skill bundle that can create monitors from your editor.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://devhelm.io/blog/runbooks" rel="noopener noreferrer"&gt;DevHelm&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>guides</category>
      <category>reliability</category>
      <category>ai</category>
    </item>
    <item>
      <title>SLO vs SLA vs SLI: What Each One Means and How to Set Them</title>
      <dc:creator>DevHelm</dc:creator>
      <pubDate>Tue, 02 Jun 2026 10:13:44 +0000</pubDate>
      <link>https://dev.to/devhelm/slo-vs-sla-vs-sli-what-each-one-means-and-how-to-set-them-137</link>
      <guid>https://dev.to/devhelm/slo-vs-sla-vs-sli-what-each-one-means-and-how-to-set-them-137</guid>
      <description>&lt;p&gt;Most SLO guides start with the same three-paragraph definitional exercise — SLI is the indicator, SLO is the objective, SLA is the agreement — and then stop. You leave knowing the vocabulary but not how to use it. You can't answer the questions that actually matter: which metric should I measure, what target is realistic for my service, and what happens when I miss it?&lt;/p&gt;

&lt;p&gt;This guide starts with the definitions because you need a shared vocabulary, but it spends most of its time on the decisions behind each one: choosing the right SLI for your service, setting an SLO that's strict enough to matter but loose enough to survive, computing and spending an error budget, and knowing when (and when not) to turn an SLO into an SLA.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three letters, disambiguated
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;SLI — Service Level Indicator.&lt;/strong&gt; A quantitative measurement of one dimension of your service's behavior. Latency, availability, throughput, error rate, ticket resolution time. An SLI is always a number with units, derived from real telemetry. "Our API is fast" is not an SLI. "The 95th percentile of API response latency, measured at the load balancer over a 5-minute window" is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SLO — Service Level Objective.&lt;/strong&gt; A target you set on an SLI. "p95 latency &amp;lt; 500ms, measured over a rolling 30-day window" is an SLO. It's an internal commitment — your team agrees that the service should meet this bar, and when it doesn't, you treat that as an incident or at least an engineering priority. An SLO is a tool for your team, not a legal document.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SLA — Service Level Agreement.&lt;/strong&gt; An SLO that's been written into a contract with a customer, usually with financial consequences for missing it. If your SLO says "99.9% availability" and you publish that as an SLA, a customer who experiences more than 43 minutes of downtime in a month has grounds for a credit. SLAs are legal; SLOs are operational. Most internal services should have SLOs and should not have SLAs.&lt;/p&gt;

&lt;p&gt;The relationship is directional: you measure an &lt;strong&gt;SLI&lt;/strong&gt;, set an &lt;strong&gt;SLO&lt;/strong&gt; against it, and optionally externalize that SLO as an &lt;strong&gt;SLA&lt;/strong&gt;. Every SLA implies an SLO, but not every SLO should become an SLA.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the right SLI
&lt;/h2&gt;

&lt;p&gt;The hardest step is the first one: picking what to measure. A service with three SLIs that capture what users actually experience is more useful than one with fifteen SLIs that capture what the infrastructure is doing.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://sre.google/workbook/implementing-slos/" rel="noopener noreferrer"&gt;Google SRE Workbook&lt;/a&gt; recommends starting from user journeys:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;User journey&lt;/th&gt;
&lt;th&gt;SLI category&lt;/th&gt;
&lt;th&gt;Example SLI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"The page loads"&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Proportion of HTTP requests returning non-5xx, measured at the edge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"The page loads quickly"&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;p95 of response time, measured at the load balancer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"My data is processed"&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Freshness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Age of the most recent successful pipeline run, measured in minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"My report is accurate"&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Correctness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Proportion of API responses returning the expected result (requires a canary or known-answer test)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two rules of thumb:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Measure at the boundary your user sees, not inside your stack.&lt;/strong&gt; If you measure latency at the application layer and your CDN adds 200ms, you're lying to yourself. Measure at the load balancer or the edge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fewer SLIs, more confidence.&lt;/strong&gt; Start with availability + latency for any request-serving system. Add freshness only if you run a pipeline. Add correctness only if you have a way to verify it. Three SLIs that are trustworthy beat ten that nobody looks at.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A common mistake: using CPU utilization or memory pressure as SLIs. Those are infrastructure signals, not user-facing indicators. A machine running at 95% CPU but serving all requests under 200ms is fine. A machine running at 30% CPU but dropping 5% of connections is not. SLIs are about the user's experience, not the server's.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting a realistic SLO
&lt;/h2&gt;

&lt;p&gt;An SLO has three parts: the SLI, the target, and the measurement window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; "99.9% of HTTP requests return a non-5xx response, measured over a rolling 30-day window."&lt;/p&gt;

&lt;p&gt;The target is the part teams argue about. Here's a way to pick it that doesn't require a week of meetings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Measure your current SLI for 30 days.&lt;/strong&gt; Don't set a target yet — just observe. If your service has been running 99.95% availability without anyone trying, setting 99.9% is reasonable. Setting 99.99% is aspirational. Setting 99% is embarrassing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Set the target slightly below your current baseline.&lt;/strong&gt; If you've been running at 99.95%, set your SLO at 99.9%. This gives you room to breathe. The point of an SLO is not to describe your best day — it's to define the minimum acceptable. If you set it at your best day, every normal fluctuation is a "violation."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Convert the target to an error budget.&lt;/strong&gt; This is where SLOs get useful. A 30-day window contains 43,200 minutes, so:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;SLO target&lt;/th&gt;
&lt;th&gt;Error budget&lt;/th&gt;
&lt;th&gt;Allowed downtime per 30 days&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;99.9%&lt;/td&gt;
&lt;td&gt;0.1%&lt;/td&gt;
&lt;td&gt;43.2 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;99.95%&lt;/td&gt;
&lt;td&gt;0.05%&lt;/td&gt;
&lt;td&gt;21.6 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;99.99%&lt;/td&gt;
&lt;td&gt;0.01%&lt;/td&gt;
&lt;td&gt;4.3 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Those numbers are the entire content of most "what should my SLO be?" debates. A 99.99% SLO on a 30-day window gives you 4.3 minutes of total downtime. If your &lt;a href="https://devhelm.io/blog/mttr-full-form" rel="noopener noreferrer"&gt;MTTR&lt;/a&gt; is 25 minutes per incident, you can afford zero incidents. That's either an aspirational commitment backed by redundant infrastructure, or it's a lie. Be honest about which one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The error budget: what it is and how to spend it
&lt;/h2&gt;

&lt;p&gt;The error budget is the gap between 100% and your SLO target. If your SLO is 99.9% availability over 30 days, your error budget is 43.2 minutes. That budget is not "waste allowance" — it's a resource you can spend deliberately.&lt;/p&gt;

&lt;p&gt;Useful ways to spend error budget:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deploy a risky change.&lt;/strong&gt; If you have 30 minutes left in the budget and the deploy might cause 5 minutes of degradation, that's a calculated risk. If you have 2 minutes left, hold the deploy until the window rolls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run a chaos experiment.&lt;/strong&gt; Kill a database replica, fail over a region, inject latency on a dependency. Each experiment consumes budget. If you can't afford to run experiments, your SLO is probably too tight.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Let a known low-severity issue ride.&lt;/strong&gt; A p99 latency blip at 3 AM that affects 0.01% of requests is consuming budget, but if the alternative is waking someone up, spending budget is the right call.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The error budget policy is the written agreement about what happens when the budget runs out. Typical policies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Budget exhausted -&amp;gt; feature freeze.&lt;/strong&gt; All engineering effort goes to reliability until the budget recovers. This is the Google model and it works if leadership actually enforces it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget below 50% -&amp;gt; deploy gate.&lt;/strong&gt; Deploys require explicit approval from the on-call engineer. This slows shipping but prevents the "one more deploy" cascade that burns the remaining budget.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget healthy -&amp;gt; ship freely.&lt;/strong&gt; This is the reward for investing in reliability. A team with a full error budget has earned the right to move fast.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight: error budgets turn reliability from a vague mandate ("be more reliable") into a quantitative tradeoff ("we have 20 minutes left this month — is this deploy worth 5 of them?"). Teams that track error budgets make better decisions than teams that track uptime, because uptime has no built-in notion of "how much risk can we take."&lt;/p&gt;

&lt;h2&gt;
  
  
  When an SLO becomes an SLA
&lt;/h2&gt;

&lt;p&gt;Most internal SLOs should stay internal. An SLA adds legal weight, customer expectations, and credit obligations. Promote an SLO to an SLA only when all three conditions hold:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;You've hit the SLO consistently for 3+ months.&lt;/strong&gt; If you haven't proven you can meet it internally, you definitely can't promise it externally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You have a remediation path for breaches.&lt;/strong&gt; What credits do you issue? How are they calculated? Who approves them? If you can't answer these, you don't have an SLA — you have a marketing claim.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The SLA target is looser than your internal SLO.&lt;/strong&gt; Your SLA should be 99.9% if your SLO is 99.95%. The gap is your operational buffer. If the SLA and SLO are the same number, every SLO breach is also a contract breach, and your team will either burn out or game the measurement.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A public status page (like the ones DevHelm hosts at &lt;a href="https://devhelm.io/status/github" rel="noopener noreferrer"&gt;/status/github&lt;/a&gt;) is a middle ground between internal SLOs and contractual SLAs — it shows real uptime data without attaching legal obligations. It builds trust through transparency rather than through contractual obligation.&lt;/p&gt;

&lt;h2&gt;
  
  
  How DevHelm gives you the data for SLOs
&lt;/h2&gt;

&lt;p&gt;DevHelm doesn't have a first-class SLO resource that you configure with a target and measure against a budget — that's a feature we're building, not one we ship today. What it does give you is the raw material SLOs are made of.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitor uptime data.&lt;/strong&gt; Every monitor computes availability as a weighted daily percentage: &lt;code&gt;(86400 - major_seconds - partial_seconds * 0.3) / 86400 * 100&lt;/code&gt;. Major outages count fully against uptime; partial degradations count at 30%. That formula runs across the status page, the dashboard, and the API — all three stay in sync. If your SLI is availability, the monitor's uptime history is the measurement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Status page uptime bars.&lt;/strong&gt; The public status page at &lt;code&gt;/status/&amp;lt;service&amp;gt;&lt;/code&gt; renders daily uptime per component with a "tracking since" date. An internal team or a customer can see exactly when the service was degraded and for how long — the same data that would feed an error budget computation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert channels for SLO-boundary signals.&lt;/strong&gt; If your SLI is latency and your monitor checks every 30 seconds, you can set a monitor threshold at the SLO boundary (e.g. p95 &amp;gt; 500ms) and route the alert through DevHelm's notification policies. That's not burn-rate alerting in the formal sense (you'd want a multi-window approach per the &lt;a href="https://sre.google/workbook/alerting-on-slos/" rel="noopener noreferrer"&gt;Google SRE Workbook&lt;/a&gt;), but it catches SLO breaches as they happen rather than at the end of the month.&lt;/p&gt;

&lt;p&gt;What we'd tell you honestly: if you need formal error budgets with automated freeze policies, you need a dedicated SLO tool (Nobl9, Sloth, or a Prometheus recording rule setup). DevHelm gives you the uptime data and the alerting layer; the budget math is yours today, ours tomorrow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to start
&lt;/h2&gt;

&lt;p&gt;If you've never set an SLO, start with one. Pick your most important user-facing service, measure its availability SLI for two weeks, then set the SLO 0.05% below the observed baseline. Compute the error budget in minutes. Write it on a whiteboard. The first time someone asks "can we deploy this risky change?" and the answer is "we have 18 minutes of budget left — let's wait until Monday," the SLO has paid for itself.&lt;/p&gt;

&lt;p&gt;If your incidents tend to be dependency-driven — AWS degrades, your CDN edge has a regional issue — your SLO's biggest enemy is something outside your stack. A &lt;a href="https://devhelm.io/blog/runbooks" rel="noopener noreferrer"&gt;runbook&lt;/a&gt; for each known dependency failure mode and a &lt;a href="https://devhelm.io/status/cloudflare" rel="noopener noreferrer"&gt;vendor status feed&lt;/a&gt; that tells you when the dependency degraded before your monitors notice are the two cheapest investments in protecting your error budget.&lt;/p&gt;

&lt;p&gt;Spin up a free account at &lt;a href="https://app.devhelm.io" rel="noopener noreferrer"&gt;app.devhelm.io&lt;/a&gt; and wire your first monitor in 60 seconds. The uptime data starts accumulating immediately — you'll have your first 30-day SLI baseline before next month's planning meeting.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://devhelm.io/blog/slo-vs-sla-vs-sli" rel="noopener noreferrer"&gt;DevHelm&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>guides</category>
      <category>reliability</category>
    </item>
    <item>
      <title>Incident Severity Levels: Sev1–Sev4 with Triage Matrix</title>
      <dc:creator>DevHelm</dc:creator>
      <pubDate>Tue, 02 Jun 2026 10:13:43 +0000</pubDate>
      <link>https://dev.to/devhelm/incident-severity-levels-sev1-sev4-with-triage-matrix-54cc</link>
      <guid>https://dev.to/devhelm/incident-severity-levels-sev1-sev4-with-triage-matrix-54cc</guid>
      <description>&lt;p&gt;Most teams define their severity levels as a table in a Confluence page, link to it from onboarding docs, and then never reference it during an actual incident. The levels exist, but nobody uses them. Three months later someone opens a sev1 for a broken CSS gradient and the on-call engineer gets paged at 2 AM.&lt;/p&gt;

&lt;p&gt;Severity levels only work when three things are true: the scale is simple enough to apply under stress, the response expectations are explicit, and the routing is automated. This guide covers all three — the scale itself, the decision framework for assigning it, and the wiring that turns a severity label into the right alert at the right time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four levels
&lt;/h2&gt;

&lt;p&gt;Most incident management systems converge on a four-level scale. The labels vary — sev1/sev2/sev3/sev4, P0/P1/P2/P3, critical/major/minor/info — but the structure is nearly universal.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Also called&lt;/th&gt;
&lt;th&gt;Definition&lt;/th&gt;
&lt;th&gt;Response expectation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sev1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;P0, Critical&lt;/td&gt;
&lt;td&gt;Complete outage of a production system, data loss, or security breach affecting customers&lt;/td&gt;
&lt;td&gt;All-hands. Incident commander assigned. Stakeholder updates every 15 minutes.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sev2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;P1, Major&lt;/td&gt;
&lt;td&gt;Significant degradation — a core feature is broken or a significant percentage of users are affected. Service is up but materially impaired.&lt;/td&gt;
&lt;td&gt;On-call responds immediately. Updates every 30 minutes. Escalation if unresolved in 1 hour.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sev3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;P2, Minor&lt;/td&gt;
&lt;td&gt;Limited degradation — a non-critical feature is broken, a workaround exists, or the impact is confined to a small subset of users.&lt;/td&gt;
&lt;td&gt;Addressed within business hours. No page. Tracked in the incident backlog.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sev4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;P3, Info&lt;/td&gt;
&lt;td&gt;Cosmetic issue, minor inconvenience, or an anomaly that warrants investigation but has no user-facing impact.&lt;/td&gt;
&lt;td&gt;Sprint backlog. No incident channel. Closed in the next cycle.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The exact boundaries shift between organizations. A company whose revenue runs through a single API endpoint has a lower threshold for sev1 than a company with redundant payment processors. The table above is a starting point — calibrate it to your blast radius.&lt;/p&gt;

&lt;p&gt;What matters more than the exact definitions is that everyone on the team can assign the right level within 60 seconds of seeing the alert. If your engineers argue about severity during an incident, the definitions are too ambiguous.&lt;/p&gt;

&lt;h2&gt;
  
  
  Severity vs priority — they are not the same
&lt;/h2&gt;

&lt;p&gt;This distinction trips up most teams. Severity describes the impact of the incident — how bad it is right now. Priority describes the urgency of the response — how fast you need to fix it. They usually correlate, but not always:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;sev1 in a staging environment&lt;/strong&gt; is critical severity, low priority. The environment is completely down, but no customers are affected.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;sev3 that blocks a contractual deadline&lt;/strong&gt; is minor severity, high priority. The feature works for most users, but the one user who matters is the enterprise customer whose annual renewal depends on it shipping by Friday.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;sev2 that self-resolves in 90 seconds&lt;/strong&gt; is significant severity, reduced priority after the fact. The incident was real, but by the time an engineer opened the laptop, the system recovered. The retro still matters, but the live response is over.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://sre.google/workbook/incident-response/" rel="noopener noreferrer"&gt;Google SRE Workbook&lt;/a&gt; formalizes this as "severity is an attribute of the incident; priority is a decision made by the responder." The practical consequence: if your alerting system routes by severity alone, you get the right response most of the time. The rest requires human override — someone promoting a sev3 to high-priority or silencing a sev1 that fired in a non-production context.&lt;/p&gt;

&lt;h2&gt;
  
  
  A triage matrix that works under stress
&lt;/h2&gt;

&lt;p&gt;When an alert fires, you have roughly 30 seconds of attention before the responder either acts or dismisses. The triage question is: "what severity is this?" The fastest way to answer it is a two-axis matrix of &lt;strong&gt;customer impact&lt;/strong&gt; and &lt;strong&gt;scope&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Single user / account&lt;/th&gt;
&lt;th&gt;Significant minority (10-30%)&lt;/th&gt;
&lt;th&gt;Majority or all users&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Feature broken, no workaround&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sev3&lt;/td&gt;
&lt;td&gt;Sev2&lt;/td&gt;
&lt;td&gt;Sev1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Feature degraded, workaround exists&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sev4&lt;/td&gt;
&lt;td&gt;Sev3&lt;/td&gt;
&lt;td&gt;Sev2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Non-functional impact (slow, noisy, ugly)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sev4&lt;/td&gt;
&lt;td&gt;Sev4&lt;/td&gt;
&lt;td&gt;Sev3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The matrix is intentionally coarse. Three scope buckets, three impact buckets, nine cells. A responder can place an incident in the right cell in seconds without reading a paragraph of definitions.&lt;/p&gt;

&lt;p&gt;Two overrides that bump any cell up by one level:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data loss or security exposure.&lt;/strong&gt; A bug that leaks PII to unauthorized users is sev1 regardless of scope — even if it affects one account.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Revenue impact.&lt;/strong&gt; If the checkout flow is broken and orders are failing, that's sev1 even if the monitoring dashboard reports 95% availability — because the 5% that's failing is the 5% that pays the bills.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What each severity triggers
&lt;/h2&gt;

&lt;p&gt;The scale has no value unless it drives concrete actions. Every severity level should map to four things: who gets notified, how fast they respond, what communication cadence they maintain, and whether a post-incident review is mandatory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sev1:&lt;/strong&gt; page on-call + backup + engineering lead. Acknowledge within 5 minutes. Incident channel created, stakeholder updates every 15 minutes, customer-facing status page updated. Mandatory blameless retro within 48 hours with tracked action items.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sev2:&lt;/strong&gt; page on-call. Acknowledge within 15 minutes. Incident channel, updates every 30 minutes. Retro recommended at team discretion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sev3:&lt;/strong&gt; Slack channel or email notification. Response within the next business hour. Ticket created, no incident channel. Retro optional, only if the pattern is recurring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sev4:&lt;/strong&gt; logged but no active notification. Next sprint. No communication, no retro.&lt;/p&gt;

&lt;p&gt;If your sev1 and sev2 have the same notification channel, the same response time, and the same retro expectation, you don't have two severity levels — you have one with two names. Merge them or differentiate them.&lt;/p&gt;

&lt;h2&gt;
  
  
  How severity drives MTTR
&lt;/h2&gt;

&lt;p&gt;Your &lt;a href="https://devhelm.io/blog/mttr-full-form" rel="noopener noreferrer"&gt;MTTR&lt;/a&gt; target should vary by severity — and if you're tracking the full set of &lt;a href="https://devhelm.io/blog/mtta-mttr-mtbf-difference" rel="noopener noreferrer"&gt;MTTA, MTTR, MTBF, and MTTF&lt;/a&gt;, severity determines which metric matters most at each tier. A sev1 with a 4-hour MTTR means your most critical incidents take half a workday to resolve — probably too slow. A sev4 with a 4-hour MTTR means you're spending on-call energy on cosmetic issues — probably too fast.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;MTTR target&lt;/th&gt;
&lt;th&gt;Rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sev1&lt;/td&gt;
&lt;td&gt;&amp;lt; 1 hour&lt;/td&gt;
&lt;td&gt;Revenue is actively lost, users are actively blocked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sev2&lt;/td&gt;
&lt;td&gt;&amp;lt; 4 hours&lt;/td&gt;
&lt;td&gt;Significant impact but not existential&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sev3&lt;/td&gt;
&lt;td&gt;&amp;lt; 1 business day&lt;/td&gt;
&lt;td&gt;Limited scope, workaround available&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sev4&lt;/td&gt;
&lt;td&gt;Next sprint&lt;/td&gt;
&lt;td&gt;Not time-sensitive&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These targets feed directly into your &lt;a href="https://devhelm.io/blog/slo-vs-sla-vs-sli" rel="noopener noreferrer"&gt;SLO&lt;/a&gt; error budget. A 99.9% availability SLO on a 30-day window gives you 43 minutes of total downtime. If your sev1 MTTR target is 1 hour, a single sev1 incident blows the budget. That tension is the point — it forces you to invest in the &lt;a href="https://devhelm.io/blog/runbooks" rel="noopener noreferrer"&gt;runbooks&lt;/a&gt; and automation that keep resolution time below the budget threshold.&lt;/p&gt;

&lt;h2&gt;
  
  
  How DevHelm routes by severity
&lt;/h2&gt;

&lt;p&gt;DevHelm models incident severity as three operational states: &lt;strong&gt;DOWN&lt;/strong&gt;, &lt;strong&gt;DEGRADED&lt;/strong&gt;, and &lt;strong&gt;MAINTENANCE&lt;/strong&gt;. This is deliberately simpler than a sev1-through-sev4 scale. The numbered scale requires human judgment about scope and blast radius; DevHelm's model is automated from check results. When a monitor's trigger rule fires, the rule specifies whether the incident is &lt;code&gt;DOWN&lt;/code&gt; (the service is not responding or failing critically) or &lt;code&gt;DEGRADED&lt;/code&gt; (the service is responding but outside acceptable bounds — slow, returning partial errors, or failing specific assertions).&lt;/p&gt;

&lt;p&gt;The routing happens in notification policies. Each policy has match rules, and one of those rules is &lt;code&gt;severity_gte&lt;/code&gt; — "match when incident severity is greater than or equal to this threshold." Severity is ordered: DOWN &amp;gt; DEGRADED &amp;gt; MAINTENANCE. In practice, this gives you two-track routing:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A policy with &lt;code&gt;severity_gte: DOWN&lt;/code&gt; routes to PagerDuty — page the on-call engineer immediately.&lt;/li&gt;
&lt;li&gt;A policy with &lt;code&gt;severity_gte: DEGRADED&lt;/code&gt; routes to a Slack channel — notify the team, no page.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first policy fires only for DOWN incidents — your sev1 equivalent. The second fires for both DOWN and DEGRADED, so a DOWN incident sends both a page and a Slack message (the on-call gets paged, the wider team stays informed). A DEGRADED incident reaches Slack but never PagerDuty. You've split your alert routing by severity without writing any code.&lt;/p&gt;

&lt;p&gt;For richer routing, combine &lt;code&gt;severity_gte&lt;/code&gt; with other match rules. A policy that matches &lt;code&gt;severity_gte: DOWN&lt;/code&gt; AND &lt;code&gt;monitor_tag_in: ["payments", "checkout"]&lt;/code&gt; pages someone for critical payment failures but not for a down developer docs site. That's severity combined with business context — the same intersection the triage matrix above describes, except it's automated instead of decided in the heat of the moment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to start
&lt;/h2&gt;

&lt;p&gt;If your team doesn't have severity levels, start by writing the four definitions in a shared doc and getting three people to agree on them. That takes 30 minutes and pays for itself the first time someone opens an incident.&lt;/p&gt;

&lt;p&gt;Then automate the routing. Set up a monitor in &lt;a href="https://app.devhelm.io" rel="noopener noreferrer"&gt;DevHelm&lt;/a&gt;, configure a trigger rule that fires as &lt;code&gt;DOWN&lt;/code&gt; after two consecutive failures confirmed across regions, and wire a notification policy that pages your on-call for DOWN incidents and sends DEGRADED incidents to Slack. You've just built a severity-routed alerting pipeline that distinguishes between "wake someone up" and "the team should know" — running 24/7 without anyone remembering to check the definitions page.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://devhelm.io/blog/incident-severity-levels" rel="noopener noreferrer"&gt;DevHelm&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>guides</category>
      <category>reliability</category>
    </item>
    <item>
      <title>MTBF Full Form: Mean Time Between Failures — Meaning, Formula, and When It Matters</title>
      <dc:creator>DevHelm</dc:creator>
      <pubDate>Tue, 02 Jun 2026 10:12:56 +0000</pubDate>
      <link>https://dev.to/devhelm/mtbf-full-form-mean-time-between-failures-meaning-formula-and-when-it-matters-2jl2</link>
      <guid>https://dev.to/devhelm/mtbf-full-form-mean-time-between-failures-meaning-formula-and-when-it-matters-2jl2</guid>
      <description>&lt;p&gt;Most reliability conversations start with uptime percentage and stop there. "We're at 99.95% availability" feels like enough — until you realize that a service with 99.95% availability could be down for 22 minutes once a month, or for 11 seconds every hour. Both hit the same uptime number. MTBF tells you which pattern you actually have.&lt;/p&gt;

&lt;h2&gt;
  
  
  What MTBF stands for
&lt;/h2&gt;

&lt;p&gt;MTBF — Mean Time Between Failures — measures the average elapsed time from the end of one failure to the start of the next. It's a frequency metric: high MTBF means failures are rare; low MTBF means they're frequent. A service with an MTBF of 720 hours (30 days) averages one failure per month. A service with an MTBF of 24 hours averages one failure per day. Both might have the same uptime percentage if the frequent failures resolve quickly, but the operational burden is completely different.&lt;/p&gt;

&lt;p&gt;The term originates in hardware reliability engineering — &lt;a href="https://en.wikipedia.org/wiki/Mean_time_between_failures" rel="noopener noreferrer"&gt;MIL-HDBK-217&lt;/a&gt;, published by the US Department of Defense in 1961, defined MTBF for electronic components. In software, we borrow the concept but adapt it: a "failure" is an incident that crosses a &lt;a href="https://devhelm.io/blog/incident-severity-levels" rel="noopener noreferrer"&gt;severity threshold&lt;/a&gt; (typically sev1 or sev2), not a hardware component burning out.&lt;/p&gt;

&lt;h2&gt;
  
  
  The formula
&lt;/h2&gt;

&lt;p&gt;For a repairable system (which every software service is), MTBF equals total operating time divided by the number of failures during that period:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Formula&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTBF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Total uptime / Number of failures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTTR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Total downtime / Number of failures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTTF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time from last recovery to next failure (= MTBF - MTTR)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The three metrics are related: &lt;strong&gt;MTBF = MTTF + MTTR&lt;/strong&gt;. MTTF is the time the system runs without failing; MTTR is the time it takes to recover. Together they span the full cycle from one failure to the next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Worked example.&lt;/strong&gt; A payment processing service runs for 30 days (720 hours). During that window, it experiences 3 incidents:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Incident&lt;/th&gt;
&lt;th&gt;Started&lt;/th&gt;
&lt;th&gt;Resolved&lt;/th&gt;
&lt;th&gt;Downtime&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;#1&lt;/td&gt;
&lt;td&gt;Day 4, 14:00&lt;/td&gt;
&lt;td&gt;Day 4, 14:45&lt;/td&gt;
&lt;td&gt;45 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#2&lt;/td&gt;
&lt;td&gt;Day 12, 03:20&lt;/td&gt;
&lt;td&gt;Day 12, 04:10&lt;/td&gt;
&lt;td&gt;50 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#3&lt;/td&gt;
&lt;td&gt;Day 25, 09:00&lt;/td&gt;
&lt;td&gt;Day 25, 09:30&lt;/td&gt;
&lt;td&gt;30 min&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Total downtime: 125 minutes (2.08 hours). Total uptime: 720 - 2.08 = 717.92 hours.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Calculation&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MTBF&lt;/td&gt;
&lt;td&gt;717.92 / 3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;239.3 hours (~10 days)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MTTR&lt;/td&gt;
&lt;td&gt;2.08 / 3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;41.6 minutes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MTTF&lt;/td&gt;
&lt;td&gt;239.3 - 0.69&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;238.6 hours&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Uptime %&lt;/td&gt;
&lt;td&gt;717.92 / 720&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;99.71%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The uptime number (99.71%) tells stakeholders the service was available most of the time. The MTBF (10 days) tells the engineering team that failures happen roughly every week and a half — often enough to warrant investment in prevention, not just faster recovery.&lt;/p&gt;

&lt;h2&gt;
  
  
  MTBF vs MTTR vs MTTF — side by side
&lt;/h2&gt;

&lt;p&gt;Teams often confuse these three metrics. Here's the clean distinction:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Measures&lt;/th&gt;
&lt;th&gt;Improves by&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTBF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mean Time Between Failures&lt;/td&gt;
&lt;td&gt;How often failures occur&lt;/td&gt;
&lt;td&gt;Preventing failures (better testing, redundancy, dependency isolation)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://devhelm.io/blog/mttr-full-form" rel="noopener noreferrer"&gt;MTTR&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mean Time To Recovery&lt;/td&gt;
&lt;td&gt;How fast you recover&lt;/td&gt;
&lt;td&gt;Faster detection, better &lt;a href="https://devhelm.io/blog/runbooks" rel="noopener noreferrer"&gt;runbooks&lt;/a&gt;, automation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTTF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mean Time To Failure&lt;/td&gt;
&lt;td&gt;How long the system runs before failing&lt;/td&gt;
&lt;td&gt;Same as MTBF — it's the operating-time component&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two less common but useful metrics:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Measures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTTD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mean Time To Detect&lt;/td&gt;
&lt;td&gt;Lag between failure start and the first alert firing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTTA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mean Time To Acknowledge&lt;/td&gt;
&lt;td&gt;Lag between alert firing and a human acknowledging it&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The full incident timeline runs: &lt;strong&gt;failure occurs -&amp;gt; MTTD -&amp;gt; alert fires -&amp;gt; MTTA -&amp;gt; responder acknowledges -&amp;gt; works the issue -&amp;gt; MTTR -&amp;gt; recovery&lt;/strong&gt;. MTBF spans the gap between recoveries.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to measure MTBF in practice
&lt;/h2&gt;

&lt;p&gt;MTBF requires two inputs: a time window and a count of failures. Both are harder to pin down than they sound.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Defining "failure."&lt;/strong&gt; Not every alert is a failure. A monitor that flaps for 30 seconds and self-recovers is an anomaly, not an incident. Most teams count only incidents above a severity threshold — sev1 and sev2 — when computing MTBF. If you include sev3 and sev4, your MTBF drops dramatically but the number stops being useful because it mixes service-impacting failures with noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choosing the window.&lt;/strong&gt; A 7-day window is too noisy — one bad week skews the number. A 365-day window smooths out seasonality but hides recent trends. The sweet spot for most teams is a &lt;strong&gt;rolling 30-day or 90-day window&lt;/strong&gt;, reported weekly. If your MTBF is trending down over 4 consecutive weeks, something systemic is degrading — even if no single week looks alarming.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Handling planned maintenance.&lt;/strong&gt; Exclude planned maintenance windows from the failure count and from operating time. A team that takes its service down for 2 hours every Sunday for database maintenance should not count those windows as "failures." If you include them, MTBF becomes a meaningless number that punishes disciplined operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-service, not fleet-wide.&lt;/strong&gt; A fleet-wide MTBF that averages your payment service (fails once a quarter) with your notification service (fails weekly) tells you nothing actionable. Compute MTBF per service or per component. The payment team needs to know their MTBF, not the company average.&lt;/p&gt;

&lt;h2&gt;
  
  
  What MTBF tells you that uptime percentage does not
&lt;/h2&gt;

&lt;p&gt;Uptime percentage compresses all failure information into a single number. Two services can have identical uptime (99.9%) with completely different failure patterns:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Uptime&lt;/th&gt;
&lt;th&gt;Failures/month&lt;/th&gt;
&lt;th&gt;Avg downtime per failure&lt;/th&gt;
&lt;th&gt;MTBF&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Service A&lt;/td&gt;
&lt;td&gt;99.9%&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;43 minutes&lt;/td&gt;
&lt;td&gt;720 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Service B&lt;/td&gt;
&lt;td&gt;99.9%&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;4.3 minutes&lt;/td&gt;
&lt;td&gt;72 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Service A is stable but slow to recover. Service B is fragile but recovers fast. The &lt;a href="https://devhelm.io/blog/slo-vs-sla-vs-sli" rel="noopener noreferrer"&gt;SLO&lt;/a&gt; error budget treats them identically — both consume the same 43 minutes of downtime per month. But the operational strategies are opposite: Service A needs faster MTTR (better runbooks, automation). Service B needs higher MTBF (better testing, dependency isolation, circuit breakers).&lt;/p&gt;

&lt;p&gt;If you only track uptime, you prescribe the same medicine to both patients. MTBF and MTTR together give you the diagnosis.&lt;/p&gt;

&lt;h2&gt;
  
  
  When MTBF is misleading
&lt;/h2&gt;

&lt;p&gt;MTBF is an average, and averages lie when the underlying distribution is skewed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rare catastrophic failures.&lt;/strong&gt; A service that runs perfectly for 364 days and then suffers a 12-hour outage on day 365 has an MTBF of 8,736 hours. That number sounds excellent — until the one failure costs $2M in lost revenue. MTBF doesn't capture tail risk. For services where the cost of a single failure is extreme, pair MTBF with a worst-case downtime metric (longest incident duration over the window).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Heterogeneous failure modes.&lt;/strong&gt; If your service fails due to three completely different root causes (DNS resolution, database connection pool exhaustion, and a memory leak), averaging them into a single MTBF obscures the fact that one cause dominates. Compute MTBF per root cause category when possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Early-life systems.&lt;/strong&gt; A new service has no meaningful MTBF. You need at least 3-5 failure cycles to compute a statistically useful average. Reporting MTBF after one incident in two weeks is technically correct ("MTBF = 336 hours") but practically useless — the confidence interval is enormous.&lt;/p&gt;

&lt;h2&gt;
  
  
  How DevHelm gives you the raw data
&lt;/h2&gt;

&lt;p&gt;DevHelm does not expose MTBF as a dashboard metric today — that's on the roadmap. What it does give you is the incident history that MTBF is computed from.&lt;/p&gt;

&lt;p&gt;Every incident records a &lt;code&gt;created_at&lt;/code&gt; timestamp (when the incident opened), a &lt;code&gt;resolved_at&lt;/code&gt; timestamp (when it closed), and a severity (&lt;code&gt;DOWN&lt;/code&gt; or &lt;code&gt;DEGRADED&lt;/code&gt;). The dashboard's incident summary already computes &lt;strong&gt;MTTR over a rolling 30-day window&lt;/strong&gt; from these timestamps. MTBF is the complementary calculation: take the &lt;code&gt;resolved_at&lt;/code&gt; of incident N and the &lt;code&gt;created_at&lt;/code&gt; of incident N+1, average those gaps, and you have MTTF — then add MTTR to get MTBF.&lt;/p&gt;

&lt;p&gt;If you need the number today, the API returns the full incident list for any monitor at &lt;code&gt;GET /api/v1/incidents&lt;/code&gt;, filtered by severity and date range. A script that walks the list and computes MTBF per monitor is straightforward — and when we ship native MTBF in the analytics view, the underlying calculation will use the same timestamps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to start
&lt;/h2&gt;

&lt;p&gt;If you're tracking uptime but not MTBF, pick your highest-traffic service, pull its sev1+sev2 incident history for the last 90 days, and compute MTBF with the formula above. Compare it to your &lt;a href="https://devhelm.io/blog/mttr-full-form" rel="noopener noreferrer"&gt;MTTR&lt;/a&gt;. If MTBF is low and MTTR is low, you have a fragile-but-fast-recovering service — invest in prevention. If MTBF is high and MTTR is high, you have a stable-but-slow-recovering service — invest in &lt;a href="https://devhelm.io/blog/runbooks" rel="noopener noreferrer"&gt;runbooks&lt;/a&gt; and detection speed.&lt;/p&gt;

&lt;p&gt;Set up monitoring for the service at &lt;a href="https://app.devhelm.io" rel="noopener noreferrer"&gt;app.devhelm.io&lt;/a&gt; and let the incident history accumulate. After 30 days you'll have enough data points to compute your first real MTBF — and the trend line from that point forward is worth more than any snapshot.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://devhelm.io/blog/mtbf-full-form" rel="noopener noreferrer"&gt;DevHelm&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>guides</category>
      <category>reliability</category>
    </item>
    <item>
      <title>Jaeger Tracing Explained: How Distributed Tracing Works</title>
      <dc:creator>DevHelm</dc:creator>
      <pubDate>Tue, 02 Jun 2026 10:12:56 +0000</pubDate>
      <link>https://dev.to/devhelm/jaeger-tracing-explained-how-distributed-tracing-works-5334</link>
      <guid>https://dev.to/devhelm/jaeger-tracing-explained-how-distributed-tracing-works-5334</guid>
      <description>&lt;p&gt;Distributed tracing answers the question that uptime monitoring can't: a request failed, but which service in the chain caused it?&lt;/p&gt;

&lt;p&gt;When your checkout endpoint returns a 500 and your monitoring dashboard shows the API is degraded, you know &lt;em&gt;that&lt;/em&gt; something broke. You don't know &lt;em&gt;where&lt;/em&gt; in the chain of payment-service -&amp;gt; inventory-service -&amp;gt; shipping-service the failure originated. Distributed tracing instruments every service in the path and stitches the results into a single timeline — a trace — that shows exactly where latency accumulated or where an error propagated from.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.jaegertracing.io/" rel="noopener noreferrer"&gt;Jaeger&lt;/a&gt; is the most widely deployed open-source distributed tracing backend and a &lt;a href="https://www.cncf.io/projects/jaeger/" rel="noopener noreferrer"&gt;CNCF graduated project&lt;/a&gt;. This guide covers how it works, how to set it up with OpenTelemetry, and when tracing complements (rather than replaces) uptime monitoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Jaeger does
&lt;/h2&gt;

&lt;p&gt;Jaeger collects, stores, and visualizes traces. A trace is a tree of spans — each span represents one unit of work (an HTTP request handler, a database query, a gRPC call to another service). Spans carry timing data, status codes, and arbitrary tags.&lt;/p&gt;

&lt;p&gt;When you instrument your services with OpenTelemetry SDKs, each outgoing request propagates a trace ID via HTTP headers. Jaeger collects spans from every service, groups them by trace ID, and renders the full request path as a timeline.&lt;/p&gt;

&lt;p&gt;The architecture has four components:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lightweight daemon that receives spans from the SDK and forwards to the collector. Optional with OTLP — the SDK can send directly.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Collector&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Receives spans, validates, indexes, and writes to storage.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API + UI for searching and visualizing traces.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pluggable backend — Elasticsearch, Cassandra, Kafka, Badger, or in-memory for development.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For development and small deployments, Jaeger ships an &lt;strong&gt;all-in-one&lt;/strong&gt; binary that bundles all four components with in-memory storage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; jaeger &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 16686:16686 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 4317:4317 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 4318:4318 &lt;span class="se"&gt;\&lt;/span&gt;
  jaegertracing/all-in-one:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Port 16686 is the Jaeger UI. Ports 4317 (gRPC) and 4318 (HTTP) receive OTLP spans from OpenTelemetry SDKs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Instrumenting with OpenTelemetry
&lt;/h2&gt;

&lt;p&gt;Jaeger originally shipped its own client libraries, but the project now officially recommends &lt;a href="https://opentelemetry.io/docs/languages/" rel="noopener noreferrer"&gt;OpenTelemetry SDKs&lt;/a&gt; for instrumentation. The OTel SDK instruments your code; the OTLP exporter sends spans to Jaeger's collector endpoint.&lt;/p&gt;

&lt;p&gt;A minimal Python example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TracerProvider&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace.export&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BatchSpanProcessor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.exporter.otlp.proto.grpc.trace_exporter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;OTLPSpanExporter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TracerProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;exporter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OTLPSpanExporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:4317&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;insecure&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_span_processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;BatchSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tracer_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkout-service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;process_order&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order.id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ORD-1234&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order.total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;89.99&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;BatchSpanProcessor&lt;/code&gt; buffers spans and flushes them in batches to avoid blocking your request path. For production, add resource attributes (service name, version, environment) so you can filter traces in the Jaeger UI.&lt;/p&gt;

&lt;p&gt;The equivalent in TypeScript with auto-instrumentation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;NodeSDK&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@opentelemetry/sdk-node&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;OTLPTraceExporter&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@opentelemetry/exporter-trace-otlp-grpc&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;getNodeAutoInstrumentations&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@opentelemetry/auto-instrumentations-node&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sdk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;NodeSDK&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;traceExporter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OTLPTraceExporter&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;http://localhost:4317&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;instrumentations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;getNodeAutoInstrumentations&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
  &lt;span class="na"&gt;serviceName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;checkout-service&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;auto-instrumentations-node&lt;/code&gt; package automatically instruments HTTP, gRPC, Express, database clients, and dozens of other libraries — no manual span creation needed for most frameworks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Jaeger vs Zipkin vs commercial APMs
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Jaeger&lt;/th&gt;
&lt;th&gt;Zipkin&lt;/th&gt;
&lt;th&gt;Datadog APM / New Relic&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Proprietary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Elasticsearch, Cassandra, Kafka, Badger&lt;/td&gt;
&lt;td&gt;Elasticsearch, Cassandra, MySQL&lt;/td&gt;
&lt;td&gt;Vendor-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Protocol&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OTLP native, Thrift legacy&lt;/td&gt;
&lt;td&gt;Zipkin JSON/Thrift, OTLP via collector&lt;/td&gt;
&lt;td&gt;OTLP, proprietary agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auto-instrumentation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Via OpenTelemetry SDKs&lt;/td&gt;
&lt;td&gt;Via OpenTelemetry or Brave&lt;/td&gt;
&lt;td&gt;Proprietary agents with deep integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Infrastructure only&lt;/td&gt;
&lt;td&gt;Infrastructure only&lt;/td&gt;
&lt;td&gt;$15-75/host/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Trace analytics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Search + compare in UI&lt;/td&gt;
&lt;td&gt;Search in UI&lt;/td&gt;
&lt;td&gt;ML anomaly detection, dashboards, alerting&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Choose Jaeger&lt;/strong&gt; when you want full control over your tracing data, you're already running Elasticsearch or Cassandra, and you have the capacity to operate the backend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose a commercial APM&lt;/strong&gt; when you need trace-based alerting, ML anomaly detection, or you don't want to operate storage infrastructure. The cost is real, but so is the operational burden of self-hosted tracing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose Zipkin&lt;/strong&gt; when you have an existing Zipkin deployment. For new setups, Jaeger has better OTLP support and a more active community.&lt;/p&gt;

&lt;p&gt;For larger deployments, the &lt;a href="https://devhelm.io/blog/otel-collector-explained" rel="noopener noreferrer"&gt;OpenTelemetry Collector&lt;/a&gt; sits between your SDKs and Jaeger, handling batching, sampling, and multi-backend fan-out.&lt;/p&gt;

&lt;h2&gt;
  
  
  When tracing isn't enough
&lt;/h2&gt;

&lt;p&gt;Tracing answers "which service in the chain is slow?" It doesn't answer "is the service reachable from the outside?" A trace only exists when a request is made — if your service is completely down and rejecting connections, there are no spans to collect.&lt;/p&gt;

&lt;p&gt;This is where uptime monitoring and tracing complement each other:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;What it tells you&lt;/th&gt;
&lt;th&gt;Blind spot&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Uptime monitoring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Is the service responding? How fast? Is the SSL cert valid?&lt;/td&gt;
&lt;td&gt;Why is it slow? Which internal dependency is the bottleneck?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Distributed tracing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Where in the call chain did latency accumulate? Which downstream service errored?&lt;/td&gt;
&lt;td&gt;Is the service reachable at all? Did the DNS resolve? Did the cert expire?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The most useful setup is both: an external monitor that checks your endpoints every 30 seconds from multiple regions, combined with internal tracing that captures the request path when those endpoints are hit. When the monitor fires an alert because p95 latency crossed your &lt;a href="https://devhelm.io/blog/slo-vs-sla-vs-sli" rel="noopener noreferrer"&gt;SLO&lt;/a&gt; threshold, the trace for that time window shows you exactly which downstream call caused the spike.&lt;/p&gt;

&lt;p&gt;For dependency-driven outages — a cloud provider degrades, your payment service slows down, your checkout endpoint breaches its latency budget — a &lt;a href="https://devhelm.io/status/aws" rel="noopener noreferrer"&gt;vendor status feed&lt;/a&gt; tells you the dependency degraded before you start digging through traces. That head start on root cause identification is the difference between a 15-minute &lt;a href="https://devhelm.io/blog/mttr-full-form" rel="noopener noreferrer"&gt;MTTR&lt;/a&gt; and an hour-long investigation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to start
&lt;/h2&gt;

&lt;p&gt;If you've never used distributed tracing, start with Jaeger all-in-one in Docker (the command above), instrument one service with the OpenTelemetry SDK, and trace a single request end-to-end. The first time you see a 2-second span on a database query that you thought took 50ms, tracing has paid for itself.&lt;/p&gt;

&lt;p&gt;Pair it with external monitoring. Set up a monitor at &lt;a href="https://app.devhelm.io" rel="noopener noreferrer"&gt;app.devhelm.io&lt;/a&gt; for the same endpoint you're tracing — you'll know both &lt;em&gt;that&lt;/em&gt; the service is degraded and &lt;em&gt;where&lt;/em&gt; in the call chain the problem lives.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://devhelm.io/blog/jaeger-tracing" rel="noopener noreferrer"&gt;DevHelm&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>guides</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>MTTA, MTTR, MTBF, MTTF — The Four Incident Metrics, Compared</title>
      <dc:creator>DevHelm</dc:creator>
      <pubDate>Tue, 02 Jun 2026 10:12:05 +0000</pubDate>
      <link>https://dev.to/devhelm/mtta-mttr-mtbf-mttf-the-four-incident-metrics-compared-584j</link>
      <guid>https://dev.to/devhelm/mtta-mttr-mtbf-mttf-the-four-incident-metrics-compared-584j</guid>
      <description>&lt;p&gt;Four acronyms show up in every incident management conversation: MTTA, MTTR, MTBF, and MTTF. They get jumbled together in slide decks, confused in retro discussions, and mixed up in job interviews. They measure four different things, from four different timestamps, with four different improvement levers.&lt;/p&gt;

&lt;p&gt;This guide puts all four side by side, traces them through a single incident timeline, and answers the question that matters: which one should you track, and when?&lt;/p&gt;

&lt;h2&gt;
  
  
  One incident, four metrics
&lt;/h2&gt;

&lt;p&gt;The cleanest way to understand the four metrics is to walk through one incident and label where each measurement starts and stops.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;14:00:00&lt;/td&gt;
&lt;td&gt;Service starts returning 500 errors (&lt;strong&gt;failure begins&lt;/strong&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:02:30&lt;/td&gt;
&lt;td&gt;Monitor fires an alert (&lt;strong&gt;detection&lt;/strong&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:05:00&lt;/td&gt;
&lt;td&gt;On-call engineer acknowledges the page (&lt;strong&gt;acknowledgment&lt;/strong&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:42:00&lt;/td&gt;
&lt;td&gt;Service restored, incident resolved (&lt;strong&gt;recovery&lt;/strong&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Next failure occurs 12 days later&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;From these timestamps:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Measures&lt;/th&gt;
&lt;th&gt;Start&lt;/th&gt;
&lt;th&gt;End&lt;/th&gt;
&lt;th&gt;This incident&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTTD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time to detect&lt;/td&gt;
&lt;td&gt;Failure begins (14:00)&lt;/td&gt;
&lt;td&gt;Alert fires (14:02:30)&lt;/td&gt;
&lt;td&gt;2.5 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTTA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time to acknowledge&lt;/td&gt;
&lt;td&gt;Alert fires (14:02:30)&lt;/td&gt;
&lt;td&gt;Engineer acks (14:05)&lt;/td&gt;
&lt;td&gt;2.5 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTTR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time to recovery&lt;/td&gt;
&lt;td&gt;Failure begins (14:00)&lt;/td&gt;
&lt;td&gt;Recovery (14:42)&lt;/td&gt;
&lt;td&gt;42 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTTF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time to next failure&lt;/td&gt;
&lt;td&gt;Recovery (14:42)&lt;/td&gt;
&lt;td&gt;Next failure start&lt;/td&gt;
&lt;td&gt;~12 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTBF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full cycle&lt;/td&gt;
&lt;td&gt;This failure start&lt;/td&gt;
&lt;td&gt;Next failure start&lt;/td&gt;
&lt;td&gt;~12 days + 42 min&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;MTTD and MTTA are the early-warning metrics — they tell you how fast you detected and responded. MTTR is the incident-duration metric — it measures total impact time. MTTF and MTBF are the reliability metrics — they measure how often failures happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  MTTA — Mean Time To Acknowledge
&lt;/h2&gt;

&lt;p&gt;MTTA measures the gap between an alert firing and a human confirming they're working on it. It's a measure of your on-call process, not your technical system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it captures:&lt;/strong&gt; pager responsiveness, on-call discipline, alert routing effectiveness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it misses:&lt;/strong&gt; everything after acknowledgment. A team with a 30-second MTTA and a 4-hour resolution time has a fast paging system and a slow debugging process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Improvement levers:&lt;/strong&gt; better alert routing (fewer false positives means alerts get trusted and acknowledged faster), escalation policies that page a backup after N minutes, on-call overlap during shift handoffs.&lt;/p&gt;

&lt;p&gt;DevHelm tracks &lt;code&gt;confirmedAt&lt;/code&gt; (multi-region incident confirmation) but does not yet record a separate human acknowledgment timestamp — the acknowledgment step lives in your PagerDuty, Opsgenie, or Slack integration today.&lt;/p&gt;

&lt;h2&gt;
  
  
  MTTR — Mean Time To Recovery
&lt;/h2&gt;

&lt;p&gt;MTTR is the most widely tracked incident metric. It measures the total elapsed time from failure start to service recovery. This is the metric your &lt;a href="https://devhelm.io/blog/slo-vs-sla-vs-sli" rel="noopener noreferrer"&gt;SLO&lt;/a&gt; error budget cares about: every minute in MTTR is a minute of downtime consumed.&lt;/p&gt;

&lt;p&gt;The deep dive is in the &lt;a href="https://devhelm.io/blog/mttr-full-form" rel="noopener noreferrer"&gt;MTTR full form guide&lt;/a&gt;, but the key point for comparison: MTTR includes detection time, acknowledgment time, diagnosis, and fix. It's an end-to-end metric, which makes it the most useful for external stakeholders but the hardest to improve because the bottleneck could be anywhere in the chain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it captures:&lt;/strong&gt; total customer-facing impact time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it misses:&lt;/strong&gt; failure frequency. A service with a 5-minute MTTR that fails ten times a month has a fundamentally different problem than one with a 5-minute MTTR that fails once a year.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Improvement levers:&lt;/strong&gt; faster detection (monitoring with short check intervals), better &lt;a href="https://devhelm.io/blog/runbooks" rel="noopener noreferrer"&gt;runbooks&lt;/a&gt; (reduce diagnosis time), automated remediation, multi-region failover.&lt;/p&gt;

&lt;h2&gt;
  
  
  MTBF — Mean Time Between Failures
&lt;/h2&gt;

&lt;p&gt;MTBF measures the average time from the end of one failure to the start of the next. It's the reliability metric: high MTBF means the system is stable; low MTBF means it breaks often.&lt;/p&gt;

&lt;p&gt;The deep dive is in the &lt;a href="https://devhelm.io/blog/mtbf-full-form" rel="noopener noreferrer"&gt;MTBF full form guide&lt;/a&gt;. The key point here: &lt;strong&gt;MTBF = MTTF + MTTR&lt;/strong&gt;. It spans the entire failure cycle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it captures:&lt;/strong&gt; system stability, failure frequency, whether your reliability investments are working.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it misses:&lt;/strong&gt; failure severity. A service with 100 sev4 flaps per month has a terrible MTBF but no real reliability problem. Filter by &lt;a href="https://devhelm.io/blog/incident-severity-levels" rel="noopener noreferrer"&gt;severity level&lt;/a&gt; (sev1+sev2 only) to keep MTBF meaningful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Improvement levers:&lt;/strong&gt; root cause elimination, dependency isolation (circuit breakers, fallbacks), better testing, capacity planning.&lt;/p&gt;

&lt;h2&gt;
  
  
  MTTF — Mean Time To Failure
&lt;/h2&gt;

&lt;p&gt;MTTF measures the operating time between recovery and the next failure — "how long does the system run before it breaks again?"&lt;/p&gt;

&lt;p&gt;In hardware reliability, MTTF is for non-repairable components (light bulbs, hard drives) while MTBF is for repairable systems. In software, everything is repairable, so MTTF is the uptime component of MTBF: &lt;strong&gt;MTTF = MTBF - MTTR&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it captures:&lt;/strong&gt; the same thing as MTBF minus the recovery time. For services with low MTTR (minutes), MTTF and MTBF are nearly identical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it misses:&lt;/strong&gt; recovery quality. If you "fix" an incident by restarting a pod and the root cause is still there, MTTF will be short because the failure recurs quickly. MTTF rewards durable fixes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When MTTF matters more than MTBF:&lt;/strong&gt; when your MTTR is highly variable. If some incidents take 5 minutes and others take 5 hours, MTBF averages the downtime in, masking the variance. MTTF isolates the operating-time question from the recovery-time question.&lt;/p&gt;

&lt;h2&gt;
  
  
  When you need which one
&lt;/h2&gt;

&lt;p&gt;Not every team needs all four metrics. Here's the decision framework:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question you're asking&lt;/th&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"How fast do we respond to alerts?"&lt;/td&gt;
&lt;td&gt;MTTA&lt;/td&gt;
&lt;td&gt;Measures on-call process health&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"How long are our customers affected?"&lt;/td&gt;
&lt;td&gt;MTTR&lt;/td&gt;
&lt;td&gt;Measures total incident duration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"How often do things break?"&lt;/td&gt;
&lt;td&gt;MTBF&lt;/td&gt;
&lt;td&gt;Measures failure frequency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Are our fixes durable?"&lt;/td&gt;
&lt;td&gt;MTTF&lt;/td&gt;
&lt;td&gt;Isolates operating time from recovery&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Is our monitoring fast enough?"&lt;/td&gt;
&lt;td&gt;MTTD&lt;/td&gt;
&lt;td&gt;Measures detection lag&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Start with MTTR.&lt;/strong&gt; Every team should track it, because it directly maps to customer impact and error budgets. The &lt;a href="https://sre.google/workbook/implementing-slos/" rel="noopener noreferrer"&gt;Google SRE Workbook&lt;/a&gt; centers its SLO framework on availability — and availability is the inverse of cumulative MTTR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add MTBF when MTTR is stable but incidents are too frequent.&lt;/strong&gt; If your MTTR is 15 minutes but you're having incidents three times a week, the problem isn't response speed — it's system stability. MTBF makes that visible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add MTTA when you suspect paging is the bottleneck.&lt;/strong&gt; If incidents take 45 minutes to resolve but 20 of those minutes are "waiting for someone to respond," MTTA makes the on-call gap visible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Track MTTF when you suspect fixes aren't durable.&lt;/strong&gt; If the same incident recurs within days of being "resolved," MTTF will be conspicuously low while MTBF might still look acceptable (because it averages in the stable periods between recurrence clusters).&lt;/p&gt;

&lt;h2&gt;
  
  
  Common pitfalls
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Averaging across severity levels.&lt;/strong&gt; A fleet of 10 sev4 flaps and 1 sev1 outage produces an "MTTR of 8 minutes" that hides the 2-hour sev1. Always segment metrics by &lt;a href="https://devhelm.io/blog/incident-severity-levels" rel="noopener noreferrer"&gt;severity level&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Counting self-healing as incidents.&lt;/strong&gt; If your system auto-recovers in 30 seconds, is that a "failure" for MTBF purposes? Most teams exclude incidents that resolve within the confirmation window (e.g., DevHelm's multi-region confirmation requires failures across at least 2 regions before opening an incident). If you don't exclude auto-recoveries, MTBF becomes noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comparing MTBF across services.&lt;/strong&gt; A payment service and a notification service have fundamentally different blast radii. Comparing their MTBF is like comparing a car engine's MTBF to a light switch's. Track each service independently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ignoring partial recoveries.&lt;/strong&gt; An incident where the service is "up but slow" for 2 hours, then fully recovered, has a different MTTR depending on whether you measure to partial recovery or full recovery. Define your measurement convention and stick to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to start
&lt;/h2&gt;

&lt;p&gt;If you're tracking nothing, start with MTTR. Pull the last 90 days of sev1+sev2 incidents, compute the average duration from start to resolution, and write that number down. Next month, compute it again. The trend matters more than the absolute number.&lt;/p&gt;

&lt;p&gt;Once MTTR is stable, add MTBF. Together they tell you whether you're dealing with a fragile-but-fast-recovering system (invest in prevention) or a stable-but-slow-recovering system (invest in &lt;a href="https://devhelm.io/blog/runbooks" rel="noopener noreferrer"&gt;runbooks&lt;/a&gt; and detection speed). That diagnostic drives your reliability roadmap more than any single metric could.&lt;/p&gt;

&lt;p&gt;Set up monitoring at &lt;a href="https://app.devhelm.io" rel="noopener noreferrer"&gt;app.devhelm.io&lt;/a&gt; — every incident records the timestamps you need for both metrics. The 30-day rolling MTTR is already in your dashboard; MTBF is a script away from the incident API.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://devhelm.io/blog/mtta-mttr-mtbf-difference" rel="noopener noreferrer"&gt;DevHelm&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>guides</category>
      <category>reliability</category>
    </item>
    <item>
      <title>OpenTelemetry Collector Explained: What It Does and When You Need One</title>
      <dc:creator>DevHelm</dc:creator>
      <pubDate>Tue, 02 Jun 2026 10:12:04 +0000</pubDate>
      <link>https://dev.to/devhelm/opentelemetry-collector-explained-what-it-does-and-when-you-need-one-4bc9</link>
      <guid>https://dev.to/devhelm/opentelemetry-collector-explained-what-it-does-and-when-you-need-one-4bc9</guid>
      <description>&lt;p&gt;The first question every team asks when adopting OpenTelemetry is: "Do I need a Collector, or can my SDK export directly to Jaeger / Prometheus / Datadog?" The answer determines whether you run an extra piece of infrastructure or skip it entirely. Most guides explain what the Collector is without answering that question first. This one starts there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Do you need a Collector?
&lt;/h2&gt;

&lt;p&gt;Three questions decide it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Do you export to more than one backend?&lt;/strong&gt; If your traces go to &lt;a href="https://devhelm.io/blog/jaeger-tracing" rel="noopener noreferrer"&gt;Jaeger&lt;/a&gt; and your metrics go to Prometheus and your logs go to Loki, the Collector fans out from a single OTLP stream. Without it, every service needs three exporters configured in its SDK — three sets of credentials, three retry policies, three failure modes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Do you need to transform telemetry before it lands?&lt;/strong&gt; Scrubbing PII from span attributes, sampling 10% of low-priority traces, enriching resource labels with Kubernetes metadata — these are processor-layer concerns. The SDK can do basic attribute manipulation, but batch logic, tail-based sampling, and cross-signal correlation live in the Collector.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Do you want to decouple services from backend changes?&lt;/strong&gt; If you switch from Jaeger to Tempo next quarter, a Collector means you change one exporter config in one place. Without it, you redeploy every instrumented service.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you answered "no" to all three — you export to one backend, you don't transform telemetry, and your backend choice is stable — skip the Collector. Configure the OTLP exporter in your SDK to point directly at the backend. You can always add a Collector later without changing your instrumentation code, because the SDK speaks OTLP either way.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Collector actually is
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://opentelemetry.io/docs/collector/" rel="noopener noreferrer"&gt;OpenTelemetry Collector&lt;/a&gt; is a vendor-agnostic proxy that receives telemetry, processes it, and exports it. The pipeline model has three stages:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Receivers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ingest data from sources&lt;/td&gt;
&lt;td&gt;OTLP, Prometheus scrape, Jaeger Thrift, Fluent Forward, host metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Processors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Transform, filter, enrich, sample&lt;/td&gt;
&lt;td&gt;Batch, memory limiter, attributes, tail sampling, k8s attributes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Exporters&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Send data to backends&lt;/td&gt;
&lt;td&gt;OTLP, Prometheus remote write, Jaeger, Elasticsearch, Datadog, Loki&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You wire them together in a YAML config. One Collector can run multiple pipelines — traces, metrics, and logs each get their own receiver-processor-exporter chain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent vs gateway deployment
&lt;/h2&gt;

&lt;p&gt;Two deployment models, not mutually exclusive:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent mode.&lt;/strong&gt; Run one Collector instance per node (DaemonSet in Kubernetes, sidecar in ECS). Each instance receives spans from local services over localhost, batches them, and forwards to the backend. Low latency, no network hop for span delivery, but you run N instances for N nodes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gateway mode.&lt;/strong&gt; Run a small cluster of Collector instances behind a load balancer. All services send OTLP to the gateway endpoint. Centralized processing, easier to manage sampling and routing rules, but introduces a network hop and a single point of failure (mitigate with horizontal scaling and health checks).&lt;/p&gt;

&lt;p&gt;Most production setups use both: agents on every node handle local buffering and basic processing, then forward to a gateway that handles tail-based sampling and multi-backend fan-out.&lt;/p&gt;

&lt;h2&gt;
  
  
  A working configuration
&lt;/h2&gt;

&lt;p&gt;Here's a Collector config for a common stack: receive OTLP from instrumented services, batch spans to reduce export overhead, and send to a Jaeger backend.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;grpc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4317&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4318&lt;/span&gt;

&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;
    &lt;span class="na"&gt;send_batch_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;512&lt;/span&gt;
  &lt;span class="na"&gt;memory_limiter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;check_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1s&lt;/span&gt;
    &lt;span class="na"&gt;limit_mib&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;512&lt;/span&gt;
    &lt;span class="na"&gt;spike_limit_mib&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;128&lt;/span&gt;

&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp/jaeger&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jaeger-collector:4317&lt;/span&gt;
    &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;insecure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;memory_limiter&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp/jaeger&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;memory_limiter&lt;/code&gt; processor is not optional in production. Without it, a burst of spans can OOM the Collector. The &lt;code&gt;check_interval: 1s&lt;/code&gt; polls memory usage every second; when usage exceeds &lt;code&gt;limit_mib&lt;/code&gt;, the Collector starts dropping data (better than crashing).&lt;/p&gt;

&lt;p&gt;Run it with the contrib distribution, which includes all community-maintained receivers and exporters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; otel-collector &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; ./otel-config.yaml:/etc/otelcol-contrib/config.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 4317:4317 &lt;span class="nt"&gt;-p&lt;/span&gt; 4318:4318 &lt;span class="se"&gt;\&lt;/span&gt;
  otel/opentelemetry-collector-contrib:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Common pitfalls
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Memory bloat from unbounded batching.&lt;/strong&gt; The default batch processor has no memory limit. In a spike, it buffers everything in memory until the export succeeds or the process dies. Always pair &lt;code&gt;batch&lt;/code&gt; with &lt;code&gt;memory_limiter&lt;/code&gt; and set &lt;code&gt;spike_limit_mib&lt;/code&gt; to at most 25% of your container's memory limit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exporter back-pressure cascading upstream.&lt;/strong&gt; If Jaeger is slow to ingest, the Collector's export queue fills up, the batch processor blocks, and incoming OTLP requests start timing out in your application SDKs. Set &lt;code&gt;sending_queue.queue_size&lt;/code&gt; on the exporter and accept that data will be dropped under sustained back-pressure — dropping spans is better than adding latency to your application's hot path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Running the core distribution instead of contrib.&lt;/strong&gt; The core Collector (&lt;code&gt;otel/opentelemetry-collector&lt;/code&gt;) ships with a minimal set of components. Most real deployments need contrib receivers (Prometheus, Fluent Forward) or exporters (Datadog, Elasticsearch). Use &lt;code&gt;otel/opentelemetry-collector-contrib&lt;/code&gt; unless you're building a custom distribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skipping resource detection.&lt;/strong&gt; Without the &lt;code&gt;resourcedetection&lt;/code&gt; processor (or &lt;code&gt;k8sattributes&lt;/code&gt; in Kubernetes), your spans lack metadata like &lt;code&gt;service.namespace&lt;/code&gt;, &lt;code&gt;k8s.pod.name&lt;/code&gt;, and &lt;code&gt;cloud.region&lt;/code&gt;. Debugging a trace without knowing which pod produced it is painful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Collector vs SDK-direct export
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;SDK-direct&lt;/th&gt;
&lt;th&gt;Via Collector&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One fewer network hop&lt;/td&gt;
&lt;td&gt;Adds ~1-5ms (localhost agent) or ~5-20ms (gateway)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reliability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SDK retries on failure; spans may be lost if the app crashes&lt;/td&gt;
&lt;td&gt;Collector buffers and retries independently of the app&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Flexibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One exporter per backend, configured per service&lt;/td&gt;
&lt;td&gt;Fan-out, sampling, enrichment in one place&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operational cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Zero extra infra&lt;/td&gt;
&lt;td&gt;DaemonSet or gateway to run and monitor&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a single-service, single-backend setup (one API exporting traces to Jaeger), SDK-direct is simpler and has fewer moving parts. For anything beyond that — multiple services, multiple backends, or any processing requirement — the Collector is worth the operational cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  How monitoring complements the Collector pipeline
&lt;/h2&gt;

&lt;p&gt;The Collector is infrastructure, and infrastructure fails. An OOM, a misconfigured exporter, or a network partition between the Collector and your backend means spans are silently dropped. You won't notice until someone asks "why are there no traces for the last 2 hours?"&lt;/p&gt;

&lt;p&gt;External uptime monitoring closes that gap. A monitor that checks your Collector's health endpoint (&lt;code&gt;http://collector:13133/&lt;/code&gt;) every 30 seconds catches Collector failures before the gap in your trace data becomes an incident. If your &lt;a href="https://devhelm.io/blog/mttr-full-form" rel="noopener noreferrer"&gt;MTTR&lt;/a&gt; for "Collector is down" is 2 hours because nobody noticed, a 30-second check interval cuts that to minutes.&lt;/p&gt;

&lt;p&gt;Set up a monitor at &lt;a href="https://app.devhelm.io" rel="noopener noreferrer"&gt;app.devhelm.io&lt;/a&gt; for every piece of your observability stack — the Collector, Jaeger, Prometheus, Grafana. The irony of observability infrastructure is that it's the last thing teams monitor, and the first thing that causes blind spots when it fails.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://devhelm.io/blog/otel-collector-explained" rel="noopener noreferrer"&gt;DevHelm&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>guides</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>MTTR Full Form: Meaning, Formula, and How to Reduce It</title>
      <dc:creator>DevHelm</dc:creator>
      <pubDate>Wed, 27 May 2026 19:50:41 +0000</pubDate>
      <link>https://dev.to/devhelm/mttr-full-form-meaning-formula-and-how-to-reduce-it-3o2n</link>
      <guid>https://dev.to/devhelm/mttr-full-form-meaning-formula-and-how-to-reduce-it-3o2n</guid>
      <description>&lt;p&gt;Ask three SREs what MTTR stands for and you'll get three answers. Mean Time To Recovery. Mean Time To Repair. Mean Time To Respond. Sometimes Resolve. They are not the same metric. They measure different parts of an incident, they imply different ownership boundaries, and conflating them is the single most common reason an engineering team's "MTTR is improving" slide is meaningless. This guide pins down the MTTR full form, walks through the formula with real numbers, distinguishes MTTR from MTBF, MTTF, MTTA, and MTTD, and ends with the six changes that actually reduce the number — including one that most monitoring tools cannot do at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  What MTTR stands for
&lt;/h2&gt;

&lt;p&gt;MTTR is one of four metrics that share the same letters and very different definitions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Acronym&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Typical owner&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MTTR&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Mean Time To Recovery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;From the moment the service degrades to the moment it is fully back to normal, averaged across incidents&lt;/td&gt;
&lt;td&gt;SRE / on-call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MTTR&lt;/td&gt;
&lt;td&gt;Mean Time To &lt;strong&gt;Repair&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;From the moment an engineer starts working on a fix to the moment the fix is deployed&lt;/td&gt;
&lt;td&gt;Engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MTTR&lt;/td&gt;
&lt;td&gt;Mean Time To &lt;strong&gt;Respond&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;From alert to first human acknowledgement&lt;/td&gt;
&lt;td&gt;On-call rotation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MTTR&lt;/td&gt;
&lt;td&gt;Mean Time To &lt;strong&gt;Resolve&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;From incident creation to the incident being closed in the tracker (includes paperwork)&lt;/td&gt;
&lt;td&gt;Incident commander&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The version you almost always want, and the one used in the SRE literature and the &lt;a href="https://sre.google/workbook/" rel="noopener noreferrer"&gt;Google SRE Workbook&lt;/a&gt;, is &lt;strong&gt;Mean Time To Recovery&lt;/strong&gt; — the customer-visible downtime metric. The other three are sub-stages inside it. When you see a public number that's "MTTR is 8 minutes" without further qualification, assume Recovery. When a vendor pitches you a 30-second MTTR, ask which one they mean — they almost certainly mean Mean Time To Respond, which is the easiest to game.&lt;/p&gt;

&lt;p&gt;In this guide, MTTR means Mean Time To Recovery unless explicitly noted.&lt;/p&gt;

&lt;h2&gt;
  
  
  The MTTR formula
&lt;/h2&gt;

&lt;p&gt;The formula is simple: &lt;strong&gt;MTTR = total downtime / number of incidents&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A worked example. In May 2026 a hypothetical SaaS had four customer-impacting incidents:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Incident&lt;/th&gt;
&lt;th&gt;Detected&lt;/th&gt;
&lt;th&gt;Recovered&lt;/th&gt;
&lt;th&gt;Duration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stripe webhook backlog&lt;/td&gt;
&lt;td&gt;02:14&lt;/td&gt;
&lt;td&gt;02:51&lt;/td&gt;
&lt;td&gt;37 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Postgres failover&lt;/td&gt;
&lt;td&gt;09:22&lt;/td&gt;
&lt;td&gt;09:34&lt;/td&gt;
&lt;td&gt;12 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI 429 spike&lt;/td&gt;
&lt;td&gt;14:01&lt;/td&gt;
&lt;td&gt;14:48&lt;/td&gt;
&lt;td&gt;47 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CDN cert expiry&lt;/td&gt;
&lt;td&gt;18:30&lt;/td&gt;
&lt;td&gt;18:34&lt;/td&gt;
&lt;td&gt;4 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100 min&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;MTTR = 100 min / 4 incidents = 25 minutes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two warnings before you start tracking this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warning 1: averages hide the long tail.&lt;/strong&gt; A team with one 4-hour incident and ten 5-minute incidents has the same MTTR as a team with eleven 25-minute incidents. The first team has a tail problem; the second has a baseline problem. Always look at MTTR alongside the p95 incident duration and the count of incidents over 1 hour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warning 2: the start time is contested.&lt;/strong&gt; Customer-visible downtime starts when the first customer experiences a problem, not when your alert fires. If your detection has a 2-minute lag and your MTTR is 8 minutes from alert, the customer-visible MTTR is 10 minutes. Most teams quietly start the clock at alert time because it makes the number smaller. Don't.&lt;/p&gt;

&lt;h2&gt;
  
  
  MTTR vs MTBF, MTTF, MTTA, and MTTD
&lt;/h2&gt;

&lt;p&gt;The reliability metric family is large and the acronyms overlap enough that even seasoned SREs slip up. Here's the canonical lineup:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Acronym&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Typical good value (SaaS)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTBF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mean Time Between Failures&lt;/td&gt;
&lt;td&gt;Average uptime between two consecutive incidents&lt;/td&gt;
&lt;td&gt;Weeks to months&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTTF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mean Time To Failure&lt;/td&gt;
&lt;td&gt;Average uptime of a component before it fails (used for non-repairable parts)&lt;/td&gt;
&lt;td&gt;Years (hardware), N/A for services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTTD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mean Time To Detect&lt;/td&gt;
&lt;td&gt;From the moment a problem starts to the moment it's detected&lt;/td&gt;
&lt;td&gt;&amp;lt; 1 minute&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTTA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mean Time To Acknowledge&lt;/td&gt;
&lt;td&gt;From alert fired to first human acknowledgement&lt;/td&gt;
&lt;td&gt;&amp;lt; 5 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTTR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mean Time To Recovery&lt;/td&gt;
&lt;td&gt;From problem start to service restored&lt;/td&gt;
&lt;td&gt;&amp;lt; 30 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;MTTD + MTTA + MTTR can be added together for a fuller picture of the incident lifecycle. The shorthand is &lt;strong&gt;MTTD + MTTA + the time spent on repair = MTTR (Recovery)&lt;/strong&gt;. If you split it that way, you can see which sub-stage is dragging the number — and they almost always need different fixes.&lt;/p&gt;

&lt;p&gt;For an internal dashboard, track all five and let each have its own threshold. The teams that ship the fastest improvements treat MTTD as a detection problem, MTTA as an alerting and routing problem, and MTTR-the-repair as an engineering practice problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's a good MTTR?
&lt;/h2&gt;

&lt;p&gt;There is no industry-wide answer. The right MTTR depends on what your service does, how many customers it has, and what you're willing to pay to move the number. A few benchmarks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DORA elite performers&lt;/strong&gt; (state of DevOps report, 2023): MTTR under 1 hour. Low performers: more than 1 week.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google SRE textbook&lt;/strong&gt;: target depends on the service's &lt;a href="https://sre.google/sre-book/embracing-risk/" rel="noopener noreferrer"&gt;error budget&lt;/a&gt;. A 99.9% SLO over a 30-day window allows 43 minutes of downtime — your MTTR per incident must fit inside what's left of that budget.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Atlassian incident management benchmark&lt;/strong&gt; (&lt;a href="https://www.atlassian.com/incident-management/kpis/common-metrics" rel="noopener noreferrer"&gt;source&lt;/a&gt;): teams at the median run an MTTR around 4 hours. Top quartile is under 30 minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Banking and trading platforms&lt;/strong&gt;: regulators sometimes mandate MTTR thresholds tied to capital reserves. 5 minutes is not unusual for top-tier financial services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal-only B2B SaaS with no real-time SLA&lt;/strong&gt;: 1-4 hours is acceptable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use these as anchors, not as targets. The right MTTR for &lt;em&gt;your&lt;/em&gt; service comes from your SLO and your customer impact model, not from a benchmark.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to actually reduce MTTR
&lt;/h2&gt;

&lt;p&gt;Six changes consistently move the number. They're listed in order of how quickly they pay off for a small team that's just starting to take this seriously.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Write runbooks for the top five recurring incidents
&lt;/h3&gt;

&lt;p&gt;Most incidents repeat. If you've fought the same Postgres failover behavior twice this quarter, you'll fight it again next quarter. Pick the five most common incident types in your tracker and write a runbook for each. The runbook needs five sections: trigger, symptoms, diagnosis steps, mitigation, verification. Aim for 200-400 words per runbook. Store them where on-call can find them in 30 seconds at 3 AM — not in a wiki you have to log into, not in a Notion folder buried three clicks deep. We keep ours in the same git repository as the service code under &lt;code&gt;runbooks/&lt;/code&gt;, linked directly from every alert.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Tune alert routing to skip people who can't act
&lt;/h3&gt;

&lt;p&gt;If a Stripe webhook backlog wakes up the whole team but only one person knows how to clear it, every other pager-recipient adds noise without adding hands. Route severity-2 alerts to a primary; promote to a wider group only if the primary doesn't acknowledge in 5 minutes. The MTTA drops, the MTTR drops, the team stops resenting on-call.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Make rollback the cheap option
&lt;/h3&gt;

&lt;p&gt;If your rollback path takes 30 minutes and your forward-fix path takes 20 minutes, every incident becomes "let's try to fix it forward." Forward-fixes under pressure produce more incidents. Cut your rollback path to under 5 minutes and the first response to most incidents becomes "roll back, then debug calmly." Every CI/CD platform supports this — most teams just don't drill it.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Run blameless post-mortems and actually follow the action items
&lt;/h3&gt;

&lt;p&gt;A post-mortem that produces an action-items list which nobody owns is theatre. Assign each action item to a single named person with a due date, track them in your issue tracker, and report on closed-vs-open in your weekly engineering review. The action items from last quarter's incidents are the cheapest way to reduce next quarter's MTTR — they're already pre-prioritized by the fact that the incident hurt enough to investigate.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Detect dependency failures before your monitors do
&lt;/h3&gt;

&lt;p&gt;This is the change most monitoring tools cannot make for you — and it's one of the key differences when &lt;a href="https://devhelm.io/vs/checkly" rel="noopener noreferrer"&gt;comparing monitoring platforms&lt;/a&gt;. A substantial share of customer-visible SaaS incidents — many teams report something near a third of them — are caused by an upstream dependency degradation: Stripe slows down, OpenAI rate-limits, your CDN edge has a regional issue, your database provider has a partial outage. When that happens, your own monitors fire — but you spend the first 15 minutes deciding &lt;em&gt;whose&lt;/em&gt; problem it is. Status pages that watch your vendors and surface their incidents alongside your own check failures collapse that diagnostic window. We'll come back to this in the next section.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Schedule game days
&lt;/h3&gt;

&lt;p&gt;A team that has never practiced an incident response will respond slowly. A team that runs a chaos exercise once a quarter — kill the database primary, simulate a CDN outage, page someone who isn't on-call — recovers from real incidents noticeably faster. The investment is one engineering day per quarter. The payback is measured in customer hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  How DevHelm reduces MTTR
&lt;/h2&gt;

&lt;p&gt;DevHelm is built around one specific MTTR problem: incidents where the root cause is a vendor your service depends on, not your service itself. When Stripe degrades, your checkout monitor fires. Your billing-webhook monitor fires. Your subscription-sync monitor fires. Three pages. Three Slack pings. Twenty minutes of "is anyone seeing the Stripe dashboard?" before someone confirms that Stripe themselves posted a status update.&lt;/p&gt;

&lt;p&gt;DevHelm shortens that pattern in two concrete ways today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Vendor status pages, watched continuously.&lt;/strong&gt; We aggregate 100+ vendor status pages — &lt;a href="https://devhelm.io/status/github" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, &lt;a href="https://devhelm.io/status/aws" rel="noopener noreferrer"&gt;AWS&lt;/a&gt;, &lt;a href="https://devhelm.io/status/cloudflare" rel="noopener noreferrer"&gt;Cloudflare&lt;/a&gt;, &lt;a href="https://devhelm.io/status/datadog" rel="noopener noreferrer"&gt;Datadog&lt;/a&gt;, and every major dependency. Each page has its own incident feed; you can subscribe each one to the same Slack channel your own monitors page to, so an AWS degradation lands in your incident channel within seconds of AWS posting it, alongside your monitor alerts. That collapses the diagnostic loop: instead of fifteen minutes of "is it the dependency or is it us?", the answer is in the same channel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Resource groups, for collapsing self-noise.&lt;/strong&gt; When several of your monitors share a single failure mode (e.g. all of them call Stripe), you put them in a resource group with a single notification policy. One incident, not three. The recovery clock keeps ticking until Stripe themselves recover, but your on-call isn't paged three times for one root cause. The YAML for the on-call-friendly version looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# devhelm.yml&lt;/span&gt;
&lt;span class="na"&gt;resourceGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stripe-fleet&lt;/span&gt;
    &lt;span class="na"&gt;monitors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;checkout-page-uptime&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;billing-webhook-uptime&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;subscription-sync-uptime&lt;/span&gt;
&lt;span class="na"&gt;notificationPolicies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stripe-fleet-sev1&lt;/span&gt;
    &lt;span class="na"&gt;matchRules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitor_id_in&lt;/span&gt;
        &lt;span class="na"&gt;monitorNames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;checkout-page-uptime&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;billing-webhook-uptime&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;subscription-sync-uptime&lt;/span&gt;
    &lt;span class="na"&gt;escalation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;channels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;oncall-slack&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The piece DevHelm doesn't yet do automatically is the cross-correlation step — when Stripe goes down, your in-flight monitor alerts don't get auto-marked as "probably caused by that vendor" without you wiring up the resource group + Slack subscription yourself. That auto-correlation is on the roadmap. Until it ships, the manual wire-up is the work — but the work pays back at every dependency incident, which is a meaningful slice of total incidents for most modern SaaS teams.&lt;/p&gt;

&lt;p&gt;The other action items in the previous section — runbooks, alert routing, rollback discipline, post-mortem follow-through, game days — DevHelm can't do for you. But it can give you back the time you currently spend correlating vendor outages by hand. If you're troubleshooting a specific failure mode right now, the &lt;a href="https://devhelm.io/blog/how-to-fix-slow-dns-lookup" rel="noopener noreferrer"&gt;DNS resolution guide&lt;/a&gt; and the &lt;a href="https://devhelm.io/status/github" rel="noopener noreferrer"&gt;vendor status feeds&lt;/a&gt; (GitHub, &lt;a href="https://devhelm.io/status/aws" rel="noopener noreferrer"&gt;AWS&lt;/a&gt;, &lt;a href="https://devhelm.io/status/cloudflare" rel="noopener noreferrer"&gt;Cloudflare&lt;/a&gt;, and 100+ more) are useful starting points.&lt;/p&gt;

&lt;p&gt;If your last incident was a vendor problem and your team spent the first 20 minutes figuring out whose problem it was, that diagnostic loop is the cheapest fix target. Spin up a free account at &lt;a href="https://app.devhelm.io" rel="noopener noreferrer"&gt;app.devhelm.io&lt;/a&gt; and connect your first vendor in 60 seconds — no credit card.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://devhelm.io/blog/mttr-full-form" rel="noopener noreferrer"&gt;DevHelm&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>guides</category>
      <category>reliability</category>
    </item>
    <item>
      <title>Why We Built DevHelm</title>
      <dc:creator>DevHelm</dc:creator>
      <pubDate>Sun, 17 May 2026 14:27:44 +0000</pubDate>
      <link>https://dev.to/devhelm/why-we-built-devhelm-4ppj</link>
      <guid>https://dev.to/devhelm/why-we-built-devhelm-4ppj</guid>
      <description>&lt;p&gt;The era of monitoring tools built for human engineers is over.&lt;/p&gt;

&lt;p&gt;Site reliability is undergoing a profound shift triggered by the agentic AI wave, and to understand why it matters, it helps to look at the pattern that came before. Every major innovation cycle until now was defined by a new environment that software runs in: on-premise to cloud, cloud to mobile, monolith to microservices. Through each of those transitions, humans remained indispensable to the SRE process. Humans debugged. Humans investigated. Humans identified root causes and remediated them, while communicating the full picture to customers and stakeholders along the way.&lt;/p&gt;

&lt;p&gt;The next wave is different. AI SRE agents can now automate large parts of that process and free human time for the decisions that actually require judgment. An AI agent can conduct a root cause investigation, understand the blast radius of an incident, classify its priority, and surface that context to on-call engineers — all before a human has finished reading the first alert. In this environment, the speed of iteration on reliability increases dramatically. AI can investigate faster, identify patterns earlier, and be far more proactive about surfacing deep underlying issues by synthesizing information from sources that no single engineer would think to check at once.&lt;/p&gt;

&lt;p&gt;In that world, the monitoring infrastructure itself becomes the agent's most critical tool. And for it to be effective, it has to be built in an agent-first, developer-first way. It must provide clear primitives for management, operations, and forensic investigation — alongside the external-facing artifacts that reliability demands, like status pages and incident communications. The old approach to SRE tooling, built around beautiful but unintegrated dashboards designed for human eyes, is fundamentally incompatible with this new paradigm.&lt;/p&gt;

&lt;p&gt;That is why we built DevHelm. Our primary focus was to deliver a developer-first, API-first platform that supports operations in this new AI-driven reality. We are launching with uptime monitoring, dependency monitoring, status pages, and developer artifacts purpose-built for agentic operations: a native CLI, Cursor and Claude skills, Python and TypeScript SDKs, an MCP server, and a Terraform provider — all included from the free tier.&lt;/p&gt;

&lt;p&gt;Our long-term goal is to build a unified reliability platform that powers the next generation of applications and services built in the agentic AI era.&lt;/p&gt;

&lt;p&gt;We are just getting started. &lt;a href="https://app.devhelm.io" rel="noopener noreferrer"&gt;Try DevHelm free&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://devhelm.io/blog/why-we-built-devhelm" rel="noopener noreferrer"&gt;DevHelm&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>product</category>
      <category>launch</category>
    </item>
    <item>
      <title>Introducing DevHelm</title>
      <dc:creator>DevHelm</dc:creator>
      <pubDate>Sun, 17 May 2026 14:27:39 +0000</pubDate>
      <link>https://dev.to/devhelm/introducing-devhelm-1fne</link>
      <guid>https://dev.to/devhelm/introducing-devhelm-1fne</guid>
      <description>&lt;p&gt;Today we are launching DevHelm — a reliability platform built to bring developer-first, agent-first monitoring infrastructure to the teams that need it most.&lt;/p&gt;

&lt;p&gt;Seeing a &lt;a href="https://devhelm.io/blog/why-we-built-devhelm" rel="noopener noreferrer"&gt;massive shift in how site reliability is practiced&lt;/a&gt;, driven by the agentic AI wave, we decided that monitoring needed to be rebuilt around a different premise: something developers define in code, AI agents operate programmatically, and your entire team understands through clear external-facing artifacts. Everything we ship reflects that premise, from the core monitoring infrastructure to the way you interact with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring that understands your stack
&lt;/h2&gt;

&lt;p&gt;At the foundation, DevHelm provides multi-protocol uptime monitoring — HTTP, DNS, TCP, and ICMP checks running from five continents at 30-second intervals with multi-region confirmation before any alert fires. That part is table stakes, and we made sure it works well.&lt;/p&gt;

&lt;p&gt;What makes DevHelm different is what sits on top of it. We track over 100 external services — Stripe, AWS, GitHub, Auth0, OpenAI, and dozens more — and correlate their health with your monitors in real time. When a vendor degrades, you don't get a separate alert for every endpoint that happens to depend on it. You get one resource group alert that tells you what's affected and why, with the vendor incident already linked. The goal is signal, not noise: your team should spend time fixing problems, not figuring out whether a problem is even yours to fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Built for developers and their agents
&lt;/h2&gt;

&lt;p&gt;The monitoring infrastructure is only useful if it's accessible to the tools and workflows that actually operate your stack. That's why every capability in DevHelm is available through a full developer surface from the free tier: a native CLI, Python and TypeScript SDKs, a Terraform provider, an MCP server, and pre-built skills for Cursor and Claude. Monitors can be defined in a YAML config in your repo and deployed through your CI pipeline — no dashboard clicks required.&lt;/p&gt;

&lt;p&gt;In practice, this means an AI agent working in Cursor or Claude can define monitors, configure alert routing, set up a status page, and investigate an incident through the same programmatic interfaces a human developer would use. The platform doesn't distinguish between the two, because in the operating model we're building for, it shouldn't have to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Status pages and incident communication
&lt;/h2&gt;

&lt;p&gt;Reliability isn't just an internal discipline — it has an external face. Every DevHelm account includes a public status page with custom domain support, real-time monitor status, and subscriber notifications. Status pages are not an upsell; they are part of the reliability infrastructure, and they're included from day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;We are launching with uptime monitoring, dependency intelligence, status pages, and the full developer artifact surface. This is the foundation. Our roadmap builds toward a unified reliability platform: deeper forensic investigation tools, richer incident lifecycle management, and tighter integration with the AI agents that increasingly operate alongside engineering teams.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://app.devhelm.io" rel="noopener noreferrer"&gt;Try DevHelm free&lt;/a&gt; — 50 monitors, a status page with custom domain, and the full developer surface. No credit card.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://devhelm.io/blog/introducing-devhelm" rel="noopener noreferrer"&gt;DevHelm&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>product</category>
      <category>launch</category>
    </item>
    <item>
      <title>What SSL Error Means and How to Fix It</title>
      <dc:creator>DevHelm</dc:creator>
      <pubDate>Sun, 17 May 2026 14:22:13 +0000</pubDate>
      <link>https://dev.to/devhelm/what-ssl-error-means-and-how-to-fix-it-1joi</link>
      <guid>https://dev.to/devhelm/what-ssl-error-means-and-how-to-fix-it-1joi</guid>
      <description>&lt;p&gt;An SSL error means your browser or HTTP client could not complete the TLS handshake with the server. The connection was dropped before any data was exchanged. Instead of your page, your users see a full-screen warning — and most of them leave.&lt;/p&gt;

&lt;p&gt;The term "SSL error" is a holdover. SSL (Secure Sockets Layer) was deprecated in 2015 when &lt;a href="https://datatracker.ietf.org/doc/html/rfc7568" rel="noopener noreferrer"&gt;RFC 7568&lt;/a&gt; declared SSL 3.0 obsolete. Every modern HTTPS connection uses TLS (Transport Layer Security) — versions 1.2 or 1.3. Browsers still display "SSL" in error codes because the names stuck, but the protocol under the hood is always TLS. Throughout this article, "SSL error" refers to any TLS handshake failure your browser surfaces.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happens during a TLS handshake
&lt;/h2&gt;

&lt;p&gt;When a browser connects to an HTTPS server, the TLS handshake negotiates a shared encryption key. The server presents its certificate, the browser verifies the chain of trust back to a root CA, checks the hostname, and confirms the certificate has not expired. If any step fails, the browser aborts and shows an error page.&lt;/p&gt;

&lt;p&gt;The three most common failure points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Certificate validity&lt;/strong&gt; — the cert is expired, not yet valid, or revoked&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hostname mismatch&lt;/strong&gt; — the cert was issued for &lt;code&gt;api.example.com&lt;/code&gt; but the browser hit &lt;code&gt;www.example.com&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chain of trust&lt;/strong&gt; — an intermediate certificate is missing, or the cert is self-signed&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Understanding which step failed tells you exactly where to look.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decode the error message — Chrome, Firefox, and Safari
&lt;/h2&gt;

&lt;p&gt;Different browsers surface different error codes for the same underlying TLS failure. This table maps the most common ones:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Chrome&lt;/th&gt;
&lt;th&gt;Firefox&lt;/th&gt;
&lt;th&gt;Safari&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Expired certificate&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NET::ERR_CERT_DATE_INVALID&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SEC_ERROR_EXPIRED_CERTIFICATE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"This certificate has expired"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wrong hostname&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NET::ERR_CERT_COMMON_NAME_INVALID&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SSL_ERROR_BAD_CERT_DOMAIN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"This certificate is not valid for the requested site"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-signed cert&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NET::ERR_CERT_AUTHORITY_INVALID&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SEC_ERROR_UNKNOWN_ISSUER&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"This certificate was signed by an unknown authority"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Incomplete chain&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NET::ERR_CERT_AUTHORITY_INVALID&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SEC_ERROR_UNKNOWN_ISSUER&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"This certificate is not trusted"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TLS version too old&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ERR_SSL_VERSION_OR_CIPHER_MISMATCH&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SSL_ERROR_UNSUPPORTED_VERSION&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Connection refused (no specific code)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Revoked certificate&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NET::ERR_CERT_REVOKED&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SEC_ERROR_REVOKED_CERTIFICATE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"This certificate has been revoked"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you see &lt;code&gt;ERR_SSL_PROTOCOL_ERROR&lt;/code&gt; in Chrome, the server likely rejected the handshake outright — possibly because it only supports TLS 1.0/1.1 (both deprecated) or has a misconfigured cipher suite.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 1 — Expired or not-yet-valid certificate
&lt;/h2&gt;

&lt;p&gt;An expired certificate is the single most common cause of SSL errors. Certificates have a fixed validity window — typically 90 days for &lt;a href="https://letsencrypt.org/docs/faq/" rel="noopener noreferrer"&gt;Let's Encrypt&lt;/a&gt; and up to 398 days for paid CAs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diagnose it&lt;/strong&gt; with &lt;code&gt;openssl&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openssl s_client &lt;span class="nt"&gt;-connect&lt;/span&gt; yoursite.com:443 &lt;span class="nt"&gt;-servername&lt;/span&gt; yoursite.com 2&amp;gt;/dev/null | openssl x509 &lt;span class="nt"&gt;-noout&lt;/span&gt; &lt;span class="nt"&gt;-dates&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;notBefore=Feb 15 00:00:00 2026 GMT
notAfter=May 16 23:59:59 2026 GMT
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;notAfter&lt;/code&gt; is in the past, the cert has expired.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix it:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Renew the certificate through your CA or ACME client (&lt;code&gt;certbot renew&lt;/code&gt;, for example)&lt;/li&gt;
&lt;li&gt;Reload your web server — &lt;code&gt;sudo systemctl reload nginx&lt;/code&gt; or &lt;code&gt;sudo systemctl reload apache2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Verify the new cert is live: re-run the &lt;code&gt;openssl&lt;/code&gt; command above and confirm the dates&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If your cert is not yet valid (&lt;code&gt;notBefore&lt;/code&gt; is in the future), either the cert was issued early and installed before activation, or your server clock is wrong. Check with &lt;code&gt;date -u&lt;/code&gt; and sync via NTP if needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 2 — Wrong hostname or missing SAN
&lt;/h2&gt;

&lt;p&gt;Your certificate must cover the exact hostname the client connects to. A cert issued for &lt;code&gt;example.com&lt;/code&gt; does not automatically cover &lt;code&gt;www.example.com&lt;/code&gt; — that requires a Subject Alternative Name (SAN) entry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diagnose it:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openssl s_client &lt;span class="nt"&gt;-connect&lt;/span&gt; yoursite.com:443 &lt;span class="nt"&gt;-servername&lt;/span&gt; yoursite.com 2&amp;gt;/dev/null | openssl x509 &lt;span class="nt"&gt;-noout&lt;/span&gt; &lt;span class="nt"&gt;-text&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A1&lt;/span&gt; &lt;span class="s2"&gt;"Subject Alternative Name"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;X509v3 Subject Alternative Name:
    DNS:example.com, DNS:www.example.com
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the hostname your users hit is not listed, you need to reissue the certificate with the correct SANs — or use a wildcard cert (&lt;code&gt;*.example.com&lt;/code&gt;). Wildcard certs cover one level of subdomains only; &lt;code&gt;*.example.com&lt;/code&gt; matches &lt;code&gt;api.example.com&lt;/code&gt; but not &lt;code&gt;v2.api.example.com&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A common mistake: deploying behind a load balancer or CDN and forgetting that the cert on the edge must match the public hostname, not the origin hostname.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 3 — Incomplete chain or self-signed certificate
&lt;/h2&gt;

&lt;p&gt;Browsers verify certificates by walking the chain from your server cert through intermediate CAs to a trusted root. If an intermediate is missing, the chain breaks and the browser shows &lt;code&gt;ERR_CERT_AUTHORITY_INVALID&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diagnose the chain:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openssl s_client &lt;span class="nt"&gt;-connect&lt;/span&gt; yoursite.com:443 &lt;span class="nt"&gt;-servername&lt;/span&gt; yoursite.com &lt;span class="nt"&gt;-showcerts&lt;/span&gt; 2&amp;gt;/dev/null | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"s:"&lt;/span&gt; | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A healthy chain shows your cert, then one or two intermediates, ending at the root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;s:CN = yoursite.com
s:CN = R11, O = Let's Encrypt
s:CN = ISRG Root X1, O = Internet Security Research Group
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see only your cert with no intermediates, your server is not sending the full chain. Fix it by concatenating the intermediate cert(s) with your server cert. For Nginx:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;server.crt intermediate.crt &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; bundle.crt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then reference &lt;code&gt;bundle.crt&lt;/code&gt; in your Nginx config's &lt;code&gt;ssl_certificate&lt;/code&gt; directive and reload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-signed certificates&lt;/strong&gt; fail on public-facing sites because they are not issued by a trusted CA. Replace them with a cert from Let's Encrypt (free) or any recognized CA. Self-signed certs are fine for internal services — but add them to your internal trust store explicitly rather than telling users to "click through the warning."&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 4 — Mixed content and HSTS issues
&lt;/h2&gt;

&lt;p&gt;Mixed content errors happen when an HTTPS page loads a resource (image, script, stylesheet) over plain HTTP. Modern browsers block mixed active content (scripts, iframes) entirely and show a broken padlock for mixed passive content (images).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Find mixed content&lt;/strong&gt; using your browser's developer console (&lt;code&gt;F12&lt;/code&gt; → Console tab). The browser logs every blocked resource with its URL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix it&lt;/strong&gt; by updating hardcoded &lt;code&gt;http://&lt;/code&gt; URLs to &lt;code&gt;https://&lt;/code&gt; or using protocol-relative paths. If you use a CMS, update the site URL in settings.&lt;/p&gt;

&lt;p&gt;HSTS (HTTP Strict Transport Security) adds another layer: once a browser has seen an HSTS header, it refuses to connect over HTTP at all — even if the cert is temporarily broken. If you deployed a broken cert and HSTS is active, users cannot click through the warning. The only fix is deploying a valid cert. You can inspect cached HSTS policies in Chrome at &lt;code&gt;chrome://net-internals/#hsts&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 5 — Client-side false positives
&lt;/h2&gt;

&lt;p&gt;Not every SSL error is a server problem. Three common client-side causes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;System clock skew.&lt;/strong&gt; Certificates are time-sensitive. If a laptop's clock is set to 2024, a cert valid from 2026 appears "not yet valid." Fix: enable automatic time sync in the OS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Antivirus TLS inspection.&lt;/strong&gt; Some antivirus software intercepts HTTPS connections by inserting its own root certificate. If the AV root is not trusted by the browser — or if the AV botches the re-encryption — the browser shows an SSL error. Temporarily disabling the AV's "web shield" or "HTTPS scanning" confirms this as the cause.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Corporate proxy.&lt;/strong&gt; Transparent HTTPS proxies (common in enterprise networks) perform the same kind of TLS interception. The corporate root CA must be installed on the client machine. If it is not, every HTTPS site shows a certificate warning.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are real scenarios, not edge cases. If users report SSL errors that you cannot reproduce, ask about their local environment first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevent SSL errors with certificate monitoring
&lt;/h2&gt;

&lt;p&gt;You have seen how certificates break: expiry, hostname mismatches, incomplete chains, mixed content, client-side false positives. Every one of these failures is predictable. Certificates do not expire by surprise — they have a fixed lifetime printed right in the X.509 data. The problem is never that the failure was unknowable. The problem is that nobody was watching. And when an SSL failure does hit production, the &lt;a href="https://devhelm.io/blog/incident-severity-levels" rel="noopener noreferrer"&gt;incident severity&lt;/a&gt; depends on how much of your traffic is affected — which comes back to whether you caught it in one region or all of them.&lt;/p&gt;

&lt;p&gt;The fix-then-forget cycle is the real trap. You renew the cert, confirm it works, and move on. Ninety days later, the same &lt;code&gt;NET::ERR_CERT_DATE_INVALID&lt;/code&gt; reappears because the auto-renewal cron broke silently two weeks ago and nobody noticed until a customer opened a support ticket at 2 AM. That diagnostic loop is exactly what a &lt;a href="https://devhelm.io/blog/runbooks" rel="noopener noreferrer"&gt;runbook&lt;/a&gt; prevents — and measuring how long it takes is what &lt;a href="https://devhelm.io/blog/mttr-full-form" rel="noopener noreferrer"&gt;MTTR&lt;/a&gt; is for.&lt;/p&gt;

&lt;p&gt;Here is how to build a monitor that catches every failure mode we covered — before your users do.&lt;/p&gt;

&lt;h3&gt;
  
  
  Set up two expiry thresholds, not one
&lt;/h3&gt;

&lt;p&gt;Most monitoring setups check whether the certificate expires within some number of days and call it done. That is not enough. You need two thresholds: an early warning and a hard deadline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://letsencrypt.org/docs/faq/" rel="noopener noreferrer"&gt;Let's Encrypt&lt;/a&gt; certificates last 90 days. If auto-renewal is working, you will never think about expiry. But if it breaks — a DNS validation failure, a misconfigured certbot hook, a container rebuild that lost the renewal cron — you want to know at 30 days remaining, not the day it expires. A &lt;code&gt;WARN&lt;/code&gt;-severity assertion at 30 days gives your team two full weeks to investigate and fix the renewal pipeline without any urgency. A &lt;code&gt;FAIL&lt;/code&gt;-severity assertion at 14 days is the hard deadline: drop everything and renew manually, because you are two weeks from a full outage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Validate the endpoint, not just the certificate
&lt;/h3&gt;

&lt;p&gt;A valid certificate does not mean your site works. The cert could be fine while your origin returns 502s, or while a misconfigured cipher suite causes 15-second handshakes that make the page feel broken. Adding a status code check (&lt;code&gt;expected: 200&lt;/code&gt;) and a response time threshold (&lt;code&gt;thresholdMs: 2000&lt;/code&gt;) catches the class of problems where TLS technically succeeds but the user experience is degraded. Slow TLS handshakes often point to missing OCSP stapling, oversized certificate chains, or a server negotiating an expensive cipher when a faster one is available.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitor from multiple regions
&lt;/h3&gt;

&lt;p&gt;This is the one most teams skip, and it is the one that bites hardest. If you run behind a CDN — &lt;a href="https://devhelm.io/status/cloudflare" rel="noopener noreferrer"&gt;Cloudflare&lt;/a&gt;, &lt;a href="https://devhelm.io/status/aws" rel="noopener noreferrer"&gt;AWS&lt;/a&gt; CloudFront, Fastly — your certificates are managed per edge location. A cert that is perfectly valid on the &lt;code&gt;us-east&lt;/code&gt; edge node might already be expired on an &lt;code&gt;ap-south&lt;/code&gt; node because the edge cert rotation did not propagate uniformly. Checking from a single region gives you a false sense of security. Checking from &lt;code&gt;us-east&lt;/code&gt;, &lt;code&gt;eu-west&lt;/code&gt;, and &lt;code&gt;ap-south&lt;/code&gt; catches regional cert failures before the affected users report them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pick the right check frequency
&lt;/h3&gt;

&lt;p&gt;Certificates change slowly. Unlike an API endpoint that might go down and recover in seconds, certificate state transitions happen once every 90 days (or 398 days for paid CAs). Running SSL checks every 30 seconds is wasteful — you are burning check quota on a signal that changes a handful of times per year. A 5-minute interval (&lt;code&gt;frequencySeconds: 300&lt;/code&gt;) gives you more than enough visibility. If a cert expires, you will know within 5 minutes. The trade-off is worth it: save the high-frequency checks for your API health endpoints where seconds matter.&lt;/p&gt;

&lt;h3&gt;
  
  
  The full config
&lt;/h3&gt;

&lt;p&gt;Here is a DevHelm monitor that covers everything above — expiry thresholds, endpoint validation, multi-region checks, and a sensible frequency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"production-ssl-health"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"HTTP"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"frequencySeconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"regions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"us-east"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"eu-west"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ap-south"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://yourapp.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"assertions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"status_code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"operator"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"equals"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"expected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"response_time"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"thresholdMs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ssl_expiry"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"minDaysRemaining"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ssl_expiry"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"minDaysRemaining"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WARN"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first two assertions confirm the endpoint is healthy and responsive. The third fires a critical alert at 14 days before expiry — your hard deadline. The fourth fires a warning at 30 days — your early warning that gives you time to fix the renewal pipeline without scrambling.&lt;/p&gt;

&lt;p&gt;You can create this monitor from the CLI in one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;devhelm monitor create &lt;span class="nt"&gt;--type&lt;/span&gt; http
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or configure it through the dashboard. 50 monitors free, no credit card required.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://app.devhelm.io" rel="noopener noreferrer"&gt;Start monitoring free&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://devhelm.io/blog/what-ssl-error-means-and-how-to-fix-it" rel="noopener noreferrer"&gt;DevHelm&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>guides</category>
      <category>reliability</category>
    </item>
    <item>
      <title>How to Fix Slow DNS Lookup: A Complete Troubleshooting Guide</title>
      <dc:creator>DevHelm</dc:creator>
      <pubDate>Sun, 17 May 2026 14:22:07 +0000</pubDate>
      <link>https://dev.to/devhelm/how-to-fix-slow-dns-lookup-a-complete-troubleshooting-guide-4ono</link>
      <guid>https://dev.to/devhelm/how-to-fix-slow-dns-lookup-a-complete-troubleshooting-guide-4ono</guid>
      <description>&lt;p&gt;Every connection your application makes starts with a DNS lookup. When that lookup is slow — or fails entirely — the symptoms range from vague latency increases to hard-down pages that return &lt;code&gt;ERR_NAME_NOT_RESOLVED&lt;/code&gt;. This guide walks through how to fix slow DNS lookup issues, diagnose two of the most common DNS errors (&lt;code&gt;DNS_PROBE_FINISHED_NXDOMAIN&lt;/code&gt; and "DNS server not responding"), and set up monitoring so these problems never wake you up at 3 AM again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DNS lookups slow down
&lt;/h2&gt;

&lt;p&gt;A DNS lookup traverses multiple layers before returning an IP address. Your stub resolver asks a recursive resolver, which queries root nameservers, then TLD nameservers, then the authoritative nameserver for the domain. Each hop adds latency. In a best case — a warm cache hit on the recursive resolver — resolution takes under 1 ms. In the worst case — a cold cache, long CNAME chains, DNSSEC validation, and an authoritative server on another continent — it can exceed 500 ms.&lt;/p&gt;

&lt;p&gt;The most common causes of slow DNS resolution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Overloaded or distant ISP resolvers.&lt;/strong&gt; ISP DNS servers are shared infrastructure. During peak hours, query times spike from 20 ms to 200 ms or more.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low TTL values.&lt;/strong&gt; A TTL of 60 seconds means every cache expires every minute, forcing full recursive lookups. TTLs under 300 seconds are a common source of unnecessary latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CNAME chains.&lt;/strong&gt; Each CNAME adds an extra lookup. A domain with three CNAME hops requires four total resolutions before returning an A record.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IPv6 fallback.&lt;/strong&gt; When a system queries for AAAA records first and the authoritative server is slow to respond (or doesn't support IPv6), the client waits for a timeout before falling back to A records — adding 2–5 seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VPN and split-tunnel DNS conflicts.&lt;/strong&gt; Corporate VPNs often route DNS traffic through a tunnel to an internal resolver, adding 50–150 ms of round-trip latency that doesn't exist when the VPN is off.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Measure first — what "slow" actually means
&lt;/h2&gt;

&lt;p&gt;Before changing anything, measure your current DNS performance. The &lt;code&gt;dig&lt;/code&gt; command (Linux/macOS) and &lt;code&gt;nslookup&lt;/code&gt; (Windows) are the standard diagnostic tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Measure with &lt;code&gt;dig&lt;/code&gt;:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dig devhelm.io
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output you care about is at the bottom:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; ANSWER SECTION:
&lt;span class="go"&gt;devhelm.io.          300     IN      A       143.198.168.42

&lt;/span&gt;&lt;span class="gp"&gt;;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; Query &lt;span class="nb"&gt;time&lt;/span&gt;: 24 msec
&lt;span class="gp"&gt;;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; SERVER: 1.1.1.1#53&lt;span class="o"&gt;(&lt;/span&gt;1.1.1.1&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;UDP&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; WHEN: Sun May 11 14:32:07 UTC 2026
&lt;span class="gp"&gt;;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; MSG SIZE  rcvd: 56
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;Query time&lt;/code&gt; line is what matters. Here is a reference table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Query time&lt;/th&gt;
&lt;th&gt;Rating&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 15 ms&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;No action needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15–50 ms&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Acceptable for most workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50–100 ms&lt;/td&gt;
&lt;td&gt;Poor&lt;/td&gt;
&lt;td&gt;Switch resolver or investigate upstream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100+ ms&lt;/td&gt;
&lt;td&gt;Critical&lt;/td&gt;
&lt;td&gt;Immediate action required&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Compare resolvers directly:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dig @1.1.1.1 devhelm.io | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"Query time"&lt;/span&gt;
dig @8.8.8.8 devhelm.io | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"Query time"&lt;/span&gt;
dig @9.9.9.9 devhelm.io | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"Query time"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your default resolver is 3–5x slower than Cloudflare (1.1.1.1) or Google (8.8.8.8), that is the first thing to fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Measure with &lt;code&gt;nslookup&lt;/code&gt; on Windows:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;nslookup devhelm.io
Server:  resolver1.isp.net
Address:  192.168.1.1

Non-authoritative answer:
Name:    devhelm.io
Address:  143.198.168.42
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;nslookup&lt;/code&gt; does not show query time directly. For timing on Windows, use PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;Measure-Command&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Resolve-DnsName&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;devhelm.io&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Select-Object&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;TotalMilliseconds&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Fix slow DNS lookup on your machine
&lt;/h2&gt;

&lt;p&gt;These fixes address the most common causes of slow resolution, in order of impact.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Flush your local DNS cache
&lt;/h3&gt;

&lt;p&gt;Stale or corrupted cache entries can cause lookups to hang or return wrong results. Flush first, then re-test.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;macOS:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;dscacheutil &lt;span class="nt"&gt;-flushcache&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;killall &lt;span class="nt"&gt;-HUP&lt;/span&gt; mDNSResponder
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Linux (systemd-resolved):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;resolvectl flush-caches
resolvectl statistics | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"Current Cache Size"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Windows:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;ipconfig /flushdns
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Switch to a faster public resolver
&lt;/h3&gt;

&lt;p&gt;If your ISP resolver is slow, change to Cloudflare (1.1.1.1), Google (8.8.8.8), or Quad9 (9.9.9.9). These resolvers have &lt;a href="https://developers.cloudflare.com/1.1.1.1/" rel="noopener noreferrer"&gt;global anycast networks&lt;/a&gt; that consistently resolve in under 15 ms from most locations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Linux (&lt;code&gt;/etc/resolv.conf&lt;/code&gt; or &lt;code&gt;systemd-resolved&lt;/code&gt;):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;resolvectl dns eth0 1.1.1.1 1.0.0.1
&lt;span class="nb"&gt;sudo &lt;/span&gt;resolvectl dns eth0 &lt;span class="c"&gt;# verify&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;macOS (System Settings &amp;gt; Network &amp;gt; DNS):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;networksetup &lt;span class="nt"&gt;-setdnsservers&lt;/span&gt; Wi-Fi 1.1.1.1 1.0.0.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Disable IPv6 DNS if you do not use it
&lt;/h3&gt;

&lt;p&gt;If your network does not have working IPv6 connectivity, AAAA queries add timeout delays to every lookup. Test whether IPv6 is the problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dig AAAA devhelm.io @1.1.1.1 | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"Query time"&lt;/span&gt;
dig A devhelm.io @1.1.1.1 | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"Query time"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the AAAA query is significantly slower or times out, consider disabling IPv6 resolution on your machine or configuring your resolver to deprioritize AAAA lookups.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Check your VPN's DNS configuration
&lt;/h3&gt;

&lt;p&gt;VPNs commonly override DNS settings, routing queries through the tunnel. If DNS is slow only when connected to a VPN:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /etc/resolv.conf   &lt;span class="c"&gt;# Linux: check which DNS server is active&lt;/span&gt;
scutil &lt;span class="nt"&gt;--dns&lt;/span&gt; | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-20&lt;/span&gt; &lt;span class="c"&gt;# macOS: check DNS configuration&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the resolver points to a VPN-provided address (e.g., 10.x.x.x), configure split-tunnel DNS so that only internal domains route through the VPN resolver.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to fix DNS_PROBE_FINISHED_NXDOMAIN
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;DNS_PROBE_FINISHED_NXDOMAIN&lt;/code&gt; means the DNS resolver returned an &lt;strong&gt;NXDOMAIN&lt;/strong&gt; response — the domain does not exist in DNS. Chrome, Edge, and Brave all surface this as an error page. The domain either genuinely does not exist, or something between your machine and the authoritative nameserver is blocking or misconfiguring the lookup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diagnosis, in order:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Verify the domain is correct.&lt;/strong&gt; Typos account for most NXDOMAIN errors. Check for swapped letters, missing hyphens, and wrong TLDs (.com vs .io vs .dev).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Test from multiple resolvers.&lt;/strong&gt; If your default resolver returns NXDOMAIN but a public resolver resolves the domain, your resolver has stale or filtered data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dig example.com @1.1.1.1
dig example.com @8.8.8.8
dig example.com @&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /etc/resolv.conf | &lt;span class="nb"&gt;grep &lt;/span&gt;nameserver | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-1&lt;/span&gt; | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $2}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Check the authoritative nameserver directly.&lt;/strong&gt; This confirms whether the domain's NS records are configured correctly at the registrar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dig NS example.com @1.1.1.1
dig example.com @ns1.registrar.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the authoritative server itself returns NXDOMAIN, the domain's DNS zone is misconfigured or the domain has expired. Check with your registrar.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Flush DNS and restart the DNS client.&lt;/strong&gt; A cached NXDOMAIN response (negative caching, per &lt;a href="https://datatracker.ietf.org/doc/html/rfc2308" rel="noopener noreferrer"&gt;RFC 2308&lt;/a&gt;) can persist for the SOA minimum TTL, which defaults to hours on some zones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Check your &lt;code&gt;hosts&lt;/code&gt; file.&lt;/strong&gt; A local override in &lt;code&gt;/etc/hosts&lt;/code&gt; (Linux/macOS) or &lt;code&gt;C:\\Windows\\System32\\drivers\\etc\\hosts&lt;/code&gt; (Windows) can shadow DNS entirely. Remove any stale entries for the domain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Disable Chrome's secure DNS if it conflicts.&lt;/strong&gt; Chrome aggressively prefetches DNS for links on a page. If prefetch queries go to a different resolver than your system default, you can get spurious NXDOMAIN errors. Navigate to &lt;code&gt;chrome://settings/security&lt;/code&gt; and check the "Use secure DNS" setting — ensure it matches your intended resolver.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to fix DNS server not responding
&lt;/h2&gt;

&lt;p&gt;"DNS server not responding" means your machine sent a DNS query and received no reply at all — not even an error. This is different from NXDOMAIN (which is a valid response saying "this domain does not exist"). No response means the resolver itself is unreachable or unresponsive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Systematic diagnosis:&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Confirm basic connectivity
&lt;/h3&gt;

&lt;p&gt;Separate "network is down" from "DNS is down":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ping &lt;span class="nt"&gt;-c&lt;/span&gt; 3 1.1.1.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If ping fails, the problem is your network connection, not DNS. Check cables, Wi-Fi, and router.&lt;/p&gt;

&lt;p&gt;If ping succeeds, your network is fine but DNS is specifically broken. Continue.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Test the DNS port directly
&lt;/h3&gt;

&lt;p&gt;DNS uses UDP port 53 (and TCP 53 for large responses). Test whether your resolver is accepting connections:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dig @1.1.1.1 devhelm.io +tcp +timeout&lt;span class="o"&gt;=&lt;/span&gt;5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this works but normal queries fail, something is blocking UDP port 53 — commonly a firewall, router ACL, or ISP filter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Check your router
&lt;/h3&gt;

&lt;p&gt;Home and office routers often run a local DNS forwarder. If the router's DNS process crashes or its upstream configuration is wrong, all devices on the network lose DNS.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Access your router admin panel (typically 192.168.1.1)&lt;/li&gt;
&lt;li&gt;Check the configured upstream DNS servers&lt;/li&gt;
&lt;li&gt;Try setting them to 1.1.1.1 and 8.8.8.8 as primary and secondary&lt;/li&gt;
&lt;li&gt;Reboot the router&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 4: Check for firewall or security software blocking DNS
&lt;/h3&gt;

&lt;p&gt;Firewalls (especially on corporate networks), antivirus software, and parental control tools sometimes intercept or block DNS traffic. Temporarily disable them to isolate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;iptables &lt;span class="nt"&gt;-L&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;53
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Try DNS over HTTPS (DoH)
&lt;/h3&gt;

&lt;p&gt;If your ISP is throttling or intercepting standard DNS (UDP/TCP port 53), &lt;a href="https://developers.cloudflare.com/1.1.1.1/encryption/dns-over-https/" rel="noopener noreferrer"&gt;DNS over HTTPS&lt;/a&gt; bypasses the interception by sending queries over HTTPS on port 443:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Firefox:&lt;/strong&gt; Settings &amp;gt; Privacy &amp;amp; Security &amp;gt; Enable DNS over HTTPS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chrome:&lt;/strong&gt; Settings &amp;gt; Security &amp;gt; Use secure DNS &amp;gt; Select Cloudflare or Google&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System-wide (Linux):&lt;/strong&gt; Configure &lt;code&gt;systemd-resolved&lt;/code&gt; with &lt;code&gt;DNSOverTLS=yes&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When the problem is upstream
&lt;/h2&gt;

&lt;p&gt;Sometimes slow DNS is outside your control. Before blaming your resolver or network, check whether the problem is upstream:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Authoritative nameserver issues.&lt;/strong&gt; The domain owner's nameserver may be slow or misconfigured. Test with &lt;code&gt;dig +trace example.com&lt;/code&gt; to see exactly where in the resolution chain the delay occurs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CDN misrouting.&lt;/strong&gt; CDNs like &lt;a href="https://devhelm.io/status/cloudflare" rel="noopener noreferrer"&gt;Cloudflare&lt;/a&gt; and &lt;a href="https://devhelm.io/status/aws" rel="noopener noreferrer"&gt;AWS&lt;/a&gt; CloudFront use DNS-based geographic routing. If your resolver's IP geolocation is wrong, you may be routed to a distant edge node. This is common with VPNs and small ISP resolvers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Registrar glue record problems.&lt;/strong&gt; If a domain's nameservers are under the same domain (e.g., &lt;code&gt;ns1.example.com&lt;/code&gt; for &lt;code&gt;example.com&lt;/code&gt;), the registrar must provide &lt;a href="https://datatracker.ietf.org/doc/html/rfc1035#section-4.2.1" rel="noopener noreferrer"&gt;glue records&lt;/a&gt; — the A records for the nameservers themselves. Missing glue records create a circular dependency that manifests as timeouts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise split-horizon DNS.&lt;/strong&gt; In corporate environments, internal and external DNS zones overlap. A query for &lt;code&gt;api.company.com&lt;/code&gt; might resolve to an internal IP on VPN and a public IP off VPN — or fail entirely if the split-horizon configuration has gaps.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prevent DNS failures with monitoring
&lt;/h2&gt;

&lt;p&gt;Everything you have done so far in this guide — flushing caches, switching resolvers, tracing NXDOMAIN responses, checking firewall rules — is reactive. You noticed a problem, diagnosed it, and fixed it. That reactive investigation is exactly the kind of work a &lt;a href="https://devhelm.io/blog/runbooks" rel="noopener noreferrer"&gt;runbook&lt;/a&gt; codifies so the next engineer doesn't repeat it from scratch. But the next DNS failure will not look like this one. An A record vanishes because someone fat-fingers a Terraform apply. A TTL gets dropped to 30 seconds during a migration and never gets reverted. Resolution times creep from 20 ms to 150 ms over three weeks because an upstream nameserver is quietly degrading. None of these announce themselves. They just erode your reliability until a user files a ticket or your on-call phone rings at 3 AM — and your &lt;a href="https://devhelm.io/blog/mttr-full-form" rel="noopener noreferrer"&gt;MTTR&lt;/a&gt; climbs because the failure mode was unfamiliar.&lt;/p&gt;

&lt;p&gt;A single "is DNS working?" check does not cover this. What you need is a layered set of assertions that catches the different ways DNS silently breaks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Does it resolve at all?
&lt;/h3&gt;

&lt;p&gt;The most fundamental check. A &lt;code&gt;dns_resolves&lt;/code&gt; assertion confirms that your domain actually returns records — that the A record exists, that the AAAA record exists, that the response is not NXDOMAIN or SERVFAIL. If your A record disappears because of a zone file mistake or a registrar lapse, you find out in five minutes instead of five hours when customers start reporting a blank page.&lt;/p&gt;

&lt;p&gt;Check both A and AAAA record types. Even if your application is IPv4-only, a broken AAAA record causes timeout-based fallback delays on clients that try IPv6 first — the exact problem covered in the IPv6 section above. Monitoring both means you catch issues on either path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Does it resolve fast enough?
&lt;/h3&gt;

&lt;p&gt;DNS that technically resolves but takes 200 ms adds 200 ms to every single page load, every API call, every webhook delivery. This latency is invisible in dashboards that only track HTTP response time because the DNS overhead happens before the connection even opens.&lt;/p&gt;

&lt;p&gt;Two thresholds give you the coverage you need. A hard failure assertion (&lt;code&gt;dns_response_time&lt;/code&gt; with a &lt;code&gt;maxMs&lt;/code&gt; of 100) fires when resolution exceeds a critical ceiling — something is actively broken, whether that is an overloaded resolver, a network path change, or an authoritative server on another continent. A softer warning assertion (&lt;code&gt;dns_response_time_warn&lt;/code&gt; with a &lt;code&gt;warnMs&lt;/code&gt; of 50) fires at a lower threshold so you catch gradual degradation before it compounds into an outage. The warning gives you time to investigate during business hours. The hard failure pages your on-call immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Are the TTLs healthy?
&lt;/h3&gt;

&lt;p&gt;Low TTLs are a silent performance killer, and they show up constantly in the kinds of issues this guide covers. A TTL of 30 seconds means every visitor's browser, every edge server, and every recursive resolver on the planet discards the cached record every half minute and triggers a full recursive lookup. During a migration, it is common practice to temporarily lower TTLs to speed up propagation — and then forget to raise them back afterward.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;dns_ttl_low&lt;/code&gt; assertion with a &lt;code&gt;minTtl&lt;/code&gt; of 300 catches exactly this. If someone — or an automated provisioning tool — drops your TTL below five minutes, you get a warning before the extra lookup load starts inflating resolution times across the board.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: Check from multiple vantage points
&lt;/h3&gt;

&lt;p&gt;DNS is not globally consistent. A record that resolves correctly from a probe in &lt;code&gt;us-east&lt;/code&gt; might be stale, missing, or pointing to the wrong IP in &lt;code&gt;ap-south&lt;/code&gt; because of propagation delays, regional resolver differences, or geo-DNS misconfigurations. If you only check from one region, you are testing your DNS health from one perspective and assuming the rest of the world agrees. It often does not.&lt;/p&gt;

&lt;p&gt;Running checks from at least three regions — &lt;code&gt;us-east&lt;/code&gt;, &lt;code&gt;eu-west&lt;/code&gt;, and &lt;code&gt;ap-south&lt;/code&gt; — ensures your monitoring reflects what your actual users experience rather than what a single datacenter sees.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 5: Check against specific nameservers
&lt;/h3&gt;

&lt;p&gt;By default, each probe region uses whatever recursive resolver is locally available. That is usually fine, but it means you can miss issues that are specific to a particular public resolver. Explicitly setting your nameservers to &lt;code&gt;1.1.1.1&lt;/code&gt; and &lt;code&gt;8.8.8.8&lt;/code&gt; — Cloudflare and Google, the two most widely used public resolvers — lets you test resolution from the same infrastructure your users are most likely hitting. If your domain resolves from Google but not Cloudflare (or vice versa), that points to a propagation issue or a resolver-specific caching problem that would otherwise be invisible until someone on the affected resolver reports it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Putting it all together
&lt;/h3&gt;

&lt;p&gt;Here is a complete DNS monitor configuration that implements all five layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"production-dns-health"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DNS"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"frequencySeconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"regions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"us-east"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"eu-west"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ap-south"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"hostname"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"yourapp.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"recordTypes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"A"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AAAA"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"nameservers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"1.1.1.1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"8.8.8.8"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"assertions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dns_resolves"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dns_response_time"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"maxMs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dns_response_time_warn"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"warnMs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WARN"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dns_ttl_low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"minTtl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WARN"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every five minutes, from three continents, this monitor resolves &lt;code&gt;yourapp.com&lt;/code&gt; for both A and AAAA records against Cloudflare's and Google's DNS. It fails hard if the domain does not resolve at all or if resolution takes longer than 100 ms. It warns if resolution exceeds 50 ms or if the TTL drops below 300 seconds.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;severity: "WARN"&lt;/code&gt; on the TTL and response time warning assertions is deliberate. These are degradation signals, not outage signals — they belong in a dashboard and a Slack channel, not in your PagerDuty rotation. The resolution check and the hard response time ceiling default to error severity, which is what triggers your incident workflow based on your &lt;a href="https://devhelm.io/blog/incident-severity-levels" rel="noopener noreferrer"&gt;severity levels&lt;/a&gt;. The distinction matters: you want to know about creeping latency during business hours, and you want to be woken up for a missing A record.&lt;/p&gt;

&lt;p&gt;If your DNS infrastructure sits behind Cloudflare, you can also track their operational status through their &lt;a href="https://devhelm.io/status/cloudflare" rel="noopener noreferrer"&gt;public status feed&lt;/a&gt; — useful for distinguishing between your DNS issues and theirs.&lt;/p&gt;

&lt;p&gt;You can create this monitor through the &lt;a href="https://app.devhelm.io" rel="noopener noreferrer"&gt;DevHelm dashboard&lt;/a&gt;, or from the terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;devhelm monitor create &lt;span class="nt"&gt;--type&lt;/span&gt; dns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://app.devhelm.io" rel="noopener noreferrer"&gt;Start monitoring free&lt;/a&gt; — DNS, HTTP, TCP, and ICMP checks from five continents, with the full CLI and API surface included.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://devhelm.io/blog/how-to-fix-slow-dns-lookup" rel="noopener noreferrer"&gt;DevHelm&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>guides</category>
      <category>reliability</category>
      <category>infrastructure</category>
    </item>
  </channel>
</rss>
