DEV Community

DevHelm
DevHelm

Posted on • Originally published at devhelm.io

Incident Severity Levels: Sev1–Sev4 with Triage Matrix

Most teams define their severity levels as a table in a Confluence page, link to it from onboarding docs, and then never reference it during an actual incident. The levels exist, but nobody uses them. Three months later someone opens a sev1 for a broken CSS gradient and the on-call engineer gets paged at 2 AM.

Severity levels only work when three things are true: the scale is simple enough to apply under stress, the response expectations are explicit, and the routing is automated. This guide covers all three — the scale itself, the decision framework for assigning it, and the wiring that turns a severity label into the right alert at the right time.

The four levels

Most incident management systems converge on a four-level scale. The labels vary — sev1/sev2/sev3/sev4, P0/P1/P2/P3, critical/major/minor/info — but the structure is nearly universal.

Level Also called Definition Response expectation
Sev1 P0, Critical Complete outage of a production system, data loss, or security breach affecting customers All-hands. Incident commander assigned. Stakeholder updates every 15 minutes.
Sev2 P1, Major Significant degradation — a core feature is broken or a significant percentage of users are affected. Service is up but materially impaired. On-call responds immediately. Updates every 30 minutes. Escalation if unresolved in 1 hour.
Sev3 P2, Minor Limited degradation — a non-critical feature is broken, a workaround exists, or the impact is confined to a small subset of users. Addressed within business hours. No page. Tracked in the incident backlog.
Sev4 P3, Info Cosmetic issue, minor inconvenience, or an anomaly that warrants investigation but has no user-facing impact. Sprint backlog. No incident channel. Closed in the next cycle.

The exact boundaries shift between organizations. A company whose revenue runs through a single API endpoint has a lower threshold for sev1 than a company with redundant payment processors. The table above is a starting point — calibrate it to your blast radius.

What matters more than the exact definitions is that everyone on the team can assign the right level within 60 seconds of seeing the alert. If your engineers argue about severity during an incident, the definitions are too ambiguous.

Severity vs priority — they are not the same

This distinction trips up most teams. Severity describes the impact of the incident — how bad it is right now. Priority describes the urgency of the response — how fast you need to fix it. They usually correlate, but not always:

  • A sev1 in a staging environment is critical severity, low priority. The environment is completely down, but no customers are affected.
  • A sev3 that blocks a contractual deadline is minor severity, high priority. The feature works for most users, but the one user who matters is the enterprise customer whose annual renewal depends on it shipping by Friday.
  • A sev2 that self-resolves in 90 seconds is significant severity, reduced priority after the fact. The incident was real, but by the time an engineer opened the laptop, the system recovered. The retro still matters, but the live response is over.

The Google SRE Workbook formalizes this as "severity is an attribute of the incident; priority is a decision made by the responder." The practical consequence: if your alerting system routes by severity alone, you get the right response most of the time. The rest requires human override — someone promoting a sev3 to high-priority or silencing a sev1 that fired in a non-production context.

A triage matrix that works under stress

When an alert fires, you have roughly 30 seconds of attention before the responder either acts or dismisses. The triage question is: "what severity is this?" The fastest way to answer it is a two-axis matrix of customer impact and scope.

Single user / account Significant minority (10-30%) Majority or all users
Feature broken, no workaround Sev3 Sev2 Sev1
Feature degraded, workaround exists Sev4 Sev3 Sev2
Non-functional impact (slow, noisy, ugly) Sev4 Sev4 Sev3

The matrix is intentionally coarse. Three scope buckets, three impact buckets, nine cells. A responder can place an incident in the right cell in seconds without reading a paragraph of definitions.

Two overrides that bump any cell up by one level:

  1. Data loss or security exposure. A bug that leaks PII to unauthorized users is sev1 regardless of scope — even if it affects one account.
  2. Revenue impact. If the checkout flow is broken and orders are failing, that's sev1 even if the monitoring dashboard reports 95% availability — because the 5% that's failing is the 5% that pays the bills.

What each severity triggers

The scale has no value unless it drives concrete actions. Every severity level should map to four things: who gets notified, how fast they respond, what communication cadence they maintain, and whether a post-incident review is mandatory.

Sev1: page on-call + backup + engineering lead. Acknowledge within 5 minutes. Incident channel created, stakeholder updates every 15 minutes, customer-facing status page updated. Mandatory blameless retro within 48 hours with tracked action items.

Sev2: page on-call. Acknowledge within 15 minutes. Incident channel, updates every 30 minutes. Retro recommended at team discretion.

Sev3: Slack channel or email notification. Response within the next business hour. Ticket created, no incident channel. Retro optional, only if the pattern is recurring.

Sev4: logged but no active notification. Next sprint. No communication, no retro.

If your sev1 and sev2 have the same notification channel, the same response time, and the same retro expectation, you don't have two severity levels — you have one with two names. Merge them or differentiate them.

How severity drives MTTR

Your MTTR target should vary by severity — and if you're tracking the full set of MTTA, MTTR, MTBF, and MTTF, severity determines which metric matters most at each tier. A sev1 with a 4-hour MTTR means your most critical incidents take half a workday to resolve — probably too slow. A sev4 with a 4-hour MTTR means you're spending on-call energy on cosmetic issues — probably too fast.

Level MTTR target Rationale
Sev1 < 1 hour Revenue is actively lost, users are actively blocked
Sev2 < 4 hours Significant impact but not existential
Sev3 < 1 business day Limited scope, workaround available
Sev4 Next sprint Not time-sensitive

These targets feed directly into your SLO error budget. A 99.9% availability SLO on a 30-day window gives you 43 minutes of total downtime. If your sev1 MTTR target is 1 hour, a single sev1 incident blows the budget. That tension is the point — it forces you to invest in the runbooks and automation that keep resolution time below the budget threshold.

How DevHelm routes by severity

DevHelm models incident severity as three operational states: DOWN, DEGRADED, and MAINTENANCE. This is deliberately simpler than a sev1-through-sev4 scale. The numbered scale requires human judgment about scope and blast radius; DevHelm's model is automated from check results. When a monitor's trigger rule fires, the rule specifies whether the incident is DOWN (the service is not responding or failing critically) or DEGRADED (the service is responding but outside acceptable bounds — slow, returning partial errors, or failing specific assertions).

The routing happens in notification policies. Each policy has match rules, and one of those rules is severity_gte — "match when incident severity is greater than or equal to this threshold." Severity is ordered: DOWN > DEGRADED > MAINTENANCE. In practice, this gives you two-track routing:

  1. A policy with severity_gte: DOWN routes to PagerDuty — page the on-call engineer immediately.
  2. A policy with severity_gte: DEGRADED routes to a Slack channel — notify the team, no page.

The first policy fires only for DOWN incidents — your sev1 equivalent. The second fires for both DOWN and DEGRADED, so a DOWN incident sends both a page and a Slack message (the on-call gets paged, the wider team stays informed). A DEGRADED incident reaches Slack but never PagerDuty. You've split your alert routing by severity without writing any code.

For richer routing, combine severity_gte with other match rules. A policy that matches severity_gte: DOWN AND monitor_tag_in: ["payments", "checkout"] pages someone for critical payment failures but not for a down developer docs site. That's severity combined with business context — the same intersection the triage matrix above describes, except it's automated instead of decided in the heat of the moment.

Where to start

If your team doesn't have severity levels, start by writing the four definitions in a shared doc and getting three people to agree on them. That takes 30 minutes and pays for itself the first time someone opens an incident.

Then automate the routing. Set up a monitor in DevHelm, configure a trigger rule that fires as DOWN after two consecutive failures confirmed across regions, and wire a notification policy that pages your on-call for DOWN incidents and sends DEGRADED incidents to Slack. You've just built a severity-routed alerting pipeline that distinguishes between "wake someone up" and "the team should know" — running 24/7 without anyone remembering to check the definitions page.


Originally published on DevHelm.

Top comments (0)