DEV Community

Cover image for Incident Severity Levels: SEV-1 to SEV-5 Calibration
Samson Tanimawo
Samson Tanimawo

Posted on

Incident Severity Levels: SEV-1 to SEV-5 Calibration

Why Severity Is Broken at Most Companies

Everyone has severity levels. Almost nobody agrees on what they mean.

Ask ten engineers what SEV-2 means and you'll get eight different answers. This causes:

  • Under-paged incidents (people thought SEV-3 meant "no rush")
  • Over-paged incidents (everything is SEV-1)
  • Exhausted on-call (false alarms)
  • Missed SLOs (incidents not escalated in time)

Calibration matters. Here's a definition that actually works.

The Five Levels

SEV-1: Critical

  • Primary product is completely down for all users
  • Active data loss
  • Security breach in progress
  • Core business stopped (can't process payments, can't log in)

Target response: 5 minutes
Escalation: Immediate, all hands
Post-mortem: Required, public within 5 days

SEV-2: High

  • Primary product is degraded for most users
  • Core feature unavailable for a subset
  • Significant customer impact but workaround exists
  • Performance significantly degraded (>5x normal latency)

Target response: 15 minutes
Escalation: Page primary on-call, notify secondary
Post-mortem: Required, internal within 5 days

SEV-3: Medium

  • Non-critical feature broken
  • Affects a small percentage of users
  • Degraded performance within tolerance
  • Bug in new feature rollout

Target response: 1 hour
Escalation: Page during business hours, ticket overnight
Post-mortem: Recommended

SEV-4: Low

  • Minor bug with workaround
  • Internal tooling broken
  • Non-customer-facing issue
  • Cosmetic problems

Target response: 1 business day
Escalation: Ticket only
Post-mortem: Not required

SEV-5: Informational

  • Not actually broken
  • Preemptive warning
  • "This might become a problem"
  • Observed anomaly without impact

Target response: Backlog
Escalation: None
Post-mortem: Not required

The Calibration Problem

Levels written on paper are useless. What matters is consistent application.

Run this exercise: take your last 50 incidents. Ask three SRE leads to independently assign severity levels. Compare.

If more than 20% disagree by at least one level, your definitions aren't calibrated. Run training.

The "When In Doubt" Rules

When severity is ambiguous, default to higher severity and downgrade if wrong.

Better to over-escalate and apologize than under-escalate and miss a SEV-1 for 4 hours.

Specific rules:

  • User data loss → always SEV-1 or SEV-2, never lower
  • Security issue → always SEV-1 or SEV-2
  • Revenue impact → SEV-2 minimum if measurable
  • Uncertain scope → start at higher severity, downgrade when scope is clear

Customer Impact Matrix

For fast calibration, use a matrix:

| <1% users | 1-10% users | 10-50% | >50% users
Product Down | SEV-2 | SEV-1 | SEV-1 | SEV-1
Major Degraded | SEV-3 | SEV-2 | SEV-2 | SEV-1
Minor Degraded | SEV-4 | SEV-3 | SEV-2 | SEV-2
Workaround Exists| SEV-4 | SEV-4 | SEV-3 | SEV-2
Enter fullscreen mode Exit fullscreen mode

This gives you a fast severity assignment without relying on intuition.

Time-Based Escalation

Severity isn't fixed for the incident lifetime. It escalates:

sev_2:
auto_escalate_to_sev_1:
- if_not_resolved_in: 60_minutes
- if_user_impact_grows: above_10_percent
- if_revenue_loss_exceeds: $10000/hour
Enter fullscreen mode Exit fullscreen mode

Start at SEV-2, auto-escalate if things worsen. Don't let an incident linger at the same severity if the impact is growing.

The Downgrade Rule

Downgrading is allowed but must be justified in writing in the incident channel.

"Downgrading from SEV-1 to SEV-2 at 10:23. Initial reports of
total outage were incorrect. Real impact is ~5% of users in
us-west-2 only. Ticket: INC-1234"
Enter fullscreen mode Exit fullscreen mode

This prevents silent downgrades that understate severity for retro analysis.

SLO Integration

Your SLOs and severity levels should align:

SLO: 99.95% availability (21.6 min/month budget)

If this month's error budget burned:
<25% → normal operations
25-50% → no SEV-3 burn-down deploys
50-75% → SEV-2 threshold lowered
>75% → any degradation is SEV-1
Enter fullscreen mode Exit fullscreen mode

When you're running low on error budget, everything gets more severe.

Practical Incident Categories

Beyond numeric severity, label incidents by type:

INCIDENT_TYPES:
- infrastructure (AWS, networking)
- application (code bug)
- deployment (bad release)
- capacity (scaling failure)
- data (corruption, loss)
- security (breach, exposure)
- external (3rd-party dependency)
Enter fullscreen mode Exit fullscreen mode

Severity tells you how urgent. Type tells you who to page.

The Monthly Review

Once a month, review:

  • All SEV-1s and SEV-2s
  • Any SEV-3 that should have been SEV-2
  • Any SEV-2 that should have been SEV-3
  • Average time from incident open to correct severity assignment

Adjust the definitions based on what you learn. Severity is a living standard.

Common Mistakes

  1. Pet severity every team invents their own. Standardize company-wide.
  2. SEV-0 don't add levels above SEV-1. Just use "SEV-1 all hands."
  3. Severity inflation if every incident is SEV-2, nobody takes SEV-2 seriously
  4. Severity deflation pressure to avoid post-mortems leads to fake SEV-4s
  5. Unchanging severity escalation is a tool, use it

The Goal

Severity should mean the same thing to every person in the org. Engineers, PMs, execs, customer support.

When someone says "SEV-1," everyone should know what that means, how urgent it is, and what the response looks like.

When you achieve that, incident response gets dramatically better.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)