Samson Tanimawo

Posted on May 3

Incident Severity Levels: SEV-1 to SEV-5 Calibration

#incidents #sre #oncall #process

Why Severity Is Broken at Most Companies

Everyone has severity levels. Almost nobody agrees on what they mean.

Ask ten engineers what SEV-2 means and you'll get eight different answers. This causes:

Under-paged incidents (people thought SEV-3 meant "no rush")
Over-paged incidents (everything is SEV-1)
Exhausted on-call (false alarms)
Missed SLOs (incidents not escalated in time)

Calibration matters. Here's a definition that actually works.

The Five Levels

SEV-1: Critical

Primary product is completely down for all users
Active data loss
Security breach in progress
Core business stopped (can't process payments, can't log in)

Target response: 5 minutes
Escalation: Immediate, all hands
Post-mortem: Required, public within 5 days

SEV-2: High

Primary product is degraded for most users
Core feature unavailable for a subset
Significant customer impact but workaround exists
Performance significantly degraded (>5x normal latency)

Target response: 15 minutes
Escalation: Page primary on-call, notify secondary
Post-mortem: Required, internal within 5 days

SEV-3: Medium

Non-critical feature broken
Affects a small percentage of users
Degraded performance within tolerance
Bug in new feature rollout

Target response: 1 hour
Escalation: Page during business hours, ticket overnight
Post-mortem: Recommended

SEV-4: Low

Minor bug with workaround
Internal tooling broken
Non-customer-facing issue
Cosmetic problems

Target response: 1 business day
Escalation: Ticket only
Post-mortem: Not required

SEV-5: Informational

Not actually broken
Preemptive warning
"This might become a problem"
Observed anomaly without impact

Target response: Backlog
Escalation: None
Post-mortem: Not required

The Calibration Problem

Levels written on paper are useless. What matters is consistent application.

Run this exercise: take your last 50 incidents. Ask three SRE leads to independently assign severity levels. Compare.

If more than 20% disagree by at least one level, your definitions aren't calibrated. Run training.

The "When In Doubt" Rules

When severity is ambiguous, default to higher severity and downgrade if wrong.

Better to over-escalate and apologize than under-escalate and miss a SEV-1 for 4 hours.

Specific rules:

User data loss → always SEV-1 or SEV-2, never lower
Security issue → always SEV-1 or SEV-2
Revenue impact → SEV-2 minimum if measurable
Uncertain scope → start at higher severity, downgrade when scope is clear

Customer Impact Matrix

For fast calibration, use a matrix:

| <1% users | 1-10% users | 10-50% | >50% users
Product Down | SEV-2 | SEV-1 | SEV-1 | SEV-1
Major Degraded | SEV-3 | SEV-2 | SEV-2 | SEV-1
Minor Degraded | SEV-4 | SEV-3 | SEV-2 | SEV-2
Workaround Exists| SEV-4 | SEV-4 | SEV-3 | SEV-2

This gives you a fast severity assignment without relying on intuition.

Time-Based Escalation

Severity isn't fixed for the incident lifetime. It escalates:

sev_2:
auto_escalate_to_sev_1:
- if_not_resolved_in: 60_minutes
- if_user_impact_grows: above_10_percent
- if_revenue_loss_exceeds: $10000/hour

Start at SEV-2, auto-escalate if things worsen. Don't let an incident linger at the same severity if the impact is growing.

The Downgrade Rule

Downgrading is allowed but must be justified in writing in the incident channel.

"Downgrading from SEV-1 to SEV-2 at 10:23. Initial reports of
total outage were incorrect. Real impact is ~5% of users in
us-west-2 only. Ticket: INC-1234"

This prevents silent downgrades that understate severity for retro analysis.

SLO Integration

Your SLOs and severity levels should align:

SLO: 99.95% availability (21.6 min/month budget)

If this month's error budget burned:
<25% → normal operations
25-50% → no SEV-3 burn-down deploys
50-75% → SEV-2 threshold lowered
>75% → any degradation is SEV-1

When you're running low on error budget, everything gets more severe.

Practical Incident Categories

Beyond numeric severity, label incidents by type:

INCIDENT_TYPES:
- infrastructure (AWS, networking)
- application (code bug)
- deployment (bad release)
- capacity (scaling failure)
- data (corruption, loss)
- security (breach, exposure)
- external (3rd-party dependency)

Severity tells you how urgent. Type tells you who to page.

The Monthly Review

Once a month, review:

All SEV-1s and SEV-2s
Any SEV-3 that should have been SEV-2
Any SEV-2 that should have been SEV-3
Average time from incident open to correct severity assignment

Adjust the definitions based on what you learn. Severity is a living standard.

Common Mistakes

Pet severity every team invents their own. Standardize company-wide.
SEV-0 don't add levels above SEV-1. Just use "SEV-1 all hands."
Severity inflation if every incident is SEV-2, nobody takes SEV-2 seriously
Severity deflation pressure to avoid post-mortems leads to fake SEV-4s
Unchanging severity escalation is a tool, use it

The Goal

Severity should mean the same thing to every person in the org. Engineers, PMs, execs, customer support.

When someone says "SEV-1," everyone should know what that means, how urgent it is, and what the response looks like.

When you achieve that, incident response gets dramatically better.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community