Why Severity Is Broken at Most Companies
Everyone has severity levels. Almost nobody agrees on what they mean.
Ask ten engineers what SEV-2 means and you'll get eight different answers. This causes:
- Under-paged incidents (people thought SEV-3 meant "no rush")
- Over-paged incidents (everything is SEV-1)
- Exhausted on-call (false alarms)
- Missed SLOs (incidents not escalated in time)
Calibration matters. Here's a definition that actually works.
The Five Levels
SEV-1: Critical
- Primary product is completely down for all users
- Active data loss
- Security breach in progress
- Core business stopped (can't process payments, can't log in)
Target response: 5 minutes
Escalation: Immediate, all hands
Post-mortem: Required, public within 5 days
SEV-2: High
- Primary product is degraded for most users
- Core feature unavailable for a subset
- Significant customer impact but workaround exists
- Performance significantly degraded (>5x normal latency)
Target response: 15 minutes
Escalation: Page primary on-call, notify secondary
Post-mortem: Required, internal within 5 days
SEV-3: Medium
- Non-critical feature broken
- Affects a small percentage of users
- Degraded performance within tolerance
- Bug in new feature rollout
Target response: 1 hour
Escalation: Page during business hours, ticket overnight
Post-mortem: Recommended
SEV-4: Low
- Minor bug with workaround
- Internal tooling broken
- Non-customer-facing issue
- Cosmetic problems
Target response: 1 business day
Escalation: Ticket only
Post-mortem: Not required
SEV-5: Informational
- Not actually broken
- Preemptive warning
- "This might become a problem"
- Observed anomaly without impact
Target response: Backlog
Escalation: None
Post-mortem: Not required
The Calibration Problem
Levels written on paper are useless. What matters is consistent application.
Run this exercise: take your last 50 incidents. Ask three SRE leads to independently assign severity levels. Compare.
If more than 20% disagree by at least one level, your definitions aren't calibrated. Run training.
The "When In Doubt" Rules
When severity is ambiguous, default to higher severity and downgrade if wrong.
Better to over-escalate and apologize than under-escalate and miss a SEV-1 for 4 hours.
Specific rules:
- User data loss → always SEV-1 or SEV-2, never lower
- Security issue → always SEV-1 or SEV-2
- Revenue impact → SEV-2 minimum if measurable
- Uncertain scope → start at higher severity, downgrade when scope is clear
Customer Impact Matrix
For fast calibration, use a matrix:
| <1% users | 1-10% users | 10-50% | >50% users
Product Down | SEV-2 | SEV-1 | SEV-1 | SEV-1
Major Degraded | SEV-3 | SEV-2 | SEV-2 | SEV-1
Minor Degraded | SEV-4 | SEV-3 | SEV-2 | SEV-2
Workaround Exists| SEV-4 | SEV-4 | SEV-3 | SEV-2
This gives you a fast severity assignment without relying on intuition.
Time-Based Escalation
Severity isn't fixed for the incident lifetime. It escalates:
sev_2:
auto_escalate_to_sev_1:
- if_not_resolved_in: 60_minutes
- if_user_impact_grows: above_10_percent
- if_revenue_loss_exceeds: $10000/hour
Start at SEV-2, auto-escalate if things worsen. Don't let an incident linger at the same severity if the impact is growing.
The Downgrade Rule
Downgrading is allowed but must be justified in writing in the incident channel.
"Downgrading from SEV-1 to SEV-2 at 10:23. Initial reports of
total outage were incorrect. Real impact is ~5% of users in
us-west-2 only. Ticket: INC-1234"
This prevents silent downgrades that understate severity for retro analysis.
SLO Integration
Your SLOs and severity levels should align:
SLO: 99.95% availability (21.6 min/month budget)
If this month's error budget burned:
<25% → normal operations
25-50% → no SEV-3 burn-down deploys
50-75% → SEV-2 threshold lowered
>75% → any degradation is SEV-1
When you're running low on error budget, everything gets more severe.
Practical Incident Categories
Beyond numeric severity, label incidents by type:
INCIDENT_TYPES:
- infrastructure (AWS, networking)
- application (code bug)
- deployment (bad release)
- capacity (scaling failure)
- data (corruption, loss)
- security (breach, exposure)
- external (3rd-party dependency)
Severity tells you how urgent. Type tells you who to page.
The Monthly Review
Once a month, review:
- All SEV-1s and SEV-2s
- Any SEV-3 that should have been SEV-2
- Any SEV-2 that should have been SEV-3
- Average time from incident open to correct severity assignment
Adjust the definitions based on what you learn. Severity is a living standard.
Common Mistakes
- Pet severity every team invents their own. Standardize company-wide.
- SEV-0 don't add levels above SEV-1. Just use "SEV-1 all hands."
- Severity inflation if every incident is SEV-2, nobody takes SEV-2 seriously
- Severity deflation pressure to avoid post-mortems leads to fake SEV-4s
- Unchanging severity escalation is a tool, use it
The Goal
Severity should mean the same thing to every person in the org. Engineers, PMs, execs, customer support.
When someone says "SEV-1," everyone should know what that means, how urgent it is, and what the response looks like.
When you achieve that, incident response gets dramatically better.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)