incident.io is a Slack-native incident response and management tool that scales as your team grows. Hypergrowth companies use incident.io to automate incident processes, focus on fixing the issue, and learn from incident insights to improve site reliability and fix vulnerabilities. See how it works on incident.io.We wrote this article in response to a question asked in our Slack Community. Click here to join hundreds of technology leaders discussing best practices for incident response! ✨
We know a thing or two about incident response. As such, we're often asked to advise when companies are designing their incident response processes.
A common question is "How do you design your incident severity levels?". It's a great question given how central they are to incident response!
In this article, we walk through:
- What incident severity levels are
- Things to consider when designing your incident severity levels
- Our recommended severity levels
What are incident severity levels?
Severity levels measure the impact of an incident. They answer the question "how bad is this incident?" If you've ever seen SEV-1, P1, and so on --- this is a severity level.
Severity levels are used for communicating impact to your coworkers, customers, and stakeholders
Well-designed severity levels create shared expectations between people responding to the incident. This makes it easier to coordinate, and prioritise effectively.
Different severity levels may trigger different processes or automation. When we launched Workflows, we found that most organisations wanted their automation driven by incident severities. For example, notifying the executive team when a Critical incident is declared.
Things to consider when designing your incident severity levels
We have the privilege to talk to people working on their incident response processes all the time. Here's the 4 top tips we've learned from that.
Have one set of severity levels for the whole organisation
The primary benefit of severity is sharing common definitions across teams, so people can easily understand the type of urgency associated with an incident without going too much into the detail.
Set your severity levels consistently across your organisation
If your severity levels are different team-to-team, it'll be hard for newcomers to understand how bad any particular incident is, and vital time will be wasted.
Add clear guidance on how to set them
Add business-specific guidance on how to triangulate and set the severity of an incident, and make sure it's front and centre to your organisation. People may be stressed or tired, and are looking for clear boundaries and instructions.
You should be able to summarise each level in 1-2 sentences. For example, 'Any financial loss above £100K will be considered a Critical incident.'.
Choose the smallest number you can get away with
You need to apply the Goldilocks principle to severity levels.
You don't want too few: they won't capture the nuance between different incidents effectively. You don't want too many: they'll be confusing and hard to discern the boundaries between --- ultimately, losing power as a communication mechanism.
You want it just right. 3-5 severity levels is the right amount in our experience. Startups should start with 3, and add severity levels as the maximum possible impact size increases through company and product growth.
Choose human words over code words
Choose human words like Low, Medium, over codewords like SEV-1 or P1. Some people will expect P1 to be more severe than P5 and others, well, won't! 🙈
Human words communicate these clearly with little room for misinterpretation: important in a stressful situation.
Our recommended severity levels
Despite approaching severity levels from first principles, we often see most organisations ending up with very similar severity levels. We recommend adopting the following tried-and-tested severity levels:
- Low: minimal impact and can be handled in work hours
- Medium: business impact (either internal or external) but doesn't significantly impact normal operations
- High: impact warrants immediate response and may disrupt normal operations
- Critical: a higher level where executives are involved and there could be reputational damage or severe business impact. This generally isn't necessary unless you're 100+ people.
There's a clear distinction between levels: enough for nuance, but not too many to overload responders that are trying to decide which severity applies.
Use these levels as a starting place, and customise the description to your organisation.
We hope that was helpful. If you've got any questions on how to design your incident severity levels, come ask us in our Slack Community and we'd be happy to help.
incident.io is a Slack-native incident response and management tool that scales as your team grows. Hypergrowth companies use incident.io to automate incident processes, focus on fixing the issue, and learn from incident insights to improve site reliability and fix vulnerabilities. See how it works on incident.io.
Top comments (0)