DEV Community

Discussion on: How do you deal with incidents?

Collapse
mitchpommers profile image
Mitch Pomery (he/him)

If your incidents are chaos, it sounds like people aren't prepared for them (unless something truly chaotic has happened like a lightning strike taking out your primary DC).

I really like the PagerDuty Incident Response docs for outlining what the roles are in an incident and who needs to do what (and who needs to stay out of it). My work doesn't have documentation similar to it, but we do have dedicated incident response managers who help co-ordinate incidents and get the right people involved.

When I'm personally in incidents I make sure to speak about what has happened, not who has done things (i.e. "The firewall rules have changed" instead of "name changed the firewall rules") and being explicit in stating what I am going to do, instead of asking permission (i.e. "I am going to redeploy X" instead of "Can I redeploy X?"). I have found both of these help keep only the people needed involved and reduce the time to recovery.

As an example, my team was called at 8AM with a major incident. The person responding asked the incident manager "Can I redeploy" to which the incident manager asked "Can you? Who do we need to ask?". Soon there were 5 other teams involved in the incident all asking "Can this be redeployed" with all the other teams going "I don't know". When our team changed from asking "Can we redeploy?" to "We are redeploying", the incident manager immediately agreed and soon after the incident was resolved.