How We Handled Our First Major Outage (And Survived)

#sre #devops #incident #culture

Three years ago we had our first real outage. Six hours of downtime. Thousands of angry users. Multiple executives on the call. Here's what we did right, what we did wrong, and what we'd do differently.

What we did right

1. Communicated immediately. The moment we knew we had a problem, we updated the status page and emailed our biggest customers personally. Not when we had answers. When we had a question.

2. Had a single incident commander. One person making calls. Not a committee. When the CEO tried to direct technical work, the IC politely rerouted and told her where her help was actually needed (talking to customers).

3. Took care of our people. During hour 4, I ordered food. During hour 5, I forced the primary engineer off the call for 20 minutes to walk outside. Long incidents destroy people. You have to feed them and force them to rest.

4. Wrote it down as we went. We had a shared doc with a live timeline. When the post-mortem came, we had every decision captured.

What we did wrong

1. Tried to fix the root cause during the incident. For the first 2 hours, we were digging into why the database was struggling. We should have been mitigating (rolling back) first.

2. Let too many people 'help.' By hour 3, we had 12 engineers in the call. Half of them were useless. The IC should have kicked people out sooner.

3. Gave optimistic estimates. 'We'll be back in 30 minutes.' We were not back in 30 minutes. That miscommunication was worse than saying 'unknown.'

4. Didn't prepare the executive communication. The CEO had to answer customer questions in real time with no script. We should have drafted talking points for her after hour 1.

What we'd do differently

Mitigate first, investigate second. Always.
Cap the number of active engineers at 4 during an incident. Others go on standby.
Default to 'unknown' for estimates. Only give a number when we're sure.
Assign someone explicitly to 'executive liaison.' Their job is to keep the C-suite informed without interrupting the technical team.

The aftermath

The post-mortem was brutal and cathartic. We identified 14 action items. We actually did 11 of them over the next quarter.

The outage was the best thing that happened to our reliability culture. It turned reliability from 'a thing SRE owns' into 'a thing everyone takes seriously.' I wouldn't wish a 6-hour outage on anyone, but I also wouldn't trade the lessons.

The final lesson

Your first major outage will happen. Prepare for it by running game days. The game days will feel silly until the real thing happens, at which point every muscle you trained will kick in.

Incident response is a skill. Skills need practice. Practice now.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com