Running a production incident is a skill. Most of the skill isn't technical. Here's what nobody told me when I started running incidents.
Skill 1: Calling the cadence
During an incident, time warps. Everyone is heads-down in logs. Nobody remembers when they last updated the status channel.
The incident commander's job is to force a cadence: 'Update in 5 minutes. What do we know? What do we need?' Without this, the incident drags on because no one is aggregating context.
Skill 2: Saying 'I don't know, and here's what we're doing to find out'
Stakeholders want certainty. You can't give it. The temptation is to guess.
Don't guess. Say 'We don't know the cause yet. We're investigating X and Y. I'll update in 10 minutes.' Trust builds on honesty, not performance.
Skill 3: Interrupting your engineers
Your engineers are investigating. They don't want to stop and explain. But if you don't interrupt, you can't make decisions.
Do it anyway. Say 'I know you're busy. 30 seconds — what have you learned?' Most engineers will appreciate the structure, even if they complain.
Skill 4: Knowing when to stop investigating and start mitigating
The temptation in an incident is to find the root cause. The right action is usually to mitigate first and investigate second.
'We don't know why, but rolling back stops it' is a win. Don't feel bad. The post-mortem can figure out why.
Skill 5: Managing morale
Long incidents grind people down. Notice when your team is flagging. Bring in a relief shift. Say 'good job, let's take 10.' Acknowledge that it's hard.
The worst incidents I've been in were the ones where the team ran out of emotional energy before the problem was fixed. That's on the commander.
Skill 6: Declaring the incident over
Incidents often drag on past actual resolution because nobody wants to declare victory. Declare it. 'Issue resolved at 15:47. We'll keep monitoring for 30 minutes but the incident is closed.' People need permission to exhale.
The real job
Incident command is emotional labor disguised as technical work. The best commanders I know are calm, honest, and generous with credit. The worst are the ones who try to be the smartest technical person in the room — that's not the job.
You can't learn this from a book. You learn it by running incidents badly and asking for feedback afterward. The feedback is usually uncomfortable. It's also how you get good.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)