Samson Tanimawo

Posted on Apr 21

The Incident Commander Role: Running Incidents Without Chaos

#sre #incidents #leadership #devops

Everyone's Debugging, Nobody's Leading

Five engineers in an incident channel. All debugging independently. Nobody coordinating. Three people checking the same dashboard. Two trying conflicting fixes. Customers waiting.

This is what incidents look like without an Incident Commander.

What the IC Does

The IC doesn't debug. They coordinate.

IC Responsibilities:
✓ Declare incident severity
✓ Assign roles (debugger, communicator, scribe)
✓ Coordinate investigation streams
✓ Make decisions (rollback? escalate? wait?)
✓ Manage communication (status page, stakeholders)
✓ Call for help when needed
✓ Declare all-clear

IC Does NOT:
✗ Write code
✗ Run queries
✗ SSH into servers
✗ Debug the issue

The IC Playbook

Minute 0-5: Declaration

1. Acknowledge the page
2. Open incident channel: #inc-YYYY-MM-DD-description
3. Post severity declaration:

"I'm IC for this incident.
Severity: P1 - Customer-facing checkout is down
Impact: ~30% of checkout attempts failing

Roles:
- @alice: Primary debugger
- @bob: Comms (status page + Slack updates)
- @charlie: Scribe (timeline)

First actions:
- @alice: Check last deploy and error logs
- @bob: Post initial status page update
- I'll update every 10 minutes."

Minute 5-15: Investigation

The IC runs a structured investigation loop:

Every 5 minutes:
1. "@alice, what have you found?"
2. Synthesize information
3. Decide next action
4. Assign next task
5. Update channel: "Current theory: [X]. Testing: [Y]."

Minute 15+: Decision Points

def ic_decision_tree(situation):
if situation.root_cause_known:
if situation.fix_available:
return "Deploy fix with canary"
else:
return "Rollback to last known good"

if situation.duration > 15 and not situation.making_progress:
return "Escalate: bring in additional expertise"

if situation.customer_impact_growing:
return "Escalate severity + enable fallback"

return "Continue investigation, update in 5 min"

Communication Templates

Pre-written templates save precious minutes:

templates:
internal_update:
format: |
**Incident Update [{severity}] {time} UTC**
Status: {investigating|identified|monitoring|resolved}
Impact: {impact_description}
Current action: {what_we_are_doing}
Next update: {time_of_next_update}

status_page_update:
format: |
We are {status} an issue affecting {service}.
Some users may experience {symptom}.
Our team is actively working on a resolution.
Next update in {minutes} minutes.

executive_escalation:
format: |
P1 Incident: {title}
Duration: {duration} minutes
Customer impact: {impact}
Revenue impact: ~${revenue}/hour
Current status: {status}
ETA to resolution: {eta}

Training New ICs

We use game days to train ICs:

Week 1: Shadow an experienced IC during a game day
Week 2: IC a simulated P2 incident (game day)
Week 3: IC a simulated P1 incident (game day)
Week 4: IC a real P3/P4 incident with a mentor observing
Week 5+: IC rotation for all severities

The IC Rotation

ic_rotation:
schedule: weekly
pool_size: 6 # Minimum for sustainable rotation
requirements:
- Completed IC training program
- At least 6 months on the team
- Shadowed 3+ real incidents
compensation:
- Same as on-call compensation
- IC counts as on-call time

Before and After

Metric	Without IC	With IC
MTTR (P1)	67 min	28 min
Communication gaps	Frequent	Rare
Duplicate work	~40%	~5%
Stakeholder satisfaction	Low	High
Post-mortem quality	Incomplete	Thorough

The IC doesn't make incidents shorter because they're smarter. They make incidents shorter because someone is actually managing the response.

If you want AI-assisted incident coordination that makes every engineer an effective IC, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community