DEV Community

Cover image for The Incident Commander Role: Running Incidents Without Chaos
Samson Tanimawo
Samson Tanimawo

Posted on

The Incident Commander Role: Running Incidents Without Chaos

Everyone's Debugging, Nobody's Leading

Five engineers in an incident channel. All debugging independently. Nobody coordinating. Three people checking the same dashboard. Two trying conflicting fixes. Customers waiting.

This is what incidents look like without an Incident Commander.

What the IC Does

The IC doesn't debug. They coordinate.

IC Responsibilities:
✓ Declare incident severity
✓ Assign roles (debugger, communicator, scribe)
✓ Coordinate investigation streams
✓ Make decisions (rollback? escalate? wait?)
✓ Manage communication (status page, stakeholders)
✓ Call for help when needed
✓ Declare all-clear

IC Does NOT:
✗ Write code
✗ Run queries
✗ SSH into servers
✗ Debug the issue
Enter fullscreen mode Exit fullscreen mode

The IC Playbook

Minute 0-5: Declaration

1. Acknowledge the page
2. Open incident channel: #inc-YYYY-MM-DD-description
3. Post severity declaration:

"I'm IC for this incident.
Severity: P1 - Customer-facing checkout is down
Impact: ~30% of checkout attempts failing

Roles:
- @alice: Primary debugger
- @bob: Comms (status page + Slack updates)
- @charlie: Scribe (timeline)

First actions:
- @alice: Check last deploy and error logs
- @bob: Post initial status page update
- I'll update every 10 minutes."
Enter fullscreen mode Exit fullscreen mode

Minute 5-15: Investigation

The IC runs a structured investigation loop:

Every 5 minutes:
1. "@alice, what have you found?"
2. Synthesize information
3. Decide next action
4. Assign next task
5. Update channel: "Current theory: [X]. Testing: [Y]."
Enter fullscreen mode Exit fullscreen mode

Minute 15+: Decision Points

def ic_decision_tree(situation):
if situation.root_cause_known:
if situation.fix_available:
return "Deploy fix with canary"
else:
return "Rollback to last known good"

if situation.duration > 15 and not situation.making_progress:
return "Escalate: bring in additional expertise"

if situation.customer_impact_growing:
return "Escalate severity + enable fallback"

return "Continue investigation, update in 5 min"
Enter fullscreen mode Exit fullscreen mode

Communication Templates

Pre-written templates save precious minutes:

templates:
internal_update:
format: |
**Incident Update [{severity}] {time} UTC**
Status: {investigating|identified|monitoring|resolved}
Impact: {impact_description}
Current action: {what_we_are_doing}
Next update: {time_of_next_update}

status_page_update:
format: |
We are {status} an issue affecting {service}.
Some users may experience {symptom}.
Our team is actively working on a resolution.
Next update in {minutes} minutes.

executive_escalation:
format: |
P1 Incident: {title}
Duration: {duration} minutes
Customer impact: {impact}
Revenue impact: ~${revenue}/hour
Current status: {status}
ETA to resolution: {eta}
Enter fullscreen mode Exit fullscreen mode

Training New ICs

We use game days to train ICs:

Week 1: Shadow an experienced IC during a game day
Week 2: IC a simulated P2 incident (game day)
Week 3: IC a simulated P1 incident (game day)
Week 4: IC a real P3/P4 incident with a mentor observing
Week 5+: IC rotation for all severities
Enter fullscreen mode Exit fullscreen mode

The IC Rotation

ic_rotation:
schedule: weekly
pool_size: 6 # Minimum for sustainable rotation
requirements:
- Completed IC training program
- At least 6 months on the team
- Shadowed 3+ real incidents
compensation:
- Same as on-call compensation
- IC counts as on-call time
Enter fullscreen mode Exit fullscreen mode

Before and After

Metric Without IC With IC
MTTR (P1) 67 min 28 min
Communication gaps Frequent Rare
Duplicate work ~40% ~5%
Stakeholder satisfaction Low High
Post-mortem quality Incomplete Thorough

The IC doesn't make incidents shorter because they're smarter. They make incidents shorter because someone is actually managing the response.

If you want AI-assisted incident coordination that makes every engineer an effective IC, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)