Everyone's Debugging, Nobody's Leading
Five engineers in an incident channel. All debugging independently. Nobody coordinating. Three people checking the same dashboard. Two trying conflicting fixes. Customers waiting.
This is what incidents look like without an Incident Commander.
What the IC Does
The IC doesn't debug. They coordinate.
IC Responsibilities:
✓ Declare incident severity
✓ Assign roles (debugger, communicator, scribe)
✓ Coordinate investigation streams
✓ Make decisions (rollback? escalate? wait?)
✓ Manage communication (status page, stakeholders)
✓ Call for help when needed
✓ Declare all-clear
IC Does NOT:
✗ Write code
✗ Run queries
✗ SSH into servers
✗ Debug the issue
The IC Playbook
Minute 0-5: Declaration
1. Acknowledge the page
2. Open incident channel: #inc-YYYY-MM-DD-description
3. Post severity declaration:
"I'm IC for this incident.
Severity: P1 - Customer-facing checkout is down
Impact: ~30% of checkout attempts failing
Roles:
- @alice: Primary debugger
- @bob: Comms (status page + Slack updates)
- @charlie: Scribe (timeline)
First actions:
- @alice: Check last deploy and error logs
- @bob: Post initial status page update
- I'll update every 10 minutes."
Minute 5-15: Investigation
The IC runs a structured investigation loop:
Every 5 minutes:
1. "@alice, what have you found?"
2. Synthesize information
3. Decide next action
4. Assign next task
5. Update channel: "Current theory: [X]. Testing: [Y]."
Minute 15+: Decision Points
def ic_decision_tree(situation):
if situation.root_cause_known:
if situation.fix_available:
return "Deploy fix with canary"
else:
return "Rollback to last known good"
if situation.duration > 15 and not situation.making_progress:
return "Escalate: bring in additional expertise"
if situation.customer_impact_growing:
return "Escalate severity + enable fallback"
return "Continue investigation, update in 5 min"
Communication Templates
Pre-written templates save precious minutes:
templates:
internal_update:
format: |
**Incident Update [{severity}] {time} UTC**
Status: {investigating|identified|monitoring|resolved}
Impact: {impact_description}
Current action: {what_we_are_doing}
Next update: {time_of_next_update}
status_page_update:
format: |
We are {status} an issue affecting {service}.
Some users may experience {symptom}.
Our team is actively working on a resolution.
Next update in {minutes} minutes.
executive_escalation:
format: |
P1 Incident: {title}
Duration: {duration} minutes
Customer impact: {impact}
Revenue impact: ~${revenue}/hour
Current status: {status}
ETA to resolution: {eta}
Training New ICs
We use game days to train ICs:
Week 1: Shadow an experienced IC during a game day
Week 2: IC a simulated P2 incident (game day)
Week 3: IC a simulated P1 incident (game day)
Week 4: IC a real P3/P4 incident with a mentor observing
Week 5+: IC rotation for all severities
The IC Rotation
ic_rotation:
schedule: weekly
pool_size: 6 # Minimum for sustainable rotation
requirements:
- Completed IC training program
- At least 6 months on the team
- Shadowed 3+ real incidents
compensation:
- Same as on-call compensation
- IC counts as on-call time
Before and After
| Metric | Without IC | With IC |
|---|---|---|
| MTTR (P1) | 67 min | 28 min |
| Communication gaps | Frequent | Rare |
| Duplicate work | ~40% | ~5% |
| Stakeholder satisfaction | Low | High |
| Post-mortem quality | Incomplete | Thorough |
The IC doesn't make incidents shorter because they're smarter. They make incidents shorter because someone is actually managing the response.
If you want AI-assisted incident coordination that makes every engineer an effective IC, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)