Incident Communication: The Status Page That Builds Trust

#incidents #communication #sre #devops

Silence Destroys Trust

During our worst outage, we went 35 minutes without updating the status page. Twitter filled the void. Theories ranged from data breach to bankruptcy. A customer posted "They're probably shutdown" which got 200 retweets.

The incident was a 20-minute database migration gone wrong. The reputation damage lasted months.

The Update Cadence

status_page_update_cadence:
P1_customer_facing:
first_update: within 5 minutes of detection
subsequent_updates: every 15 minutes
post_resolution: within 30 minutes

P2_degraded_service:
first_update: within 15 minutes
subsequent_updates: every 30 minutes
post_resolution: within 2 hours

P3_minor_impact:
first_update: within 30 minutes
subsequent_updates: every hour
post_resolution: next business day

The rule: If there's nothing new to say, say "We're still investigating. Next update in 15 minutes." Silence is worse than no news.

Writing Good Status Updates

What NOT to Write

Bad examples:
"We're experiencing issues." (vague)
"Our database is overloaded." (too technical)
"A junior engineer deployed bad code." (blaming)
"Everything should be fine soon." (no commitment)
"We're working on it." (says nothing)

What TO Write

Good example:
"Some users are experiencing slow page loads when
accessing their dashboard. Our team has identified
the cause and is deploying a fix. We expect full
resolution within 20 minutes. API access and core
functionality are not affected."

Structure:
1. What users see (symptom)
2. What we're doing (action)
3. When it'll be fixed (ETA)
4. What's NOT affected (scope)

The Template System

We pre-built templates for common scenarios:

templates = {
'investigating': {
'title': '{service} Investigating {symptom}',
'body': (
'We are investigating reports of {symptom} '
'affecting {service}. Some users may experience '
'{user_impact}. We are actively working to identify '
'the cause and will provide an update within '
'{next_update_minutes} minutes.'
)
},
'identified': {
'title': '{service} Issue Identified',
'body': (
'We have identified the cause of {symptom} '
'affecting {service}. {root_cause_plain}. '
'Our team is implementing a fix. We expect '
'resolution by {eta}. {scope_statement}'
)
},
'monitoring': {
'title': '{service} Fix Deployed, Monitoring',
'body': (
'A fix for {symptom} has been deployed. '
'We are monitoring to confirm full resolution. '
'{service} should be functioning normally. '
'If you continue to experience issues, please '
'contact support. We will provide a final update '
'within {next_update_minutes} minutes.'
)
},
'resolved': {
'title': '{service} Resolved',
'body': (
'The issue affecting {service} has been fully resolved. '
'Total impact duration: {duration}. {affected_count} '
'were affected. We will publish a detailed incident '
'report within {report_timeline}. We apologize for '
'any inconvenience.'
)
}
}

Internal vs External Communication

Internal (Slack/war room): External (status page/Twitter):
───────────────────────── ──────────────────────────────
Technical details User-facing impact
"DB connection pool exhausted" "Some users may see errors"
Specific error messages Expected resolution time
Code-level discussion What's NOT affected
Blame-free technical discussion Empathetic, professional tone

The Post-Incident Report

Publish a public post-incident report within 48 hours for P1/P2:

# Incident Report: Dashboard Loading Issues March 15, 2024

## Summary
On March 15 from 2:15 PM to 2:42 PM UTC, approximately 30% of
users experienced slow or failed dashboard loading. The issue
was caused by a database configuration change that reduced
connection pool capacity.

## Timeline
- 2:15 PM UTC Monitoring detected elevated error rates
- 2:18 PM Engineering team began investigation
- 2:25 PM Root cause identified (connection pool misconfiguration)
- 2:32 PM Fix deployed
- 2:42 PM Full resolution confirmed

## Impact
- Duration: 27 minutes
- Users affected: ~30% (dashboard only)
- Data loss: None
- Other services: Not affected

## Root Cause
A planned configuration update inadvertently reduced the database
connection pool from 100 to 10 connections.

## What We're Doing to Prevent This
1. Adding automated validation for configuration changes
2. Implementing canary deployment for config updates
3. Adding connection pool monitoring with alerting threshold

We sincerely apologize for the inconvenience and are committed
to maintaining the reliability you expect from us.

Transparency builds trust. Every company that handles incidents well publishes honest post-mortems.

If you want automated incident communication with AI-generated status updates, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community