Silence Destroys Trust
During our worst outage, we went 35 minutes without updating the status page. Twitter filled the void. Theories ranged from data breach to bankruptcy. A customer posted "They're probably shutdown" which got 200 retweets.
The incident was a 20-minute database migration gone wrong. The reputation damage lasted months.
The Update Cadence
status_page_update_cadence:
P1_customer_facing:
first_update: within 5 minutes of detection
subsequent_updates: every 15 minutes
post_resolution: within 30 minutes
P2_degraded_service:
first_update: within 15 minutes
subsequent_updates: every 30 minutes
post_resolution: within 2 hours
P3_minor_impact:
first_update: within 30 minutes
subsequent_updates: every hour
post_resolution: next business day
The rule: If there's nothing new to say, say "We're still investigating. Next update in 15 minutes." Silence is worse than no news.
Writing Good Status Updates
What NOT to Write
Bad examples:
"We're experiencing issues." (vague)
"Our database is overloaded." (too technical)
"A junior engineer deployed bad code." (blaming)
"Everything should be fine soon." (no commitment)
"We're working on it." (says nothing)
What TO Write
Good example:
"Some users are experiencing slow page loads when
accessing their dashboard. Our team has identified
the cause and is deploying a fix. We expect full
resolution within 20 minutes. API access and core
functionality are not affected."
Structure:
1. What users see (symptom)
2. What we're doing (action)
3. When it'll be fixed (ETA)
4. What's NOT affected (scope)
The Template System
We pre-built templates for common scenarios:
templates = {
'investigating': {
'title': '{service} Investigating {symptom}',
'body': (
'We are investigating reports of {symptom} '
'affecting {service}. Some users may experience '
'{user_impact}. We are actively working to identify '
'the cause and will provide an update within '
'{next_update_minutes} minutes.'
)
},
'identified': {
'title': '{service} Issue Identified',
'body': (
'We have identified the cause of {symptom} '
'affecting {service}. {root_cause_plain}. '
'Our team is implementing a fix. We expect '
'resolution by {eta}. {scope_statement}'
)
},
'monitoring': {
'title': '{service} Fix Deployed, Monitoring',
'body': (
'A fix for {symptom} has been deployed. '
'We are monitoring to confirm full resolution. '
'{service} should be functioning normally. '
'If you continue to experience issues, please '
'contact support. We will provide a final update '
'within {next_update_minutes} minutes.'
)
},
'resolved': {
'title': '{service} Resolved',
'body': (
'The issue affecting {service} has been fully resolved. '
'Total impact duration: {duration}. {affected_count} '
'were affected. We will publish a detailed incident '
'report within {report_timeline}. We apologize for '
'any inconvenience.'
)
}
}
Internal vs External Communication
Internal (Slack/war room): External (status page/Twitter):
───────────────────────── ──────────────────────────────
Technical details User-facing impact
"DB connection pool exhausted" "Some users may see errors"
Specific error messages Expected resolution time
Code-level discussion What's NOT affected
Blame-free technical discussion Empathetic, professional tone
The Post-Incident Report
Publish a public post-incident report within 48 hours for P1/P2:
# Incident Report: Dashboard Loading Issues March 15, 2024
## Summary
On March 15 from 2:15 PM to 2:42 PM UTC, approximately 30% of
users experienced slow or failed dashboard loading. The issue
was caused by a database configuration change that reduced
connection pool capacity.
## Timeline
- 2:15 PM UTC Monitoring detected elevated error rates
- 2:18 PM Engineering team began investigation
- 2:25 PM Root cause identified (connection pool misconfiguration)
- 2:32 PM Fix deployed
- 2:42 PM Full resolution confirmed
## Impact
- Duration: 27 minutes
- Users affected: ~30% (dashboard only)
- Data loss: None
- Other services: Not affected
## Root Cause
A planned configuration update inadvertently reduced the database
connection pool from 100 to 10 connections.
## What We're Doing to Prevent This
1. Adding automated validation for configuration changes
2. Implementing canary deployment for config updates
3. Adding connection pool monitoring with alerting threshold
We sincerely apologize for the inconvenience and are committed
to maintaining the reliability you expect from us.
Transparency builds trust. Every company that handles incidents well publishes honest post-mortems.
If you want automated incident communication with AI-generated status updates, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)