Where is your SRE team on the maturity curve? I've worked with teams at every stage. Here's a rough map.
Stage 0: Reactive
The site goes down, someone scrambles to fix it, the cycle repeats. No on-call rotation. No dashboards. Alerts are emails nobody reads.
Characteristic phrase: 'We'll look at that after launch.'
Stage 1: Foundation
On-call rotation exists. Alerts route to a paging tool. Basic dashboards for CPU, memory, error rate. Post-mortems happen sometimes.
Characteristic phrase: 'Did anyone see that spike last night?'
Stage 2: Measured
SLOs defined for critical services. Error budgets tracked. Alert volume is monitored and pruned. Post-mortems are written and reviewed.
Characteristic phrase: 'We're at 80% of our error budget for the quarter.'
Stage 3: Automated
Runbooks exist for top alerts. Toil is measured and reduced. Deployment pipeline has automatic rollback. Chaos engineering is practiced.
Characteristic phrase: 'The auto-rollback caught it.'
Stage 4: Predictive
Anomaly detection catches issues before alerts fire. Capacity planning is data-driven. New services have SLOs and dashboards at launch, not after. AI/ML assists incident response.
Characteristic phrase: 'We caught that before customers noticed.'
Where most teams are
Most teams I've worked with are at Stage 1 or Stage 2, trying to get to Stage 3. The jump from 2 to 3 is the hardest it requires sustained investment with no immediate crisis to justify it.
The trap
Don't try to skip stages. Teams that install ML anomaly detection at Stage 0 just have prettier chaos. Get the foundation right first. Then automate. Then predict.
The highest maturity team I've seen was boring. Almost nothing broke. The engineers had time to work on interesting problems. That's the goal.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)