New SRE team at your company? Here's a 90-day plan I've used twice. It works because it balances 'show immediate value' with 'build for the long term.'
Days 1-14: Observe
Resist the urge to change things. Watch the current system, read existing post-mortems, shadow on-call, talk to engineers about their pain.
Output: a list of the top 5 reliability problems ranked by 'engineering time lost per week.'
Days 15-30: Quick wins
Pick the top 2 from your list. Fix them. Make the fixes visible announce them in eng all-hands.
Good quick wins: delete a flaky alert, automate a repetitive runbook, fix a broken dashboard everyone complains about.
Bad quick wins: rewrite the deployment pipeline. Too big, takes 90 days alone.
Output: visible reliability improvements + trust from engineering teams.
Days 31-60: Foundations
Now use your trust. Introduce the boring stuff:
- Define SLOs for the top 3 critical services
- Set up an error budget dashboard
- Establish a weekly reliability review (10 minutes, not an hour)
- Write an incident response runbook template
Output: measurable reliability targets that engineering can rally around.
Days 61-90: Programmatic change
Start turning the reliability work into ongoing programs:
- Post-mortem process with action-item tracking
- Monthly toil survey (what did engineers do this month that could've been automated?)
- Quarterly reliability review with leadership
- A clear hand-off process: when does reliability work become product engineering work?
Output: processes that continue working when you take a vacation.
The trap
'This is fine, we can do all this in the first week.' You cannot. Every team I've seen that tried it got burned out or resented. 90 days is the minimum. More is normal.
The real goal
At day 91, the engineering team should be able to describe your team's value in one sentence. If they can't, you spent 90 days on the wrong things.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)