The Google Rule That Breaks at Startups
Google's SRE book says: SRE time should be no more than 50% toil. The other 50% must go to engineering work that reduces toil.
At a 10-person startup, your "SRE team" is one overworked engineer. They're already at 95% toil. There is no slack to reduce it.
So you have to be ruthless about what work is worth automating and what work is worth eliminating entirely.
Defining Toil Precisely
Google's definition:
- Manual
- Repetitive
- Automatable
- Tactical (not strategic)
- Lacks enduring value
- Scales linearly with service growth
If a task checks all six boxes, it's toil. If it checks some but not all, it might be legitimate engineering work.
Example: responding to an alert is tactical and lacks enduring value, but if it's not repetitive, it's not toil.
Example: writing a new runbook is manual, but it's strategic and has enduring value, so it's not toil.
The Startup-Sized Toil Audit
Track for 2 weeks. Every 30 minutes, write down:
- What am I doing right now?
- Is it toil (manual, repetitive, automatable)?
- How long have I been doing it this week?
At the end of 2 weeks, you'll have a toil ranking. Pick the top 3.
Typical top offenders:
- Manually running deploys (30+ min/week)
- Responding to known false-positive alerts (3+ hours/week)
- Provisioning new dev environments (1+ hour per request)
- Checking on flaky CI runs (2+ hours/week)
- Writing the same runbook context in every incident (1+ hour/incident)
Rule 1: Eliminate Before Automating
The best toil is toil you don't do at all.
Before automating the deploy process, ask: why do we deploy manually?
- If the answer is "we don't trust our tests" → fix the tests
- If the answer is "we need human approval" → build a self-serve approval flow
- If the answer is "production is scary" → build better rollback, then trust the automation
Before automating alert response, ask: why is the alert firing?
- If it's a false positive → fix the alert
- If it's a symptom of something deeper → fix the root cause
- If it's expected behavior → delete the alert
Automating false positives is worse than doing them manually it hides the underlying problem.
Rule 2: Automate the Second-Most Common Task
Counter-intuitive but works:
The most common manual task is usually the one you've already optimized manually. You've gotten fast at it.
The second-most common task is where you're slow, it's still frequent, and automation has the highest ROI.
Example: you spend 3 hours/week on deploys (already optimized with scripts). You spend 2 hours/week manually provisioning dev environments (still done via the UI). Automate the environments first.
Rule 3: Measure Before and After
toil_metrics:
deploy_manual_time: 35 min/week
deploy_automated_time: 5 min/week # Saves 30 min/week
alert_response_time: 180 min/week
alert_response_time_after_tuning: 45 min/week # Saves 135 min/week
If you can't measure the savings, you don't know if the automation worked.
Rule: if automating a toil task takes 10x longer than the toil itself saves per year, don't automate. Eliminate.
Rule 4: Self-Service Is the Force Multiplier
At scale, toil scales linearly with team size. Self-service breaks the linear relationship.
Examples:
- Instead of SRE provisioning dev environments → Terraform module + docs + CI approval
- Instead of SRE running database queries → read-only proxy with query approval flow
- Instead of SRE creating alerts → YAML templates engineers can copy
Rule of thumb: if three different engineers have asked you to do the same thing, build it as self-service.
Rule 5: Runbooks Are Temporary Debt
A runbook says "here's the manual procedure." A good runbook is instructions for a bot, not a human.
Every runbook should have an expiration date:
runbook: restart_stuck_worker
created: 2024-01-15
expires: 2024-04-15
automation_ticket: #4872
If a runbook is still manual after 3 months, either:
- It's rare enough to not need automation
- We've failed to allocate time for automation
- The underlying issue should be fixed instead
Either way, reconsider.
The 80/20 Rule for Startup SRE
At Google scale, you can justify building a platform team. At startup scale, you can't. So you apply Pareto ruthlessly.
What to automate (20% effort, 80% value):
- Deploys (frees 30+ min/week, prevents errors)
- Dev environment provisioning (frees 2+ hrs/week)
- Known-cause alert response (frees 3+ hrs/week)
- Secret rotation
- Backup verification
What not to automate (80% effort, 20% value):
- Provisioning one-off infrastructure
- Debugging novel issues
- Writing custom dashboards
- Responding to security incidents (needs judgment)
Save your engineering cycles for the high-leverage automation.
The Monthly Toil Review
Every month, ask:
- What toil do I do now that I didn't do last month?
- What toil did I eliminate in the last month?
- Is my total toil going up or down?
If toil is going up and you're not seeing a plan to reduce it, that's your biggest reliability problem. Not the outages. Not the alerts. The toil.
Because toil crowds out the time you need to fix the underlying systems.
The Hardest Part
The hardest part of toil reduction isn't technical. It's psychological.
Toil feels productive. You finish tasks. You feel needed. You're the hero who fixed the broken thing.
Engineering work to eliminate toil feels slow. You build for weeks before seeing results. Nobody pages you for doing it.
Resist the dopamine of toil. The goal is to make yourself less needed, not more.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)