Samson Tanimawo

Posted on May 4

Reducing Toil: The Google SRE Book Applied to Startups

#sre #toil #productivity #automation

The Google Rule That Breaks at Startups

Google's SRE book says: SRE time should be no more than 50% toil. The other 50% must go to engineering work that reduces toil.

At a 10-person startup, your "SRE team" is one overworked engineer. They're already at 95% toil. There is no slack to reduce it.

So you have to be ruthless about what work is worth automating and what work is worth eliminating entirely.

Defining Toil Precisely

Google's definition:

Manual
Repetitive
Automatable
Tactical (not strategic)
Lacks enduring value
Scales linearly with service growth

If a task checks all six boxes, it's toil. If it checks some but not all, it might be legitimate engineering work.

Example: responding to an alert is tactical and lacks enduring value, but if it's not repetitive, it's not toil.

Example: writing a new runbook is manual, but it's strategic and has enduring value, so it's not toil.

The Startup-Sized Toil Audit

Track for 2 weeks. Every 30 minutes, write down:

What am I doing right now?
Is it toil (manual, repetitive, automatable)?
How long have I been doing it this week?

At the end of 2 weeks, you'll have a toil ranking. Pick the top 3.

Typical top offenders:

Manually running deploys (30+ min/week)
Responding to known false-positive alerts (3+ hours/week)
Provisioning new dev environments (1+ hour per request)
Checking on flaky CI runs (2+ hours/week)
Writing the same runbook context in every incident (1+ hour/incident)

Rule 1: Eliminate Before Automating

The best toil is toil you don't do at all.

Before automating the deploy process, ask: why do we deploy manually?

If the answer is "we don't trust our tests" → fix the tests
If the answer is "we need human approval" → build a self-serve approval flow
If the answer is "production is scary" → build better rollback, then trust the automation

Before automating alert response, ask: why is the alert firing?

If it's a false positive → fix the alert
If it's a symptom of something deeper → fix the root cause
If it's expected behavior → delete the alert

Automating false positives is worse than doing them manually it hides the underlying problem.

Rule 2: Automate the Second-Most Common Task

Counter-intuitive but works:

The most common manual task is usually the one you've already optimized manually. You've gotten fast at it.

The second-most common task is where you're slow, it's still frequent, and automation has the highest ROI.

Example: you spend 3 hours/week on deploys (already optimized with scripts). You spend 2 hours/week manually provisioning dev environments (still done via the UI). Automate the environments first.

Rule 3: Measure Before and After

toil_metrics:
deploy_manual_time: 35 min/week
deploy_automated_time: 5 min/week # Saves 30 min/week

alert_response_time: 180 min/week
alert_response_time_after_tuning: 45 min/week # Saves 135 min/week

If you can't measure the savings, you don't know if the automation worked.

Rule: if automating a toil task takes 10x longer than the toil itself saves per year, don't automate. Eliminate.

Rule 4: Self-Service Is the Force Multiplier

At scale, toil scales linearly with team size. Self-service breaks the linear relationship.

Examples:

Instead of SRE provisioning dev environments → Terraform module + docs + CI approval
Instead of SRE running database queries → read-only proxy with query approval flow
Instead of SRE creating alerts → YAML templates engineers can copy

Rule of thumb: if three different engineers have asked you to do the same thing, build it as self-service.

Rule 5: Runbooks Are Temporary Debt

A runbook says "here's the manual procedure." A good runbook is instructions for a bot, not a human.

Every runbook should have an expiration date:

runbook: restart_stuck_worker
created: 2024-01-15
expires: 2024-04-15
automation_ticket: #4872

If a runbook is still manual after 3 months, either:

It's rare enough to not need automation
We've failed to allocate time for automation
The underlying issue should be fixed instead

Either way, reconsider.

The 80/20 Rule for Startup SRE

At Google scale, you can justify building a platform team. At startup scale, you can't. So you apply Pareto ruthlessly.

What to automate (20% effort, 80% value):

Deploys (frees 30+ min/week, prevents errors)
Dev environment provisioning (frees 2+ hrs/week)
Known-cause alert response (frees 3+ hrs/week)
Secret rotation
Backup verification

What not to automate (80% effort, 20% value):

Provisioning one-off infrastructure
Debugging novel issues
Writing custom dashboards
Responding to security incidents (needs judgment)

Save your engineering cycles for the high-leverage automation.

The Monthly Toil Review

Every month, ask:

What toil do I do now that I didn't do last month?
What toil did I eliminate in the last month?
Is my total toil going up or down?

If toil is going up and you're not seeing a plan to reduce it, that's your biggest reliability problem. Not the outages. Not the alerts. The toil.

Because toil crowds out the time you need to fix the underlying systems.

The Hardest Part

The hardest part of toil reduction isn't technical. It's psychological.

Toil feels productive. You finish tasks. You feel needed. You're the hero who fixed the broken thing.

Engineering work to eliminate toil feels slow. You build for weeks before seeing results. Nobody pages you for doing it.

Resist the dopamine of toil. The goal is to make yourself less needed, not more.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community