This is a continuation of my Google SRE Review - Cheat Sheet. You may want to continue reading this or jump on to the cheat sheet first.
Every person will have their own ideas and perceptions. These are my own.
Few chapter and points that sticked to my mind were:
Service Level Objectives (SLOs)
Probably the most important chapter to me. Defining measurable metrics that tells you about the service you are providing.
Many teams say:
"We need 100% uptime."
Google instead asks:
"How reliable does the service actually need to be?"
This introduces:
- SLIs (Service Level Indicators)
- SLOs (Service Level Objectives)
- Error Budgets
Example:
Instead of saying
"Never have downtime"
you might define
- Availability SLO = 99.95%
- Latency SLO = 95% of requests under 200 ms
Now engineering decisions become measurable rather than emotional.
Eliminating Toil
Removing repetitive manual work before it consumes the team - sounds familiar?
Google defines toil as repetitive operational work that:
- is manual
- provides no lasting value
- scales linearly with service growth
- should be automated
Examples:
- restarting services
- cleaning queues
- rotating logs manually
- clicking deployment buttons
- manual failovers
If someone performs the same task every week, the system—not the person—is the problem.
This chapter changed how many engineering organizations think about operational work.
Monitoring Distributed Systems
This chapter provides an excellent practical advice.
It explains the famous "Golden Signals":
- latency
- traffic
- errors
- saturation
Even today these are the foundation of monitoring systems.
This chapter also explains why dashboards aren't enough—you need actionable alerts.
When to have Alerts
One of the most applicable chapters for all engineers. The book argues against alerting on infrastructure metrics like:
- CPU > 80%
- Memory > 90%
Instead alert when users are actually affected.
For example:
Alert if:
- checkout success rate drops
- API latency exceeds the SLO
- login failures increase
rather than merely because CPU is high.
Responding to Incidents
What to do when incidents happen? This covers:
- incident commanders
- communication
- postmortems
- blameless culture
The most important thing that struct me is "The blameless postmortem philosophy".
This has become an industry standard. Fix the issue than blame someone.
Capacity Planning
How should we (not only Google) predict:
- storage growth
- CPU usage
- memory requirements
- traffic growth
before any system fails.
Summary
The biggest lesson isn't technical.
It's this:
Reliability is an engineering problem—not an operations problem.
Instead of hiring more operators, Google (We should) writes more software.
Instead of more runbooks:
- automate deployments
- automate failover
- automate recovery
- automate scaling
That's a philosophy you can apply whether you run ten servers or ten thousand.
Top comments (0)