Rodel Talampas

Posted on Jul 2

Google SRE Review - Points Taken

#devops #google #reviews #sre

This is a continuation of my Google SRE Review - Cheat Sheet. You may want to continue reading this or jump on to the cheat sheet first.

Every person will have their own ideas and perceptions. These are my own.

Few chapter and points that sticked to my mind were:

Service Level Objectives (SLOs)

Probably the most important chapter to me. Defining measurable metrics that tells you about the service you are providing.

Many teams say:

"We need 100% uptime."

Google instead asks:

"How reliable does the service actually need to be?"

This introduces:

SLIs (Service Level Indicators)
SLOs (Service Level Objectives)
Error Budgets

Example:

Instead of saying

"Never have downtime"

you might define

Availability SLO = 99.95%
Latency SLO = 95% of requests under 200 ms

Now engineering decisions become measurable rather than emotional.

Eliminating Toil

Removing repetitive manual work before it consumes the team - sounds familiar?

Google defines toil as repetitive operational work that:

is manual
provides no lasting value
scales linearly with service growth
should be automated

Examples:

restarting services
cleaning queues
rotating logs manually
clicking deployment buttons
manual failovers

If someone performs the same task every week, the system—not the person—is the problem.

This chapter changed how many engineering organizations think about operational work.

Monitoring Distributed Systems

This chapter provides an excellent practical advice.

It explains the famous "Golden Signals":

latency
traffic
errors
saturation

Even today these are the foundation of monitoring systems.

This chapter also explains why dashboards aren't enough—you need actionable alerts.

When to have Alerts

One of the most applicable chapters for all engineers. The book argues against alerting on infrastructure metrics like:

CPU > 80%
Memory > 90%

Instead alert when users are actually affected.

For example:

Alert if:

checkout success rate drops
API latency exceeds the SLO
login failures increase

rather than merely because CPU is high.

Responding to Incidents

What to do when incidents happen? This covers:

incident commanders
communication
postmortems
blameless culture

The most important thing that struct me is "The blameless postmortem philosophy".
This has become an industry standard. Fix the issue than blame someone.

Capacity Planning

How should we (not only Google) predict:

storage growth
CPU usage
memory requirements
traffic growth

before any system fails.

Summary

The biggest lesson isn't technical.

It's this:
Reliability is an engineering problem—not an operations problem.

Instead of hiring more operators, Google (We should) writes more software.

Instead of more runbooks:

automate deployments
automate failover
automate recovery
automate scaling

That's a philosophy you can apply whether you run ten servers or ten thousand.

DEV Community