DEV Community

Rodel Talampas
Rodel Talampas

Posted on

Google SRE Review - Points Taken

This is a continuation of my Google SRE Review - Cheat Sheet. You may want to continue reading this or jump on to the cheat sheet first.

Every person will have their own ideas and perceptions. These are my own.

Few chapter and points that sticked to my mind were:

Service Level Objectives (SLOs)

Probably the most important chapter to me. Defining measurable metrics that tells you about the service you are providing.

Many teams say:

"We need 100% uptime."

Google instead asks:

"How reliable does the service actually need to be?"

This introduces:

  • SLIs (Service Level Indicators)
  • SLOs (Service Level Objectives)
  • Error Budgets

Example:

Instead of saying

"Never have downtime"

you might define

  • Availability SLO = 99.95%
  • Latency SLO = 95% of requests under 200 ms

Now engineering decisions become measurable rather than emotional.

Eliminating Toil

Removing repetitive manual work before it consumes the team - sounds familiar?

Google defines toil as repetitive operational work that:

  • is manual
  • provides no lasting value
  • scales linearly with service growth
  • should be automated

Examples:

  • restarting services
  • cleaning queues
  • rotating logs manually
  • clicking deployment buttons
  • manual failovers

If someone performs the same task every week, the system—not the person—is the problem.

This chapter changed how many engineering organizations think about operational work.

Monitoring Distributed Systems

This chapter provides an excellent practical advice.

It explains the famous "Golden Signals":

  • latency
  • traffic
  • errors
  • saturation

Even today these are the foundation of monitoring systems.

This chapter also explains why dashboards aren't enough—you need actionable alerts.

When to have Alerts

One of the most applicable chapters for all engineers. The book argues against alerting on infrastructure metrics like:

  • CPU > 80%
  • Memory > 90%

Instead alert when users are actually affected.

For example:

Alert if:

  • checkout success rate drops
  • API latency exceeds the SLO
  • login failures increase

rather than merely because CPU is high.

Responding to Incidents

What to do when incidents happen? This covers:

  • incident commanders
  • communication
  • postmortems
  • blameless culture

The most important thing that struct me is "The blameless postmortem philosophy".
This has become an industry standard. Fix the issue than blame someone.

Capacity Planning

How should we (not only Google) predict:

  • storage growth
  • CPU usage
  • memory requirements
  • traffic growth

before any system fails.

Summary

The biggest lesson isn't technical.

It's this:
Reliability is an engineering problem—not an operations problem.

Instead of hiring more operators, Google (We should) writes more software.

Instead of more runbooks:

  • automate deployments
  • automate failover
  • automate recovery
  • automate scaling

That's a philosophy you can apply whether you run ten servers or ten thousand.

Top comments (0)