DEV Community

Hercules Lemke Merscher
Hercules Lemke Merscher

Posted on • Originally published at bitmaybewise.substack.com

SRE book notes: Introduction to Site Reliability Engineering

Incentivized by my manager at GitLab, Rachel Nienaber, I’m taking notes from the book Site Reliability Engineering, How Google Runs Production Systems, and decided to share some quotes I find more interesting here, and eventually some comments with my thoughts and perspectives as well.

Site Reliability Engineering, How Google Runs Production Systems

This is the first post of a series, so stay tuned. You’re welcome to interact via comments, I’d love to know your thoughts.

Without further ado, here are the notes from the first chapters:


when systems are “reliable enough,” we instead invest our efforts in adding features or building new products.


even though a small organization has many pressing concerns and the software choices you make may differ from those Google made, it’s still worth putting lightweight reliability support in place early on, because it’s less costly to expand a structure later on than it is to introduce one that is not present.


the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort. Yet software engineering as a discipline spends much more time talking about the first period as opposed to the second, despite estimates that 40–90% of the total costs of a system are incurred after birth.

In my own experience, a seldom trait of companies is to worry about maintenance, be it the quality of the systems, or the cost of keeping everything running.

Do they need a cultural shift? Someone, to defy the status quo? Better prepared professionals? More knowledge? Braveness? A bit of all the previous options?


please bear the SRE Way in mind: thoroughness and dedication, belief in the value of preparation and documentation, and an awareness of what could go wrong, coupled with a strong desire to prevent it.


Hope is not a strategy!

I love this one!

As Murphy’s law states: “If anything can go wrong, it will”


SRE is what happens when you ask a software engineer to design an operations team.

In practice, SREs are also engineers, they do not just maintain and keep the systems running, but they also build them.

More below.


By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.

Therefore, Google places a 50% cap on the aggregate "ops" work for all SREs—tickets, on-call, manual tasks, etc.

Google’s rule of thumb is that an SRE team must spend the remaining 50% of its time actually doing development.


In general, for any software service or system, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and 99.999% available.

Thus, the marginal difference between 99.999% and 100% gets lost in the noise of other unavailability, and the user receives no benefit from the enormous effort required to add that last 0.001% of availability.


Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.


The book is full of good content for thought. It’s not just about Google. The ideas and practices presented are valuable to all software engineers out there.

I’m enjoying every single chapter. Keep an eye on new publications, because there are more to come regularly as I progress in my reading.


If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.

You can also follow me on Twitter and Mastodon.

Image of AssemblyAI tool

Transforming Interviews into Publishable Stories with AssemblyAI

Insightview is a modern web application that streamlines the interview workflow for journalists. By leveraging AssemblyAI's LeMUR and Universal-2 technology, it transforms raw interview recordings into structured, actionable content, dramatically reducing the time from recording to publication.

Key Features:
🎥 Audio/video file upload with real-time preview
🗣️ Advanced transcription with speaker identification
⭐ Automatic highlight extraction of key moments
✍️ AI-powered article draft generation
📤 Export interview's subtitles in VTT format

Read full post

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay