SRE book notes: Embracing Risk

#sre #books #risk

These are the notes from Chapter 3: Embracing Risk.

This is a post of a series. If you missed the previous post, it contains notes for the introductory part:

SRE book notes: Introduction to Site Reliability Engineering

Hercules Lemke Merscher ・ Jan 10 ・ 3 min read

#sre #books #notes #reliability

It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer.

From this chapter onwards, you notice how it’s all about trade-offs on a daily basis for SREs.

an incremental improvement in reliability may cost 100x more than the previous increment.

The cost of redundant machine/compute resources, and the opportunity cost.

when we set an availability target of 99.99%, we want to exceed it, but not by much: that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs

A similar note has been mentioned in the previous post, but it’s worth highlighting again, to demonstrate how the opportunity costs are relevant.

It keeps reverberating in my mind!

availability = successful requests / total requests

Product development performance is largely evaluated on product velocity, which creates an incentive to push new code as quickly as possible. Meanwhile, SRE performance is (unsurprisingly) evaluated based upon reliability of a service, which implies an incentive to push back against a high rate of change. Information asymmetry between the two teams further amplifies this inherent tension.

This tension is crystal clear in companies where there are two well-separated teams, one for operations, and one for development, and they do not engage very well.

our goal is to define an objective metric, agreed upon by both sides, that can be used to guide the negotiations in a reproducible way. The more data-based the decision can be, the better.

When it mentions both sides, it refers to infrastructure vs product teams.

Google SRE’s unofficial motto is "Hope is not a strategy."

Again, it’s a good thing to reverberate impactful sentences like that, to keep us on our toes.

The more data-based the decision can be, the better.

The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.

The main benefit of an error budget is that it provides a common incentive that allows both product development and SRE to focus on finding the right balance between innovation and reliability.

Wrapping up, this chapter is not that long but covers a lot of relevant content about the measurement of availability, identifying the risk tolerance for your systems, and the benefits of an error budget.

If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.

You can also follow me on Twitter and Mastodon.

Photo by Loic Leray on Unsplash