Page It to the Limit
SRE Right Answers/Wrong Answers With Dueling Brians
What is SRE:
Brian Weber kicks the conversation off with an overview of Site Reliability Engineering (SRE).
Brian Weber: “I look for things that are outliers and efficiency that are health problems, not so much product health, but developmental health and try to impose those “standards” on my team. I find myself talking to other SREs, both in the company and out of the company to try and get an idea of what those kinds of things look like.”
Brian Rutkin gives us the wrong answer, discussing how SRE is more than just a cool title and how SRE is not DevOps.
Brian Rutkin: “Taking your engineers that are DevOps and suddenly rebranding them as SRE is not necessarily the right thing either. SRE kind of falls in toward the middle of those terms of how you would use them.”
Delving into DevOps and SRE
The Brians talk to us about how DevOps and SRE work together.
Brian Rutkin: “SRE is an implementation of DevOps…. SRE is the understanding that operational work is required, and the goal should be to remove absolutely as much of it as you possibly can by a human.”
Brian Weber counters with the misconception of how much software development someone with an SRE title should be doing. He continues to talk to us about applied implementation and researching components of SRE.
Metrics: SLOs and SLAs
We talk about setting and publishing Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and best practices around setting and accomplishing these.
Brian Rutkin: “This is really going to vary for every organization and every service a least by a little bit. I think that most people would agree that you want to focus on a very few number of SLO’s to drive and accomplish your SLAs.”
Rutkin continues to talk to us about setting the metrics and what you want to know from your service; success rate, latency, and accuracy.
Brian Weber discusses the misconceptions about what a customer is, what a customer should be, and what people should be paying attention to with SLAs.
Weber: “Your SLA is the amount of uptime and availability… it has everything to do with what is your end state.”
Tuning Alerts
The conversation turns to creating and tuning alerts and what the role of SRE is within that area. Brian Weber talks about how to make noise levels appropriate for new products and how tuning alerts is an ongoing process.
Weber: “Tuning alerts ends up being an ongoing process, that’s why it’s tuning alerts and not setting alerts.”
Brain Rutkin walks us through antipatterns with alert tuning and how going to extremes is a common mistake that is made. He also talks about how alerts should be used for two purposes; immediate problems and to watch trends over time.
Sharing Learnings
The Brians discuss how to share learnings across the organization.
Rutkin goes over how you can determine the correct ways to communicate within your organization and define thresholds for when a postmortems is required.
Brain Weber illuminates where many postmortems go wrong, “Blame, blame, blame.” He continues to discuss blame as a main reason postmortems go wrong. Weber continues to talk about using the right words in the postmortems and how it’s rarely a single person that is the cause of a problem.
Both Brians discuss the “5 Whys” of how you can prevent future outages through systems and culture changes.
Additional Resources
- PagerDuty Home Page
- PagerDuty’s Postmortem Resources
- Episode transcribed by Rev