Originally posted at the Blameless blog.
Traditionally, DevOps engineers have had their hands tied when they ask for investments in reducing technical debt. Architecture changes, code refactor, and slowing down feature development for product availability are mostly wishful thinking, even though these efforts inevitably become mission-critical for every company. If fixing tech debt is not handled or timed well, the company's product and business health will both suffer. When should you deal with your technical debt? How can you bring the leadership team on board? What does the SLO have to do with these questions?
Blameless chatted with SRE leader Garrett Plasky to get the answers. Garrett's SRE team of 15-20 engineers is responsible for keeping the lights on, a.k.a. running the production infrastructure, for Evernotes over 220M users across 5 billion resources. Their product is a cross-platform SaaS application designed to enable people to organize, personalize, consume, and share thoughts from any device at any time.
Anytime you go to the leaders with “Heres my problem. We should invest in our architecture.” They say, “Wheres the data?” You go, “Heres my SLO graph.”
Anytime you go to the leaders with “Heres my problem. We should invest in our architecture.” They say, “Wheres the data?” You go, “Heres my SLO graph.”
To that end, SLOs are a critical part of shipping code to your production environment. Having SLOs in place for your production services allow you to remove the emotion and ambiguity when it comes to figuring out the impact of an unplanned outage or a bug released to production.
Blameless: What is an SLO? How does it relate to SLI and SLA?
SLOs, SLIs, and SLAs are used exclusively for metrics that capture your users experience, such as availability, request latency, throughput, and error rate, etc.
SLO, Service Level Objective, is an internal target for a metric that you are measuring.
SLI, Service Level Indicator, is the name for the metric.For example, if an SLI you are measuring is availability, then a corresponding SLO you might set would be 99.95%.Setting an appropriate SLO is an art in and of itself, but ultimately you should endeavor to set a target that is above the point at which your users feel pain and also one that you can realistically meet (i.e. SLOs should not be aspirational).
SLA, Service Level Agreement, is an external metric that you are legally obligated to meet, such as 99% for availability. When you don't meet your SLA, you compensate customers with dollars. Thats why you want to set your SLOs to be more stringent than your SLAs.
Both your DevOps and product development teams are responsible for meeting the SLO. This puts pressure on both directions. Are we meeting the SLO? If so, product engineering can run faster. Are we not? Now we need to put back pressure from the operations side - how do we fix the reason why we are not meeting our SLO.
What is an error budget?
An error budget is 1 minus the SLO. Continuing the example, a 0.05% monthly error budget for availability (based on a 99.95% uptime target) means your service can be unavailable for around 22 minutes in that month.Error budget = 43800 min/month x 0.05% = ~22 min/monthSRE practices encourage you to strategically burn the budget to zero every month, whether its for feature launches or architectural changes. This way you know you are running as fast as you can without compromising availability.
How have SLOs changed the way your company operates?
Every month, I present how well we have been meeting our SLOs. One bar graph I use shows our performance for availability on top of our SLO for each month over the most recent 6 months.The SLO graph helps us drive product roadmap decisions. When a piece of architecture is prone to failure and causing us to not meet our SLO, we can make a conscious decision to invest time (or not) to make it better. Without SLO tracking, we wouldn't have the data to justify the time investment in the architecture.
How has measuring against SLOs affected your leadership team?
Its been very useful for the leadership to gain visibility on our SLO performance.Recently, our new SVP of Engineering came to the service review meeting where I was presenting our performance against our SLOs. We had a not-so great-month. He jumped right in and asked “So what are we doing about it?” He was asking all the relevant questions without needing any context. He just got it!
The SLO graph sparked a great conversation that led to him saying, “Lets double down on some of these architecture investment effort that we've been talking about!”
The SLO graph sparked a great conversation that led to him saying, “Lets double down on some of these architecture investment effort that we've been talking about!”
What is the most important metric that teams should have an SLO for? How do you measure that metric?
Ultimately, it comes back to availability. People expect that when they do things on a phone, it's immediately available across the world for someone else to see. If thats the expectation of your users, then you have to serve that. In order to serve that, you have to be available. Thats why it's important to set and meet the SLO for availability.
At Evernote, we currently measure availability at the shard level as well as the overall service bucket. We use external probes that hit a health check endpoint on our service and return a success or fail. This probe is done every minute across every host. We measure the number of successes over all the health check data points across a fixed time period to determine availability.
When service becomes unavailable, a.k.a. when an incident burns the error budget, are fingers ever pointed?
Absolutely not. I have been passionate about blameless postmortems since before I even knew about SRE. It's not about “who tripped over the power cord?”, but “how can we prevent people from tripping over the power cord next time?”.
Its not about “who tripped over the power cord?”, but “how can we prevent people from tripping over the power cord next time?”.
What insights can we derive from tracking the error budget?
I have also created a step function graph for error budget burn. The line on the graph starts from the top left and dips down like staircase for every instance of error budget burn. By tracking and aggregating the key causes of these incidents, and analyzing which ones burn more budget than others, we can take proactive measures to prevent or manage the riskiest causes in the future.
How long did it take you to set up the monitoring system for SLOs?
The upfront investment of setting up the instrumentation, visualization, and data collection took 2-3 engineers about month to complete. Every month, I spend a day to pull together a presentation with SQL queries and graphs.
How did you get leadership buy-in on SLOs?
Im grateful that my VP at the time was equally keen to implement SLOs and error budgets. He helped amplify the ideas within the company. Our SLOs are codified in an internal Evernote doc that everyone in the company has access to.The first time I presented by SLO graph and error budget burn was within the SRE team, and a few peers in engineering. Over the course of time, more people joined, including a rotating crew of people like our client leads, other engineering managers, and even our CFO.
As you lead your SRE team on this journey to reliability, how do you advocate the importance reliability to Evernote?
As a SaaS company, our users trust us to not lose their data, and to store them forever. They trust that they can use Evernote whenever they want, wherever they are. We are committed to upholding that trust.
As a SaaS company, our users trust us to not lose their data, and to store them forever. We are committed to upholding that trust.
Written by Charlie Taylor
Top comments (0)