What is Site Reliability Engineering? A short intro

I recently started learning about Site Reliability Engineering (SRE), a discipline that began at Google in the early 2000s and is now popular at companies around the world. Since one of the best ways to properly understand a topic is to explain it to someone else, I decided to write this article explaining some of SRE's core concepts. As such, this article is partly for me, and partly for others who are curious about SRE or just starting out on their SRE journey.

This is by no means intended to be a definitive guide. Instead, I'm going to focus on three of the core concepts that I found the most interesting and insightful: toil and how to eliminate it, SLOs, and the error budget. I'll start with a short intro on what SRE is, the problems it addresses, and then dive into each of the core concepts in turn.

What is SRE? Is it DevOps?

SRE is an approach that uses software engineering concepts to solve operations problems. It focuses heavily on automation and its key principles help align the goals of the development and operations teams.

SRE is not exactly DevOps, although they do share many common traits and have similar aims. For example, DevOps and SRE were both brought in to address the problem of conflict between the development and operations teams.

Both aim to break down the barrier between the two teams and foster a healthier, more effective, and more productive working environment. But where DevOps is a fairly general term, SRE is a definite set of principles. SRE can be thought of as a specific way of doing DevOps.

What problems does SRE fix?

To really understand SRE, we have to look at the problems organisations faced before SRE (and DevOps) were part of the picture.

Conflict between development and operations teams

In traditional organisations, product development and operations were two quite distinct teams. Development was responsible for writing code and building new features, whereas the goal of operations was to keep everything stable and running smoothly in production.

The problem is that this setup inherently creates tension between the two teams. The development teams want to move quickly and ship as many new features as possible. However operations want to move slowly and limit changes because they break things and risk system downtime.

Over time, this conflict results in a number of problems such as bad communication, different goals regarding service reliability, and ultimately a lack of trust and respect between the two teams.

Scalability

Another problem that SRE addresses is scalability. The problem with a traditional operations team is that as the company launches new services or as those services start getting more traffic, the operations team must scale with it. This is because the work of a traditional operations team is mostly manual. So, the more services or traffic, the more people needed to help run those services and keep them stable.

When you have a company the size of Google, the operations team would therefore have to grow to a size that would quickly become unmanageable and extremely expensive.

How does SRE fix these problems?

SRE has a set of principles that focuses on aligning the goals of development and operations as much as possible.

SRE is what happens when you ask a software engineer to design an operations team.

So what might this look like? Let's take a look at some of the core concepts now.

Eliminating toil

Toil is the manual and repetitive work that comes with running a production service. So toil is not just any manual work, but rather work related specifically to keeping the site up and running. Toil is not something that moves you forward. If, after completing a task, you're in the same place as you started, chances are that task is toil. It's important to note that toil is also something that is automatable. Therefore any tasks that rely on human judgement are not toil.

So how does the SRE team eliminate toil? SRE tackles this by making automation a top priority. At Google, SRE team members have a cap of 50% on the time they can spend on operations work such as manual tasks and being on-call. And what do they do with the rest of their time? They spend it on engineering projects (hence the "Engineering" part of SRE). A big part of this involves automating manual tasks, with the goal of automating away that year's work. With more and more tasks automated, more toil is eliminated.

And how exactly does eliminating toil help solve the problems mentioned above? First of all, it directly tackles the problem of scalability, because with more and more tasks automated, the SRE team doesn’t need to scale in line with more services or traffic. Now the company can expand its services, but the size of the SRE team can stay the same.

Secondly, it intelligently addresses the issue of conflict between development and operations. You might be wondering what happens if there’s just more operational work to be done and the SRE team exceeds its limit of 50%? If this happens, any excess operational work gets redirected back to the development team.

Now instead of being in conflict, the values of both teams are aligned and focused on reducing the overall amount of manual operations work. The development team will be more careful with the code they send to SRE, because they know that any badly-tested code that causes problems will only increase the SRE team’s workload. If it goes over the 50% limit, they'll have to pick up the excess.

SLOs

SLO stands for Service Level Objective. This basically means the level of reliability or availability a service aims to offer its users.

It's a common misconception that a company should aim to offer services with 100% availability. However, if you think about it, this is actually pointless. A user's computer or internet connection may only be 99% reliable, so they wouldn’t even notice if your service is only 99.99% reliable instead of 100%.

One major downside of aiming for 100% reliability is that it stifles innovation. If the system can't be down even for a second, it's going to be very hard for developers to launch new features as this risks breaking things. Also, to make a service 100% reliable requires great effort (if it's even possible), and that effort is almost definitely better spent on other things.

An SLO is therefore an agreement on what level of reliability is acceptable. It can be measured in terms of how often a service is available, as well as things like how long it takes to return a response to a request. These measurements are called SLIs (Service Level Indicators).

SLOs are an important way of communicating to users what level of reliability they should expect. However, they’re also key in getting developers and operations on the same page. Since the SLO is negotiated in advance, there will be less conflict with operations wanting more reliability and development teams wanting less.

Error budgets

The error budget is another clever way that SRE aligns the incentives of the developers and those concerned with reliability. It helps find a balance between releasing new features and making sure these features are reliable.The error budget is the other side of the coin to the SLO. If the product team decides that the SLO of a service should be 99.9%, then the error budget is the remainder, in this case 0.1%.

The main benefit of an error budget is that it provides a common incentive that allows both product development and SRE to focus on finding the right balance between innovation and reliability.

The easiest way to think about this is in terms of time. If the SLO is 99.9%, then the remaining 0.1% is time that the service is allowed to fail. In this way, the development and SRE teams agree in advance what the acceptable level of unreliability is, thereby reducing conflict.

New features can be added until that quarter’s error budget is spent. For example, if the error budget is 0.1% and a change causes the system to fail 0.01% of the time, that problem uses up 10% of that quarter's error budget. Once this limit is reached, no more features can be launched.

This gets the developers thinking like SREs. If they know the error budget is almost used up, they’ll write better, more well-tested code. This works to their advantage too, because if their code causes less problems, they can continue to publish new features.

However if the developers are having a hard time launching new features because of a strict error budget, the SLO can be relaxed. This would increase the error budget and encourage more innovation. The important thing is to find a balance that works.

Of course, there are other things that can consume the error budget other than buggy code: for example, the failure of a data center. Since this isn't the development team's fault, should it still affect their remaining error budget? The answer is that anything that causes the system to go down will eat up the error budget. However, it can be handled by splitting the budget into different parts: part can be reserved for the development team and part can be set aside for other types of outages.