loading...
Blameless

Learn How to Apply SRE Outside of Engineering with Dave Rensin

kludyhannah profile image Hannah Culver ・42 min read

In this talk at Blameless Summit '19, Dave Rensin, Sr. Director of Engineering at Google, shares how to apply SRE outside of engineering within our organizations, transforming the way we think about traditional IT operations.

The following transcript has been lightly edited for clarity.‍

Christina: Our first keynote speaker is a senior director of engineering at Google. You might know him as they guy who founded and leads the customer reliability engineering function at Google. CRE, this is a team that teaches the world SRE principles and practices. Now I want to tell you a bit more about him, because I think he has a very unique view and perspective. He is deeply compassionate and intuitive as a teacher, not just a lecturer. He makes sure that whatever he's trying to transfer, the knowledge actually gets received. And when he's ... I've heard him speak, and he shared that as we invest more and more into machines as they become more and more powerful, are we humans working for the machines or are they working for us?

Now if we get woken up at 3:00 AM by a pager, then that's humans working for the machines. But how can we use a principle like SRE to preserve our humanity? Maybe that gets automatically responded to. Maybe we don't get woken up in the middle of the night. And it's that idea of fundamentally preserving our humanity that very early on helped me see the bigger picture of SRE. So without further ado, please help me welcome Dave Rensin for how do we apply SRE outside of engineering.

‍Dave Rensin: Well thank you Christina. That was entirely more kind than I deserve. Good morning everyone. Yep, I am Dave, and I do work at Google. The topic of the talk is what might it look like to apply SRE principles outside of engineering in our companies? I mean, a good set of principles shouldn't, or at least should be transferable in some sense across domains. Otherwise, they're fairly brittle, as is said of a principle of free speech is not restricted to someone's fear of our life, right? Before we can kind of answer that question, we have to maybe narrow it a bit and ask, well, which SRE principles are we talking about? Well, there are several SRE principles and a lot of practices. And Google, we wrote two really large books of 800 and some odd pages each on both of them, which you can read for free at google.com/SRE. So we clearly couldn't cover that in an hour.

So, I really want to focus on what I think are the two most important core SRE principles, and then let's have a discussion about how we might apply that outside of engineering. So the first principle is the principle of error budgets, okay? And then error budget policies, but error budgets, the idea that things won't be perfect. It's dumb to expect they will be perfect. So how much imperfect is acceptable? And the second principle we want to focus on, shockingly for this audience, are blameless postmortems, what would it mean to apply that outside of an engineering culture. Okay, so before we dig into it, let me ask this question. Who here has written error budgets or error budget policies and then been governed by them, had to live under them? Okay. We will generously call that a minority of hands.

So how about we take two minutes and do the super-fast review, okay? Then we're all talking from the same page. Let's start with this. Perfect is the wrong goal for I think everything, but I'm open to being convinced that there is something, but I've yet to find one. So zero errors, 100% customer happiness, 100% reliability. Any of those goals are bad goals. Bad, like not good. If you labor under those, very terrible things will happen. Alright, why is that true? Well, in the first place, they're unattainable. No system has ever been created by humans, and I don't just mean computing system. No political system, no economic system, no whatever, philosophical framework. There's literally nothing that's ever been constructed by humans that is perfect, or perfectly reliable or perfect on any access. Actually, there are very few things ... this will be maybe 30% controversial to say, but there are very few things that I can think of that are actually perfectly terrible, have zero redeeming value ... And I can think of some really awful things, some unbelievably small margin maybe. Doesn't mean we should adopt any of them.

Okay, so we're bad at doing perfect. Don't feel bad about it. That's okay. Nature doesn't do perfect very well either. I can't think of a single physical system in nature that is perfectly reliable, including the sun. It's just, all of its downtime minutes will come at once, and forever. And this is okay. It turns out humans are terrible at following directions, perfectly awful at it. I have this thing I like to do when I'm speaking at a really big crowd, a couple, whatever, thousand people or something, where I ask everyone to raise their left hand, and then I take a picture. Okay, cool. Then we move on, we talk for a while, and I get to this point somewhere in whatever presentation about humans are terrible at following instructions, and I will point out, because it's true without fail, that something like 20% of the audience will have raised the wrong hand.

That's the simplest thing I can think to ask someone to do, raise your left hand. But 20% of the people will have raised their right hand. And a non-zero fraction, much of the time, will have raised both hands, and I don't know why that's true. But it does happen a lot. And this is not a new revelation, that humans make mistakes. I think we are designed this way, or we've evolved, or however you care to think about it. But we are in this place in nature, and our purpose is to make well-intentioned mistakes. And I couldn't exactly tell you why we always do it. Maybe we're always interposing some evaluation or judgments in everything before we go do it. Beats me. I don't really know.

But this is not a new thought. This thought is thousands of years old. Seneca, people who have heard me speak have heard me quote him before. It's one of my very favorite sayings, errare humanum est, sed perseverare diabolicum. We all know what that means. It just means to err is human, but to persist knowingly in error is diabolical, right? Or maybe I would say more simply, the sin is not in the failing. The sin is in the failing to notice, okay? So we screw up even little things all the time. That's okay. That is working as intended, as we would say at Google. That is somehow by design. Therefore, we're not going to create perfect things. That is an unattainable goal. Therefore, if we set standards of perfections for our organizations, for our systems, so I'm using the very broad definition of system here, not just computing system, we know we will not attain that goal. That is a metaphysical certainty.

So, what happens as we start to get to whatever the point is reasonably attainable, and yet still ask people to get better. They lie to us. That's what happens. They lie to us. And you can see it everywhere in history. So that's why perfection is a really terrible goal. Don't have that. So then the question is ... Oh, and by the way, also our users, however we care to define users, our customers, don't care. If we think of it in our mind's eye, if we think of a graph, and on the Y axis is our users' happiness, and on the X axis is sort of our number of errors, right? And we start at zero. Let's see, I'll do this in reverse so it makes sense. We start at zero, hey, our users are as happy as they're going to be with our system, right? And if we get a little above zero errors, well, they're going to have basically the same level of happiness. It's not linear. In fact, it's flat for a while.

And then you get some level of error. I don't know what that is. It's different for every system. And then it falls. It really drops off a cliff at that point. So there is some amount of error, some distance between perfect and some line, in which our users are entirely indifferent. And you can prove this to yourself anecdotally. Do you care if your internet is working in your home if you're not home, or if it's working on your phone if you're not looking at your phone? Do you care if the BART is on time if you are not riding the BART? Right? Your attention and time are divided across a lot of things, and most of your day, you're not paying attention to most of the things you are using, so they don't have to be perfectly reliable because that's just a waste. And in fact ... So not only is it wrong to try to be much better at that sort of philosophically, because we will turn our people into liars, which is terrible, but our users don't care.

So clearly, if we're greater than whatever that error threshold is for any extended period of time, we'll start losing users. That's bad. We should all try to avoid doing that. But if we are very much under that error threshold for an extended period of time, we are wasting time and resources, because our users are indifferent. So if we say that we have a budget of 30 bad minutes in some computing system a month, and we're like, hey, this month we only had 10 bad minutes, woo hoo for us. The answer's no, not woo hoo for us. Why are we wasting 20 bad minutes? What could we be doing faster? We're doing more innovation. So our goal are to run our systems, our computing systems, our marketing systems, our legal systems, et cetera, just a little above, just a little better than that margin, right? And so that margin's called the error budget, how much budget. And we like to say at Google, take your nines for a spin.

The other thing people get confused about, just really quickly, is, hey, guess what? Our users decide our reliability for whatever system we're talking about, not our logs. No user ever said, "I'm having a problem with your system," and then we said, "Well, our logs look fine." And the user said, "Okay, never mind." And trust me, I built a global support org. I know. Our users care about symptoms, not causes. No user ever cared about that the CPU spiked. The do care that their transaction's taking a long time or it failed or some such thing, right? This becomes important in a moment. And then of course with error budgets, the question is, what do we do when we blow our budget? And we'll talk a little bit about that too.

Like in a software world, we might say if we had an error budget, let's see, a four nine system, your budget's going to be 4.32 bad minutes every 30 days, so if I have a nine minute outage, what do I do now that I've blown my error budget, I've spent twice my error budget? Easiest thing to do is freeze new roll outs for 30 days for the period it takes you to pay off that debt. We can talk about what would that mean in a legal system? What would that mean in a marketing system? What would that mean for sales humans?

Okay, that's the first thing, error budgets. Yes, a question. Alright, we have our first question. Please stand up, tell us your first name and yell or talk loudly your question.

‍Jacob: My first name is Jacob. My question is, Google Cloud this summer had an incident that greatly exceeded its error budget. GCNET-19009, I think maybe 50X, because Google Compute has a four nines SLA I think rather than SLO or error budgeI'm wondering sort of what ... If that's 50X, that's five years of a feature freeze that I assume-

‍Dave Rensin: That's a lot.

‍Jacob: Google probably won't be doing for-

‍Dave Rensin: Probably won't.‍

Jacob: So what does Google do for incidents of that magnitude?

‍Dave Rensin: It's ... Sorry, I'm going to pause for a minute. I know the answer to your question. I now have to put it through the filter of stuff I'm allowed to say. There's a lot of latency in that filter. It goes deep, so give me a second. Yes, from time to time in any system, right, you're going to experience outages of the system that are much larger, where you'll have blown your budget by so much, you're never going to pay it back. Let me ask you something. If you accrued debt that was so high you couldn't pay it back, what would you do? You'd declare bankruptcy, right? I mean, it's awful. It's painful. And people don't want to lend you money again, et cetera. We do analogous things for systems like that, right? So we declare bankruptcy. We say, okay, we're not going to pay this back. But something is fundamentally broken here.

Why did we ... Was this a black swan? Did an actual meteor actually hit a data center? Oh, no? Oh, okay, so this was preventable in some way. How is it we constructed a system that was so brittle that in its limit ... Again, it's not that there was a problem. That's not the fault. It's that it lasted so long. Now, what were the contributing factors for it lasting so long? Did we not notice for too long? Was mitigating the problem too complicated, like maybe we couldn't do a rollback or something. Maybe we could only roll forward, which is a terrible way to build a system. For something that large, we'll ask, are there communication barriers between the teams? We're not trying to get blameful here, but it's ... Sometimes, it is true in a large organization that how you organize your teams does affect the velocity which what they do. I'd prefer that not to be true, but in 100,000-person org, sometimes that matter.

I know the incident; I am deeply familiar with the incident you are talking about. I am not going to get in to the details past the public postmortem because I'm not. But the general answer to your question is you have to declare bankruptcy. And like in a financial bankruptcy, you need to have pretty serious penalties. Penalties is maybe the wrong word, but you have to have a very serious reaction to that. That is an event you'd like to not have happen hardly ever. Now, that having been said, statistically speaking, every five to seven years in some large system, you're going to have that. I mean, it's on the tail and it will happen. That's the thing about one in a million events. That's not very rare until you do things a billion times.

‍Audience Member: Would you adjust your error budget?

‍Dave Rensin: You might. Now ... Well, okay, so actually, the question you're asking, you're asking a fine question. You're asking about the SLO, the service level objective, like what is the magic line? Okay, the only answer to that question about adjusting the error budget is, what do our users care about? So you adjust your error budget if you have an outage and you find that that exceeds you error budget, right But you find that your users didn't care and that you had too strict of an error budget, right? Or if you have an outage that doesn't blow your error budget that people get really annoyed, they care a lot about it. You've miscalibrated where that line is. But at the end of the day, the fact that I had a huge outage doesn't mean I'm going to adjust my error budget. The only thing, the only vote that matters is the user's vote, right? Does that make sense? Because you can imagine, it's pathologically terrible to say, "This system for dollar reasons can't be more reliable than X. Therefore, I'm going to set my error budget to X." No, that doesn't encourage good engineering.

It's fine to say, "This system can't be more reliable than two nines." I see your hand. "This system can't be more reliable than two nines because of dollar reasons." My user requirement is that the system run at three nines. They get super grumpy if it's not running there. Now I know that for the period of time it takes me to figure out whatever I have to do to get this thing from two nines to three nines, I'm going to have grumpy users, right? And that's part of declaring a bankruptcy and doing some other things. But, so the user is the most important vote in where the error budget sits. But yeah, but you have to be explicitly recognize there's a gap. Yes, young lady. Sorry, I have the lights in my eyes, so I have to sort of-

‍Audience Member: So continuing off the last question, I had a thought, which is, what happens when those gigantic events happen, right? And they do happen, and one of the things that I'm wondering is, for each individual team that is responsible for their own SLOs, for their own product, right? Obviously loads and loads of these teams and these products are impacted by this black swan event, right? And one of the key concepts that people talk about in SLOs is being very careful to define your SLO and understanding the SLOs that are beneath you, that you depend on, right?

‍Dave Rensin: Yeah.

‍Audience Member: And so, does your system kind of account for that and say, look, you blew your SLO for five years' worth, but that's because everything was down and therefore, your team is not expected to magically have ... Does that kind of conversation happen?

‍Dave Rensin: Yeah, we have a philosophy about that, and our philosophy can be summed up as this. The user doesn't care. The user doesn't care. The user said, "This system X didn't work for me," and the user experienced this terrible outage, right? So you're asking the classic systems dependency question. So let's say some fundamental storage layer failed, right, and that was the cause of all the things. You might say, "Well, I'm the front end developer of this web application, right? I depend on the server and it depends on some stuff and it depends on some stuff, and all we ... Peel away all the turtles all the way down, we're at a storage layer, right? That ain't got nothing to do with me." And my answer is, "Wrong." Sure it does. The user doesn't care.

So, you need to ask a question, Mr. And Mrs. Developer, are your assumptions about the reliability of this layer that you depend on accurate? If they're not accurate, great, because it doesn't change the expectation the user has. Do you need to do something different? Do you need to decouple your dependency in some way on that deeper layer? I have never seen an outage between dependent systems where let's say A depends on B where there was nothing A could have done to lessen the blast radius of an outage of B. And so actually, no, we will apportion that error budget penalty across all the layers of the stack until we get to the user.

So, if it's a ... I'm trying to think of what would be the power system. I'm trying to think of the most basic thing I can think of. A power system, a power failure. Oh, here's a good one, because we had this happen a couple of years ago. People just didn't notice it, thank goodness, because of ... It's the way we distribute stuff. There was a bug in a firmware of a power controller. We didn't know. We buy this power controller from a vendor. And the bug showed up in three data centers at exactly the same moment and threatened the power in three data centers. Three data centers, depending on size, is on the order of tens and tens of thousands of machines. That was bad.

Okay, so we might say ... I mean, it's coding in the firmware of a power controller, which feeds the power system, which controls the hardware, et cetera, et cetera, et cetera all the way up the stack. Not my fault. Well I mean, yeah, you didn't introduce the bug, but to the extent it burned error budget, it burned some error budget, but not all of it, thankfully. The user doesn't care. So everybody in the stack, which basically meant everyone who was running services in that data center, which is almost everybody in Google, got apportioned. Their individual error budgets, their individual SLOs for their stuff took a hit somewhere. And so it was like ... And the teams that had been good about having geographic distribution of their serving load and quick recognition and switching over to other data centers, well, they didn't get much of a hit because they could slosh over and it didn't really matter. It didn't affect anyone.

So, our philosophy is no, it's everybody's ... Oh, wow, they cut me off. It's everyone's responsibility all the way down. All the turtles, all the way down. Because the user doesn't care. Let's move on here real quick, right? The second principle, because we actually probably want to get into how to apply these things outside of engineering, blameless postmortems. I think we all understand why this is important. If you blame people, they won't tell you the things you have to learn, which means what you can learn from problems. There's a ceiling. It's an asymptote to it. You learn something, or you learn the wrong things, right, because they hide the truth. That's terrible. Which means you will make the mistake again. That's doubly awful. So that's a terrible thing. If you have a blame-y culture you will go slower, at the end of the day. And the cost of fixing your problems will go up dramatically. This is why we have blameless postmortems.

The key thing about a blameless postmortem ... Key principles of a blameless postmortem. Number one, you have to start from the assumption that your co-workers are not stupid, lazy, or evil. If you really think you have stupid, lazy, or evil co-workers, your hiring process is broken. If you can't change your hiring process, you should quit, because that company will fail. But it is the truth in any company that survives for more than a few years. The overwhelming majority, you might as well think of it as everyone in a practical sense, nobody is stupid, lazy, or evil. They're all well-intentioned humans who will make well-intentioned mistakes.

So when I, the human, I don't know, push code or something and it causes a problem or tweet that thing that seemed oh so clever but causes a PR nightmare, the question is not why is that lazy, stupid, or evil human doing lazy, stupid, and evil things? Because I am not that. It's how did that person come to become unlucky? Why was that the person who tripped over this? What is it in the system that permitted a human, who is going to make mistakes, even on simple things, what permitted this person to be unlucky in that situation? And is it possible to put guardrails around that so people can't be unlucky in that situation? And of course, you're training against velocity and stuff.

But the reason we ask this question, how do we make it harder for people to experience bad luck or the results of their bad luck to have smaller blast radius is when you frame it that way, then people will be very forthcoming. They're like, yes, I was the unlucky victim. It happened to me. Here are my suggestions about how maybe it could not happen to me in the future and not happen to the next unlucky Dave in the future.

Okay, and again, this is not new to Google in any way. Doctors, at least if they're doing well, do M&M conferences, mortality and morbidity conferences. They get together and ask, why did this patient die? Because they would prefer everyone learn that lesson from one incident than have to learn it from a lot of patients. The most important thing about a postmortem besides being blameless is it actually has actions, right? It has concrete actions that are around what could we do to make the monitoring for a thing like this better? What could we do to make the alerting better? What could we do to make it easier to mitigate, right? It's all about reducing the blast radius. And then of course, what would we do to fix it so that we hopefully don't repeat this class of error again. But again, the goal is not to not fail. The goal is to not fail to notice, right, to notice as quickly as you can and then mitigate as quickly as you can.

Okay, so those are the two basic concepts we want to start with, error budgets and blameless postmortems. So, let's talk about how we might apply these outside of engineering, because they seem relatively straightforward to apply in engineering. Okay, well, let's talk about the different functions of a business. Let's pick everyone's favorite function, marketing. How could we apply SRE principles to marketing? I don't know. That's a good question. Well, let's ask some foundational questions first. Who are the customers of our marketing team? I'm going to put PR and coms in with marketing here, right? Just people responsible for shaping our image to external humans, if that's reasonable. Well, the sales team? Sure, they seem like they are a consumer of the marketing teamwork. Customer support? Yeah, I think so. They consume marketing's work. Our executroids, our leadership? Yeah, sure. Our shareholders, our investors, or ... And then of course our customers. Our customers consume the work of our marketing. So let's just go and say everybody. Everybody is a customer of marketing.

So, let's look at a couple of these in order and ask, could we reason about how to apply some SRE principles? So, our supplier in this case, our system is marketing, our marketing system, a collection of well-intentioned humans who want to say good things about the company, and our customer in this case is, let's say the sales team. Let's start there, okay? Starting from the principle of our customers only care about what our customers only care about, and you should only care about what our customers only care about. What do our sales customers care about from a marketing perspective? Why do they care about marketing? Well, okay, the sales team wants to beat their revenue targets. That's their job is to go sell stuff. What are the things they care about? Well, they care about deal friction, like how hard is it to convince a company to go do a thing, to go buy a thing? And they care about their inbound funnel. How quickly can I find prospects that I can go talk to and then what kind of friction do I have when I go do that?

So how can we apply SRE principles to those things? Well, okay, what are they sensitive to that the marketing team might have some impact on? Well, bad PR. Bad PR is not good. If you have a tweet storm about something, some bug in a product or whatever, you can imagine that would definitely cause deal friction, right? Some customer will say, "I read about this terrible thing that happened to you. Please tell me why I should buy your stuff." So that's no good. You can imagine it would impact the funnel. If I hear a bad thing about a product, maybe I'm less inclined to reach out to the company and say, "Please come talk to me about your product." What else does marketing do, brand? Brand ID and perception in the marketplace. Forget good bad. It's harder to sell something to someone if they've never heard of you before you walk in the door. If I walk in the door and say, "I'm from Acme Widget Company," and their first question is, "What is Acme Widget Company and why am I talking to you?", you have an uphill battle when you go. So, I see an important contribution from the sales team's perspective of the marketing team.

And then of course the funnel. Are people proactively reaching out to me to express their interest that I come talk to them because maybe they want to buy my stuff? That is a much easier thing to manage and much higher velocity to manage than having to do cold calling where you dial random humans and say, "Have you heard of us, and would you like to buy our things?"

Okay, so our system is marketing, our customers in this case are sales. We have identified the shape of some stuff the sales team cares about. How do we go put some metrics around it? So for example, how do we figure out how much bad PR the sales team can tolerate? I mean, we're not going to have zero bad PR. That's not going to happen. People have seen Twitter. So everyone has an opinion, and so zero's not the goal, because that's always not the goal here. So we're going to have non zero bad PR, let's say. How much can we tolerate as the sales team? Because we can clearly tolerate more than zero. We have more than zero now. Well, the answer is, we can tolerate just enough bad PR before it starts to add friction to our sales process. That is the general answer to that question.

Well, how would we measure that? What kind of metrics would we set up that we would actually track on a dashboard? Okay, well, there are a bunch of things. How about impacted customer minutes, preferably scaled by revenue? What I mean is, one source of bad PR in a company is going to be, we promise to do a thing, like we promised our product would work, and it didn't. Downtime or an outage or something, right? That's not just a technical thing, that's a PR thing and a marketing thing too. Well, we can measure how many bad minutes did our customers experience? And I think we should all acknowledge the reality that a customer who’s worth, whose economic value to a company is, I don't know, $1,000 a minute, is worth more than a customer whose economic value to your company is $1 a minute, right?

So, you should scale those minutes by money, by dollars, because at the end of the day, the sales team cares about revenue. That's a measure we can do now. We can look at support tickets, we'll get impact logs, and we can actually measure that right now. How much money did we put at risk because of this bad PR moment? So that's the thing we can measure and track and do SLOs around. How about social media impact, which is something like number of events times their reach times the time. So if I tweet something bad about, I don't know, Comcast. I haven't, but let's say I did. That's different if then Kanye tweets something bad about Comcast. That, I would gather, would get just a teeny tiny bit more attention than the, I don't know, 300 followers or something I have. So that's much more reach, right? So how long is that out there in the void unaddressed by the company?

So there's some equation about that, which again, we can measure social media impact and other kinds of things. The thing is here is you don't want to just measure that as one quantity. It needs a denominator, right? So let's think of those is bad social media minutes. We'll call it BSMM ... Or, fine, so let's find another acronym. Let's just call it bad minutes, how's that? You've got to divide it by something. You want to divide it by total minutes, right? So how many external communication events, how many PR events do you have? How many publicly facing tweets, announcements, whatever, what was their reach? How long did they last in the zeitgeist, et cetera? That becomes your denominator. How many of them were bad? That becomes your numerator. Now you have a fraction, you have percentile that you can compute and track on a graph. Most people, your customers are going to be indifferent, or they're all going to be indifferent to some number greater than zero. It's how much more is a percentage, right? And by the way, it is a relative percentage in PR. It's not an absolute thing.

So, you could, let's say, your service, your business, you could have some kind of outage, some kind of down time for 10 minutes. Well, the network outage that I was being asked about earlier. That's a bad PR thing, right? If it shows up in the context of a week of a bunch of really good PR stuff, I don't know, a bunch of new features people want or great earnings or whatever. You can imagine what that ... Then your overall lasting impression in the public consciousness may be much smaller than if it's the only thing that's there during that same period of time. So ratios actually matter here. This would be a thing that we would measure in a computing system, bad minutes versus total minutes, right? I mean, bad minute percentiles. You can do the same thing with PR. And so these are things we can lie into.

How about this? Number of times, just a straight count, that a sales team encounters an objection in the sales process that's related to a bad PR thing. Let's have the sales team count. I went in to this customer today and they asked me about that network outage. One. I also went in to a customer today. Two. Right? And we start counting, and that metric by itself won't necessarily tell us what the impact on revenue is, but in combination with those other things, we can start to get a shape of these contours. There's no such thing as a sales call that has no objections. That doesn't happen unless you're in a consumer business and people just go into a retailer to buy your thing, but that's not what we're talking about here. So, you're always going to have some non-zero value there. What you're looking for are first derivatives. You're looking in changes in slope for this.

But again, it's measurable. Number of companies I know who measure these things? Let's see, carry the two, divide by six, zero. Nobody. I don't know anybody who measures these things and puts them in graphs and applies SLOs to them, including my own. Oh, just to be complete. Funnel measurements, that we can work into this. How many people need to come proactively to us as a company and expressed interest for us to meet our sales plan? Some of them will turn into sales, and then they will turn into some distribution of values, of revenue. Some will be small, some will become maybe medium or whatever. How many of those do we have to have across time and what distribution lets us meet our revenue plan for the year, right?

And then, same thing, we can find out coming in how many of them have heard about or been impacted about. So we can actually track bad PR minutes, good minutes and bad minutes, and how it impacts the funnel. Same thing with blameless postmortems. I don't want to read a postmortem from the marketing team that said, "We fired Clyde because he said something stupid on Twitter." Hi, I have a question. Does your hiring process suck? No. Cool, glad to hear it, because you hired Clyde. So, if your hiring process doesn't suck, you don't think you necessarily made a mistake hiring Clyde, but you've hired Clyde, so that seems really bad, and it's expensive to recruit the new Clyde.

So, I have an easier question. How is it that Clyde got to say the dumb thing on Twitter? What is it about the system? Is no one reviewing stuff that goes out under the corporate Twitter account? And if people are, well how did that happen? Do we need better training for people? Do we need a second set of eyes? Do we need systems that are doing sentiment analysis on the tweets we just tweeted so we can untweet them if they are bad? Alright, they're like, those of the kinds of things we want to ask. The question isn't to ask, "Why did Clyde make a bad decision?" The question is, "Why did the system permit it to happen?", right? How do we prevent the next unlucky Clyde? And there's no reason not to apply that there.

Okay, so that's marketing as ... Oh, no, I guess that's another thing. Who's another customer of marketing? How about customer support? They're a customer of marketing, although it's not always obvious how. Here's one. The marketing team says, "New feature Q will be available Wednesday," and Wednesday comes, and new feature Q is not available. It doesn't become available till Friday. What do we think will happen between Wednesday and Friday? We will hear from our users, loudly, through our customer support channels. Or, somebody posts a bad review, let's say, of our product on, I don't know, wherever people post bad reviews about products, Yelp or something. Probably not Yelp. And it goes unanswered. Maybe it's not correct. I tried to do this, and this product clearly doesn't do this, and it's garbage. Actually, the product clearly does do this, but it's an RTFM issue or something.

Okay, but it goes unresponded by the marketing team. That's going to generate inbound calls to the support team. There were actually a lot of things. Or if we just push out the wrong information about a thing. That's going to generate inbound calls. So the customer support team is impacted by a lot of the same kind of metrics that we've talked about. These are things that we can instrument. And now, now we have falsifiable inbound data called support tickets that we can tie to these things. Now we have two sets of measurable metrics that we can correlate over time to one another. That actually gives us ... We can start plotting curves and actually know really interesting things. We can start to alert ... Can you imagine a system where rather than alerting everyone in marketing when some celebrity says something bad obliquely about a thing, where everything's a fire drill, how's that? And we can't distinguish from important and not important. Or a system where the marketing team can say, "Yeah, but you know what? This is under our error budget. We're watching it, but it'll be cool," and just move on?

How many of you interact with your marketing teams, and your PR and comms teams especially regularly? Yeah, I mean, a few of you. Okay, for those of you who don't, you think being on call is hard? Ha, ha, ha, ha. Try being on call for opinions, because that's what they're on call for. You have graphs and data. They just have some random human's opinion that might not even be a human. It could be a bot on Twitter. They don't know. So their hair is on fire constantly. You think you get no sleep? Try doing a spin in coms. I'll tell you a dirty secret. I've never worked in marketing at Google, not because I have no interest or I couldn't learn the skill, or not because it's not intellectually interesting or I don't have the skill. It's, I know what it's like to be on call, and I don't want to have to do a 10X for things I can't control or for opinion-based things. No, no, thank you. I'll go do something else.

So, these are things we need to look at, right? Okay, so, alright, that's kind of the ... That's a deep dive, and marketing is a system. Alright, let's say yes, we could probably apply SRE principles to marketing. How about to the ... How are we doing on time? Okay. How about to the least tractable, least comprehensible, most onerous and scary part of any business? Legal. Could we apply SRE principles to legal? Yeah is the answer, or I think the answer's yes. First things first, who are legal's customers? Okay, let's just make this easy. Everybody. Everybody one way or the ... directly or indirectly interacts with legal. That's just how it is. Our end users? They are also customers of legal, right? That's our end user click through agreements and our privacy policies, et cetera. Our leadership certainly are customers of our legal department. They give them advice about what they can or cannot do.

Okay, so let's do more quickly the same kind of analysis from the legal perspective. Let's start with the customer of employees. What, as an employee, what do I care about as an employee? I'm not a lawyer. I'm not in the legal department. What do I care about? I probably look at legal as a speed bump. They're the people who add friction. They make me take mandatory training for X or Y, or I have to go talk to them if I want to go do Z about a thing. If I want to sign a contract, I've got to run it by legal. Right? I mean, that's how we think about it. Most people think of legal as just speed bumps. I don't think it's true, but that is how most people think of it. Hey, we can measure that, right? How long does it ... Here's a thing we can measure. How long does it take on average for someone who has an inquiry to a lawyer cat to get a useful answer? That's a measure of latency. Not any difference than a request for response latency in a system.

Since every interaction I am forced to do with a lawyer feels like a bug, like no one wakes up and says, "Woo hoo, I want to go talk to the lawyers." No one wakes up and says, "Woo hoo, I want to call customer support." We can count how many of those interactions are we having? How many of these inbound requests, just a straight up count, just like our ticket count in a customer support system. A big concern with engineers particularly talking to lawyers, or anyone talking to lawyers, is the lawyer will say no, and that will impact my velocity to do a thing. I want to go launch this project. We've been through legal review, you can't use that trademark. Oh, son of a bitch. Okay, now what? Right?

So, same thing with good minutes bad minutes. How often are we getting requests where our answer is, "No." If that number is too low, you think, holy cow. 100% of our requests we're answering yes to. That's fantastic. No it's not. That's awful. Strictly speaking, all of those requests were useless. There wasn't a form or a really easy doc that people could read so they don't even ask you. You couldn't just send them to a thing where they're like, are you X? Are you Y? Are you Z? Are you condition A? No? You're good to. Congratulations. You have been blessed by legal.

So that's a value too low. On the other hand, a value too high is also probably a problem. If 86% of the inbound requests are being answered with, "No," that's a bad user experience. And again, that's probably a training thing. There's something like, people have unreasonable expectations, or we have ... Well, one of us has unreasonable expectations about how this is supposed to go. We can measure those things in a legal department, and we can report those metrics into a dashboard. They'll fit in my data dog just like my other count and latency and good minutes and bad minutes will from my software systems or my marketing systems. There's actually no reason not to instrument those things and put them right next to all the other things so that everyone can consume them as monitoring dashboards. Why should it just be my software that gets instrumented this way?

Our leadership, our CEO, et cetera, our board, they're customers of the legal department, right? They would like to know, hey, are we getting sued? Are we in trouble somehow? That's a thing they would like to know. Or, very specifically, what is the expected value of cost for these things, right? Because it is essentially an expected value problem. We're being sued. What's it going to cost us? What's it going to cost us to defend if it's a thing we need to defend? What will it cost us if we choose not to defend it, if we choose to not respond? That's a real thing. Companies really ask that question, et cetera. That's how you get settlements and other things.

But again, that's an area where we could apply SRE principles. We can monitor and measure those things. And those look, you hope, Jesus you hope, look a little more like the black swan things, like rarely occurring, high impact things, like a big fine from a European country you hope is pretty rare. But as you become popular, you're going to get sued a lot. That's just how it is, I'm sorry to say. We live in a world where it's very easy to sue people. You're going to get sued all the time, alright, if you become even kind of popular as a company. None of you in this room may have to see it, but it's good for you to know that it's happening and there's no reason not to instrument it. Just like I don't need all the individual details of an incident in the dashboard, because maybe there's a customer PII. I don't need all the individual details of the complaint, let's say, in my dashboard.

But I'd like to be sued, and our dollar risk overhang looks like this, therefore our error budget is being burned down. That way, before somebody asks me, no, I don't have a great answer of what a good error budget policy should be if we get sued a lot. That's an excellent question. I don't know. But something will have to change. Something will change. Okay, how about this last customer of the legal department, our users? Like it or not, our users, our end users, are customers of our legal department. We all have click through agreements, we all have to negotiate contracts. Our users consume our legal department. So things we can measure. How much time are people spending in our click through agreements, right? Again, it's a curve. It's a sigmoid of some kind. If they're spending no time, you have a wall of text that nobody understands. If they're spending a ton of time, you have, I don't know, a small wall of text that could be comprehended if they spend a lot of time on it. You're looking for something in between. It's an inverted U somewhere. It's a thing we should measure.

We should be aiming for our agreements to be comprehensible. Same thing with our privacy agreements. How long do business contracts take to negotiate? Say you're a sales human, dum dum dum dum dum dum dum. You go to the lawyers and go, "I have closed the Glengarry deal. Give me the leads." And they're like, "Fantastic. We're going to read this agreement and red line it and it'll take you four and a half months to close the Patels," if anyone's ever seen the movie. No, we can't do that. It's got to be much faster than that. So we can measure latency. How long does it take to process a contract? That's a kind of a latency measurement that we can say how much of this thing turns out to be red lines, et cetera.

A customer cares about, do I the customer clearly understand my obligations as a customer and your liability as a company, right? Ashar talked about a company he didn't name, and I don't know who it is, about having some number of millions of dollars in SLA credits. Those are liabilities, right? Do I understand what your SLA says? Do I know when I'm entitled to service credits or not? So again, things we can measure. And then we can measure how many SLA credits we have and so forth. Same thing about blameless postmortems. My father was a lawyer for a long time, and I can say for absolute certainty, there is no such thing as a blameless culture in a law firm. You lose a case, you lost the case, you're a terrible lawyer, hit the bricks. Or, or maybe there's this other thing. Let's look at it as, you put it, I just got really unlucky. How did I get really unlucky? Because you spent a lot of money recruiting me and training me to do this and making me fit in. You gave me a lot of support for this thing.

So, let's go look at it a different kind of a way. Okay, so those are ... I can see some eyes glazing over. This is fantastic. You know you're really succeeding when people are falling asleep. Those are some ways we could just framework so we can start thinking about how we can apply some core SRE principles outside of engineering. I like to say that SRE is a business philosophy that grew up in a technical environment. SRE is not a technical discipline. It's not. It's a business discipline that grew up in a technical context. So, it sounds technical because it has technical terms, but it can be applied everywhere.

I'll tell you a quick anecdote. My first job at Google was to build and run customer support for Google Cloud. And when I got there, one of the things that I was displeased about was that our dashboard, our instant response dashboard, was very robotic. It was all, I would say it's like cop speak. You ever seen a police officer talk on TV? Instead of saying, "We caught the guy crawling out of the window with the TV," so it's, "We observed the perpetrator egressing the building through the non-door horizontal entrance with the picture viewing device." Right, you saw him crawling out with the TV. Great. Our support responses were a little bit like that. They were kind of robotic. They're very unhuman. So I went to our small number of support engineers at the time and I'm like, "Hey, why do these suck so bad?" And they said, "Oh, we're not allowed to write our own words. We have to pick from key into text that the lawyers pre-approve."

Oh, is that right? Hi legal team. This is to let you know I'm putting you on a pager. I'm sorry, you're what? You're going on a pager. They'll be delivered at the end of the week. Hi, yeah, could we step back for a minute? Who are you and what? Yeah, yeah, hi. New guy, building, running customer support. I'm putting you on a pager. Why? I understand that my engineers aren't allowed to post anything without your approval, and you don't want to be woken up at 3:00 in the morning because, I mean, all incidents happen at 3:00 in the morning, duh. Unlike, I don't know, where socks go, but it's not my drawer. So, all incidents happen at 3:00 in the morning.

So, you don't want to get woken up at 3:00 in the morning, so you gave them this really crappy canned language. Well, we're thinking ... That's cool. Cool, cool, cool. That might be minimizing our liability. You know what else it's doing? It's minimizing our customers, because they're pissed off. So I'm going to put you on a pager. Not a problem. We're going to run all these by you as we want to post them. We get X number a week. This shouldn't be any big deal. We'll run a pager queue, I'll teach you how to go on call. It'll be fine. They're like, "Hey, that sounds awful. I don't want to do that." Oh, yeah, you don't? Alright. Well, let's see. Let's see if we can think of something else.

Hey, here's an idea. Why don't we have an error budget? A what? My support engineers are going to post anything they want to, whatever they think is going to be helpful to the customers. And when you come in the next day, you can read all the things. And then you can say, "Passes muster, passes muster, doesn't pass muster," and we'll have a budget just to count, straight count per month, of number of things that we're allowed to say that don't quite pass muster. If we exceed the budget, then I don't know, we'll find some other plan that makes you happy. Well, we don't know. I'm like, "Okay, here are the pagers." Fine. So we had an error budget. I forget what it was. It was something small. It was like 10 dumb things a month. It was a pretty small number. First month, we used zero. Next month, zero. Every month thereafter for five months, zero.

Now, you might think this is strictly better. We're getting human, humane responses in colloquially understandable English that convey information to users. Woo hoo. Legal's not unhappy with it, we're good. Wrong. We are consuming zero of the error budget. This is bad. How can we consume more error budget? Went back to the team. Hey team, how do we consume more error budget? And they're like, "Well, I mean, I guess we could give less useful answers." No, let's not do that. Inaccurate? No, let's not do that. Alright. Then, one of my engineers was like, "Well, I have an idea." Ooh, I like how this is starting. What's your idea? He said, "I'd like to start answering small percentage of some class of questions with AI. I don't know how long it'll take, but I actually think we could take some load off the humans and provide the same technical quality." Really? I would like to know more about that please.

So he went and did a little prototype, a tiny fraction of our support cases. We didn't start with public incident posts first. We just did the support cases first. Tiny little fraction, and at first, the CSAT, the customer satisfaction for that fraction was awful. It was 35 points lower than the human interactions. We're like, yeah, we're not going to put that in production. But we refined it and refined it and refined it. After a course of about 18 months, Emily Gage, as she became to known, engage ... I didn't name it, but cool. Brian wrote it, he can name it, became our highest rated support engineer in all of Google.

And then we started actually autosuggesting ... We still do human review, but we actually started having the AIs autosuggest text for interstitial updates. They still go through human review, but looking at the data and autosuggesting more than templates. It's an AI based system. And we started to take up our error budget a little bit. Now every month, we can count on sort of two or three out of the 10 are getting flagged by legal. But that's a much healthier kind of number. So, sorry, that's an anecdote. You can really actually apply these systems any place.

Okay, I saw your hand first.

‍Audience Member: How do you approach Goodhart's law, this idea that when you make a target or a metric that then people sort of game it, right? In the legal case, maybe they'd give looser or weaker or less scrutiny because they want to hit this latency target or whatever.

‍Dave Rensin: Yeah, for those of you who don't know Goodhart’s law, if I'm remembering it completely correctly, is whenever, as soon as you turn a measure into a metric, it ceases being useful as a measure, right? Because people game it. It's an excellent question. The best way I know how to do it is to take away people's incentives for gaming. Why do people game things? Because they either think some punishment, blame, is going to be assigned to poor performance in the metric, or they think some reward is going to be ... to superlative performance in a metric. That's the best way I know to avoid Goodhart's law. Otherwise, you just get into a vicious cycle, and you're constantly having to be ... And then your metrics become, they go from simple and relatively straightforward to Byzantine. There's an incentive to create Byzantine metrics so that people can't game them, and then the whole thing just collapses.

You know where Goodhart's law really becomes a problem, but not my problem because it's not my part of the business? Well, it's all my problem, but nothing I can really do anything about. Sales quotas. A perfect example of Goodhart's law. As you interact more with your sales teams, you will find people who will, for example, have the opportunity to close some piece of business on the last day of Q2, but they're already hit all of their targets for Q2. So they slow walk the deal so it doesn't close till Q3 and then get a head start on their quota for Q3. That's a sadly common way people ... But anyway, the only way I know how to mitigate that is to do everything I can to de-incentivize, to take all of the actual incentives, not just tell people, "Don't do that," but actually take all of the incentives out of gaming it.

Yes, did you have a question?

‍Audience Member: I was going to ask, do you tell them to do dumb things?

‍Dave Rensin: No, I don't tell them to explicitly do more dumb things. I think I would not have my job for long I think if I did that. But I do tell them, "Take more risk." That's actually how I phrased it. We're underconsuming our error budget. Can we use that space to innovate and take more risk? How can we ... And we have a lot of ideas, like maybe we could hire contractors who will probably be less technically deep in some of these things but could do a higher volume than we can and shove ... shunt some of it. I didn't mean shove. That was a misspeak. Shunt some of this stuff to them. And my answer was, "Yeah, I don't want to do that. Find something else." And then it was, hey, we could probably trade speed for quality, which is where the AI came in. And that's a reasonable trade, because again, there's some volume that your users kind of don't care about. I can probably do two more questions before we're at time. Or I can do no more questions. That's also fine. Whichever you would prefer. Okay, yes.

Alright, the question for those of you that didn't hear it is how often do you evaluate if the metric you are measuring is the right metric to measure? Okay, actually two bits to that. The target you would like the metric to be, your error budget, your SLO, do that at least once a quarter, okay, the target. Now, is it the right metric? Once a year is probably a good pace. It's friction to change a metric, because you've got to change all the downstream monitoring. Doesn't mean you shouldn't do it. And the way you do that is ... So the way that would normally come about is I have some metric latency, let's say, right? And then I have a target, I have an SLO target which means I have an error budget. And I'm finding that this month, I blew the heck out of the error budget, but nobody complained. Oh okay, maybe the error budget needs to be looser.

And then next month ... So that's something in the back of my head. Maybe I don't adjust it. Then next month I find I'm way under my error budget, and man, people are unhappy. Huh. Those two results don't match. I probably picked the wrong metric, because it doesn't seem to correlate with my user happiness. So, the answer is that you want to do it in an interval that you can really see if the metric actually correlates to user happiness. Probably that's not more than once a year. But the target you're doing at least once a quarter, maybe once a month.

‍Audience Member: What if there's one of those catastrophic events that drives you to needing to change those metrics?

‍Dave Rensin: So, the question is what about a catastrophic event, a big black swan once every five-year kind of a deal? That may or may not cause me to change a metric. Probably what that will make me do is ask, how could I ... Descartes' rule, you can't go from zero to two without passing through one, right? So how did I get from zero to a billion so fast? What could I do to mitigate that, and the first question I have is could I have noticed earlier? And that's where maybe the question of the metric shows up. Are we just measuring the wrong things that we didn't notice; we were blind to this growing tsunami of terrible?Most of the time, it's actually we just had the wrong threshold set. We underestimated the fan out or the blow up of the incident, but we had the right metric. But that might be a case where we would look at the metric if we've concluded, no, we were just blind. No target we would've picked on any metrics would've caught this in time. We need to measure something different. That answer your question?

Alright. Yes, one more. You sir.

‍Audience Member: So you mentioned sometime you have the wrong threshold, so in case you notice the thresholds being wrong, do we manually correct it or do we have any auto-adjustment mechanism?

‍Dave Rensin: Ooh. In the beginning, it's manual, because it's a business decision usually. When you have a system that is mature enough, then you can automate it. But even if you do automate it, you have to periodically review, usually quarterly, is the automation doing a good job of setting the right metric? I'll give you an analogous example, autoscaling of a system, right? Actually, it's not just an analog, it's almost a perfect analog. In the beginning, I have a system. I'm not quite sure what my demand curve looks like, so I statically scale it, and maybe it has ... Then I have units of dynamic scaling, add 10 when I see the spike kind of a deal. And then eventually, I build systems that are very good at tracking the patterns because I have a mature system and knowing when to scale up and scale down the targets.

It is exactly the same thing with error budget targets. So if I'm a retailer and I have a system, the error budget I want on February 23rd is probably different than the error budget I want over Thanksgiving. And so you can start to automate those things. The secret to any automation is finding patterns, having forecast accuracy with small enough cones of uncertainty, and that really just means doing enough measurement over an extended ... You can only do that with very mature systems, because you have to have enough history, and it has to be stable, meaning the metrics have to be fairly stable before you can automate those things.

Posted on by:

Discussion

pic
Editor guide