Hannah Culver for Blameless

Posted on Aug 4, 2020

Bringing Operational Excellence to Dev with Github's Lauren Rubin

#sre #devops #techtalks

Originally published on Failure is Inevitable.

At the 2019 Blameless Summit, Lauren Rubin spoke about how to bring operational expertise to development teams.

The following transcript has been lightly edited for clarity.

Lauren Ruben: I was going to ask for a show of hands of how many people here who are on call right this minute right now. I am actually on call right this minute. I like to live dangerously. If my phone beeps, the specific noise that means I have been paged, I'm sorry, I am going to look at it.
I titled this talk "Putting the Ops in Dev" or "How to Stop Worrying and Love the Pager." And a lot of what I was thinking about when I wrote this talk was trying to start a grassroots effort to encourage developers to want to learn these operational things because, in my career, I've worked at that tiny scrappy startup where there are 10 people and you're that one Ops person. They're constantly like, "Hey, we're having a problem with something that has an LED in it, can you come help please?" Other places ranged all the way up to what we now call global scale or planet scale companies that have huge, huge moving systems with thousands of employees touching thousands and thousands of moving parts.
The rest of us who are more operation-focused can take this back to our own organizations and to our own open source projects that we might be working on, and we can evangelize why it actually is helpful for the developers to learn operational concepts. In fact, it's going to reduce the overall amount of pain for all of the human beings at your company if developers understand and think about certain operational practices.

I spent three years at a very large company doing basically just postmortem meetings. I don't care for that term, but I will continue to use it because everybody knows what I mean when I say it. Let's just say that this company graciously allowed me, for three years straight, to lead a very high-stakes incident review a couple of times a month. The amount of skill and craft that I developed from that is very interesting. And it's very interesting to know that one person in your postmortem meeting can start making a difference.

So, what I learned collectively is something that I actually noticed I couldn't find when I started to take computer science classes.

I actually have a nontraditional career path. I decided in high school that I could be making a lot of money in the first dot com boom. I'm dating myself, but you already know how long I've been doing this. And two weeks out of high school, I left and I had an assistant job. I started taking college classes on the side. That's what your parents tell you to do, or you're going to end up living in a box.
I went to the local community college, which actually has quite an interesting computer science staff because of their geographic location. A lot of the teachers wrote the programming language, or worked for HP and designed these really low level, really interesting things that we all depend on. And it was really good curriculum. What I noticed is that, when I would go to classes, none of what they talked about was particularly relevant to what I was doing every day. I would sit there and think, "Oh, okay, maybe this Unix class is going to be the one that teaches me more about being a Unix admin in the business world." No.

And I sat there and thought, "Why doesn't it apply to my job?" The answer is because my job in operations at a company is basically to sit around going, "What the heck is going on, and what do we do about it?" And they don't teach that in college. So I learned this is the hard way over and over again being "the people in the basement," as we say.

I also learned that this is an underappreciated skill, and I think that's part of why it's not taught academically very often. But accidents, events, incidents, whatever you want to call them, have a 100% probability of happening. I will bet my entire year's salary that there is a 100% probability of an incident in your environment. We can get into some shades of gray of what's an incident. That's a whole rabbit hole of academic study, right? But I'm going to tell you things are going to go wrong guaranteed. When you respond to that thing that is wrong, you're ever going to sit back and go, "You know what, everything that we did in response to that unexpected situation, it was all awesome. Nobody made any mistakes, all of the tools worked. The power didn't go out. Nobody was sick, everything was just magic." That cannot happen.

And I think that operations and incident response is significantly different from other areas of business because you can, should, and will plan about what to do when you have an incident or an adverse event. But unlike many other things, there's absolutely no way to actually know what you are planning for. You will never, ever, ever prevent incidents by sitting down and writing out every single thing that went wrong in the last one.

Guess what, you're still going to have incidents.
And that's actually a really tough human concept that, no matter how hard I try and no matter how much I plan, incidents are still coming, and they're not going to run smoothly. But what you can do is to understand that and say, "Okay, if I know for sure that someday probably soon, something is going to be really not good. And I know that I'm going to need multiple human beings to notice this, decide they need to do something."

We need to plan for the human parts of this to say, "Hey, people involved in this. We support you. We have spent time understanding where the people have struggled during the incidents." We all have probably, even within the last week, had some conversations with our leadership at the company, maybe even somebody with a C in front of their title. And we say, "Hey, we need this thing."

Whatever this is requires resources, changes your..., or does something with your..., right? The executives are like, "Nope, that doesn't sound exciting."

The people in the basement who are watching something catch on fire are saying, "Hey, hey, hey, it's on fire." But while we're doing that, you would think most rational leadership would look and say, "Whoa, that's bad. It's been on fire 10 times this week, we don't like that. We don't want it to be on fire." The problem is that while you're jumping up and down, "It's on fire. Look, there's smoke, what's going on?", the marketing department has a stage with a band and a crowd that they've paid to show up, and this marketing organization walks out on the stage looking great.

The production value of this is so good and they say, "We have a life changing thing that if you give us all of the money from next year's budget, it will be amazing, unbelievable."

And obviously marketing is going to win that discussion. But hope is not lost because we can change the attitudes of the people at the company. And once enough of the developer people that have lived through doing operations make it up into management, they will say, "Oh, you're telling me that operationally this is not tenable. Now, I'm listening."
I spent most of those 20 years as the very traditional Ops gating role: "Oh, these developers, they keep throwing this code over the wall. The only thing that we can do to get our bonus and keep the site up is to stop them from doing it because every time they do it, it's going to break." The rules don't help that. In fact, they make it worse because your idea for a grand process and tooling system that prevents people from making mistakes is not going to prevent them from making mistakes, but it's going to prevent them from being flexible and creative. And they're not going to like it.

So it becomes this adversarial thing where people want more rules to make operations better. But you're not actually changing anything; you're not changing the dynamic between the teams. You're not changing who is responsible for and paying attention to these things.

When assessing things in your process or in your company that you think could be better, if your first idea is to look at a bullet list of problems and come up with a bullet list of rules, you're probably doing it wrong. But here's the positive side of when anyone, whether it's all of us who have learned this the hard way or if it's a developer, wants to learn: you become able to quickly detect, identify and diagnose problems so that you're catching them when they're smaller problems, and they're easier to work on. You become able to prioritize things that are going on.

Every company that starts to track what's going wrong is just going to have a long list of things that will never be action. That's the nature of it, right? How do you pick which of those things is actually important enough for you to respond to right now at 3:00 AM? As I mentioned, you can study all kinds of architectural principles in school, and obviously studying is important. We're just talking about learning. This is all just a system of getting better and actually learning from our real actions instead of waiting for someone to write us a textbook.

Having an actual understanding of how some of these architecture paradigms work in real life is extremely eye-opening. And the only way you're going to really, really understand the exact trade-offs in your environment is by looking very closely at what is not going wrong in very small places. It makes everyone's job easier if the problems are being found more quickly because people are attuned to this.

That's a habit that we all have developed by being constantly in situations where somebody expects us to have known something was wrong. You have a spooky feeling of doom all the time: "Oh man, I know it's coming." And service guaranteed citizenship. You pick up that pager and you're part of us now, you're one of us. A lot of what I wanted to put in this slide deck is again, general concepts that are base level because no solution is correct for every company even if they're the same size. And I find it a little bit frustrating that line of thinking like we need rules, we need programs, here's a program that will fix this. Management loves this.

What do you do when you want to change something? Well, you decide what you want to change, you decide how you think you're going to change it, and then you move this massive thing into place and it starts changing all of these things. And a lot of humans don't really enjoy that part of it.
Stick to the basic concepts, whether you're a one-person passion project posting this as open source, or growing from your one-person passion project and now have four people that are helping you, or even maybe 50 people. And when you decide to start a company because you can sell this, and now maybe you have a hundred people, all of these things are true the entire time.

What you do with that information and that understanding obviously is going to change quite a bit. But someone, somewhere, is responsible for what code is doing in production. If you're a developer and you don't know who is running the code in production, then some other poor person is now learning a lot of stuff about what your code does when users use it, and it interacts with other systems. But you, the developer, are not learning that.

The next point here is humans need to make real time decisions. Again, regardless of your size, when something isn't right, and it's especially something that you haven't been able to plan for, you want people because people have this amazing ability. We're trying to spend all this money across the industry to do AI, to just replicate like the reasoning of a five-year-old human.

When you look at an outage, don't say, "Oh no, a person made a mistake. How can we cut the person out of this loop?"That is not a good first step, right? So responding to these operational surprises, as I like to call them, it is a skill we've all practiced. The skill is that you're looking for problems and figuring out how to solve them, and then finding them faster and solving them faster.

One thing I will say is the less of this operational expectation that you're receiving from outside, the more you need to bring from yourself. And that can be extremely effective because you all have teams in your companies who are doing very poorly. We have Dev teams, and their product is not doing well. And then there are some developer teams that they're running their stuff, and their availability is great. And how does that work?

The answer is they have their own structure. They have somehow figured out how to learn these things quickly and take care of them. So, the first thing that I'd like to promote as part of my grassroots effort here is DevOps culture.

This is a great time for an anecdote that is very recent and top of mind for me. Again, I come from this Ops background, Ops for life. When we all get together, those of us who are basement dwellers we share our pain about all of the times developers have done something wrong and the system has failed. And we talk about those things. You'll tell some story, some other person who's done operations is going to go, "Wow, it's like you've lived my life."

If anyone has read the book, The Phoenix Project. I read that, and I was like, "It's made up, but this is literally so true. This has happened so many times, it's uncanny." I get up here and I'm like, "Let me tell you about the time a developer made a change that was not tracked in any way to a very important system, and then left their desk. Three hours later after the entire site has been down and CNN is talking about it, somebody figures out that this person changed something, and finally figures out how to get ahold of this person."

Once that thing is done, it's fixed in like two minutes. And when I tell that story in this room, everybody's just going to say, "Yeah, that happened to me last Thursday."
But when I go talk to people who are Dev only and not Ops people, especially in my role when I was doing postmortems, I would say, "Hey, how's it going? Let me explain what I do. I analyze what went wrong in an incident and what we can learn from it and how we can do better." And then I would always get this story from them. They'd be like, "Oh, let me get this off my chest. You're the person I can tell about this. This one time I accidentally pushed a change that caused a 50% error spike."

And of course, I'm just like, "Oh, well, I've never even heard of that event, that doesn't rate on my scale of caring. That's too low level." But every time, they're very circumspect and they're like, "Wow, I feel terrible that I had this effect." And I would always ask, "Well, did you tell your friends that you took something down or broke something?" "No, no."
I find it interesting that the operations folks are happy to vent with each other and share this information. So DevOops culture is something that I want to promote amongst developers like, "Hey, when you make a mistake, you should tell everyone. I want to hear about your mistakes before your successes."

And that's not how the business world normally works, right? So that's why this is grassroots. So here's how you do this in very broad strokes.
There is very little advice I would give you regardless of what company you're coming from. This is the line we're going to start giving to our developer coworkers: you should be the first to know that something is not going right with your software because you are the person who is the expert. This is literally what we pay you for is to make this and to know what it does and to know when it's doing something it's not supposed to be doing right.

What is deployment confidence? Depending on your organization size and the size of your project, that changes a lot. Maybe you have a giant dashboard or 20 dashboards. Maybe you have a crazy cool CIS system that when somebody deploys, it puts it into canary or some other like limited testing. And it automatically looks at what's happening and says go-no-go. That's the end stage, everybody wants that kind of thing, right? You don't need that.

What you need to know is "What happened when I pushed this, what changed? Did it change the way I expected it to?" Meaningful alerting. This is something where I have recently had a really interesting set of conversations with developers. A lot of the time they say, "Oh, I don't want to be on call because everybody who's on call right now in real life, their pager is constantly going off, and they're constantly annoyed."

And we've all lived through noisy alerts, paging me every time there are five errors at 7:01. So if you as the software developer are writing your own alerts that monitor your own software, A, you know what to monitor, hopefully a little better than some random SRE person, someone random operations human. And if it pages you too many times, great, fix it. That's it, nobody needs to suffer. If you just do it yourself, then you can just reduce the amount of paging because otherwise you have to wait for somebody else to get so annoyed at your alert that they come to you. Just skip that whole thing and figure it out.

It's how you learn to make better alerts; it’s how you learn what your software is doing.

When you're writing a really small project, even if you're a member of a big company and just starting something that's greenfield, it's tempting to kind of skip all of that observability stuff, real error messages that actually tell you where to look for the problem. Otherwise, you're sort of like, "Oh no, I got a generic error. Let me search through the code base or search through stack overflow and see what this does."

The first part is your cycle that you set up because you know that's coming. The second part is also accepting you're going to get some surprises. And what do you do when those surprises happen? The thing that is important and that makes for successful faster remediation of these incidents, faster restoration to the customer, is communication. You need to support the people who are responding so that they can bring their full attention to this incident.

While you're doing all of this, think of ways to improve that. After you've responded to a few of these incidents, come back and close the learning loop. That's a huge core SRE concept. You cannot prevent incidents, but you can make dealing with them better.

No matter what size your project is, list out your tools, and your tools might be a pen and paper if it's just a one person project. Yeah, I write stuff down, I keep notes, I decide what my feature list is later. And then when something doesn't work, I write down a friend, another team told me this thing is not working. And you stick your post-it on your screen. That may be all you need at a smaller size. But know what the tools are.

It's easy to know that there's an entire microservice environment at your company that's just dedicated to incident response. There's 50 million of them. But also consider everything that you might use or touch or read or look at during responding to an incident. Look at that as a tool that you can evaluate whether it's helping you or hindering you or maybe it's 50/50 and you could just pivot it a little bit. Write your process down.

This is a super, super interesting thing to do because if you work for a larger organization, there's probably some kind of written incident response process. If you work for yourself, just one person, you still have an incident response process, you probably just haven't written it down before. Go look at your written process if it exists and write down what you think the process is based on your experience of doing it. Compare it to what's written down. If it's not written down, maybe you can write that down. Which part is not working? Which part is working great?

You need to know which parts are working well so that you can evaluate those also, because something that works for you really well right now won't work for you forever. And it's impossible to know in advance where that line is, but you can just be aware of it, and it's one of the things that you're trying to learn as you go through these steps. Maybe if you have multiple people on your team, see where people don't know about something, see where the gaps are. Maybe you sit down with the developers that you're working with if you're an embedded SRE, and you ask them to join you in this.

These are some kind of super simple examples of what you might get when you write this down: this is my one-person passion project. Maybe somebody hits me up in Slack and says, "I tried out that thing you pasted me a link to, it didn't work." Whether or not you need to open a ticket for that that probably nobody is forcing you to do any of this.

So maybe your real process is you file that away in your mind or a little post-it note or something. But it doesn't require or involve anything particularly costly or even annoying. There's not a lot of steps here, it's all very simple. But just to understand that that's where you're at right now is very valuable because when you keep looking at this and understanding what you're writing down here versus your incidents is how you eventually end up at this point where I can't click. There we go. This is your global scale.

No offense to Google, this is the way that Google does it because they have been on this journey from this very first one-person project, and they're at the end stage of that journey. And they're a really big company with a lot of money, and there's a huge difference. We're talking about pen and paper versus again a cloud, an entire cloud just for incidents. Don't think that that is everyone's end state. You don't need all this big company stuff possibly ever depending on some of your approaches to things.

So one thing that I like to use as an example of introducing people to incident management in general is the FEMA incident command system. And here are the reasons: it is cross-disciplinary and it is based on actual real emergency response. I was trained in FEMA incident command when I was an EMT. I trained for mass casualty incidents. I trained for these things to happen because we know they're going to happen. We're not going to prevent awful tragedies from occurring. But what are we doing when we get there?
I worked one mass casualty incident, which I am happy to tell you didn't really involve many casualties. If anyone remembers when Sully Sullenberger landed that jet on the Hudson River, I was a volunteer EMT. I showed up, and I was not even wearing a uniform. But because I am so skilled at operations and because I am so calm during those crazy events, people saw me in my civilian clothes looking very calm, and they assumed I must be really important. And I was talking to the head of the state police and the FBI and all this stuff and helping them out and doing things.

This is a great system to understand the concept. I do not think you should just replicate ICS; I don't think anyone should just do ICS for computer stuff because this was designed to help fighting wildfires and plane crashes and terrorist attacks. These are some of the key concepts that the incident command system I think encapsulates very well. I like the completeness of the incident command system because it addresses a lot of these things that are very, very crucial and/or just helpful in responding to incidents.

Here's the breakdown: once you've read ICS, seen a real system at work, it's a great example of an actual usable incident command system. What it's providing for you is, number one, a facilitator. This is what a lot of people refer to as the incident commander or the incident lead. And this role is super important because the facilitator or multiple facilitators of your incident actually can kind of make or break your response.

This is an area where, again, there's not a lot of specific training that explains the concept of facilitating an incident response. I've read so many internal company guides of how they do it at this one company. But obviously, they're all very different, and they don't apply to anyone else's use case. So you need a facilitator, you need an appropriate engineering response. And what I mean by this is the facilitator exists there to keep track of what's happening, what is needed, where things are at. And the appropriate engineering response is whatever subject matter experts might be required, they need to be gettable by the incident commander. We need to be able to find the people who can solve our problems.

It turns out it's really tough to facilitate and fix at the same time. So that's viable in one-person situations only. Once you get larger than that, if you can possibly bring more people into your incident, it's important to do so. You need to communicate to upper management because they want to know why they're losing money right now. You need to communicate possibly to the public. Part of my role as an incident commander right now is literally to tweet on behalf of my entire company to say I'm really sorry to tell you that it's broken right now. And we need to foster the communications between the people that are that appropriate engineering response.

But there needs to be an understanding, this is agreement on the plan. The people who you're going to page need to understand, "Hey, I'm going to page you, here's what it means. Here's what we need from you when you get paged." I'm only going to page you if it's really important, if I can help it. And then I expect the following things to make it go better. Accurate timeline and data.

Let me tell you how many actual fights I've seen in an office because we started a postmortem meeting without a detailed timeline and some teams thought that their activities and such were not accurately represented. Having the ability to easily collect that information while you're going is extremely helpful. A lot of times places will just record the phone conversations or the Zoom chats or the Slacks and then some poor human being has to go through that entire thing and write a transcript or go pick out the things that are meaningful. That's a waste of time.

I have personally never done that having done a million these high-stakes postmortem meetings. I've never reviewed an eight-hour transcript. Nobody wants to relive an eight-hour crappy incident. Make it so that while you're doing it, you already know what you're going to talk about later in those meetings. You've been to a meeting or two. You already know the kinds of things to call out like, "Yeah, we've talked about this sort of thing, didn't go so well last time, seems like that's happening again." This is all operational mindset, which things stick out to you in your mind is a result of experience. And all of our developer coworkers are just as capable of learning these things.

So you need that data to be able to review. And when you're reviewing here, you're not looking for, did we stop all the incidents from happening? What we're looking for is, "Is anything from this incident like the last one? Do we have other things that keep coming up? Are we noticing trends that maybe if we address that it will improve things overall? How do we just make people feel better is a big thing?"

Postmortems, super hot right now. Everybody likes to talk about them. Since we've already heard a little bit about it, most of what I wanted to say is be careful with your postmortem process because there's a business need to find out what happened and what we're going to do. That's just a regular business need. But when you get into these meetings, if people are not participating in these fact finding types of meetings on purpose, if you start with some management person saying, "I don't see anyone representing this team, this team, and this team, you go get them in here right now," I can tell you that that meeting is a waste of time. Just don't even have it because all you're doing is making a bunch of your employees really mad, assigning them a bunch of work they don't want to do.

They will let those corrective action tickets fester in their queue or their backlog or their icebox or whatever it is they're using forever. And even if you get their VP to tell them to fix it, they still don't want to fix it because in their mind they were called into a meeting, yelled at, and then told how to do their job a different way.

So the postmortem is not a place for accountability. You need accountability in your business, and it doesn't belong in the postmortem because if people feel like they are defending themselves and their product, they're not willing to raise their hand and say, "Actually, my product wasn't even really involved in this particular incident. But I think if we significantly changed how we're doing things, we would help a couple of these other teams."

That only happens when they feel like they're here to solve a problem and not here to account for their actions or to defend themselves or to defend their product and their honor. I know it can be difficult to sell management on caring about the humans, as it sounds very wishy washy, and it doesn't sound like a thing that has a place in business. But having experienced the difference between a meeting where people are pointing fingers at each other for real and screaming and having those same people come back, even though upper management wouldn't let me stop blaming them, I just stopped blaming them anyway. I did some guerrilla warfare stuff here, I just made things happen. Those same human beings who seemed like they weren't capable of having a civil conversation were suddenly volunteering to work with each other and giving themselves work.

And guess what, when you volunteer to do a thing because you think it's helpful, you're pretty motivated to do it. You don't need some person to come nag you and say, "You have 50 of these, why haven't you done them?" because you're interested in improving things and you want to learn about them, and you want to help.

So another big thing SRE, operations, and development talk a lot about is change versus stability. This is the tension that we have, right? And the reason that the friction comes up between Dev and Ops is because Ops wants no change because no change is stability. Guess what, if you don't touch things, they break less. Great, that's what we want. Change is how you get the new cool features, developers want that, management wants that. Everybody wants that except whatever human being has to deal with the fallout of that and has no meaningful input into making any changes. So it's definitely a very tricky conversation to have.

Depending on your organizational size, it may be appropriate that you have entire teams of people who are dedicated to making tools and processes that track what changes are happening, make it possible for you to know what changed. But in a lot of cases, that is not a well-developed area of the company.

This is something that's been changing business-wise. It used to be that I got to submit my change board and look at all the changes that we're trying to schedule for the week. And now with continuous deployment, your management will come to you and say, "Hey, you can't stop my developers from shipping, they quit. Do you know how much it costs me to get those developers? How dare you know? No, I want them to ship constantly."

So here's what's interesting about continuous deployment is, we went from a model in business where we were pretty sure we knew what was going to change. And if we didn't know what was going to change, that's the fault of the person who changed it. We see you, we're coming for you. We're going to make it so you can't make changes without letting us know first. But that doesn't scale to the current level of what we're doing. So how can we allow rapid change and mitigate some of that risk?

Well, part of the problem is all of this change. There's so much change and so many more moving parts, and people want even faster change. We have entire teams that we pay to make it faster to change things now.

You may not be able to even fully understand the complexities of your own service the way that many software packages work these days. Just while you're making it, it gets really complex. And as it grows over time, people have been doing different things. You don't know how this section of it works. You start to pair it with new services that come online in your environment, and they've never talked to each other before. Now you have a bunch of services and nobody quite knows how they really interact with each other. You have users using them too, most of the time.

This is a recipe for absolute confusion. You need everybody that writes code to understand and be responsible for their own code because it is absolutely not possible to staff an Ops department that receives the software and just runs it in production. It cannot work that way anymore. Cross-team chats and meetings are super helpful for this.

Say you become aware that your software doesn’t work well with another team’s, and it's an important flow and customers care about it. You keep having these problems, so you start a Slack channel together with them. When you see something a little weird that isn't worth opening a ticket about, ask there. Encourage that kind of behavior that, hey, you need to know what's going on, you need to know what your product is doing, and you need to support other people in using it.

Also let me just say, it won't kill you to write a document. Everybody's like, "Oh, nobody's ever going to fill in the document." You can get people to fill in a pretty bare minimum amount of documentation, but you do need a bare minimum amount of documentation even for yourself.

Please encourage people and thank them when they do it. Help them update it when you find something is wrong in a very helpful way like, "Hey, I heard this thing and it doesn't match what's in here, would you like me to update your document for you?" These are the sorts of things that, if you're a small team and you maybe don't have the benefit of a massive set of magical tools, you need to be able to figure out what just changed. Like, "Hey, I see this graph was like this and now it's like that, and that's not what I want. What was happening right before this, what do we do? How do we find that?"

Probably somewhere, regardless of the size of your project, there's probably a bunch of different logs and trails. Then you have the problem of “Okay fine, I just found 10,000 things that changed in the last five minutes. Which of those 10,000 things do I care about?” So you need to be able to quickly discern.

You need to develop a system that lets you know what are the blast radiuses of each of these types of changes. Am I looking for the ones where storage is involved? Am I looking for the ones where some other thing is involved in? And if this is a YAML update for a non-production, noncritical system, I probably would not start there as my first place to look for what's going on.

Then you need to know who changed it: was it a person, was it a script? If it's a script, you need to turn that off. If it's a person, you might want to ask them what their change was supposed to be doing and if we can roll it back. Because it might not be. In some cases, you have to do some migration. It's helpful to know that before you start that. Have we done this before? Did it go well? Does it traditionally not go well? Do we need to spend a little more time paying attention to this one? And which of our types of changes have the most failure?

I want to be really clear here, I am not in any way suggesting that we just take a bunch of lists of what went down and say, "Your team fails the most, you're in trouble." What I want is for us to take a look at the types of changes like, when we change things, how will these two things talk to each other? What has gone really poorly every time in the past? Maybe you see that in a few areas of the business if you're all talking to each other. Then you start to realize, maybe we need a culture shift, maybe we need a new tool. Maybe we want a new process. Imagine if developers were asking you for a new process for how to ship things. Can you even imagine that? Totally a thing that's possible, it has happened to me.

One thing I wanted to point out specifically is code freezes, coming from a deep ops background, are the best. You're like, "Yay, code is frozen! I can go on vacation, I can actually sleep. This could be so good."

Let’s say there's a code freeze because you are some sort of business that has a very peak season or day and these events are so crucial to your bottom line. Black Friday is a great example. It's called Black Friday because it puts companies in the black. That's a big deal.

It’s totally normal that management is going to say, "Okay, this is the one time we'll let you stop people from changing things because we absolutely cannot have our stuff go down on Black Friday. This is our one day of the year we make all the money. This is how you get your bonuses, this is how I get my stock.”

And then what happens in real life is, "Oh, there's a code freeze. Hey, I need to make this change. Well, you need VP approval. You need senior exec approval.” Guess what, they never say no. In fact, I've experienced where not only did they never say no to the exceptions, they came back in and said, "The developers are kind of annoyed that you keep making them get exceptions and that you won't let them work for this week, and they don't like it."

The executive that paid all this money and developed this developer. The best developers, that's what people were focused on. They don't like it when their employees complain because they quit. Do you know how much the developer cost? So we can't freeze code anymore. They don't work because nobody wants to participate in them other than the people who are like, "Excuse me, it is my problem when this doesn't work."

If you feel that the only thing that you have left in your organization to protect yourself on a crucial day is to just freeze all of your changes, that's a great sign that you need to really, really look at how this is working, how you can better, how you can improve your incident response, problem detection, and the people who need to do all of those things. Computers are not going to detect all of your problems for you. They're certainly not likely to detect the ones where it turns out you were down for 10 hours.

Those are the weird anomaly events that you could not have predicted, and those are the ones that you care about the most. So, chaos engineering is my one quick fix solution.You want to get good at ops? Good, do it hard. I know what happens when I unplug that database. It's really simple. Staff can't access the database. I'll tell you what else happens, and I'm positive about this: not only can things not access the database, a whole bunch of other things that don't talk to that database also don't work anymore, and people are really confused by that.

With chaos engineering, you're scheduling your outages in advance. You're saying, "Listen, we don't know what happens when this particular database goes down, and we need to know. Let's pick a time. We'll talk about what we think is going to happen when we unplug the database. We'll talk about what we can do to sort of mitigate this. What can we do in advance? What planning works?" And then you go pull the plug.

Then you write down all this stuff that was a surprise, how exciting. We didn't know any of that was going to happen. But doing that over and over again is a great way to do what we've all done the hard way, except in a supported environment where you know 4:00 PM on Tuesday is when your database is going to be down, you know when your outage is.
So, you're thinking, okay, great, canceling my dinner plans. I'll make sure I have a babysitter or whatever, I'm going to come to work late, I'm going to sleep in. It doesn't have to be fancy. You don't need tools. You can buy lots of tools, you don't need tools. You just need to form a hypothesis about how the software works together. Figure out how to test that hypothesis, every single time you're going to be wrong. Then you learn from that and you do it over and over again. And if you do it on purpose, that's better than doing it the hard way. You're still going to do it the hard way, but you're not going to have to do it as often if you've done it the slightly easier way.

I like to summarize this up with the Mark Twain quote, “If it's your job to eat a frog, it's best to do it first thing in the morning.”

So at the beginning of your project, when it has two components, you have a ruby on rails thing that you stood up on a MySQL instance, and that's it. And it does very simple things. Turn off the database, what actually happens? What can you do to make it so your users can still do some of the stuff while that database is off? Now's a great time to figure that out, right?

If all the developers are embracing chaos engineering for themselves and making the time to practice these things, they're going to learn what we have learned by doing this the hard way: the unplanned, the surprises. You just wait for the surprises and then you learn or you can make the surprises happen on a schedule, and you learn even more.

One area that I think is very tricky to discuss, especially with smaller teams, is the concept of operational maturity. I think this is an area where developers often heard of it, especially in a larger company, but they don't really know what it means. They might think like, you've operated for a long time and you're mature in it.

No, operational maturity is a way for you to figure out what you are okay to depend on based on your needs. If I'm a passion project developer, I need some EC2 instances, I need RDS or something. I'm going to think, well, all right, I can get this from Amazon. Is there a cheaper provider for this because I don't need it to be up all this time? Understanding what your needs are and being able to evaluate your choices based on those needs is really crucial. And that's difficult to do in a complicated environment because at big companies, you get all these platforms to make it easy to make tools. And if we make a new thing, it does something we wanted, it's great.

Let's say I'm going to develop an application to get the color of my coworkers' socks. The operational maturity I require for that is like zero. I don't care if it goes down. But I stand up the service and I use all the automated tools, and therefore other people that work in my company. You can find this service, and they see it. And they go, "Huh, I think I could use this in my project." So they build it into their project, and they've had this project in alpha secret mode. They're not even out of beta, and they've already been working on this thing and tested it. And they're like, "Yep, we got this. We've done everything, this is so awesome. We're ready to go, put it in production."

A month later, the database it relies on goes down and it doesn't come back. And they're like, "What was going on? Oh, where's the database?” It turns out the people running that database, well, that was their alpha project, and it never went down while these other people were using it. But there it is.

You need to be able to understand where you are in the cycle and to understand where other products are, whether that's outside or inside of your company. And you need to be able to express that in some language that everyone at your company can use to compare these things. Don't make them checklists where you have to get a score on minimum compliance.This should be a conversation for everyone that says, okay, do I know who owns this, is someone backing this with a budget? Is that important to me in this case?

The one last thing I wanted to touch on was SLOs. This is one of these terms that gets overloaded.

SLI is a service level indicator. Can I ping it? Can I log into it? Is it up? How many HTTP 200 responses did I serve. SLOs are how much do I want my customers to do that? My customers want HTTP 200 responses, how many of them can be not 200 responses before they get mad and stop paying me? That's what your SLO is.

SLA is something that is hopefully much more generous with downtime than your SLO because that is where the pain money comes from. That is the thing that upper management cares about.

SLOs are awesome for developers because if you are thinking about what your customers care about and you're instrumenting to monitor that, you know when you're making changes, you're not doing wrong things for your customer. That's it, that's the whole reason. It's super important that you understand that as a developer.

When I've explained this to developers, they're like, "Oh yeah, you're right, I care." Understand what your users want. It might take you some time to do the math on getting it right, and that's understandable. This involves calculus. I don't expect every developer to become an expert, I want them to understand it is an area of expertise. I want them to seek that expertise and know that it's something they need to think about.

In summary, everything is complex, everything is crazy, everything is chaos. This is a whole new world, we haven't quite learned to adapt to it yet. But all of us are sitting around thinking, how can I know what my portion of this is doing, how can I make it better, and, most importantly, how can I learn as I go and not miss out on that learning by dumping it on a different human?

Since you're all next-level ops, please go Google this. If you just Google it, you're going to find stuff. Go look up John Allspaw from Etsy. You're going to get down that rabbit hole and you'll start to understand that there is actually academic research that backs up caring for the humans. It's crazy, but it's here, and this is how we move the needle.

DEV Community

Bringing Operational Excellence to Dev with Github's Lauren Rubin

Latest comments (0)