Hannah Culver for Blameless

Posted on Sep 11, 2020

SRE Leaders Panel: Embracing Resilience During Crises

#sre #devops #techtalks

Originally published on Failure is Inevitable.

Blameless recently had the privilege of hosting SRE leaders Liz Fong-Jones, Dave Rensin, and Alex Hidalgo to discuss how SREs can embrace resilience during pandemic, and how the principles of SRE intersect with global trends. The transcript below has been lightly edited, and if you’re interested in watching the full panel, you can do so here.

Amy Tobey: Thanks for joining us today to talk about how SREs can help organizations adapt in this time. A lot of us have been talking about this. I've had conversations with Alex and Liz about it, and we keep wondering “What can we do?” And we thought we should get these very smart people together to talk about that very topic.

So for now, introductions. I'm Amy Tobey, and I’ll be moderating today. I'm a staff SRE at Blameless. I've been an SRE and DevOps practitioner since before those names existed. I love this community and believe that SREs are uniquely positioned to change the world in small and large ways. I'll let my panelists introduce themselves, starting with Alex.

Alex Hidalgo: Hey everyone. My name is Alex Hidalgo. I've been an SRE for about a decade now and it is something that truly speaks to me. I kind of wonder how I ever did anything else before. I'm currently at Squarespace and currently in the process of writing, implementing service level objectives, which I hope will be well received.

Dave Rensin: Hello everyone. My name is Dave Rensin. I'm an SRE director at Google. I've only been an SRE five years. So I guess I'm the puppy of the bunch. I’m just one of the principal editors on the SRE workbook and am pleased to be contributing to Alex's book and as well. And it's lovely to be here and see everyone for some value see.

Liz Fong-Jones: Hi, I'm Liz Fong-Jones. I am a principal developer advocate for honeycomb.io. And I've also been an SRE. I consider myself both an SRE and dev advocate. And I've been around SRE for the past 12 years and I've been working in the DevOps space for about the past 15 or 16 years.

Amy Tobey: Awesome. Getting started, I wanted to start with a question about error budgets and development velocity. And it seems like there's a lot of teams out there that are running with reduced personnel and available spoons or cognitive capacity, right? How can these practices help teams cope and adapt?

Liz Fong-Jones: I've seen a couple of bad examples of how to cope with this, and I've seen a couple of good examples of how to cope with this. The bad example is an unnamed bank in Europe that divided their teams up into the blue team and the green team and mandated that the blue team come to work on alternate days with the green team so that you would only lose half of your team. And that way you could still keep the services running and keep the deployments running. And that's a bad example. I don't think that we should be doing that. I think that we need to be a lot more kind of continuous with our resilience practices. That we need to accept that people are potentially going to have less throughput, that people are potentially going to be less available.

And instead of kind of dividing things into the blue team and the green team, let's instead think about how we actually make it so that people can swap in and out of on-call, depending upon their parenting needs. How do we make sure that people can continue to push out releases according to automated release trains so that if someone does feel like writing code, they can still get that code pushed to production? I think that's kind of the direction that we need to be headed instead.

Dave Rensin: I think I've seen a mix of things that went really well and things that are going not well and then some realities. So I agree with Liz that the scenario she described is generally an anti-pattern. But I would like to acknowledge that there are times when it's kind of a requirement, particularly when you have people who have to have physical contact with some infrastructure. We have had double and triple split teams in our data centers at Google because the humans have to go in and do human things to the equipment that we haven't invented robots to do yet. So it's not optimal in all the ways Liz described, but it's just a requirement. Things that have gone well. Well, the same things have gone well and have gone badly.

Teams that have been good about paying attention to their technical debt and their toil, right? Meaning keeping their toil volume low. So being very proactive about giving to the computers the things that computers are capable of doing, are generally fairing better having to work remote than other teams who didn't. The teams who haven't been paying as much attention to their automation or keeping their toil under control are finding they're having a really hard time because doing toil really doesn't scale when everybody is remote. There is communication friction that being remote creates, especially for teams that are not used to having to interact via video conference principally. So that's both a pattern and an anti-pattern.

One of the conversations we were having internally today about SRE leadership is that we’re in a global pandemic and it's awful. But as long as it exists, it provides a really interesting natural experiment to ask how many of the principles that we practice in SRE at Google and other places are scalable to the degree of global scale pandemics, and what edges they expose. Things we have to rethink are or aren’t as true as we thought they were. And so, if you're looking for sort of a silver lining in an otherwise really great cloud, I think that'll be pretty interesting as it emerges over the next few weeks.

Alex Hidalgo: I think one of the most important things that everyone has to keep in mind, especially in framing this in context of error budgets, is what you think is a tolerable amount of failure in times like this, especially in terms of what your release cadence looks like? What are you expecting from your humans? And often when people think about SLOs or error budgets, they think about the windows that these are calculated over. The classic example is: if you’re out of error budget, focus on the reliability. Have an error budget remaining? Release, move, do whatever you want. But that's not really how to best use those numbers. I think the best way to use the concept of an error budget isn’t that you have to actually have measurements, but rather that the concepts behind it give you a different way of thinking about things. And to have good discussions with people with that data and to help you make decisions based upon that.

So I often tell people that you should revisit what a target is whenever you need to. Sometimes it's because you had an incident, sometimes it's because your code base changed or your dependencies change. Things change about the world and sometimes you need to change your expectations. But I also tell people, “Do that whenever you need to,” and right now might be one of those times. Right now it might be one of those times where you have to stop and say, "What makes sense? What makes sense for our users, for our engineers, for the product team?"

In some cases I could see this being an example of where you need to make things a little more stringent. Zoom is very important right now. Netflix is very important right now. But there's also perhaps chances where maybe you don't need to focus as much on something, because you need to prioritize correctly in this current world.

Liz Fong-Jones: Yeah. People first, right? Like it turns out that if your service is going down 1% more,it’s probably actually acceptable if it means that people stay home, or that people take their kids to the hospital if they're getting high fevers.

Dave Rensin: Also, this is a moment where teams have to make cognitively uncomfortable choices that maybe this product that I was going to launch or this feature I was going to launch that I mean users will really like, just isn't that important in the context of everything else that's going on and we're just going to stop and divert resources to other things. And it's painful in the sense that it was never frivolous. But maybe it's relatively frivolous to the current time.

Liz Fong-Jones: We can think about global prioritization as well. I actually saw this morning and have signed up for the New York state tech SWAT team that is being dispatched to deal with Coronavirus. So that's kind of almost like doing a potentially either a USDS rotation or similar, but for 90 days or even just like volunteering services. These are things that we can all be doing to make sure that we as a group of human beings survive.

Alex Hidalgo: I also like building off the point that Dave made in terms of perhaps you don't need to ship this feature. Maybe you do. There's this new service called My Bodega that wasn't planning on launching until later on in the summer. And from the start, their whole concept was about allowing bodegas and corner stores to more profitably deliver things. And so their plan was always to only charge 50 cents for each delivery as opposed to things like Caviar and Seamless that can charge upwards of 30%.

So they always had the owners of these small businesses in mind in the first place. So they launched early. And they said, "Look, everything may not be perfect. But we're launching early." They're not charging anyone. They're a 100% free platform to use as long as things are currently, like in this state, that things are currently in. And so yeah, sometimes for greater good, perhaps you do throw something out there as long as people understand that it may not have been as polished as you had originally hoped.

Dave Rensin: Liz and Alex know that when you have a service outage, one of the things you worry about is what is going to happen when you restart the service. “What's the crush of built up demand and how do you measure it and moderate?” And there's some really interesting things for example in the health system that we have to start paying attention to, like what happens when this pandemic abates. We're paying a lot of attention obviously to intensive care beds and ventilators and respirators and those are obviously things we need to be in triaged space. Those are things we have to be paying attention to.

But as we approach half the US population and a billion people worldwide, all shelter-in-place—don't laugh when I say this—it's probably a good idea to also start paying attention to what our maternity ward capacity is going to be 40 weeks from now. Because I think it would be foolish to assume there won't be shall we say an uptake in that case. And that problem if you will, is generalizable to a lot of things. Though we're in the height of this crisis and at least in the US, we're expecting things to get a lot worse over the next couple of weeks, now is actually probably the time to start thinking about that future. When we restart, what are the ways that we want to restart things so that we don't cause immediate overload and crash everything?

Liz Fong-Jones: It's like when you turn the servers on, right? You have to turn it on with exponential back off. You have to turn it on with jitter. If you don't do that, then you just get in immediately inundated the second everything comes back.

Alex Hidalgo: I've already been thinking about what it is going to be like the first day that New Yorkers are allowed to go back out to the bar. And not just in terms of how busy it's going to be, but what kind of safety measures should be in place to ensure that people who are probably going to drink a bit more than they normally would can get home safe. These are all things that we need to start thinking about now.

Amy Tobey: And there's capacity turned down almost everywhere. Even the rideshare systems are running at lower capacity right now, aren't they?

Liz Fong-Jones: They are. And in New York it's sufficiently bad that many drivers are driving around and not able to find work such that the city actually stepped in and said, "We're going to hire rideshare drivers to carry critical medical supplies from place to place rather than have them drive around looking for passengers that won't turn up."

Dave Rensin: And there have been fascinating frenemy things happening, right? So some of the ride share services who might compete with food delivery services are starting to partner with them so that they can still give rides and hours to their drivers. That way they don't lose them permanently, except now they're delivering food to people instead of people to people.

Amy Tobey: That's really good adaptability there. So we talked a little bit about the need to plan for the capacity as we emerge from the lock downs and quarantines. Going back to where we started, how do we create the accountability while still balancing that with compassion? Because we need to create that pressure, right? This is why we have SLOs and things to do that capacity planning. So what are the successful strategies that you all have seen in the world for kind of starting that process and getting that process going? Because as SREs, we often are the people crying in the dark and going, "We really got to turn up capacity before it's too late." And sometimes we're not heard. What can folks do?

Alex Hidalgo: I don't know. This is a tricky question in terms of our current situation because it's kind of uncharted territory. You can do your best when you're talking about any kind of capacity planning or any kind of feature launch or any kind of product launch and you can try to use the data that you have, right? That's all you can do. No one can really predict the future. You can throw all sorts of stats at what numbers you do have and perform regression analysis, but that's still not really telling the future. And sometimes I think the best you can do is make a guess but be ready to change it. That's what resilience really means, right? Do you know what to do and can you?

As an example, yesterday I cut myself pretty badly. I was opening an avocado because I wanted to make some guacamole. Robustness is the fact that I have this other hand. Resilience is the fact that I am first aid-trained, red cross-certified, and I knew exactly what to do and had the tools on hand to do it. So even though this was unexpected, I certainly wasn't planning on cutting myself while I was trying to make some guac. It happens. But the resilience aspect of that is the fact that I knew what to do. I was prepared for it. So we have to kind of take a similar view on how we turn both services and society back on in these times. We're not going to know. Take a guess, but try to be as prepared as you can possibly be.

Liz Fong-Jones: And have feedback loops, right? Like maybe we don't let all of New York city come back to work at the same time. Maybe we increase capacity a little bit and then increase capacity a little bit and then discover we need to back off. The more you have adaptability and flexibility in your system, the better prepared you'll be.

Dave Rensin: Here's where well designed incentives play a role. We were talking about error budgets and that the value of the error budgets is around the incentive alignments they create. If you're over your error budget a lot, the SRE team is going to hand back the pagers. That's a super over-simplification. So we're starting to see some of these things.

In the early days, in most countries you started to see hoarding, and that was terrible. People were cleaning out stores. They'd buy 14 years of toilet paper because, I don’t know, they expected COVID-19 to make them all have to poop more or something. Or they bought three years of perishable goods, which made no sense. And so now stores are starting to respond and do really good things. Now that first unit of hand sanitizer is the normal $4, and the second unit is two for $80.

That's a good incentive response, which will keep people from hoarding. Or another example, stores opening early only to service elderly and at-risk patrons. Those are good responses. But I also want piggyback on something Alex said, which I think may be the most important thing in this. Let's tell people that it's not a sin to fail. The sin is in failing to notice. Whatever we do, it's going to be wrong the first time. That is a metaphysical certainty. It's all about the feedback loop and the monitoring, and noticing that things are going off the rails, and having the levers to adjust and try something new. If, let's say, the blast radius of your mistake is sufficiently small, you can afford to discover the right thing to do by first discovering all the wrong things to do.

Alex Hidalgo: I also really like that point about how people were really hoarding things and in some cases still are depending on the store. It also leads to the need for you to ensure that you're communicating things well. In my neighborhood, suddenly all the toilet paper was back and someone I know just a few neighborhoods over in Brooklyn still couldn't find any at all. And I didn't know that, or else I would've let people know ahead of time. But as soon as I found that out, it was just on Twitter. I was like, "Oh, go to this store at this intersection and this store at this intersection", and this person was able to find the toilet paper they needed. Maybe that could have happened much earlier if we knew what we had to communicate to each other. Now, you can't always know what you need to communicate, but that's just another important aspect of keeping things reliable; by making sure that everyone who needs to know knows.

Liz Fong-Jones: I've been following the efforts by a group of aviation professionals, called an ops group. They're a set of people across private pilots, commercial pilots, and also people who work for major airlines. And despite the fact that they work for competitors, they share information about, “here are the airports that are currently closed, here's what's going on with the missile strike and Iran,” and such. Like things like that. They share that information and it makes all of them much more adaptable, because they have the information that they need rather than siloing it.

Amy Tobey: A thing that struck me while you all were talking was how the toilet paper thing reminded me instantly of the thundering herd problem we were talking about earlier. When we turn up services at the edge, and then they hammer all the services on the inside, the whole system kind of goes into a flapping state for a long time. It's really kind of weird to see that out in the real world.

I want to jump ahead to a question from one of our audience members because it's really relevant to where we are right now. How do you all see the world of disaster response and business continuity planning changing as we move forward into recovery? I think right now a lot of folks are finding out that their disaster plans were incomplete because nobody really planned for a pandemic, or a lot of organizations didn't. And there's a lot of need for using some of our recovery, our disaster planning, that maybe hasn't been tested that well before.

Liz Fong-Jones: I think you can't enumerate every single possible thing that's going to go wrong. The playbook strategy is not necessarily going to work super well because you cannot anticipate what the next black swan is going to be. So we have to focus on making our organizations of people more resilient. That's the lesson I hope people will take away from this. I was reading an article this morning in the New York Times about how American airlines had a plan for dealing with the pandemic in China. And they were like, "Yeah, we'll shut down two flights to China." And then it spread to Italy. And they were like, "Well, we can shut that down too, but that suddenly our plans no longer work."

Dave Rensin: I completely agree with Liz. You cannot game out every contingency; the permutations are insane. But failure modes cluster across just a handful of axes. So you can plan generally about how you're going to think through and respond to sharp swings in capacity requirements, whether that's technical capacity like computer networking or human capacity like we have to search humans to a place or another. And those are generalized techniques and things you can plan for that are pretty portable across different situations. You can generally plan for a communication partition.

What if the east coast can't talk to the West coast? What do we do there? Which is what do we have to do if ... and that rhymes with this problem of, oh, we can't send all the staff to a physical place at the same time because there's infection risk. That looks like a partition, a staff partition.

So there are classes of things that you can drill into, so that you at least have the mental framework of how to take these classes of things and apply them to the specifics of where you are. I'll tell you a funny anecdote. I think some people might know, and obviously Liz and Alex, this thing at Google we do called DiRT, disaster and recovery testing, where we try to simulate the most existentially awful but plausible thing we can think of. Earthquakes or giant outages or whatever.

Liz Fong-Jones: Or zombie plagues.

Dave Rensin: Well, we did in fact simulate that one because zombies could happen. And when I would talk to people in the industry and they would find out about DiRT, the reactions usually kind of went like this. First reaction was, “Wow, that's kind of neat and it sounds kind of fun” because it is kind of neat and it is kind of fun. And then the second reaction is “Man, that is such a luxury item.” It's not even ... like of course it's Google and whatever. The good news is a lot of what we've learned in DiRT, we're actually having to apply. Not because we can, but because we're having to.

The bad news is we are discovering all the vectors of terribleness that 20 years of DiRT testing did not prepare us for. Or worse, miseducated us about. So we're unlearning some lessons rapidly too. It's a weird sort of dynamic. You would think a company that's spent 20 years thinking about crazy meteor strike kinds of things might be better prepared. On some axes we definitely are, but on other axes it actually kind of hurt us. I don't know what lesson we'll learn from that, but we definitely need to learn some lessons from it.

Alex Hidalgo: And not only can you not plan for every potential outcome, you can't plan for every potential problem. You also can't plan for the scale. A good example, I was reading last night that Waffle House is shutting down. Waffle House never shuts down. And part of that is capitalist greed of course blah blah blah. We don't have to get into that. But they take this seriously. Every store or every restaurant does training. Everyone that works there knows how to do this. They have reduced the menus that they fall back onto. They are ready to help deliver things to people if they can't leave their houses. This is part of how Waffle House has set themselves up. They have tried to ensure that they can be as resilient as a restaurant business possibly could. They're closing. Because sometimes even if you are as prepared as you possibly think you can be, you can't prepare for the scale of what you're dealing with either. I think that's kind of what's going on there.

Liz Fong-Jones: It's kind of an interesting situation where Waffle House is used to hurricanes taking out like one or two or three states at a time, but they're not used to something affecting the entire nation at once with no possibility of serving people a limited menu.

Amy Tobey: What I found really fascinating about the Waffle House case is, as a signal to the people who live in those states who get hit by hurricanes frequently, that was a stronger signal that they had to button down than the official warnings came out. I was wondering if anybody had thoughts about these sorts of unofficial or colloquial alert systems that maybe we already have in our organizations too.

Dave Rensin: People love tea leaves; it’s something about human nature. I don't know exactly what it is and some day there's maybe some body of research. But I think there's a fundamental human distrust of authority, particularly authority that seems very abstracted from you in your locality. And also I think maybe makes people feel a little more in control when they're like, "Oh yes, I see this tea leaf, then I can read." And so yeah, things like the Waffle House example is a good local signal. In a lot of small towns and communities, whether a Walmart is open or the Walmart parking capacity is another kind of proxy measure. People use it. There's just something about human nature where they love that. It's the same thing in companies. People love anecdotes, and anecdotes are more viral than data for sure.

In a previous life, a long time ago, I worked in the US intelligence community. We used to say that the great sources of intelligence for any group were: signs, signals intelligence, humans, human intelligence, and stuff you get from spies or whatever. But the most powerful was rumint. What you heard in rumors. And so as leaders, it's really interesting and hard because on the one hand, you want to be making decisions based on well curated and aggregated data. On the other hand, you have this human instinct to want to pay attention to anecdotes, and you have to find some filtration mechanism to mix and munge them together. The higher you are, the more challenging the problem becomes.

Liz Fong-Jones: So I guess to spin that on its head, what can we do as leaders to kind of communicate doing the right things to our people when our people are disinclined to believe what we say at face value?

Alex Hidalgo: I think part of that is you use narratives. Not only do people like these signals that Dave was talking about, people like stories. We've always been storytellers. That's how for the vast majority of humans having existed, that's how we passed information via stories. And that's why the best postmortems are narratives and not timelines.

Liz Fong-Jones: Goodness. After this pandemic, we're going to have to just really stop using the word postmortem. We're going to have to start using retrospective because now postmortem does feel like people are going to have relatives who have died recently. And that's a little bit on the chew on the nose.

Alex Hidalgo: Yeah. I've actually renamed things as incident retrospective internally. And I like it a lot better, but still, we've been calling it postmortems for decades in this industry. It can be difficult to kind of get, jump onto a new phrase, but totally agree.

Dave Rensin: There's an old saying among marketers and people who have to persuade people for a living. So this is to really piggyback on Alex's point because he makes a really important point that identity beats analogy, analogy beats logic, and logic beats nothing, right? When you're making an argument, if you're arguing against nothing, if you're arguing against a vacuum, you can use facts and data and logic and you're going to win the argument. And if you're making an argument, and what you're arguing against are a set of facts that can be interpreted a bunch of ways, that analogy turns out to be a stronger way to persuade people. And the strongest way to persuade people is identity. In fact this is going to change the way you look at advertising. But if you look at any longform advertising, it always goes in what's known as the up-and-down pyramid.

It starts with identity. “Don't you want to be known as a handsome, sophisticated, suave human being?” And then it'll go down to an analogy after that. “Imagine you are an elegant animal in a herd”, I don't know, some crappy analogy people make for men's clothing or something. And then data. “46% of all employers love people with blue ties.” And then it works back up. It reinforces with an analogy and then an identity again. That is the arc of every piece of successful long form presentation.

There's something to Alex's point about human nature where we value stories. Maybe they are easier to compress and store in our brain or something about the way they interact with our ability to abstract, think. Or maybe they're sort of pre-chewed in the sense that if you give me data I'd have to chew through it and look for knowledge and then look for wisdom and then synthesize it into a thing and store it. But whereas if you just give me a story, I can just pull a moral out of it.

Liz Fong-Jones: What's particularly interesting here though is we had heard those stories for months, right? We'd been hearing from countries in Asia who are impacted and countries that were outside of Asia that chose to ignore those stories, which I think brings us back to the idea of identity. If you think that your identity is such that that story has no relevance to you, you're not going to pay attention to that story.

Alex Hidalgo: Yeah. But I think part of that too is the way those stories were presented to us here was numbers. There were just news reports like X number of people now are infected. Y number have died, Z number have recovered. I'm sure there was much better data out there and there were better stories being told out there, but that's how I was receiving the news at least. It was just numbers. Again, numbers from a place I've never been. And I think that's one of the reasons it was probably easy for people to not take it seriously because as Dave was saying, it's stories work better. I'm always telling people a good SRE understands marketing because you need to get other people to buy into what you're selling either in terms of a system, how you think about reliability, or “I think tool's a really good idea” or “We should be focusing on this”.

You need to know how to market, and to do that you need to be able to tell stories. Last year I spent a lot of time trying to convince people to spend a lot of money on a certain vendor. And I tried using numbers at first. Like, “Hey, we need this. We need X number of engineers to build the same functionality internally.” That salary cost would far outweigh what it cost to just pay this vendor to do it. That didn't get me anywhere. But then during our trial phase when someone was able to actually solve the problem, when I told that story about how, "This human on this team did this thing," that was what was able to help convince leadership.

Liz Fong-Jones: Yeah. It's hoping people will have that ‘aha’ moment.

Amy Tobey: I think another thing that we probably deal with a lot in our spaces in work is we are very biased toward understanding the numbers. And a book I keep thinking about a lot lately is Innumeracy and how if I said to three of you, I said, "Oh yeah, it's going on an exponential curve." You would go, "Oh yeah, that's an exponential curve. It looks like this." But for the majority of people, if we say it's going to go exponential, they go like, "Is that like adding more or?" for a large population broadcast. I think that's a really key skill for SREs.

Dave Rensin: Yeah. It's also worse than that. All humans, even really math-savvy, humans are terrible at internalizing and dealing with probabilities. I believe firmly that humans really only understand two probabilities, zero and one, and everything else is kind of a metal coin toss. Most people reduce risks to either zeros or ones. Entrepreneurs too often reduce risks to zero and pessimists, let's say, too often reduce risks to one. And then that's it. That's all people understand. So you see this thing where even in like a press reporting and the public discussion over the pandemic. You'll have some people saying, "Well, we could be through it by this date." And other people saying, "No, no, no, no, no. It'll take them much, much longer than this.”

That's insane. Actually the truth of the matter is that the probability of any one of those scenarios being true is roughly the same. No one knows which one's going to happen. And so whether you tend to be an optimistic person or maybe a pessimistic can't count, just depends on how you look at the world. Or a pessimistic person, it depends on which scenario you gravitate to. People have a really hard time accepting that two or three or four different outcomes are all equally plausible and therefore things that feel like they're conflicting can all be simultaneously true. And they have a really hard time figuring out then like what do I do in that scenario.

Liz Fong-Jones: This is where we have a role as people who think about preparedness. To prepare for each of those three or four scenarios. What would it take to adapt under this scenario versus that scenario? Once you have that concrete menu of options, that feels like a much more reducible problem than dealing with black swans. But we do have to deal with blacks swans too.

Alex Hidalgo: I also notice that a lot. I've been trying to introduce better knowledge of basic statistics and using probability. These are very powerful tools for calculating meaningful SLIs and picking good SLO targets and stuff. What I learned is that some of the most technical people, yeah, they have real trouble grasping simple probability concepts. It's just not easy for human brains to do. You may accept the fact that there is a chance that no deck of cards has ever been shuffled in the same way, right. If you do the math there are just trillions or, I can't remember what the number is now, but there's so many possible outcomes that there's a chance that no two decks have ever been shuffled in the same way. You might know that via math, but I don't even really believe that. It's one thing to have numbers and it's another thing to actually convince your brain of things.

The Monty hall problem I think is another great one. It was based on a game show, and there was a gift behind one door and behind two other doors were donkeys I believe, or something like that. Basically once you pick a door, one of the two remaining doors would be opened and you'd get to find out what was behind that door. If it was a donkey as opposed to the gift, you were allowed to switch which door you had picked. Intuitively, we all want to say that it doesn't make any difference, but it does. You increase your chances of getting the gift if you switch the door again. It's incredibly counterintuitive, but you can find this on Wikipedia and you can find the mathematical proofs. From a probability standpoint, if a door's opened and there's a donkey behind it and you've already selected, move, choose the other door, you actually have a better chance. This makes no sense. It's very difficult for humans.

Dave Rensin: The other thing is I find people have a pretty good intuition about expected value. If there's a 30% chance of this outcome and a 70% chance of this other outcome, we weigh them together and then the expected value is some number that actually doesn't appear in any of the outcome tables. They have a pretty good intuition about expected value, but it turns out that in a lot of these situations, the expected value is not useful for making a decision. Maybe you have a 30% chance of living and a 70% percent chance of dying, for this one particular patient. And we say living is a value of one, and dying is a value of zero. Then you have an expected value of 0.3, which does nothing to the decision process because it's not one of the actual outcomes.

Liz Fong-Jones: Exactly. I've been having to tell people like a 3% death rate or even like a 1% death rate among people who are able bodied and younger, means that every company is going to either lose someone or is going to lose a relative of someone working at the company. That makes it a lot more concrete. It makes it more than just math. But all of this is kind of rubbing back around to the idea that when we think about reliability, when we think about risks, that a lot of it revolves around persuading people of it rather than necessarily just the math.

Amy Tobey: Exactly. Let’s move on now to Q&A.The first one I have overlaps a little bit with what we were talking about earlier, but it has a little twist. Given the view that humans and technology are part of the same system, sudden staffing reductions can deeply impact these systems, such as the loss of expertise. How can SREs proactively help prepare our organizations to adapt for this?

Liz Fong-Jones: I talk a lot in my talks about the idea of collaboration. That’s the idea: We need to not silo knowledge, which means that you have to have more than one person with a working knowledge of how something works. How do we do things, why do we do things? That's one kind of redundancy mechanism that we can employ to guard against the possibility of someone going to the hospital and not being available.

Alex Hidalgo: Part of it is also just proactive-ness from people who know that they can be thought of as the subject matter expert, or a single point of failure. People often know that, and when you see yourself as that person, when you know that you're the only one that holds this data, you need to proactively share this.

Dave Rensin: There's an exercise you can do as a team, which I do on my teams. I talked about this a few months ago at Chaos Conference. One of the exercises is called Wheel of Stay-cation where every week you randomly pick a person. I'd go into much more detail but you can look it up. You randomly pick a person and they stay at work but they have no work communication. They can't answer any work emails. They can't have any IMs, no one is asking questions, nothing. And the point of the exercise is to discover your information SPOFs. Like what questions couldn't get answered? Just because that one person randomly wasn't there that one day, that one week.

Then, actively engineer the information SPOFs. There are other things you can do, but the only way you can discover things like expertise SPOFs or information SPOFs is to regularly and routinely exercise them before the emergency shows up. There are a set of exercises you can do to make this happen.

Amy Tobey: The other thing that this made me think of, is how a lot of SREs are those experts that have a lot of critical knowledge in their head on how the infrastructure is held together and how all the pieces are connected. Our community is likely especially loaded with this kind of information.

Alex Hidalgo: And on the technical standpoint, something I've been thinking a lot about is that I'm going to work from home more often, not because I necessarily love it, but because I need to make sure I can. So many companies reported running out of VPN licenses because they never expected every single employee to have to log in at the same time. Even if they had enough VPN licenses, was the subnet used to assign to the clients large enough? And even if that subnet was large enough, were there enough DHCP leases available? There's so many different things that you need to think about if you are being a little bit more proactive, and you're testing scenarios and you have exercises like what Dave mentioned, these are all good things that can help you learn.

Liz Fong-Jones: We had a really interesting thing that really prepared us for this at Honeycomb in that we were concerned about potentially losing our office lease back in September or October, and not having the cash to put down a deposit on a new office. This was before we raised our most recent round. As a result, the company instituted one week out of every eight weeks, we would all work from home that week to get us ready for this mentality of potentially losing our office. We didn't lose our office the way we expected to lose our office. I'll tell you that.

Amy Tobey: I have another question about remote work leading into that. When all the forced remote people are told that they can return to their office, what proportion of them do you think will say, "No thanks? I'm used to this now and I like it better. Please don't make me go back to the office."

Alex Hidalgo: This is incredibly anecdotal, but I have this social Slack. It's just me and my friends. It's been around for a few years. It's like a glorified group chat at this point. We appear to be split 50/50 on who misses the office and who doesn't. We're split just about 50/50 on who has always wanted to work from home or does and who would really like for things to get back to normal, like in that sense. Purely anecdotal, very small sample size. But I wouldn't be surprised if it's something close to that. A lot of people have noticed that after a week or so, they really miss the human interaction, and there's other people who are very happy just to be indoors.

Liz Fong-Jones: I'm really curious to see whether this expands the range of people hiring outside of San Francisco. Overall though, this crisis is really exposing the need for people's tooling to support remote workflows, for people to not need to be shoulder surfing each other in order to collaborate. The sooner companies can adapt to that reality with solutions that enable people to collaborate without being physically in the same place, the better prepared they'll be for any scenario, including hiring remote employees, or this crisis dragging on longer. And yes, including people not wanting to go back to the office.

Dave Rensin: I expect something like 100% of the people or a very, very high percentage of the people who are working remotely will go back to the office that first week. I'm a raging introvert. I'm fine not dealing with humans. But even I'm going kind of nuts. And so I'm definitely going to go back to the office the first week and then I think then I'll see it as a double boomerang. So a bunch of people will likely say “I'm going to work from home three days a week if the company will permit it.” I also share Liz's hope that this will make companies bolder about where they are hiring people, and where they'll allow people to be hired.

That would be awesome for de-justification and decongestion and just generally making things more affordable and better for people. Where I work, Google, there is a very strong culture of in-person. It's just that's the way the company was built. Sometimes it drives me nuts too. For some roles, Bay Area, like when I interviewed, I was living on the East coast near Washington DC. I said to the person hiring like, "Hey, you have an office 15 minutes from my house and a job you're hiring for is a globally distributed job with teams across the world. How about I just do the job from Reston?"

Her answer was, "That is a really fantastic and interesting point. By the way the job is in Mountain View." So it'll be really interesting for a company like ours with that strong geographic affinity to see if we learn any of these lessons or take this as an opportunity to relax some of those constraints. That's a conversation I'm really looking forward to.

Liz Fong-Jones: It's also interesting seeing the kind of back and forth, at least in the larger SRE orgs that are multi-homed. That they already have some of these skills in order to collaborate across time zones, just not necessarily being remote in the same time zone.

Amy Tobey: I think there's two questions here I think are very similar, so I'm going to combine them. The first one is if people think that this pandemic is something that could happen only once, do you think they're capable of learning from this incident? If we call this all one big incident. The second one is, in two years' time when we aren't directly talking about this anymore, what impacts to the way we work do you hope will have stuck? What do you feel like people are going to learn from this all? Everybody here has a lot of experience with trying to educate people coming out of incidents, and sometimes very large incidents. What do you think is going to stick?

Alex Hidalgo: I know this is contrary to what a lot of people think. But in my experience, spending 18 years in tech in Moonway, like when the other people actually expect things to be the same, things that they've seen happen before, they expect to happen again. Even something of this magnitude, I think people are going to remember that this can happen. So I don't think people will actually view a problem as large as this as a one-off. People have trouble categorizing things as black swan events. When a new problem happens, they will always go back to “Oh, was this the same problem we saw back in October” or “Is the same problem we saw last year?” Even if that problem never pops its head up again, people instinctively expect the future to look like the past.

Liz Fong-Jones: We can still pick out micro-patterns that might happen again. Like people rushing to the grocery store to hoard. People all getting on Zoom at the same time from the same country. These are things that we potentially can learn from. When Alex, Dave and I were all at Google at the same time, there was this Wiki page on the Google internal Wiki docs that said something like formative outages. You can hear about all of the black swan outages that Google has had over the past 10 years and which ones were the most influential. Which ones should you know as a new SRE. In that way I'm actually less worried about what people who've lived through this are going to learn, and more worried about what happens to the next generation of SREs who have not lived through this, how can we communicate what we've learned from this to them.

Dave Rensin: That's the most valuable thing I think. I have no fear that people will learn lessons from this. Everyone's tendency towards recency bias is going to keep this fresh in mind. Also, this isn't rare; just in the last 10 years. We've had SARS and MERS and H1N1 and this. This one broke out faster and the reaction was sort of more swift and loud, let's say and swine flu. But every couple of years, something like this either happens or is on the verge of happening. So this isn't nearly as black swan as people think. I 100% think companies are going to learn these lessons. I really and truly do. The ones who don't will self-select for extinction. What you want to see happen is that the lessons they learn go into the culture of being at the company.

So to Liz's point, the next generation of SREs or employees in general, it should be part of the culture. That remote work, or capacity planning, or whatever it may be is part of the culture. That's the most useful thing we can do from here, is to ask what principles, what practices, what cultural norms do we want to drive into our companies, so that the next generations don't have to remember the specifics of this incident in order to get the value of what we learned from it.

Liz Fong-Jones: Kind of like an immune system; how the immune system remembers things.

Alex Hidalgo: I don't think it's necessarily the case that every company is going to learn anything from this, or maybe they're going to learn the opposite lesson. There are plenty of examples of companies who are just waiting on the bailout and they always work. You're not going to convince me that the current setup of our airlines is actually going to change any of their practices moving forward unless we put any regulations to change how they make decisions. Same with our banks; until we can regulate things better, they're just always going to wait for a bail out next time that they fail.

That's why all the buybacks are happening in aviation. And Boeing knows that the government's always going to bail them out because our economy cannot survive without them unfortunately. And there's ways around this. But I just wanted to point out that it's also not necessarily going to be the case that everyone currently impacted is actually going to change their practices. There's always going to be outliers. There's always going to be problem actors.

Amy Tobey: So what you're saying is we need to hand the pager back to those organizations and stop bailing them out? Essentially what you made me think of when you were talking about that, was that's our equivalent in SRE. If we keep bailing out service teams, they never are going to really learn to swim for themselves. Sometimes it has to be “Here's your pager, good luck. We're here to support you.”

Dave Rensin: The good news is they're not independent actors. They act in an ecosystem. They have externalities in the form of government intervention and regulation. So a lot of the loopholes that we've seen, for example, the 2008, 2009 ballot TARP, they don't exist in the set of bills that just passed the US Congress. Because people remember and they learned some lessons and they were like, "Yeah, no. You don't get to do that. We saw you do that last time. You don't get to do that this time." And companies will invent new and creative ways to do something else. But the good news is like there is a feedback loop for a bunch of these things.

Liz Fong-Jones: The other element is reducing over the long term the amount of Too Big to Fail. Can we save specific things that are essential while not saving the things that are frivolous?

Dave Rensin: Yeah. Shorting for the win.

Liz Fong-Jones: As far as commerce is concerned, yes.

Alex Hidalgo: I think it's important to remember we use the term complex system a lot in this industry. And when we do, people's mind often goes to some kind of microservice mesh with a bunch of things that are talking to each other. But that's a very old concept that other engineering disciplines have been using for a very long time. Everything is a complex system. And it doesn't even have to be a thing. I'm not just talking about the fact that the airplane is a complex system or a bridge, but our political system is, our economic system is, individual organisms are. You can apply the same ways of how you think about your complex microservice system to our political system and beyond.

Dave Rensin: We're also going to have to unlearn some interesting lessons we thought we had learned. For example the distribution of global manufacturing across a lot of different countries rather than concentrating heavily in one country. Another was a thought about resiliency. I remember very clearly in the late eighties and early nineties, the debates around NAFTA and other trade agreements were, “Hey, this is going to create resiliency so that we're not just dependent on wherever in the United States or someplace in Taiwan.”

Liz Fong-Jones: We should have learned that lesson a long time ago. Like the floods in Thailand that took out a whole portion of hard drive manufacturing.

Dave Rensin: Yeah, but that was a lesson only like the tech industry learned. That was too localized to cause a global conversation. So one of the lessons we're learning now is about actually having some capacity sharding, like maybe we don't want to concentrate the production of all of our antibiotics to the lowest cost provider because it tends to concentrate in one place geographically. Maybe making it a little more expensive, but sharding it across a lot of different places, will be better. So we're learning there are some things we thought we understood about global supply chains were wrong, and now we have to adjust them. That'll be a really interesting thing to watch happen.

Amy Tobey: Awesome. Well, that's time for us everyone. And thank you so much to our attendees and especially a huge thanks to our panelists, Liz, Dave and Alex. I certainly appreciate your insights and had a great time. And I hope everyone had a great time on this panel. We're hoping to do more of these in the future. If you have more ideas for how we can connect with our community, catch us on Twitter. Stay safe, healthy, and stay home.

DEV Community

SRE Leaders Panel: Embracing Resilience During Crises

Top comments (0)

Read next

Day 3: What is Docker and why should I care?

Kubernetes Multi-Container Pods: Sidecar, Adapter and Ambassador Patterns

Jenkins vs. GitHub Actions vs. GitLab CI

CEO Keynote with Matt Garman - under 5 minutes