DEV Community: Ancy Dow

How to Improve On-Call with Better Practices and Tools

Ancy Dow — Fri, 31 Jul 2020 15:31:16 +0000

In the era of reliability, where mere minutes of downtime or latency can cost hundreds of thousands of dollars, 24x7 availability and on-call coverage to respond to incidents has become a requirement for the vast majority of organizations. But setting up an on-call system that drives effective incident response while minimizing the stress placed on engineers isn’t a trivial task. Establishing equitable on-call rotations, putting the right guardrails and automation in place, and regular incident practice are key. In this blog, we’ll share key tools and practices to ensure your on-call engineers are set up for success.

On-call practices and policies

When setting up your on-call system, it is important to define clear and consistent policies and practices. When taking on on-call responsibilities, engineers shouldn’t need to reinvent the wheel when the pager goes off; ideally, the planning around severity, incident playbooks, and more should take place during peacetime. The team should work together to create rules that dictate when and how on-call escalations happen. Make sure you have the following worked out before implementing an on-call system.

Creating Rotation Schedules

First, you’ll need to build your on-call schedule. Work out which engineers would need to be available for different system areas where incidents could occur, by looking at where each engineer has ownership and domain expertise. Create teams to maximize diversity and coverage, allowing for each time to respond effectively to many different types of incidents. Fill out a calendar with these teams, making sure every shift is covered for your rotation period.

During all of this, consult with your engineers to ensure that your schedules are reasonable and fair. How long should an on-call shift last? How frequently should a team go on-call? What should the procedure be if an engineer has to change shifts? To keep morale high and teams responding effectively, make sure every engineer has a fair say in these choices.

Be prepared to change your rotation schedule frequently, even after implementation. The reality of working on-call shifts is often very different than predicted, so look at on-call data to uncover whether certain individuals are overburdened with off-hours interruptions or critical incidents, and load balance accordingly. Be flexible in hearing out people’s concerns as they develop. External business changes and stages in development cycles can also drastically change the nature of on-call shifts, so be prepared to reflect those with adjustments to shift lengths and rotation frequencies.

Because of these constant changes, it’s important to keep the rotation schedule up-to-date. Make sure it’s kept in a place where it’s convenient to make changes, automated and easy to integrate with different systems, and accessible to anyone. Many on-call platforms also offer scheduling tools to make this process easier and more robust.

Defining Escalation and Response Policies

The next set of policies you need to define is to decide when your on-call teams are actually contacted and how they respond. To combat alert fatigue, you’ll want to be judicious about when your teams are notified, but also ensure that critical incidents are not overlooked.

You should have a system to classify incidents, sorting them based on severity and affected area into established classifications. These classifications will determine who is alerted and what response is necessary. This response should also include timelines for when incidents of severity need to be resolved before you violate SLOs or SLAs.

You can determine severity by looking at the business impact of an incident — issues preventing customers from using services or violating SLAs require a much faster and larger response than a small component loading slightly slower than usual.

You’ll also need to prepare a defined response to each category of incidents. Engineers should be equipped with tools like runbooks to begin tackling an incident as soon as they’re alerted. These runbooks can also include checks for triggering further escalation. Make sure your on-call engineers are familiar with these runbooks, and confident about executing them when the time comes. Schedule regular review sessions to update runbooks based on incident retrospectives.

Cultivating On-Call Culture

Between being called out of bed in the wee hours, having to handle incidents with fewer teammates and resources than normal, and facing extreme pressure to restore service as business reputation is on the line, on-call can be an extremely stressful experience. Being overwhelmed by on-call responsibilities, believing that on-call duties are assigned unfairly, or generally feeling under-appreciated can quickly destroy engineers’ morale and accelerate burnout.

Combat these challenges by cultivating an empathetic on-call culture that puts people first.

Involve engineers in setting schedules and other policies. Hear out their experiences, celebrating their successes and addressing their struggles. Make sure you hear these concerns blamelessly; instead of attributing setbacks or miscommunications to individuals, look at the systems behind them. Protect against a ‘hero’ culture, and embrace sustainable on-call through eliminating single points of failure, and embracing smaller and more frequent changes, distributed rotations, and continuous learning.

Reframe incidents from failures and setbacks to investments in future reliability — every incident, when properly addressed, makes the response to each future incident better. Likewise, each on-call shift is an investment in making future on-call shifts better. When there’s challenges in load balancing, having effective responses prepared, or proper escalation, embrace them as opportunities to refine and grow.
For more tips on how to implement empathetic and effective on-call practices, check out our top 5 on-call practices here.

On-call Software

Implementing on-call practices is a complicated process, but fortunately there are great paid as well as free on-call tools and platforms to help. The most popular tools include PagerDuty, OpsGenie, VictorOps, Cabot, and LinkedIn On-Call (open source)

When selecting an on-call tool, some important requirements to consider include:

Alerting through phone, SMS, Hipchat, or email
Breadth of integrations across the tech stack, from cloud monitoring to source control
Alert grouping, filtering, and de-duplication
Team-based management
Simple visualization of teams’ statuses across the calendar
Rock-solid reliability

On-call is an essential component of a reliable system. To take your on-call and reliability practice to the next level, you’ll need to codify context into guardrails and automation, minimize toil, and foster a culture that is inclined toward curiosity instead of blame. Blameless can help you get more out of your on-call and broader reliability efforts by integrating valuable data from SLOs, incident checklists, , postmortems, follow-up action items, and much more. To find out how to empower your SRE solution with Blameless, join us for a demo!

Originally posted on the Blameless blog.

Enabling the Stripe and Lyft Platforms Through Modern Safety Science

Ancy Dow — Thu, 30 Jul 2020 15:27:48 +0000

Jacob Scott is an experienced engineer and enthusiastic participant in the resilience engineering community, having spent time caring for the technology systems powering high-growth startups as well as unicorns like Lyft and Stripe. He is deeply passionate about how to apply learnings from modern safety science to real, complex socio-technical systems.

Blameless SRE Darrell Pappa recently interviewed Jacob to delve into how his research has informed his career journey and experiences to-date, especially in his latest role at Stripe where he helps operate the economic infrastructure of the Internet.

The following transcript has been lightly edited for clarity.

Darell Pappa: Jacob, it’s great to connect. How's it going?

Jacob Scott: There's actually some interesting Adaptive Capacity Labs and Learning from Incidents stuff on this topic, of “How's it going” during the coronavirus. When you meet someone, it's usually like, "How are you?" And, "I'm good." I think it's helpful to be like, "Well, I'm good, given the circumstances." Right? The world is in an odd and unfortunate place, but given where it is, I'm doing pretty well. How about yourself?

Darell Pappa: Yeah, especially with the circumstances, it's been challenging for sure, but I'm happy we can connect. I wanted to first kick us off by diving into your background and some highlights from your journey to date.

Jacob Scott: A long time ago I was actually a theoretical computer scientist Approximation algorithms, combinatorial optimization. I did research in that domain in undergrad. I went to grad school. I dropped out with a master's. And then in 2008, when I left grad school and started my career in industry, I joined Palantir. This was an interesting company to join. It was a different president, different time, but with everything that's happening in the intersection of society and technology today, it’s really interesting to reflect on working in a company at the center of that.

I was there from 2008 to 2013, and saw it go from 100-200 people to over 1,000. I was a backend generalist in Java, frequently a tech lead or lead engineer on a variety of projects. Then after Palantir, a friend from grad school convinced me to join as a very early engineer at a startup doing productized machine learning on sales and marketing data, solving the same problem for many SMBs. That was pretty interesting. I had a horizontal portfolio as there weren't that many engineers—maybe there were 10 or 20 engineers max—and so there I did a lot more. It was Python instead of Java. I did a lot of Postgres stuff, data pipelines, external data integrations, all sorts of random things.

Darell Pappa: Looks like a pretty big change from your previous role.

Jacob Scott: It was, with the intent of optimizing for learning new experiences. It's an approach that I have. Who knows what will make me the most energized and fulfilled? But if I triangulate, I'll learn more by doing something sort of different. You learn a lot, not just technically, but about how the industry works, how Silicon Valley works.

An interesting thing throughout my career, both at Palantir, and then at Infer, Lyft and Stripe, has been the relationship between the business and the technology. There’s always someone on Hacker News saying, "Well, why didn't you use this?" Or, "I could build this thing in a weekend with my buddy." But you can't necessarily build the support arm, the data science arm, the regulatory compliance arm, everything you need to actually have a functioning business. When you're growing and you're the darling of Silicon Valley and you're getting lots of money, you can attract all the right people. This lifts all your boats.

I was at Palantir for about five years, Infer for four years. I left Infer and ended up going to Lyft. At Infer, if you had the AWS root [access], and the log-in, if you were one of the early people setting up the TSDB, basic observability stack, you just kind of did it and hopefully the site wouldn't crash. I went to Lyft with that context.

Lyft obviously has a very sophisticated observability service mesh set up. Matt Klein wrote Envoy at Lyft having seen a lot of stuff at Twitter and AWS that helped inform that. He wrote it with the team at Lyft, and it was interesting to see that technology. Palantir had been early, maybe early 2008 to 2013. People were doing whatever made sense. Then Infer was small. Lyft had an incident program. But despite the sophisticated measures in place, incidents would still arise. Whether they're related to your project or not, they overlap your sphere of ownership and things derailed a bit.

So I got curious about the fact that there were so many smart people — the number two ride sharing company, raised lots of money, very sophisticated technology — but the incidents would sometimes be kind of like, “We didn't have this graph, this detector, we didn't notice this thing."

It went on. Not like a hard failure, not like 500s and everything smashed. But especially if you're dealing with machine learning, something can be wrong and have a financial or user impact failure. You can notice these sorts of grey failures later than you'd like, and you end up with an impactful incident

You wish it wouldn't have happened, and so it's like, "How did we get here?" How do you make sense out of the fact that there's so much success, and so much on fire. That led me at Lyft to move teams. I joined initially to work on problems in the mapping space, which is interesting for Lyft because time and distance play an important role in which driver you dispatch to what passenger, and they also play in pricing. They're low-level Lego bricks. I moved from there to a team focused on chaos engineering and cross-cutting reliability, working on bot-based load testing that was actually a very successful source of reliability for Lyft. Eventually in the summer of 2019, I ended up leaving Lyft, and was lucky enough to spend about six months exploring resilience engineering.

Amy Tobey is at Blameless now, and she's obviously in that community and amazing. Something I got really interested in was: Whether it's socio-technical systems or cognitive systems engineering or the intersection of modern safety science and software systems, there’s this sort of explanatory power or interesting perspective that could say, “Well... If you look at this dynamic system in a much larger sense, maybe you could see how you could get such a bad break. How could Google go down for six hours last summer? How could CloudFlare ship a regex bomb in their WAF rules?” People want functionality, right? You can't test things perfectly.

And so after being lucky enough to spend some time at the South Park Commons, which is a group in San Francisco for people in between, I decided it was time to go and reapply all this. And so I joined Stripe in February just in time, just before Coronavirus. ‍

Darell Pappa: Nice. What is it that fascinates you, draws you to resilience engineering. It seems to me it expands outside of software. Amy just did a relevant panel discussion with Ward Cunningham and Tim Tischler from New Relic and Jessica Kerr.

Jacob Scott: Yeah. I need to watch that. I try to keep up with talks online with various folks in the wider Learning from Incidents community, and ground myself in thinking about modern safety science as applied to tech industry startups, unicorns, etc. Resilience engineering actually is one flavor of modern safety science. There's also, for example, high reliability organizations.

Resilience engineering is this sort of academic practitioner core of scientists and researchers such as Hollnagel, Woods, Dekker who write about it and have studied it for a couple of decades. There's an evolving view of what does safety mean? Leveson. It’s all there in factories, cars, medicine, power plants, software systems. We want systems to do some things and not others, sometimes implicitly. That's where edge use cases have come from. "I would like A plus B to equal C." It's like, "Well, what about overflow, underflow, floats, conversion?" I don't want to think about that; infer what I want. But computers aren't that smart.

When I think about resilience engineering or modern safety science in general, it sprawls very quickly. It’s first really embracing the totality, a larger view of a system.

So rather than saying, "I want a service mesh, so I should use Envoy. And if I have my generate retries and circuit breakers and all these things set correctly, and I have the right dashboards and all of these things, then I'll have a reliable system"...

Resilience engineering would say, "Okay, but who are the people? Who's going to get paged? What's going to happen when they get paged? How ergonomic will the dashboards be? Do you understand these systems? If there's a high-priority feature request from the business, will that draw resources away? You plan to deploy Envoy, does it sort of pause well? Can you go partway into this migration? Are you setting yourself up for a fragile, brittle, all-or-nothing sort of situation?”

Specifically, questions in resilience engineering that I find myself relying on a lot...one is, “How did it make sense at the time?”

‍Which is maybe cognitive systems engineering. People generally don't show up to work to mess up and ruin things for customers or their coworkers. If something bad happened then it's because someone tried to do the right thing, or took a series of actions that they thought would have a positive outcome, but it did not. How did it make sense to them? This is cognitive perspective.

People generally don't show up to work to mess up and ruin things for customers or their coworkers. If something bad happened then it's because someone tried to do the right thing, or took a series of actions that they thought would have a positive outcome, but it did not. How did it make sense to them? This is cognitive perspective.

Another is this Safety-II idea that the work is the work. Based on what leads to promotions, what we see leadership not only say but what they do...people decide every day which corners to cut. If there's an aggressive deadline, what tradeoffs to make. A lot of the time that leads to success, some of the time it leads to failure, but it's not that someone flipped a coin at the start of their day and said, "I'm going to do success work or failure work." They did work, and all those sort of latent variables ended up one way or the other.

I think it's from Ryan Kitchens (and the overall Learning from Incidents community) where we get the idea of a perfect storm, or the “nines are useful” line. You think about what sort of incidents you get into, or what you want to improve or avoid from a reliability perspective. There’s this idea that with some of the highest-profile incidents, is that 50 things went wrong at once and they won't go wrong the same way again. It's about contributing factors, and not root causes.

You think about what sort of incidents you get into, or what you want to improve or avoid from a reliability perspective. There’s this idea that with some of the highest-profile incidents, is that 50 things went wrong at once and they won't go wrong the same way again. It's about contributing factors, and not root causes.

And I don't know when this shifted...obviously people like John Allspaw have been working on this for a long time, and there’s been an increasing presence of folks giving talks on these sorts of things at SREcons and various other places. The interesting challenge now, and one that I'm interested to see manifest in my job at Stripe, is how to take that perspective and map it to success in how to evaluate a process or outcome, in actual systems. Safety is not as interesting to me in the abstract. If you're talking about safety of a hospital, theoretical hospital safety is interesting, but if you can actually figure out how to get better outcomes, even better.

Darell Pappa: Absolutely. I feel that SRE is kind of opening the door to some of this conversation around understanding the human aspect, and it seems it could be a nice gateway into broadening into this huge field like resilience engineering. I am trying to grapple with where you define the boundaries? It seems like it can continue to push out and grow based on the broad feeling of psychology, and extend into how we can drill into distributed causes and effects. Can you share how to apply this specifically in the field?

Jacob Scott: Yeah. That's a great question. There’s the fallacy of best practices, which is to say that what successful organizations have adopted and what works is highly context-specific. In terms of applications and where you draw these boundaries, it's going to depend on many factors: where you are in your company, how your leadership is thinking about reliability, for example.

That's a good challenge, to take this down from the abstract. The advice that Richard Cook and John Allspaw gave, which I’ve come to embrace, is “catch a wave” or “start small”. A clear thing that you can do, which intersects with resilience engineering, is to learn from incidents. The Etsy debriefing facilitation guide is a great place to start. The Adaptive Capacity Labs blog has a lot of great resources on what it means to learn from an incident, as opposed to the many other ways that a blameless postmortem could be used in an organization.

So rather than stopping something or telling people what they're doing is wrong, try to find a cohort of people in your organization who are curious and then do the leading work of exploring perspectives in discussion.

This reminds me of something I thought was really interesting from Subbu Allaramaju at Expedia. He did a review of all the incidents that Expedia had had, and classified them. An interesting thing for resilience engineering to reflect on is chaos engineering and learning from incidents. It helps you bootstrap your ability to learn, in this cognitive way: how to succeed at ensuring this incident isn’t worse.

If a new hire was involved in this incident, instead of the senior seasoned person who stepped in, how would that have been different? The fact that you're looking at a real incident, makes it context-specific to your organization. And it's actually a real thing that happened, which is such a rich source. If you look at chaos engineering or other approaches like continuous verification, there's an advanced mode where you're trying to have things fail a little bit all the time so that you can better understand what those modes of work are. But if you think about chaos engineering game days, you're stressing the system, you make a hypothesis, you're trying to see what happens when you have it failing a certain way. That's a hypothetical failure, in QA or whatnot.

Of course, you can actually see it play out live. In your incidents, you have such a richness of data that is concrete. This is no longer abstract. Your customers are impacted, someone was paged at two in the morning. That's the place to start. I think a lot about high alignment and loose coupling and back pressure. A place where I think resilience engineering is interesting is where people make local decisions. “I increased the 9s of this system. I invested in reliability.” But what gives you confidence that actually improves the reliability that your customers see, or your overall goal?

Mickey Dickerson has an essay in Seeking SRE, where the storage team cranks for a quarter and really improves the reliability of the backend. And then the application teams will improve our latency, we no longer have to make this many retries because the storage is more reliable. You can consume this safety margin generated somewhere in some other part of this complex system, where no one can fully comprehend it all.

That to me feels like one of the ultimate goals: to help understand how a local change can track to an end-to-end result. But this runs through how an entire organization operates. So it becomes both very complicated on a social or a political level, and also very context specific. If you're some executive who's eager or open in a certain way, then it may be easier. If you're crunching to hit some crazy deadline, you may not get much out of it.

It's leading and lagging indicators. You train and do the marathon and hopefully your time improves. Find like-minded people, and learn from incidents in the time that you can budget or free up for leading work. In terms of the outcome, there it becomes much harder to give generic best practices. You're learning a bunch, so keep an eye on what's happening in the organization or when you can track a path to leverage those learnings. That's the “catch a wave” suggestion.

Darell Pappa: I definitely agree; I feel like incidents are the entry point. It’s the area where most people can kind of come together in a sense and really feel the same pain, that gets the conversation started. Now I wanted to bring us back to your time at Lyft, where the team developed some really cutting edge solutions to highly complex problems that come from operating distributed systems. What are some of the most interesting technology challenges you encountered there?

Jacob Scott: A really interesting technology challenge, especially for larger organizations, is this bifurcation that happens between infrastructure teams and product teams, for example what it looks like to migrate to Kubernetes. Should you do it, if you have a group that's responsible for providing strong technical primitives around networking and compute and storage and other resources?

A pattern that I would hypothesize that is frequent for high growth companies, any Decacorn in the Precambrian era, is that your business is exploding and you have five people on the infrastructure team, and then it’s just “hook or crook” it. That is very mutable infrastructure, many pets not cattle, because that works during that time. But then you may get to a point where the properties of this configuration management or service discovery is not as safe or it's kind of clunky. It's probably not doing enough for the customers in sort of product engineering, but also the people who own and operate it are not that happy with it anymore.

But then how do you build the new thing and keep the old thing running? Because that's where all product is running. And how do you simultaneously deliver stuff that is actually what the customers want? From the outside, it's the hype cycle. It's like, "Okay, I'm an infrastructure engineer. So now VMs are old and busted, I'm onto containers, let's do Kubernetes. Kubernetes is going to be awesome. It has its own inertia." But what is the platform that you're providing to product engineers? Is it going to solve their business problems? The value of the infrastructure team is actually observed at the end-to-end impact it has via product teams delivering features to customers.

Kubernetes is going to be awesome. It has its own inertia." But what is the platform that you're providing to product engineers? Is it going to solve their business problems? The value of the infrastructure team is actually observed at the end-to-end impact it has via product teams delivering features to customers.

While I was at Lyft, I saw startups building service mesh products like control planes built on top of Envoy. However, given Envoy’s roots at Lyft, our control plane was one of those first ones. I think about it the same way you think about cell phones and landlines. If you get cell phones before you have a landline infrastructure, you just put towers up everywhere and then you have cell phones. If you have landline infrastructure, it's like, "Well, when do I get a cell phone? The landline is probably working okay. Do I want to really pay for a cell phone and a landline? The landline does half the stuff the cell phone does."

There's a complexity shell game. You have all these stressors, you have all this growth. Well, how do you handle storage? It's probably not the right moment to migrate from Mongo to Dynamodb. But at some point, your use of Mongo may be getting very long in the tooth. And how do you make those calls?

When I think about technology challenges or problems, it can feel like a frog in the pot. You come into a company that's been operating for a decade that has hundreds or thousands of engineers, and all this stuff has been built up. And so it's like, "Is the technical challenge to build a new system? Is the technical challenge to migrate to that system? Is the technical challenge to build a new primitive?"

It may be my own bias based on where I've worked, but frequently, most companies don't want to build their database; they want to use a database. So the technical challenge is, how do you rapidly adapt technology to let people focus on the business logic as much as they can?

Most companies don't want to build their database; they want to use a database. So the technical challenge is, how do you rapidly adapt technology to let people focus on the business logic as much as they can?

And now we move back to resilience engineering. It's the triangle of safety, efficiency, workload. People jump on new gadgets to try and get their work done. And then this pulls you towards un-sustainability. You're like, "Oh no, don't use it this way." But if you make it possible for someone to make a query without an index and it works for them today, then they will do that. And then in three months, you'll be like, "My costs are up." Or "I can't chart in this way." And they'll say, "Well, but I had to ship this thing."

When I think of technology challenges, I think the industry is good at framing problems like: where should we place our Kubernetes clusters so that we tolerate some AZ or region failure? Or, how should we build a CRD? Or, how should our auto-scaler work based on latency or CPU? I think an underappreciated challenge is -- how well do new hires understand this? Does this actually a net positive impact for people trying to deliver values to customers?.

Darell Pappa: That's a good point. Amy has said, "If the short answer isn't in some version of it depends, it's not really the right answer. Or it's not really speaking the truth." That completely embodies what you've been saying here; it really depends on your goals and the situation that your team is trying to work towards collectively.

Jacob Scott: The other thing that's interesting is the role. If you have an incident, there's a default to not talk about it that much. If it's an unimportant incident, just sort of put a blog post out for something small. If it actually impacted customers, what gets written is something that is for consumption primarily by people who are paying you money. It has a lot more considerations than learning.

There's a challenge because you're thinking about all this stuff and you're writing code and you're shipping software, building a product. It is challenging. Avoiding an incident is challenging, but you can open source a library and put it on GitHub. But when you're talking about the concrete application of these things, how we go from “It depends” to “I wrote this thing last week and it made something faster or better”, it's challenging to take those learnings and make them legible globally.

It might be the quality of a dashboard, or you're changing some property of a system like splitting up a service into two sub-services that have different traffic patterns. Your choice there and how it depended and why it was the right choice, just polls on. The threads there go to this deep knot that's existed for the past decade or a few years, or maybe they live in the intuition of senior architects in the company. If you think about the concrete path that connects this philosophy or perspective to clear outcomes, in the same way that shipping a new feature is a clear outcome, it is very context-specific. So it can be hard to talk about outside the organization in which it exists.

Darell Pappa: You’ve also spent some time between Lyft and Stripe working on research. Can you tell me a little bit about that?

Jacob Scott: Between Lyft and Stripe, I went to this place called South Park Commons, which is a community in San Francisco. I was there for about six months. It's for people who are sort of figuring out what they want to spend the next decade doing. That community is great, and I had a lot of fun. While I was there, I had the freedom for self-directed exploration, soaking in material from Twitter, Slack, academic papers, books, talks, stuff outside of software, Friendly Fire by Snook. There's these great examples of complex system failures, and a friendly fire shoot down of helicopters in Iraq by fighter jets.

This is a thing that really should not happen, right? The military has many protocols surrounding how to know who should be fine, where, when, how you coordinate between different groups of people. How you do visual detection of the relay, how to determine what you're seeing before you fire a missile. And yet, this actually happened. When you dig in, these local pieces are not perfectly coordinated. The helicopters, the people who approve the helicopters, the fighter pilots, people who approve the fighter pilots, the AWACS people who are supposed to be discovering everything. So it's a perfect storm.

What many people work on with software is really important. I'm happy that I'm not working in a place where mistakes cause people to die. But it's really powerful to notice and learn from these commonalities or parallels from other domains.

That's another big learning of resilience engineering: the human component. Everyone is making these trade-offs. Everyone is cutting corners. Everyone is looking at graphs and doesn't understand all of them. So there is this tremendous amount of knowledge, this history of incidents that we can learn from.

My time at South Park Commons was a really great opportunity to sort of take a pause from being at a high growth, high velocity software company, to just be curious and swim in the ocean and mull over these big topics in a more relaxed environment. And then, again, to be lucky enough to take that perspective back into industry. How do I take this high-level perspective and what does it take for it to become legible, and how does it really improve things for engineers and customers at a real company?

Darell Pappa: The South Park Commons looks like a really cool place to be.

Jacob Scott: Yes, it's great. I recommend it to anyone who's at that point in their career. It's quite interesting as a community, especially in times like COVID, because historically the physical space has played an important role. But that's resilience engineering. Adaptation is fundamental.

Darell Pappa: If people reading want to get started with Resilience Engineering, what’s the best way to do that at their companies? Do you have any books or other resources you’d recommend?

Jacob Scott: My favorite resources are the Adaptive Capacity Labs blog, and the Learning from Incidents community blog. Twitter is good; there's a few dozen people to follow who are at the center of this community of resilience engineering intersecting with software. Friendly Fire by Snook is also interesting because it's a story. Which resources help most may depend on where you are in your resilience engineering journey.

Are you someone who's like, "Yes, I understand." Or someone who’s like, "Through my lived experience, I understand blame and sanction and the ways in which this can go wrong." In which case, John Allspaw has many videos on YouTube: pick your favorite one.

If you’re someone who's like, "I've heard about this, I'm curious. It doesn't quite click for me", then try the blogs, or a book like Friendly Fire or the Three Mile Island Report, which walk you through real case studies. The places where this is most stark is in environments where it’s like, “How could this happen? We tried so hard. The people here are so smart. We followed all the best practices. How are we still having this incident again? How come what we did last time didn’t work?” Catastrophes happen, cascading failures happen. So the stories where people have gone very deep and which can be teased apart can bring valuable learnings to light.

Catastrophes happen, cascading failures happen. So the stories where people have gone very deep and which can be teased apart can bring valuable learnings to light.

If you’re interested in joining Jacob and the amazing infrastructure team at Stripe, check out their careers here.

Originally posted on the Blameless blog.

A Culture of Reliability with Matt Klein, Creator of Envoy

Ancy Dow — Fri, 26 Jun 2020 16:25:40 +0000

A discussion with David N. Blank-Edelman and Matt Klein.

Matt Klein is a software engineer at Lyft and the creator of Envoy, an open source edge and service proxy designed for cloud-native applications. Envoy is now used by organizations such Google, Microsoft, Netflix, Stripe, IBM, and Airbnb. He's a big fan of service mesh architectures as a way of improving the reliability and observability of environments, especially those that are microservices-based (Envoy is at the heart of a number of service mesh projects like Istio).

Though it is always a pleasure to have a deep dive technical discussion on service meshes with Matt, I've also found his observations on SRE culture to be a good starting point for conversations about the field. That's the direction we took for this discussion.‍

David: Let's do a level set. How do you define SRE?

‍Matt: I think a lot of people have different definitions about what SRE is. The way that I think about SRE is that it's really a software engineering and a thought discipline that is focused on system and product reliability, and not necessarily building features for a specific product. So obviously it's pretty nuanced. But I think at the end of the day, that's the only way to really cleanly break it down. Traditional software engineers are focused on building particular product features, typically driven by product managers and business requirements. And SREs or production engineers, or reliability engineers, or however you want to call them, I think their focus is not necessarily on building features, it's on making sure that those features have a baseline level of reliability for the customers who end up using those features.‍

David: Do you feel that the culture for SREs is the same as the culture for feature producing software engineers?

‍Matt: So I guess it depends. This topic is obviously fairly complicated. We can talk about the culture of a product engineer, or the culture of a reliability engineer. But I find it more useful to think about culture in the sense of “what is a reliability culture?” I think that spans product engineers, and it spans reliability engineers.

‍David: Okay, so that's a great distinction. So what does a culture of reliability entail?

‍Matt: I think that is actually what is most interesting about this discussion. Because if you look at the differences in terms of how companies have thought about this problem—dealt with this problem of building features and having those features actually end up being reliable, there isn't one answer. The industry has actually been all over the place.

We have gone from having the very traditional siloed roles of software engineers, test engineers, systems engineers, release engineers. And then we've gone full stop the other way, where we have this theoretical concept of DevOps. DevOps means different things to different people, but at least at a high level, the idea of DevOps is a much more agile development process—the idea that we're doing infrastructure through code, the idea that everyone can do their own operations.

And then we have somewhere in the middle of this a reliability engineering discipline where we try to enable product engineers to have operational excellence. Provide them the tooling and the culture. So I think it's important to separate culture from discipline.At a super high level, reliability culture is honestly, just respecting the user. It's respecting production. And I actually think it's that simple.

‍David: When you say “respecting the user” or “respecting production” what do you mean?

‍Matt: We can have a much more nuanced conversation about it, but there's a way to think about software development or product development where we're going to add features X, Y, or Z. We don't do that in complete isolation without thinking about how our actions actually go through and affect the user. If we have a bug, what impact does that have on the user? If we for example deploy our new feature at a peak time, and we break the website, is that the most user respecting way that we could have done that? If we don't use feature flags, so that we can incrementally test things, is that respecting the user? Are we doing development in a way where we reduce the chance of impacting the user experience as much as possible?

SRE culture is balancing velocity. Balancing feature development with risk assessment. And it's always a balance of how do we have as much feature velocity as possible, while having as little risk as possible. And that's why this is such an interesting topic. Because there's no one answer—there's no one way of actually solving this problem.

‍David: And respecting production?

Matt: Respecting users is a superset of respecting production—if production goes down the user doesn't have anything to use. I think they're kind of synonyms, just in the sense that I think implicitly by respecting production, we are respecting the user who uses production.

‍David: If I were to walk into an environment and I was trying to determine whether that environment had the sort of reliability culture that you were speaking of, what would be some of the indications of that?

‍Matt: Right, that's where we can pop down a level. I think as an industry, we have evolved towards a set of best practices that allow us to have high velocity, while maintaining respect for the user or respecting production. We have a good understanding of things like using feature flags and doing incremental rollouts, how to do canarying, SLAs/SLOs, actually alarming on issues, dashboards that are not only infrastructure metrics but also business metrics to understand the impact to the user, a strong culture of automated testing, how we do config management, can we rollback, code reviews… And I can go on, and on, and on. There's a set of industry best practices. And these are DevOps best practices also.

But the idea here is that the more of these best practices that an organization uses, the more likely it is that they can maintain high velocity, but still respect production, respect the user, and hopefully have as few issues as possible.

David: Are there best practices that are human based—related to how we treat the humans and what the humans do that you think are important as part of a culture of reliability?

‍Matt: Yeah. And so that's where I actually think this conversation gets pretty interesting. Because as an industry, I think that is the area that we have the least agreement on.I think most people, if you sit them down, and ask them about best practices, would broadly agree. Like I would be amazed if anyone disagrees at this point that you should have a canary process, you should probably do feature flagging, continuous integration, etc. There's some basic table stakes most people agree with.

Where I think it gets a lot more interesting, and there's a lot less agreement, is on things like “are all developers on call? Or, are only a subset of people on call? Who is responsible for documentation? And who's offered to keeping that documentation up to date? Who is responsible for new hire education, continuing education, teaching engineers how to understand the concepts of canary and feature flagging?”

Because we don't teach this in college, right? You can't assume that people just come in and know these things. We have to actually teach them. So there is a strong component of education, of documentation, of mentorship. And I think as an industry, there is the least amount of agreement on how we go about doing that.

As an industry we haven't decided on the right people layout. By people layout I mean, “do we have all engineers being on call, doing deploys and those types of things?” Or do we have a set of SREs who tend to do those things? Or do we have a completely siloed software engineering team versus a systems engineering team? Very siloed roles in terms of who touches production, who has gates on how we monitor things, how we deploy, when we deploy. There's a range all the way from the silo to true “DevOps” with an everyone is on call culture, where everyone deploys, everyone manages their own runbooks, everyone manages their own systems.

I've been thinking about this area, this human stuff, a lot recently and wrote this blog post called, The Human Scalability of “DevOps”. I think that this is the area where as an industry, we have to do the most evolution. I've become increasingly convinced that agile DevOps development practices for all engineers are an absolute no brainer. We should do CI, we should allow everyone to deploy. These are very basic things. But at the same time, I feel that as an industry, we don't invest in the human side of things.

‍David: Can you give an example?

‍Matt: I think increasingly we have human scalability issues where we're not respecting the fact that people don't know how to necessarily use modern infrastructure, or don't know off the top of their head all of these site reliability best practices. You can go and read about them in the Google Site Reliability book, or the new book that's coming out. But these things are just not obvious to people. And as an industry, I don't think we educate folks enough, and do enough continuous education and support them. And I don't think we do enough mentorship. There's really good agreement on the things that people need to do, but there's not good agreement on how we educate people, and how we help them actually do those things.

I think there are two main problems: first, as an industry, we have an over-inflated view of how easy cloud native infrastructure is actually to use. We like to claim that we can go and use all this modern technology, infrastructure as code, etc. and the days of yore in which you needed a old style system admin, that just doesn't apply anymore.

We are very far ahead of where we were 10 years ago. But I think it is a gross exaggeration to say that it's easy. If you look at the current state of the world, including Envoy, Kubernetes, all of the tooling that we've built—it's still really hard to use. So trying to expect people that don't have the domain experience either in core infrastructure or in reliability engineering, to come in and just know how to do networking, or know how to do provisioning, or know how to do containers…to me, it's a little nutty. So that's number one.

I think the other side of things is that we are engineers, and engineers like to build things. And a lot of engineers don't necessarily value softer but important things. The two things that I'm thinking of are continuing education and documentation.

The dream of the cloud native experience is that it's self-service…we're expecting everyone to come in and through APIs and code, and through documentation, build amazing applications. I think we're getting there, but at a lot of companies, particularly larger ones that have infrastructures that have scaled beyond what some of the current cloud abstractions can give people (Lyft is one of those companies) there are engineers that like to build stuff and solve problems. We don't like to write documentation. We don't like to build new hire or continuing education classes. And I think we grossly under invest in those two things.

‍David: Do you feel that the industry has come to consensus on how to handle the situations where things break, where the reliability is compromised?

‍Matt: There's no consensus on how organizations do postmortems. Not only postmortems, but how people treat follow up actions. Like do companies actually fix their follow-up actions? How do we prevent the same incident from happening again? Or even during an incident, how do we communicate? Is there an incident leader? Do we get on a call? Do we use Slack? There isn't consensus, there isn't common tooling that people use around this. There's a common theme here. We're running before we're actually crawling.

You can look at older organizations like Amazon or Google where they have a strong set of procedure that are borne out of, not only years of experience, but also because they started when the industry was in a very different place. They're going to have an incident management team that knows how to deal with an incident and run the incident, and open a call bridge, and do all of those things. How do do follow-up, etc. Whereas in newer companies typically there isn't an incident leader role—it can be chaos when an incident happens. Or there's no incident command center, there's no central monitoring command center.

These are just areas where we as an industry want to do things with as few specialized people as possible, because rightly so, we want people to write business logic, that's how we make money. But, we see this theme, again, and again, and again, where the automated tooling that we assume will save us—it's not quite there yet.And there's still a lot of these human issues. We don't have consensus on “what are the right roles that we have to hire for? How do those roles actually interoperate with other folks at the company? What roles are required? Do you need an incident command center? Do you need a monitoring command center?” Most modern newer companies would say no. But then you wind up with issues during incident handling, so it's complicated.

‍David: Do you think we're going to get there? Do you think that the industry will coalesce?

‍Matt: I do think we will get there. I think it's going to be longer than a lot of people would like. If you look at the way that most of the cloud native infrastructure is heading, look out 5, 10, 15 years, I can totally buy that people are mostly going to be writing their applications as functions. You know, they're using lambdas and serverless, and all that stuff. At a visionary level, that sounds great. People write application logic snippets, you can talk to a database that you don't really need to know how it works, and you can make network calls to functions, and it just works.

Unfortunately, we can barely run our container-based infrastructures today, let alone serverless, which I would argue is an order of magnitude more complicated in terms of the concerns around auto scaling and dynamic infrastructure, and networking, and observability. We're just not there yet.

So, if I look out 10 years, I do think that eventually we will be using a lot more pre-canned solutions, where a lot of things will be a lot more automatic in terms of write some code snippets, have them run. The operational concerns around deploying, and all of those things, they're just going to work. That doesn't mean though, that 10, or 15, or 20 years from now, when people are writing their applications with amazing functional substrates, it doesn't mean that there doesn't have to be a reliability culture. Because you can write a functional snippet that breaks all of production.

I do believe that over time, cloud native infrastructure will become easier to use. A lot of the stuff that we struggle with today in terms of how people do data storage, and how people do multi-region, and how they do routing, and all of these things that are a mess today—I think a lot of them are going to become just built in to cloud platforms. But there's still all the human concerns around thing like “okay, just because the cloud platform supports this amazing canary system, do people use it?” You can't force people to do things. You can't force people to use feature flags. You can't force people to think about how to test their software.

So I don't think we're ever going to reach a point in which we get rid of this reliability engineering role. I think there always is going to have to be this duality of roles, where we have people that are thinking about product, we have people that are thinking about reliability. It may be that in the future, what took 200 people on an infrastructure team takes four, but it doesn't mean that we don't still have to be thinking about these basic human things.

Originally posted at the Blameless blog.

How DevOps and SRE Fit Together

Ancy Dow — Tue, 23 Jun 2020 15:16:15 +0000

I've had the pleasure of talking to many organizations about their operations practices and how they think about the challenges they face around maintaining a production environment that adequately serves their business. I have yet to meet an organization that hasn't had to balance often conflicting needs around feature velocity and operational stability. Simply put, this is the classic "devs gotta make stuff (customers want features and functionality) and the ops gotta keep things running (systems that aren't up can't serve customers)" dichotomy that everyone wrestles with.

Two "movements" (for lack of a better word) sprung up in direct response to this challenge: DevOps and Site Reliability Engineering (SRE). The former is better known because it grew up in the more public sphere. The second, until relatively recently, was cloistered in larger organizations as part of their efforts to scale operations beyond anything the planet had seen before. As a result, public understanding and adoption of DevOps is significantly further ahead of SRE. This post is an attempt to provide a very brief introduction to SRE and to hopefully suggest how it could relate to your existing DevOps practices.

What's SRE?

Just like there's no one canonical definition for DevOps, I won't pretend to be able to give you the final word on SRE. You will hear a range of answers starting from "SRE is what happens when you ask a software engineer to design an operations team" to "Site reliability engineering (SRE) is the application of scripting and automation to IT operations tasks such as maintenance and support."

When I speak about SRE I usually describe it as an engineering discipline devoted to helping an organization achieve the appropriate level of reliability in their systems, services, and products. There are two crucial parts to that definition: first, SRE is specifically focused on reliability as a fundamental property (perhaps the fundamental property). The rationale behind this is pretty straightforward. You can expend a huge amount of effort and resources adding features and functionality to your service or product. Kerjillions of dollars and countless hours could be expended to create something incredibly feature and functionality rich. But if it is not up, if it is not available when your customers attempt to use it, it doesn't do them or your business a lick of good (or your profits).

The second, slightly more subtle part of my definition hangs on the word "appropriate" when speaking about level of reliability. An important observation made by the SRE world early on was that there are actually very few systems and services that have to be 100% reliable. In fact, there are very few situations where it is even desirable because almost always the cost of achieving greater reliability from a cost and effort perspective, rises at a very steep rate. And as friends at Google are fond of pointing out, sometimes it's not even possible to hit certain levels of reliability. SRE seeks to not only acknowledge this gap between perfect reliability and desired reliability but in many cases, exploit it for the greater good of an organization's engineering priorities.

Whither DevOps?

Many (Most?) of the same challenges SRE was created to address were also the same motivation for the formation of DevOps. I think of SRE and DevOps as parallel tracks both attempting to solve the same problems. As a result, it is no coincidence that some of the same best practices are requirements for both sets of practices. For example, both SRE and DevOps require you bring automation to bear as a way of addressing scaling (and other) problems. Sound release engineering processes (including CI/CD) are required to create a manageable production environment. And everybody's favorite subjects: monitoring and observability are both core to SRE and DevOps practices.

So, I can just rename my existing team?

Well, not so much. While there is an overlap in practices, there is not an equivalence in philosophy, attitude and approach to many of them that is shared by the two practices. The emphasis in the two practices can often be different. Plus, in order for SRE to succeed en masse, crucial parts of the organization have to be willing to accept some of the values and priorities that permit SRE to properly operate. At the very least, there has to be buy-in in the right places around the value of reliability to the business as discussed earlier.

Do I need to hire all new people?

It's really important to distinguish between dedicated SRE roles (people who call themselves site reliability engineers) and SRE practices. In many organizations, as they grow, there comes an inflection point where it becomes appropriate to hire people (and then form teams of them) whose expertise and primary focus is on reliability. It is important to note that when you have such people on the payroll, it is not the case that they are the only people in the organization responsible for paying attention to reliability (everybody is responsible for constructing reliable software and infrastructure). SREs are the people who have a specialization that can be brought to bear on these challenges.In this regard there is a direct analogy to security. At a certain point, it makes sense for a business to hire people who focus primarily on security. They are not the only people in the organization paying attention to security (they better not be--security is everyone's responsibility), but they do serve a crucial role in regards to that domain.

That's SRE roles, but what about SRE practices? Before the inflection point mentioned before takes place, before you've hired dedicated SREs, it absolutely makes sense to introduce some of the more congruent SRE practices and SRE tools into your organization.

Often DevOps organizations are the perfect place to plant these seeds because they already have a culture that values modern operations practices (as mentioned in Whither DevOps above). An easy example is one near and dear to the people who run this blog: incident response and post-incident follow up (the blameless postmortems from those incidents). Another might be around the creation of service level objectives (more on this in a later piece). As these practices become popular, as the culture changes as a result, and as some individuals start to gravitate towards a reliability-centric viewpoint, it becomes natural to consider introducing SRE as a full-time role for these people.

Yes, have both

SRE offers a set of principles, practices/ and a particular focus. If you already have a DevOps culture and practices in place in your organization, there is no good reason to walk the halls ripping up people's business cards and handing out new titles... In fact, don't do that (that's a subject for a future blog post).

But if you do look at SRE and find its ability to deftly address the feature velocity vs. operational stability conundrum compelling, by all means, consider exploring those principles. Experiment with some of the practices and tooling that makes adoption easier. See if it can offer you the same benefits enjoyed by the many organizations who have already walked this path. And be sure to let me know how it goes for you.

If you'd like to see how a platform can help you adopt SRE (and DevOps) best practices via tools, the Blameless team can show you how. Sign up for a trial at www.blameless.com.

Written by David Blank-Edelman
Originally posted at the Blameless blog.

How SLOs Help Evernote's SRE Team Manage Tech Debt

Ancy Dow — Sat, 13 Jun 2020 22:00:22 +0000

Originally posted at the Blameless blog.

Traditionally, DevOps engineers have had their hands tied when they ask for investments in reducing technical debt. Architecture changes, code refactor, and slowing down feature development for product availability are mostly wishful thinking, even though these efforts inevitably become mission-critical for every company. If fixing tech debt is not handled or timed well, the company's product and business health will both suffer. When should you deal with your technical debt? How can you bring the leadership team on board? What does the SLO have to do with these questions?

Blameless chatted with SRE leader Garrett Plasky to get the answers. Garrett's SRE team of 15-20 engineers is responsible for keeping the lights on, a.k.a. running the production infrastructure, for Evernotes over 220M users across 5 billion resources. Their product is a cross-platform SaaS application designed to enable people to organize, personalize, consume, and share thoughts from any device at any time.

Anytime you go to the leaders with “Heres my problem. We should invest in our architecture.” They say, “Wheres the data?” You go, “Heres my SLO graph.”

To that end, SLOs are a critical part of shipping code to your production environment. Having SLOs in place for your production services allow you to remove the emotion and ambiguity when it comes to figuring out the impact of an unplanned outage or a bug released to production.

Blameless: What is an SLO? How does it relate to SLI and SLA?

SLOs, SLIs, and SLAs are used exclusively for metrics that capture your users experience, such as availability, request latency, throughput, and error rate, etc.

‍SLO, Service Level Objective, is an internal target for a metric that you are measuring.

‍SLI, Service Level Indicator, is the name for the metric.For example, if an SLI you are measuring is availability, then a corresponding SLO you might set would be 99.95%.Setting an appropriate SLO is an art in and of itself, but ultimately you should endeavor to set a target that is above the point at which your users feel pain and also one that you can realistically meet (i.e. SLOs should not be aspirational).

‍SLA, Service Level Agreement, is an external metric that you are legally obligated to meet, such as 99% for availability. When you don't meet your SLA, you compensate customers with dollars. Thats why you want to set your SLOs to be more stringent than your SLAs.

Both your DevOps and product development teams are responsible for meeting the SLO. This puts pressure on both directions. Are we meeting the SLO? If so, product engineering can run faster. Are we not? Now we need to put back pressure from the operations side - how do we fix the reason why we are not meeting our SLO.

What is an error budget?

An error budget is 1 minus the SLO. Continuing the example, a 0.05% monthly error budget for availability (based on a 99.95% uptime target) means your service can be unavailable for around 22 minutes in that month.Error budget = 43800 min/month x 0.05% = ~22 min/monthSRE practices encourage you to strategically burn the budget to zero every month, whether its for feature launches or architectural changes. This way you know you are running as fast as you can without compromising availability.

How have SLOs changed the way your company operates?

Every month, I present how well we have been meeting our SLOs. One bar graph I use shows our performance for availability on top of our SLO for each month over the most recent 6 months.The SLO graph helps us drive product roadmap decisions. When a piece of architecture is prone to failure and causing us to not meet our SLO, we can make a conscious decision to invest time (or not) to make it better. Without SLO tracking, we wouldn't have the data to justify the time investment in the architecture.

How has measuring against SLOs affected your leadership team?

Its been very useful for the leadership to gain visibility on our SLO performance.Recently, our new SVP of Engineering came to the service review meeting where I was presenting our performance against our SLOs. We had a not-so great-month. He jumped right in and asked “So what are we doing about it?” He was asking all the relevant questions without needing any context. He just got it!

The SLO graph sparked a great conversation that led to him saying, “Lets double down on some of these architecture investment effort that we've been talking about!”

What is the most important metric that teams should have an SLO for? How do you measure that metric?

Ultimately, it comes back to availability. People expect that when they do things on a phone, it's immediately available across the world for someone else to see. If thats the expectation of your users, then you have to serve that. In order to serve that, you have to be available. Thats why it's important to set and meet the SLO for availability.

At Evernote, we currently measure availability at the shard level as well as the overall service bucket. We use external probes that hit a health check endpoint on our service and return a success or fail. This probe is done every minute across every host. We measure the number of successes over all the health check data points across a fixed time period to determine availability.

When service becomes unavailable, a.k.a. when an incident burns the error budget, are fingers ever pointed?

Absolutely not. I have been passionate about blameless postmortems since before I even knew about SRE. It's not about “who tripped over the power cord?”, but “how can we prevent people from tripping over the power cord next time?”.

Its not about “who tripped over the power cord?”, but “how can we prevent people from tripping over the power cord next time?”.

What insights can we derive from tracking the error budget?

I have also created a step function graph for error budget burn. The line on the graph starts from the top left and dips down like staircase for every instance of error budget burn. By tracking and aggregating the key causes of these incidents, and analyzing which ones burn more budget than others, we can take proactive measures to prevent or manage the riskiest causes in the future.

How long did it take you to set up the monitoring system for SLOs?

The upfront investment of setting up the instrumentation, visualization, and data collection took 2-3 engineers about month to complete. Every month, I spend a day to pull together a presentation with SQL queries and graphs.

How did you get leadership buy-in on SLOs?

Im grateful that my VP at the time was equally keen to implement SLOs and error budgets. He helped amplify the ideas within the company. Our SLOs are codified in an internal Evernote doc that everyone in the company has access to.The first time I presented by SLO graph and error budget burn was within the SRE team, and a few peers in engineering. Over the course of time, more people joined, including a rotating crew of people like our client leads, other engineering managers, and even our CFO.

As you lead your SRE team on this journey to reliability, how do you advocate the importance reliability to Evernote?

As a SaaS company, our users trust us to not lose their data, and to store them forever. They trust that they can use Evernote whenever they want, wherever they are. We are committed to upholding that trust.

As a SaaS company, our users trust us to not lose their data, and to store them forever. We are committed to upholding that trust.

Written by Charlie Taylor