Page It to the Limit
Thinking About Your Humans With J. Paul Reed
COVID as an Incident:
J. Paul talks to us about being on the critical operations team at Netflix, what that has been like during the quarantine, and the pressure they all felt at Netflix to make sure the service is stable for their customers.
J. Paul: “What we were looking at real early on, is are we able to serve streams to our customers, are we able to provide those moments of joy?”
J. Paul continues to discuss the impact and coordination required to overcome technical challenges, and how looking at COVID as an incident helped with planning exercises.
J. Paul: “We started to shift our perspective, less around technology and systems and making sure they’re stable and all that, because we had some good evidence that things were going to be fine right at that point and we started looking more at the people impact.”
Socio-Technical Systems:
The conversation shifts to the impact of people being on-call and being required to work from home.
J. Paul: “We started to really look at, from the operations perspective, if that operations team was understaffed and underwater when COVID happened, now you’ve got a whole other set of problems to think about with that.”
He continues to talk about socio-technical thinking - how the socio part is really about the people in the system who are responsible for getting systems up and running and operating them.
The Impact of Operating Systems
J. Paul brings up the levels of impact to the people, beyond just the surface-level impacts of being at home.
J. Paul: “If you’re on an engineering team you are likely going to be on an on-call rotation for your team. So the core team will page you into an incident, where we use PagerDuty for that. And so one of the interesting things is that means we have a really large data set of what people are experiencing or what we’re seeing with paging rotations and that sort of thing. So we have a we’ve been starting to parse through that. We actually have a monthly kind of socio-technical systemic risk meeting, so we’ve started actually talking about the impacts of working from home.”
Time has no Meaning?
J. Paul moves on to discuss the difference between capacity and availability, and how people are the same as systems.
J. Paul: “We may be highly available or as available as people expect us to be, so we might be eight hours, you know, online in our home office or whatever the case may be. But people’s capacity is reduced during this because of the stress of COVID.”
The conversation around availability vs. capacity continues and J. Paul encourages us to give our team members more grace.
Changing the Way Teams are Built
Mandi and J. Paul talk about the biggest changes we see with remote conversations and the need to be onsite, as well as the value of being remote.
He then mentions ways distributed teams can increase the cost of managing incidents and how they combat this at Netflix by practicing and doing incident management on Slack. J. Paul continues to discuss the ways folks are changing the way they work due to a lack of in-person meetings.
Thinking with Stories
The conversation moves to a discussion around how humans think through stories and why stress levels are higher during incidents.
J. Paul introduces us to Jabe Bloom’s (@cyetain) research at IBM’s RedHat Global Transformation Office, and how humans process through stories that make sense. He explains how incidents that don’t follow the “stories” have broken down.
J. Paul: “And the reason that it’s stressful is because all of the inferences that we made about the future and the stories that basically reduce the cognitive load for us are not true, which means we have to pay attention in the moment. And the bandwidth to do that on our brain is incredibly high. We have to pay attention to every little detail because we can’t rely on the stories that were told to us about these systems anymore.”
Exit Criteria
J. Paul explains that Netflix couldn’t keep the COVID incident open forever, and how they needed to learn and become increasingly adaptive in the new environment.
J. Paul: “The requirement for that adaptive capacity has actually gone down right, because they figured out, but for other team they’re still having to be adaptive and innovative in the way that they do work, but they know that now, so they know what they need to do to keep that adaptive capacity level.”
Season 2
Just a reminder, if there is a series you want to beg Netflix to bring back J. Paul offered you the ability to tweet him @jpaulreed with your requests.
Additional Resources
- PagerDuty Home Page
- Episode transcribed by Rev