Writing and operating software is hard.
We've seen huge changes in how we build reliable software over the past two decades, driven by the DevOps movement and SRE practices the last few years. Some of the great things that came from the years of experience and great minds in the community are that operating software at scale is very hard, and it’s not just about code and processes. We have learned that people are also a big part of the puzzle. High-performing organizations and teams make sure to prioritize people. We need a lot more of that; it’s common for DevOps and SRE teams to burn out and suffer from pager fatigue or even on-call PTSD.
Today, I’m only focusing on SREs. What if we train SREs to lead with empathy? It’s not just about keeping servers healthy and green and our users and CTOs happy. But what is empathy? While empathy may be defined as this definition, I think empathy is the ability to relate to other human beings by being curious, listening, offering help while building trust.
First, what is SRE?
SRE stands for Site Reliability Engineering. This practice has been around for a while, but with different names like Operations Engineering, Systems Engineering, or Production Engineering through the years. SRE got more established when Google released the book on SRE, which I refer to as the SRE bible. You can read it for free online via this website.
An SRE is an engineer that focuses on system reliability. Some folks like to think of SRE as engineers that spend 50% of their time writing code and the other 50% of their time operating and optimizing software for the purpose of reliability. When we talk about SRE, we often get asked how SRE and DevOps vary from one another. The best way I’ve seen it explained is covered on Google Cloud’s “What's the Difference Between DevOps and SRE?” with Liz Fong-Jones and Seth Vargo.
SRE is an implementation of DevOps.
It’s DevOps with the purpose of reliability.
What does an SRE do?
SRE duties vary from company to company. Some of the duties one can find themselves doing while being an SRE include (and are not limited to):
- Writing code
- Being on-call
- Setting up tooling (CI/CD, Observability, Incident Response tooling, Testing pipelines, etc.)
- Setting up alerting
- Setting up SLOs
- Capacity Planning
- Running Chaos Engineering Experiments
- Advocating for modern and best practices
- Running blameless postmortems
- Reviewing the company’s testing practices
- Auditing services
- Interviewing other teams
- Reading other company’s postmortems
- Improving processes
- Reviewing Architecture decisions
- Drafting new architecture plans or PRDs (Product Requirements Documents) on vendor/tooling changes
When we talk to seasoned SREs, we learn that their empathy has allowed them to be more successful. This is because, as an SRE, you have empathy for the service teams one supports and understands that they are doing their work to the best of their ability. But it’s not just them; this also expands to the other SREs in one’s organization. As I said above, SRE varies from company to company, and every SRE comes from a different background. But this also extends to other duties like educating the rest of the organization on reliability best practices.
How can SREs embody empathy?
Empathy goes a long way; being empathetic builds engineers' knowledge level and increases engineering morale and company culture. It begins with listening to the needs of the teams they support and the issues they are working on. Within their teams, this might mean being able to support one another when trading on-call shifts, triaging bugs, dealing with incidents, or leading incident review meetings.
Building empathy takes practice, and every SRE might embody it differently. For some folks, it might look like this:
- curiosity
- patience
- open communication
- ability to admit being wrong
SREs should stay curious and ask as many questions when reviewing documentation, incidents, or data. This allows one to try to get the most context into decisions and understand the “Why.” Curiosity lets an SRE embark on their continuous learning journey.
We know that the work SREs do can be very time-demanding and stressful; the ability to stay calm under pressure is sometimes one of the best qualities of an SRE. As people are the ones that operate our systems, one needs to understand that the full context of a system might not be seen through a dashboard, and one must be patient to research what exactly happened and debug a bit further. We also see that when an SRE practice is still in its early days, you will need a lot of patience to work with leadership and other teams for the organization to prioritize reliability action items every sprint.
Additionally, like any work or personal relationship, the more one can communicate, the better; providing data with this is helpful, especially when dealing with incidents. Over-communication is appreciated and goes a long way. Let’s take, for example, when we are dealing with incidents. When Jewel sees that their system is starting to degrade its performance, they should say something. Jewel can say something and start researching on their own, but this gives the heads up to the rest of the team that something could be happening, and it also opens the door for another team to share any insights they might have. It’s common for two people to be working on the same system at the same time. In another scenario, let’s take an ongoing incident. There is a Slack channel set up and a Zoom call ongoing; engineers are researching, debugging, and trying to bring a system to a better state. It’s good practice to bring in as much relevant data from other sources that provides more context into why the system might have left its unhealthy state. With communication practices like this, one might be able to resolve an incident in less time or just be able to provide more context when a postmortem is written up.
As we think about building distributed systems across distributed teams, we see that over-communication helps build empathy. This helps build trust and allows room for failure. After all, we are humans working on complex systems. Additionally, the ability to be vulnerable, open, honest, and transparent with our peers is important. Our systems are changing rapidly, and our mental models will be outdated. Cultivating a culture where failure is embraced and a culture of experimentation is cultivated is important and essential. This will also be very helpful for an honest postmortem. One gets to show up for their team.
As an SRE, leading with empathy means that you become an individual that has established a lot of trust in the organization. This has several benefits:
- Teams might ask you earlier in the development processes/issue process for feedback on the issues they are facing (ability to approach)
- Discourse at incident reviews is less about blaming others and more about learning and sharing and moving forward together (trust)
What world are we creating with more empathy?
It is common to find empathy in SRE and DevOps communities, to the extent that there is a huge movement behind #HugOps. For all of our favorite companies, applications, and tools, there’s always a team of folks working hard to keep things running or firefighting when things go wrong. When those sites and applications go down, we acknowledge the outage and downtime sucks, but we send our condolences and support to those working to bring things back. We know the work is hard, and we stand together in support as a community. Hugs are sent just like packets across the internet to our friends.
_We must continue cultivating this culture and ensuring that all engineers follow it. _
Disclaimer: Emotional SRE Toil
In SRE, we talk about toil. Toil is the manual work needed to keep systems running in production. Toil is defined as “the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” When we think about our SREs embodying more empathy, we need to look at the other side of the coin. We are asking our teammates to take on a larger work that can not be quantified: emotional work.
Understanding other people and being in someone else’s shoes eventually catches up to you.
We already know SREs sometimes struggle with mental health, which is worth mentioning in these discussions. How do we get ahead of the possible emotional toil that comes from being more empathetic? We first need to accept that this is already happening in our organizations, but we are not talking about it.
Looking at the workload of SREs during the pandemic and # of incidents that they fire fought or the number of services they helped in, we can see that they’ve been a massive help to our organizations and keeping our users happy. How do we protect the humans that support our systems and keep our services up?
This is a call to action for our leaders to check in with their teams more. This entails listening to quarterly retrospectives, attending post-mortem reviews, and revising the information we collect about our services, requests, and incidents. This also includes other manager-y duties like listening to ideas your team brings to the table, proactively tracking and ensuring your team takes time off (I’m looking at you managers at unlimited PTO companies) along with ensuring that on-call schedules are balanced across the organization, ensuring your SMEs (Subject Matter Experts) are not very close from burning out, and ensuring that other people are also getting trained to be on-call.
To make an impact, our SRE leaders need to lead with empathy and help the rest of the organization engineer with empathy.
Stay tuned as we continue the conversation and talk about empathy with our users next. Do you have any thoughts? Feel free to reach out via Twitter at @ana_m_medina or find me on LinkedIn.
This post was originally posted on the Lightstep Blog
Top comments (1)
Empathy is key. That's where collaboration starts from.