In the world of technology, the stakes have never been higher. The move to the cloud and microservices to maximize agility has given way to digital disruptors and unprecedented competitive threats. As distributed systems become increasingly complex, the scale of ‘unknown unknowns’ increases. On top of this, customer expectations are sky-high. The cost of downtime is catastrophic, with customers willing to churn if their needs are not promptly met. According to Gartner, the average cost of downtime is $300,000 per hour. For some companies, this number is considerably higher; for example , Amazon lost approximately $90 million during their Prime Day outage in 2018, and the outage only lasted 75 minutes.
Organizations need to prioritize reliability so they can innovate as quickly as possible on top of a strong foundation that won’t compromise customer experience. This will become even more critical as more businesses move toward distributed systems with high reliability requirements. That’s where site reliability engineering (SRE) comes in. The SRE function is growing quickly (30-70% YoY growth in job listings), but there is not enough skilled talent in the market to compensate. In other words, it will be important to understand how you can not just hire SREs, but grow your existing organization to adopt the practices and mindsets required for production excellence. With the shortage of SREs for hire, what can you do to ensure your service’s reliability? To answer this question, you’ll need a deeper understanding of what SRE actually is.
SRE is a practice first coined by Google in 2003 that seeks to create systems and services that are reliable enough to satisfy customer expectations. Since then, many large organizations such as LinkedIn and Netflix, have adopted SRE best practices. In recent years, SRE has become more widely adopted by many organizations globally, with the goal of reliability and resilience in mind in light of exponentially growing customer expectations as well as systems complexity.
SRE is based on a customer-first mentality. This means that SRE efforts are all tied to customer satisfaction, even if the customers using the service are actually internal users. Each decision should result in an increase in customer satisfaction. Teams work together to determine which factors and experiences affect customer happiness, measure them, set goals, and balance reliability requirements with the innovation velocity required to stay viable in an increasingly competitive digital landscape.
To achieve this lofty goal, SREs and teams that have adopted SRE best practices refer to several key tenets of SRE. According to Google, these include:
*Ensuring a durable focus on engineering
*Pursuing maximum change velocity without violating a service level objective (SLO)
*Monitoring, including alerts, ticketing, and logging
*Demand forecasting and capacity planning
*Provisioning, efficiency, and performance
According to Forrester, 46% of the tenets can be applied out-of-the-box for most software teams in the enterprise, but the rest require customizations or won’t make sense for the vast majority of organizations. The important question to ask yourself is how these tenets fit in with what you’re already doing, and how your teams can improve.
Think of SRE as the practice that brings life to the DevOps philosophy. The core principles of DevOps and SRE are nearly identical. According to Google’s Coursera course on SRE, “class SRE implements DevOps,” the 5 DevOps principles are as follows:
- Reduce organizational silos: SRE helps by sharing ownership across developers and production teams, and unifying tooling.
- Accept failure as normal: Blameless postmortems are an SRE best practice that ensures that all incidents are used as learning opportunities. SRE also creates a safe space and guardrails for failure through SLOs and error budgets.
- Implement gradual change: This is done by canarying rollouts to a small subset of customers before allowing all users to interact with new features. Smaller changes are easier and safer to dissect and iterate on.
- Leverage tooling and automation: SREs work to eliminate toil by measuring it and creating automation to do repetitive tasks without needing human intervention. This way, humans can focus on higher-value work.
- Measure everything: SRE specifically focuses on measuring toil and reliability to make sure that both customers and software teams are happy with the service.
With these common principles defined, it’s easy to see how SRE and DevOps fit really well together, with SRE codifying practices that make it easier to achieve the promises of DevOps. In fact, you could say that SRE is the human side of DevOps, the culture-building function that approaches systems with the people who run them in mind.
Resiliency engineering as a practice looks at systems holistically, considering not only infrastructure but also human, process, and cultural factors. Without adopting the culture and mindset behind SRE, you’ll simply have new processes with no uniting value at the center to keep the initiative in place. Focusing on the human approach to systems requires reevaluating your organization’s attitude towards three crucial things.
The notion of on-call is important in SRE for several reasons. It establishes clear ownership to ensure software problems are immediately addressed, and inherently incentivizes developers to ship more performant code. But while going on-call is now a fairly common practice, establishing a healthy, balanced process is crucial to prevent burnout. Nobody can be on-call 24/7, especially when incidents during the on-call period actively disrupt engineers' personal lives. People need uninterrupted time away from work to be at their best, so on-call responsibilities need to be carefully monitored. If someone is waking up at 2 AM every night for a full month, there’s something wrong; it’s simply unsustainable. Additionally, more than one person should have to carry the burden. The whole development team should be empowered to be on-call so the responsibility becomes a shared one. This also incentivizes developers to ship better code to avoid getting woken up at 2 AM.
SRE best practices encourage a better proactive system, with a robust reactive system in place. Being proactive means fostering a community of constant learning and improvement. When your engineers are better prepared and learning from previous incidents, it’s less likely that the same mistakes will be made again. This lowers the amount of incidents occurring as your SRE practice matures. From a reactive perspective, better incident management practices can allow for streamlined communication during an incident, and provide a foundation to treat incidents as ‘unplanned investments’ as they become important learning opportunities. Postmortems thus give engineers a place to begin looking when the root cause of an incident is evading them. SRE gives those who hold the pager more agency.
Constant firefighting, especially with a tough on-call schedule, can leave engineers feeling burnt out. Over time, burnout leads to high turnover rates, meaning the senior engineering will need to pick up additional slack while new hires are ramped up. This only increases burnout, leading to a vicious cycle of dissatisfied engineers who have little capacity to think about improvements, and new hires who are clueless about where to begin.
In this situation, the SRE approach would encourage improved visibility into engineering hours, on-call periods, and repeat incidents. Each of these issues directly contributes to burnout, yet many organizations aren’t tracking them. By knowing which engineers have spent abnormally high hours over an extended period of time, team leads can suggest vacation time to curb burnout. Knowing who has been on-call every weekend for the last month allows teams to better manage the rotation so everyone gets a break. Monitoring repeat incidents and incidents of a similar class can give insight into what’s burning through engineering hours, as well as whether previous postmortems uncovered improvements or follow-up items that were not taken action on. These are issues that should promptly be fixed in order to give teams a break from firefighting, and more time for strategic work.
Failure will happen, incidents will occur, and SLOs will be breached. These things may be difficult to face, but part of adopting SRE is to acknowledge that they are the norm. Systems are made by humans, and humans are imperfect. What’s important is learning from these failures and celebrating the opportunity to grow.
One way to foster this culture is to prioritize psychological safety in the workplace. The power of safety is very obvious, but often overlooked. Industry thought leaders like Gene Kim have been promoting the importance of feeling safe to fail. He addresses the issue of psychological insecurity in his novel, “The Unicorn Project.” Main character Maxine has been shunted from a highly-functional team to Project Phoenix, where mistakes are punishable by firing. Gene writes “She’s [Maxine] seen the corrosive effects that a culture of fear creates, where mistakes are routinely punished and scapegoats fired. Punishing failure and ‘shooting the messenger’ only cause people to hide their mistakes, and eventually, all desire to innovate is completely extinguished.”
Getting the most out of your teams and systems cannot be achieved if blame exists. Blamelessness is at the core of SRE. To fully adopt this practice, you need to acknowledge that people are not a source of failure. Each team member is simply doing their best with the knowledge at hand, making the decisions they believe are right and in the best interests of the organization. Punishment or blame takes away the desire to try, fix, and continuously learn.
Fear is an innovation killer, but failure is an innovation inspiration. Creating safety and trust within your organization is key to fully realizing and unleashing your team’s potential.
Any organization can adopt SRE best practices, and it can begin in small increments. The most important change you will make will be the cultural one. As organizations are made of people, any organization can foster continuous learning, blameless culture, and psychological safety so long as its people are committed to a growth mindset. Once these cultural factors are in place, it becomes much easier to implement the practices, processes, and tools that scale that culture of excellence.
If you enjoyed this article, check out these as well:
*Our Top 5 On-Call Practices
*Fostering Psychological Safety in Remote Teams is Crucial
*Best Practices for Effective Incident Management