Hannah Culver for Blameless

Posted on Sep 1, 2020

How to Build Your SRE Team

#sre #devops

Originally published on Failure is Inevitable.

As you implement SRE practices and culture at your organization, you’ll realize everyone has a part to play. From engineers setting SLOs, to management upholding the virtue of blamelessness, to marketing teams conducting retrospectives on email campaigns, there’s no part of an organization that doesn’t benefit from the SRE mentality.

However, while it’s not necessarily to have people with the title of ‘SRE’ in order to successfully adopt the best practices of SRE, having people who are dedicated to stewardship of SRE practices is important to achieve reliability excellence. In this blog post, we’ll look at some of the many roles an SRE can play, and how to find people with those skill sets.

Common pathways to becoming an SRE

When looking for people to fill your SRE team, looking only for self-described SREs may be too limited in scope. People from a wide variety of backgrounds can learn the tenets of SRE while also benefiting from their unique expertise. Here are some examples of career paths that could make people a great fit for SRE:

Software developers understand the value of reliability metrics and are accustomed to solving optimization problems based around them
System administrators take a holistic perspective to entire system architecture and proactively address reliability issues, such as potential downtime
System design engineers create efficient procedures through complex systems, helpful for coming up with runbooks and other incident responses
Quality assurance engineers have a test-oriented mindset, ensuring that systems stay reliable in the most adverse conditions
Database administrators are accustomed to optimizing the storage and reliability of huge data systems, making them well suited for the challenges of reliably scaling

Although SREs commonly emerge from other tech disciplines, the mindset of SRE can be appreciated and embraced by anyone. SREs can emerge from backgrounds as diverse as communications, business studies, and the arts. By looking at their own challenges of reliability through the lens of SRE, they can contribute unique insights.

SREs as engineers of reliability

SRE is a holistic discipline that involves many skills outside of writing code. Nevertheless, a major role your SREs will play is that of a software engineer, building systems and software to improve reliability. Ensure that your prospective SREs understand the languages and architecture of your systems. Even if they aren’t writing much code, they need to understand how development decisions impact systems.

SREs can work with development teams to “develop for reliability.” This involves considering how development will impact key reliability metrics, measured by SLOs and error budgets. As these metrics reflect the most fundamental levels of coding and architecture configuration. To work with them, SREs will have to understand how potential development directions will impact the entire stack, from top to bottom.

Because of this “big picture” approach to development, looking for SREs with a strong systems engineering background can be helpful. In “Hiring Site Reliability Engineers” for login, the USENIX magazine, Google employees Chris Jones, Todd Underwood, and Shylaja Nukala detail their technical hiring process for SREs. They break down how SREs with the ability to form connections throughout complex systems can make up for missing expertise of the specific software systems. Through a combination of holistic systems analysis and detailed examination of ramifications in the lowest level of code, SREs can fully understand the relationship between development and reliability.

SREs as stewards of reliability

At the heart of decision-making is data. Without complete and accurate data about how your system operates, it’s impossible to know where to prioritize development efforts. Another key role of SREs is collecting, refining, and analyzing this data. There are many monitoring tools available that can help extract and visualize data from your system. SREs can transform this data into something actionable.

A key example of this transformation is creating SLIs. SLIs combine low-level monitoring data into a single metric that reflects business impact, which is then used to set SLOs. SLIs and SLOs should be determined and reviewed by large teams of people, but SREs can be a bridge of knowledge for those teams. SREs can connect different domains of technical and business expertise to find the most impactful indicators and guardrails of reliability.

Monitoring data and indicators built from it, like SLOs, should be readily accessible and comprehensible to your entire team, but SREs will have a special relationship with it. Acting as the stewards of reliability, they advocate for maintaining SLOs and other best practices to shift quality left so teams can scale sustainably When hiring SREs for this role, don’t just look for expertise in your particular monitoring or other tools, as your tech stack tomorrow will look different than the one today. Instead, look for people who understand the importance of putting data in the right context, and how to persuade others to adopt best practices.

SREs as leaders who align reliability with business needs

As you develop your SRE solution, you’ll find yourself building up a framework of policies and practices: review and revision cycles, ownership maps, incident classifications and response procedures, and more. These should be understood, agreed upon, and adopted by the entire team, but SREs can serve as leaders for fine tuning and keeping this framework operating. SREs can help develop and implement procedures in many areas of SRE, including:

SLO and error budget review
Incident classification review
Runbook creation and review
Incident retrospective practices
On-call scheduling policies
Security audits
Chaos engineering test procedures

For each of these categories, all stakeholders should be consulted. SREs serve as a holistic bridge between their domain expertise and the greater impact on reliability metrics for the entire business.

Once policies are in place, SREs can take the lead on ensuring they’re followed. As “reliability educators,” SREs can conduct internal audits to make sure incident retrospectives contain the necessary data, that follow-up tasks are being completed, that runbooks are having scheduled updates, etc. Of course, these audits would be conducted blamelessly—in socio-technical systems, if certain procedures aren’t being upheld, it isn’t the fault of the individuals not following them, but likely how the procedures themselves have been set.

SREs in this role don’t need to be experts on every category listed above, although some familiarity is necessary. Each team’s adoption of best practices will be unique, and teams should embrace context over control so team members are empowered to make the best decisions they can in dynamic situations. Most importantly, SREs need a collaborative attitude and a willingness to consider the concerns of others. Ask prospective SREs how they’d handle disagreement in how a policy should be developed, or an incident where people were found to be negligent in following policy. Understanding their attitude in such situations can be just as useful as their technical expertise.

SREs as ambassadors of reliability culture

SREs should embody the cultural lessons of SRE in every role they play. Decisions they make around policy or development should always reflect these values—not just implicitly, but as a stated element of their decision-making process. That means fostering an environment of empathy, ownership, and trust.

It can be difficult to determine how a prospective hire will align to these cultural values. These beliefs are unlikely to show up on someone’s resume or transcript. To start diving deeper into their attitudes, here are some questions to consider when evaluating a prospective SRE:

How do they approach reliability goal setting?
What value do they place on failure?
How do they work to attribute error without blame?
What about incident retrospectives do they find most valuable?
How do they approach situations where others may come to them with concerns about teams, reliability, or other subjects?

Try asking about hypothetical situations where these beliefs would be tested. Ask them to explain the values behind their decisions, then probe even further, asking why these values are beneficial to the team. Experienced SREs should be able to connect policies with cultural and business outcomes, and advocate for healthy reliability practices. For example, they’ll be able to connect the dots between investments in things like SLOs, documentation, and toil automation, and how they ultimately lead to shared context and improved morale.

Common team structures

As you start an SRE team, where your SRE team sits in context of the rest of your engineering organization will depend on your organization's operational maturity, culture, and needs. Here are a few of the common structures:

SRE model with dedicated engineers focused on infrastructure and/or tooling (shared services, observability, etc.)
This configuration has the SRE team reinforcing efforts across the organization. They maintain services used by many different development teams without focusing on any specific project. With this configuration, the productivity and reliability of many projects can be improved at once. However, there may not be the resources to address specific reliability needs.

Embedded SRE model where full time SREs are assigned to a product/service
In this configuration, each product or service team is assigned some number of SREs to address their specific reliability requirements. This allows greater flexibility in allocating SRE resources—you can focus on areas with the biggest business impact. Embedded SREs should still take care to communicate to maintain consistency in their practices.

Distributed SRE model of SREs as consultants or stewards of reliability standards
This configuration has your SREs serving as consultants for reliability issues across the organization. SREs can still be centres of knowledge for particular products or services, but aren’t embedded into development teams. Instead, they work to keep services to agreed upon reliability standards, and consult with engineers to achieve them.

No matter how large your team grows, you’ll find that good tools will empower your SREs to respond to incidents effectively and develop for reliability proactively. To see how Blameless can help level up your SRE solution, check out our demo!

If you enjoyed this blog post, check out these resources:

DEV Community

How to Build Your SRE Team

Common pathways to becoming an SRE

SREs as engineers of reliability

SREs as stewards of reliability

SREs as leaders who align reliability with business needs

SREs as ambassadors of reliability culture

Common team structures

Top comments (0)

Read next

Modern Traffic Management with Gateway API in Kubernetes

Beyond Docker - A DevOps Engineer's Guide to Container Alternatives

AWS Billing Fundamentals

Why Linux is Essential for DevOps: Lessons from My Learning Journy