Resiliency isn’t something that just happens; it’s a result of dedication and hard work. To reach your optimal state of resilience, there are some crucial SRE best practices you should adopt to strengthen your processes.
As you know, failure is not an option… because actually, it’s inevitable. Things will go wrong, especially with growing systems complexity and reliance on third-party service providers. You’ll need to be prepared to make the right decisions fast. There’s nothing worse than being called in the wee hours of a Sunday morning to handle a situation where thousands of dollars are going down the drain every second. Your brain is foggy, and you’ll likely need time to adjust to the extreme pressure of a critical incident. In these cases (and really, all cases where an incident is involved), incident runbooks can help guide you through the process and maximize the use of time.
According to Chris Taylor at Taksati Consulting, good incident runbooks help you cover all your bases. They typically include flowcharts and checklists to depict both the big picture and the minute details, a RACI (responsible, accountable, consulted, informed) chart for each step, and a list of environmental influences that are unique to your system. To create your incident playbook, Chris recommends aggregating the following information:
*An inventory of relevant tools
*The right personnel/subject matter experts to engage in response
*Knowing the problem to solve, or the workflow you’re trying to document
*Current state (whether this is a new process, or updating and old one)
By developing incident runbooks and practicing running through them, you’ll be more prepared for the inevitable.
Change management is often done haphazardly, if at all. This means that organizations are unable to manage the risk of pushing new code, possibly leading to more incidents. Rather than employ ITIL’s arduous CAB method, SRE seeks to empower teams to push code according to their own schedule while still managing risk. To do this, SRE uses SLOs and error budgets.
SLOs, or service level objectives, are internal goals for service availability and speed which are set according to customer needs. These SLOs serve as a benchmark for safety. Each month, you have a certain allowable amount of downtime determined by your SLO. You can use this downtime to push new features. If a feature is at risk for exceeding your error budget, it cannot be pushed until the next window. If the feature is low to no risk to your SLO, then you can push it. Each month teams should aspire to use the entirety, but not exceed, their error budgets. This way, your organization can optimize for innovation, but do so safely without risking unacceptable levels of customer impact.
Black Friday outages, scaling, moving to cloud. All of these big events required heightened capacity planning. If you don’t have enough load balancers on Black Friday or Cyber Monday, you might be sunk. Or, if your company is simply growing quickly, you’ll need to adopt best practices to make sure that your team has everything it needs to be successful. There are two types of demand that require additional capacity: the first is organic demand (this is your organization’s natural growth) and inorganic demand (this is the growth that happens due to a marketing campaign or holiday. To prepare for these events, you’ll need to forecast the demand and plan time for acquisition.
Important facets of capacity planning include regular load testing and accurate provisioning. Regular load testing allows you to see how your system is operating under the average strain of daily users. As Google SRE Stephen Thorne writes, “It’s important to know that when you reach boundary conditions (such as CPU starvation or memory limits) things can go catastrophic, so sometimes it’s important to know where those limits are.” If your service is struggling to load balance, or the CPU usage is through the roof, you know that you’ll need to add capacity in the event of increased demand. That’s where provisioning comes in.
Adding capacity in any form can be expensive, so knowing where you need additional resources is key. It’s important to routinely plan for inorganic demand so you have time to provision correctly. The process of adding capacity can sometimes be a lengthy effort, especially if it’s the case of moving to cloud. You’ll also need to know how many hands you’ll need on deck for these momentous occasions.
Resiliency doesn’t just exist in your processes, it also exists in your people. Capacity planning is an important part of having a resilient system because in thinking about the allocation of resources, your team members matter. They need time off for holidays, personal vacations, and the obligatory annual cold. When you fail to plan for time off, you won’t have enough hands on deck to handle incidents as they occur. Denying people time off is obviously not the answer, as that will only lead to burnout and churn. So it’s important to develop a capacity plan that can accommodate people being, well, people.
Johann Strasser shares four steps you can take to develop a capacity plan that will eliminate staffing insecurity:
- Establish all necessary processes with the appropriate staff – from top management to team leaders. Decide how often you will need to revise/revisit this process and make sure that everyone is in agreement on this.
- Provide for complete and up-to-date project data and prioritize your projects. What projects are the most important, and which can be put on the back burner for now? Additionally, how long will each project take? You’ll need this data to be able to move forward with accurate plans.
- Identify the capacities across your existing team, as well as your infrastructure and services. Is the team equipped and system architected in a way that minimizes performance regressions, to protect efficiency and capacity?
- Consolidate the requirements (step 2) and the capacities (step 3). Identify underload as well as overload and try to balance them.
So, now you’ve got the people and the process, but how can you learn and improve on your resilience? For that, you’ll need great retrospective practices in place that facilitate real introspection, psychological safety, and forward-looking accountability.
When something goes wrong, it’s important to learn from it to prevent the same mistake from happening again. To do this, it’s important to craft and analyze retrospectives (or post-incident reviews, RCA reports, or whatever you like to call them). To have retrospectives worthy of analysis, applying SRE best practices will be key. In fact, retrospectives are a great place to begin your SRE adoption journey.
As Steve McGhee, SRE Leader at Google, shares, “Conducting blameless retrospectives will enable you to see gaps in your current monitoring as well as operational processes. Armed with better monitoring, you will find it easier and faster to detect, triage, and resolve incidents. More effective incident resolution will then free up time and mental bandwidth for more in-depth learning during retrospectives, leading to even better monitoring.
In other words, building a retrospective practice will eventually enable you to identify and tackle classes of issues, including fixing deeply rooted technical debt. With time, you’ll be able to practice SRE, directly improving the systems continuously.”
One of the most important elements of a retrospective, and of SRE as a whole, is the notion of blamelessness. To learn from retrospectives, there needs to be total transparency. Opening up about mistakes can often be frightening, and requires a psychologically safe space to do so. Positive intent should always be assumed in order to foster the trust that allows for true openness. Blaming team members or defining people as the root cause for failure will only lead to more insecurity, covering up the important truths that retrospectives are meant to uncover.
To craft great retrospectives, there are four other best practices that will ensure your incidents are being used to their full advantage:
*Use visuals in your retrospectives: As Steve McGhee says, “A ‘what happened’ narrative with graphs is the best textbook-let for teaching other engineers how to get better at progressing through future incidents.” Graphs provide an engineer with a quickly readable yet in-depth explanation for what was happening during the incident days, weeks, or even years later.
*Be a historian: Timelines can be invaluable for parsing through a particularly dense incident. Chat logs can be cluttered, and it’s difficult to quickly find what you’re looking for. Thus, it’s important to have a centralized timeline that gives a clean, clear summary of the events. This also provides the context that helps relevant team members analyze what happened.
*Tell a story: An incident is a story. To tell a story well, many components must work together. Without sufficient background knowledge, this story loses depth and context. Without a timeline dictating what happened during an incident, the story loses its plot. Without a plan to rectify outstanding action items, the story loses a resolution.
*Publish promptly: Promptness has two main benefits: first, it allows the authors of the retrospective to report on the incident with a clear mind, and second, it soothes affected customers. Best-in-class companies like Google, Uber, and others have internal SLOs around publishing their retrospectives within 48 hours.
Creating incident runbooks, utilizing change management and capacity planning, and following retrospective best practices will all contribute to your system’s resilience, but that’s not all that SRE seeks to do.
Happy engineers means happy customers, as engineers won’t build the best products possible without support from the organization. There are two majors ways that SRE can help brighten engineering’s day.
- Toil: One of the main focuses of SRE is automation.Toil is a waste of precious engineering time, and by SREs creating frameworks, processes, internal tooling/building tooling to eliminate it, engineers can get back to innovating.
- Elimination of tech debt: SREs create accountability around retrospective follow-up action items to make sure that old issues aren’t buried under new code. SREs also put together frameworks to help developers deliver more performant code, prioritizing what matters most to the customer experience. Pinpointing the tech debt build-up that hurts customer experience is important to guide refactoring initiatives and other practices to reduce tech debt. This establishes a baseline for healthy engineering practices to help minimize future accrual of tech debt.
Additionally, SREs invest in cultural change that prevents more tech debt from accruing in the future, while still making way for innovation. Jean Hsu wrote about her experience refactoring Medium’s codebase, and realized that the most important thing she could do for her team wasn’t just to fix spaghetti code; it was to create a culture that fixes technical debt as it goes along, deleting old code as needed. Jean wrote “I realized that if I always did this type of work myself, I would be constantly refactoring, and the rest of the team would take away the lesson that I'd clean up after them. Though I did enjoy it myself, I really wanted to foster a long-term culture where engineers felt pride and ownership over this type of work.”
SREs are often the cultural drivers for this sort of work, improving the way engineering teams function as a whole rather than simply going from project to project fixing bugs. These changes are long-term initiatives that spark growth and adoption of best practices for the entire organization.
As you can see, SRE could positively impact each engineer’s day-to-day productivity. In fact, SRE is not about tooling or job titles, and is rather a more human-centric approach to systems as a whole. With this context in mind, adopting a resiliency mindset brings positive business benefits for everyone in the organization.