How to Improve the Reliability of a System

#sre #devops

Originally published on Failure is Inevitable.

Site reliability engineering is a multifaceted movement that combines many practices, mentalities, and cultural values. It looks holistically at how an organization can become more resilient, operating on every level from server hardware to team morale. At each level, SRE is applied to improve the reliability of relevant systems.

With such wide-reaching impact, it can be helpful to take time to reevaluate how to improve the reliability of a system. In this blog post, we’ll work through some helpful steps to take when improving a system’s reliability. We’ll use a development project as an example, but the essence of this advice can be applied anywhere SRE is being implemented.

Analyze customer pains

When analyzing the reliability of a system, just looking at abstract metrics like availability or response speed only tells half the story — the data also needs to be in context. You also need to consider the impact different hotspots of performance issues will have on the users of your service. A consistent delay of a second when logging into your service might matter more than an occasional total unavailability of service with less traffic.

To improve the reliability of your system in a meaningful way, you need to determine which user journeys are more critical. This will allow you to create service level indicators that reflect where reliability issues have the greatest business impact. Techniques such as user journeys and black box monitors can help you understand the customer’s perspective of your service, and focus work on the areas that matter most.

Engineer for reliability

Now that you’ve understood where most impactful reliability issues occur, you can prioritize properly when developing for reliability. Your SLIs can be used to set SLOs and error budgets, standards for how unreliable your system is allowed to be. New development projects can be assessed for their expected impact on reliability, and then accounted for within the error budget. When developing for your system, think of reliability as a feature, and in fact your most important feature, as customers simply expect that things ‘just work’.

Once you have your objectives, reliability engineering provides practices to help you reach them, including:

Redundancy systems: Such as contingencies for using backup servers
Fault tolerance: Such as error correction algorithms for incoming network data
Preventative maintenance: Such as cycling through hardware resources before failure through overuse
Human error prevention: Such as cleaning and validating human input into the system
Reliability optimization: Such as writing code optimized for quick and consistent loading

Keeping these reliability practices in mind as you develop will make your code acceptably reliable. At the same time, you’re able to confidently accelerate development by evaluating it against SLOs.

Grow from incidents

No matter how reliable you design your systems, unforeseen issues will always arise. This is the golden rule of SRE: failure is inevitable. How you respond to and learn from these incidents determines how reliable your systems are after deployment. SRE encourages seeing incidents as an unplanned investment in reliability.

There three major components in incident response that leads to a more reliable system

Classify your incident. Create an incident classification scheme based on severity and service/incident type. Ensure that your classifications are unambiguous and consistent. Regularly review them to ensure incidents are being correctly classified.
Respond to the incident. Each classification should point to a specified response, such as a runbook, that the respondent can follow to begin solving the problem. Your runbook should specify who is alerted and how they respond.
Record the details of your incident. Have an incident retrospective document that responders create as they work through the problem, containing the timeline, steps taken, and communication.

Thoroughly responding to incidents helps system reliability in multiple ways. At the most basic level, responding to incidents efficiently means the issue is mitigated faster, lessening customer impact. As you log the classification of incidents, you’ll find patterns of consistently impactful incident types—areas where reliability engineering efforts could be focused. Finally, reviewing incident retrospective can reveal where procedures can be made more effective, speeding up future responses.

Test yourself with chaos engineering

Now that you’ve established your incident response procedures, it’s time to put them to the test. Testing in production is a practice of using techniques like A/B testing to safely find issues in deployed code. By using actual production environments, the customer impact of reliability issues will be more apparent. Another benefit is that you’re guaranteed that the tested environment matches that used by customers. However, if you want to test for reliability catastrophes, you’ll need a technique more removed from customer impact, such as chaos engineering.

Chaos engineering is a discipline where engineers intentionally introduce failure into a production system to stress test and prepare for hypothetical scenarios such as server outages, intensive network load, or failure of third party integrations.

Chaos engineering teaches many lessons about the reliability of your system. As incident response teams treat the experiment as if it was a real system failure, you’ll be able to assess the effectiveness of your responses. This allows for iteration and improvement of your response procedures and system resilience. You can further simulate failures within responses, such as key engineers being unavailable, creating worst-case scenarios to see if your system weathers the storm. These experiments can provide useful data for future reliability engineering focus.

Monitor and log everything

Incidents, whether real or simulated, provide a wealth of information of how your system behaves. On top of that, the normal functioning of your system is constantly churning out data on its use and response. Without collecting and understanding this data, you’ll be unable to make meaningful decisions about where and how reliability should be improved. Thus, having an observability system set up to ingest and contextualize all the data your system produces is essential.

Monitoring tools are a good place to start. These tools will gather data from across your system and present it in a way that helps make patterns apparent. For example, a histogram showing response time of one component correlated with a histogram of server load can reveal causation of lag. The monitoring data can be combined with incident data: when an incident occurs, a snapshot of the data being monitored can be included with the retrospective.

Continually review, revise, and learn

To truly close the loop from the data you gather to improvements in reliability, you need to dedicate time to studying it. A core tenet of SRE is that perfect isn’t the objective; rather, improvement is. Just as you should expect incidents to arise in even the most meticulously designed code, you should expect that your responses will hit speed bumps, your experiments will reveal weak points, and your monitoring will be incomplete. This shouldn’t be discouraging, but motivating:a continuously improving system is one that is truly resilient in the face of any change.

Set regularly scheduled meetings to review incident retrospectives, SLIs and SLOs, monitoring dashboards, runbooks, and any other SRE procedures or practices you’ve implemented. These meetings shouldn’t be siloed to the particular engineers involved in an incident or a project. For example, if a meeting set around reviewing the on-call schedule conferred with those meeting to review the remaining error budget on a development project, both teams could better understand the resources and pressures of the whole system.

The tone and mindset of these meetings is especially important. Blame shouldn’t be assigned to any particular individuals. Instead, systematic issues should be uncovered collaboratively. Punishing a single person does nothing to improve the reliability of a system, and in fact, likely has the complete opposite effect. By working blamelessly and addressing contributing factors of incidents, systematic unreliability can be addressed and improved upon.

Want to see how Blameless can help improve the reliability of your systems? We’ve got tools for collaborative responses, incident retrospectives, SLOs and error budgeting, and more, helping teams such as Iterable reduce critical incidents by 43%. Check out a demo!