Originally published on Failure is Inevitable.
After getting managerial approval for incident management, your SRE buy-in program is well underway. How can you prove that it's effective, and that adopting more best practices is necessary? In part 2 of this blog series, we're going to share how to convince a VP or director to invest in more SRE practices. These practices include automated metrics and continuous learning.
The situation
Your team has implemented incident management and can react to incidents and resolve them faster than ever. But, you aren’t learning as much from these incidents as you could be.
Manually figuring out what to measure (let alone how to do so) is time-consuming. You need to find a better way to report data so your team can stay focused on learning, improving, and innovating. This requires buy-in at the VP or director level. Before approaching this conversation, you need to take two considerations into account.
First, you need to understand that this is a big undertaking for your VP/director. They will need to get support from the entire engineering and DevOps team, as well as members of the product team, for this initiative to succeed. Thus, it’s important that your appeal takes their point of view and goals into account.
Second, you need to define what continuous learning looks like. We define it as a collection of capabilities that drive shared context and focus. This includes but is not limited to:
- Automated and aggregated data measurement (MTTR, customers affected, etc.)
- Standardized means of reporting and dashboards
- Incident retrospectives
- Team training
Now that you’ve got the basics of your proposal, it’s time to articulate the incentives.
The incentives
There are several major incentives for automating metrics and continuous learning. These are framed in a way that your VP or director will care about.
- Aggregating data from product teams, customer support teams, etc. removes organizational silos and promotes cross-team collaboration. This makes the communication process easier, as manual data collection is painful and time consuming.
- Automation of reporting will reduce tedious hours of database query pulling. If it’s not automated, engineering teams might not get to it at all. This makes it harder to show progress and benefits of investments in SRE
- Retrospectives help teams capture the learning and distribute it across software teams. This prevents the same mistake from happening again through deep analysis. Additionally, this learning can help resolve similar incidents faster in the future. While experienced engineers may have an instinct about what's wrong, junior engineers or new hires might not. Capturing crucial knowledge in retrospectives allows you to codify knowledge. You'll train new engineers and get them up to speed faster.
Even with these incentives laid out, resistance is still likely. Here are some common rebuttals you should prepare for.
The resistance
If you’ve adopted incident management best practices, your VP/director might say that’s good enough. Additionally, they could argue that reliability isn’t a pressing issue at the moment, and that new features are a higher priority. In other words, the incentives favor immediate term as opposed to longer term goals.
Another resistance is that retrospectives vary, are hard to review, and usually one-and-done. This means that after completion, they are filed away and forgotten. Many might not be completed at all, as they take too much time to construct and don’t command as much immediate urgency compared to tasks like resolving incidents or shipping new product features.
Though these concerns seem difficult to counter, by looking at both emotional and logical appeals, we can present the reasons why adoption of continuous learning and automation are necessary. We'll also provide metrics to prove it.
The emotional appeal
To connect to VPs and Directors, it's important to illustrate ‘hair on fire’ moments that prove reliability is a pressing issue. You can begin by addressing team stress. Engineering teams are dealing with incidents, but it’s a continuous battle. Without the ability to aggregate systems, incident, and retrospective data and see patterns, the extent of reliability issues remain hidden. This could result in frustration in engineering teams. Teams become bogged down by manual, repetitive work, leading to burnout and churn.
If engineering stress levels aren't a concern for your VP or director (they always should be!), you can speak to customer satisfaction. Reliability is now the most important feature. Reliability is the net sum of all features you've already shipped before. If any shipped feature is unreliable, the value of all the other features is moot. So the sum is greater than any single one new feature you're about to ship. If customers are unhappy with the sum, they will leave for a competitor that delivers a better experience.
These appeals are important, but you need the data to back it up. To prove to your manager that automation efforts are worthwhile, quantify the number of incidents, bugs, regressions caused by new feature work and the time required to fix them. How many fires and on-calls incidents happen within a month, and how do these correlate to feature and project work? How much money and resources go into new features that can’t hold up to customer standards? The numbers will likely surprise them.
The logical appeal
Let’s focus on the argument that incident management is already enough. It’s crucial to point out that SRE is not only about incident recovery, but a way to maximize learning from the patterns of incidents. Without continuous learning efforts, you're not improving and the situation will worsen. You can look at a past set of incidents, write retrospectives for them, and see what the correlation is. It’s likely that many incidents could have been prevented if you were able to automate metrics and track patterns from them.
Once you do a trial run on automation and continuous learning, you’ll need to prove the effectiveness. To do this, collect company-wide metrics on all incidents. Show the time saved with automated reporting on incidents and the rate of follow-up action items completed. Additionally, you should measure new hire errors and issues.
These two metrics alone should demonstrate the need for automation and continuous learning. Retrospectives vary, yet patterns lurk under the surface. Your team needs the ability to uncover them. Yet, there is still one common resistance to address.
Retrospectives are too varied. Experienced teams have a gut instinct that incidents may be related but until a formal process to use and aggregate data is in place, the metrics needed to drive organization change are time-consuming to produce. Generally, retrospectives are freeform and difficult to go back and analyze.
To combat this, suggest tooling that automates retrospective information gathering. From a reporting perspective, you can create a metadata schema (e.g. services impact, customers affected, contributing factors) to show underlying patterns. Map the metadata from postmortems to feature and project work to show correlation.
With these metrics in hand, you’re prepared to propose automated metrics and continuous learning to your VP or director. But, there’s one more level of leadership you need to convince to empower SRE adoption. In part 3, we discuss how to get buy-in from your CEO or CTO.
If you liked this, check out these posts:
Top comments (0)