Heloisa Moraes for Codacy

Posted on May 25, 2022

How to measure Time to recover?

#performance #productivity #programming #software

Time to recover is one of the most interesting metrics you should use to evaluate your team’s speed, quality, and overall efficiency of your Engineering performance. This metric is also known as Mean time to recover, Mean time to restore, or MTTR.

There are four key Accelerate metrics that we will cover in detail:

Lead time for changes;
Deployment frequency;
Time to recover; and
Change failure rate

In this article, we are focusing on Time to recover. So keep on reading to know what Time to recover is, how we measure it in Pulse, and how your team can start using it to achieve elite engineering performance.

What is Time to recover?

In a nutshell, Time to recover measures how long it takes to recover from failure. This metric is correlated with both the speed and the quality of your engineering team.

“Traditionally, reliability is measured as time between failures. However, in modern software products and services, which are rapidly changing complex systems, failure is inevitable, so the key question becomes: How quickly can service be restored?”
Accelerate: The science of lean software and DevOps: Building and scaling high performing technology organizations

What does Time to recover measure?

Time to recover is a good indicator of teams’ response time and overall development process efficiency. It is one of the most interesting quality metrics for an organization to work on to ensure the availability and correct functioning of their software.

This metric analyzes how quickly your team can identify incidents and their root causes (e.g., a damaged database or a deployment that breaks an existing feature), notify the appropriate people to deal with those incidents, and resolve them fast.

How to read Time to recover?

On the one hand, a Time to recover that is too high indicates that there may be broader issues, like inefficient processes, lack of people, or inadequate team structure. On the other hand, a low Time to recover shows that your team can quickly respond to and resolve incidents.

Your teams’ goal should be to continuously reduce your Time to recover because it means that, when there is an incident, your team can fix it in a timely fashion, and you are not compromising the availability of your software.

Time to recover in Pulse

Products like Pulse make it very easy for your team to know how your Time to recover looks like. Pulse connects seamlessly with GitHub, PagerDuty, or your incident management tool to give you Time to recover and other Engineering metrics out-of-the-box. Your team only needs to focus on making informed decisions to improve your results.

How is Pulse measuring Time to recover?

Pulse measures Time to recover by calculating how long it takes your organization to recover from a failure in production (e.g., service impairment or unplanned outrage). It’s computed with the following formula:

average(incident resolved timestamp - incident created timestamp)

Time to recover inside Pulse — Time to recover in Pulse

Pulse takes care of all the details to ensure your metrics are accurate and reliable. For example, Pulse automatically associates incidents with the causing deployments, enabling you to explore the metrics by service and each of your GitHub’s team.

What Time to recover should a team have?

The best practice performance level for this metric, published every year on the Accelerate State of DevOps 2021 report:

Elite: Less than 1 hour
High: Less than 1 day
Medium: Less than 1 week
Low: More than or equal to 1 week

See what's your Time to recover

How to use Time to recover

Over time, Time to recover should be getting smaller, while your team increases in the performance level until reaching best practice levels.

Having a small Time to recover allows your team to have a sound overall recovery process, deliver high-quality software quickly, and improve the availability of your software.

“When she first looked at her dashboard, she was surprised by how big the MTTR was. It was too high for her liking. She waited for the next outage to see first-hand how the team would react. Quickly, she understood what was going on. They had neglected documenting incident management as part of the onboarding for the new members. She provided training for the new members of her team and provided them with new tools so that they could respond to new incidents faster.”
A tale of four metrics - Codacy

Causes of high Time to recover

Tests being carried out manually;
Shortage of people or a change in the organizational structure;
Inefficiencies in the incident management process (e.g., blocks, dependencies, not using the right tools).

How to reduce Time to recover

Improve your organizational structure: the right people can quickly identify the problem’s root cause and resolve it promptly.
Use test automation: incorporating automated tests at every stage of the CI/CD pipeline can help you reduce delivery times and help you recover quicker.
Clearly document the incident management process: continuously train team members on the process and on how to react in case of incidents.
Find potential steps of the incident management process to be automated: any step your team needs to perform manually - assigning a responsible, or creating a document - adds to Time to recover.

Conclusion

Time to recover is a fundamental metric to analyze and improve the efficiency of your Engineering team. It is a valuable metric to understand the capabilities of your team and how quickly they respond to problems and outrages.

Together with Lead time for changes, Deployment frequency, and Change failure rate, this metric provides valuable insights for your team to achieve full engineering performance.

With Pulse, you can easily measure and analyze the Time to recover in your projects and make better decisions.

Top comments (1)

Andrew Baisden • May 25 '22

Great article so informative.

DEV Community