MTTR: Mean time to recovery

#monitoring #devops #webdev

What is Mean time to recovery (MTTR)?

Mean time to recovery (or mean time to restore) is the average time it takes to recover from a product or system failure.

It is an essential metric in incident management as it tells how quickly you solve downtime incidents and get your systems back up and running.

Calculating Mean time to recovery?

Time to recovery (TTR) is a full time of the outage – from the time the system fails to the time it is fully functioning again. The average of all times it took to recover from failures then shows the MTTR for a given system.

MTTR = sum of all time to recovery periods / number of incidents

For example, if a system went down for 20 minutes in 2 separate incidents, the MTTR of such system would be 10 minutes.

10 minutes = 20 minutes / 2 incidents

Other meanings of MTTR?

MTTR usually stands for mean time to recovery, but it can also represent other KPIs (key performance indicators) in the incident management process. Because of those multiple meanings, it is recommended to use the full names to prevent any misunderstandings. The other possible meanings of MTTR are:

Mean time to respond (MTTR)

Mean time to respond is the average time it takes to recover from a product or service failure from the time the first failure alert is received. The difference between the mean time to recovery and mean time to respond gives the time it takes for an alert to come in.

This metric helps you to see how much time of the recovery period comes down to alerting systems and how much is down to the actual work of the repair team.

Calculating Mean time to respond?

The time to respond is a period between the time when an alert is received and the resolution of the incident. The average of all incident response times then gives the mean time to respond.

MTTR = sum of all time to respond periods / number of incidents

Mean time to repair (MTTR)

Mean time to repair is the average time it takes to repair a system. In comparison to mean time to respond, it starts not after an alert is received, but when the incident repairs actually begin.

It is useful when comparing with mean time to respond as it shows how much time the team spends on diagnostics vs. how much they spend on the actual repairs.

Calculating Mean time to repair?

The time to repair is a period between the time when the repairs begin and when they finish, and the system is fully operational again. The average of all incident repair times then gives the mean time to repair.

MTTR = sum of all time to repair periods / number of incidents

Mean time to resolve (MTTR)

Mean time to resolve is the average time it takes to resolve a product or service failure. The resolution is defined as a point in time when the cause of an incident is identified and fixed. This incident resolution prevents similar incidents from occurring in the future.

Mean time to resolve metric gives a great insight into the full scope of fixing and resolving incidents as it goes beyond the downtime and includes the works after the downtime is solved.

Calculating Mean time to resolve?

The time to resolve is a period between the time when the incident begins and the resolution of the specific incident. The average of all incident resolve times then gives the mean time to resolve.

MTTR = sum of all time to resolve periods / number of incidents

Other incident management KPIs

Mean time to acknowledge (MTTA)

Mean time to acknowledge is the average time it takes for the team responsible for the given product or service to acknowledge the incident from when the alert is triggered.

Main use of MTTA is to track team responsiveness and alert system effectiveness. If your team is receiving too many alerts, they might become overwhelmed and get to important alerts later than would be desirable. This situation is called alert fatigue and is one of the major problems in incident management. Luckily thanks to MTTA, it can be tracked and accessed, so it won’t become an issue.

Mean time between failures (MTBF)

Mean time between failures is the average time between one product or system failure and the next. It is a great metric to see how your team is doing in the long term in terms of preventing potential incidents as it gives an overview of system reliability.

How to lower your MTTR?

Use faster monitors: Monitoring for incidents is the first part of any incident resolution process as it provides your team with the information that something is not working properly. Using high check frequency monitors (30-seconds is considered the best practice for general uptime checks) can decrease the time between when downtimes happen and are experienced by your users and when your team gets alerted.
Improve your alerting: Prevent alert fatigue in your team by setting alerts that reflect the importance of the monitored systems. For example, phone call alerts are great for vital systems, but for lower-priority systems, Slack/Microsoft teams messages or push notifications might be enough. Improving alerting this way can significantly reduce your mean time to acknowledge (MTTA).
Understand your incidents: Improving the information that your team is getting in the incident alerts could significantly decrease the time they spend on diagnostics. Quality debugging data like helpful event logs, error screenshots, and system performance graphs can make the diagnostics process noticeably easier.
Automate on-call management process: It is crucial to set up an automated on-call scheduling process integrated into your team members’ calendars. This assures that the right person, on the right team, in the right timezone, and in the right time, is always alerted.
Create an action plan: Sometimes, the assigned on-call team members might not answer the alert or might not be able to solve it on their own. In those cases, it is important to have a solid action plan for escalating incidents so that they are solved as soon as possible. Smaller organizations often have an ad hoc response process when solving incidents, while enterprises employ rigorous procedures. It is, however, recommended even for smaller teams to create an actionable troubleshooting plan for when incidents happen.
Designate team roles and responsibilities: On-call duties are often a dreaded responsibility. Because of that is important to properly set responsibilities for each team member so when an incident occurs everyone knows what they should do.
Don’t underestimate postmortems: Postmortems are often overlooked as they are only reported after everything is back to normal, and no immediate action is necessary. But in-depth postmortems and incident analysis can make a significant difference between solving an incident for once and preventing it from occurring ever again in the future.