When using cloud-native services, you will undoubtedly have cloud incidents that disrupt the normal operation of your systems. No SRE team believes they can achieve 100% uptime. Instead, they plan ahead, trying to anticipate what could go wrong (or has in the past) and create runbooks (sometimes called pipelines or workflows) to get things back to normal as quickly as possible.
MTTR is a metric used by SRE teams to help their team better understand how often incidents occur, and how quickly they are repaired. The first three letters are always seen as Mean Time To, but the R is interchanged between Repair, Respond, Resolve, Remediate, and Recover. MTTR can also sometimes be used in customer contracts, with consequences when exceeded. Keep in mind that MTTR represents a typical repair time, not guaranteed, so when reviewing a vendor’s MTTR know that some incidents will resolve more quickly, and others longer.
MTTR is calculated differently by many organizations. The key is consistency within the organization, and that it’s a meaningful metric that can be used to help the SRE team improve their results (and reduce the MTTR). When you hear someone talk about MTTR, it’s always a good idea to get clarification to ensure you’re on the same page and your discussion makes sense.
However calculated, a low MTTR is obviously a good thing and indicates either a robust and resilient set of services, a sharp and quick to respond team or both. MTTR can be measured in whatever units make sense (minutes, hours, days).
What contributes to MTTR?
A typical SRE workflow after an outage or disruption in the service is detected involves multiple steps that contribute to the end-to-end response time.
First step is to troubleshoot to determine the root cause of the problem. ZK Research found that 90% of the time spent in MTTR is spent identifying the source of the problem.
When an incident occurs, the responder often has to first acknowledge the alert (unless you don’t have monitoring and alerts and you learn of outages from your users), then gather the appropriate system information, then find the runbook that should be used to start the repair. At this point you hope that the person following the runbook is using the right one, for the right environment, and has the right permissions to run it.
How do you reduce your MTTR?
As you can see above, there are multiple steps involved in responding to an event that requires the SRE to interact with multiple services. SREs should continually look for repeatable processes they can automate, with code. By doing so they reduce human error, have a consistent approach to incident remediation regardless of who is handling it, and can in many cases greatly speed up the time to resolve, thereby reducing MTTR. Tying together monitoring, alerting, and data collection the SRE can have everything they need at their fingertips to make the call on next remediation steps. They can get even more advanced by having a slack channel spun up or a zoom meeting created adding the right people for the severity and type of issue that occurs.
Fylamynt has created the world’s first enterprise ready low code platform for building, running and analyzing SRE cloud workflows. With Fylamynt an SRE can automate the parts of the runbook that are the most time consuming, allowing them to make decisions where their expertise is needed.
Top comments (0)