Must-Have metrics for successful Site Reliability

#sre #devops #metrics

Originally posted on Code-Artisan

An overall Site Reliability strategy should always be based on data and metrics. However having the right metrics is a key challenge to evolve services and practices internally. The rise of agile, DevOps and SRE introduced different scales of metrics that I propose to review together. The main three topics for metrics presented here will be Incidents, Service and Teams.

Incident metrics will be focused on how incidents are handled and which metrics to look for when improving Service Level. They can either be failure and production incident but also service degradation depending on the maturity of the product/solution

Service metrics are focused on the service itself and its lifecycle, meaning Devops concepts like deployment metrics and also operational metrics to look for on running services.

Last but not least , Teams metrics are the human and business sides of the coin. Yes you have a flawless service but at what cost. I do believe that is an important part on how we should look at metrics when touching at Service Levels.

Service Levels

Before diving into those metrics, I would like to propose a quick reminder on Service Level and their declinations as they are often misused or not well understood :

Service Level Indicator (SLI): This is actually the measure itself you will focus on, for instance availability (uptime) of a service. It’s a raw value with formulae on how to calculate it.
Service Level Objective (SLO): This is the objective fixed for one or more SLI value. For instance the infamous two nines : 99.99% availability of the service through the year.
Service Level Agreement (SLA): Finally this is the agreement between two parties on how to deliver the chosen SLO and eventually the consequence of not meeting the objectives on these indicators. Some agreement for instance states penalty to the provider if they can’t meet their objective and therefore breaks the agreement.

Now that we have the concepts defined, the rest of the article will mainly focus on SLI I believe are important when running a modern service to make it reach maturity and excellence in the long term. The SLO and SLA however will depend on your current maturity and also the objectives fixed with your customers and internally. Overall on the subject I highly recommend reading the SRE book published freely by Google on the subject.

With that said, let’s dive into the subject,

Incident : Don’t be Mean to me

Incident will unfortunately be an important part of the lifecycle of the services you maintain. Whether we define an incident as a full unavailability or just a degradation of service (unsuccessful requests, latency etc…). The metrics related to these incidents are often overlooked but are key to pinpoint organizational / technology problems and spot trends on the overall service level. The speed in which incident are registered, treated and resolved are also a key indicator to the maturity both of the team and the service. The main objective here will be to spot trends on incident handling and therefore the mathematical tool will be mean values. I propose to define them and explain visually what each value means:

Mean Time to Detection (MTTD): fairly straightforward , how much times does it take to detect a problem. This could influence the decision at monitoring level and refine the collection/evaluation time of rules. It might look simple but adjusting threshold just right at monitoring level to trigger only true positive alerts you can act-on is a huge challenge and constant work.
Mean Time to Acknowledge (MTTA): This is the actual time between the detection of the problem and someone actually acknowledging and assigning it for investigation. This metric is often a key part of SLA with some SLO to acknowledge events within 1 or 4 hours, even outside of business hours. In business reality almost nobody expect incident resolution in such time because you never know what will occur, however the time to take responsibility and acknowledge it will often appears on SLA.
Mean Time Between Failure (MTBF): This one is really interesting because it allows to spot patterns in failures in order to predict and correct long terms errors. A service failing every friday is most likely a sign of an underlying cause that needs to be fixed. The other benefit of defining this metrics is that it forces you to identify the incident uniquely because you will need to know if this is the same incident happening again or a new one. This capability may look trivial but also shows a certain level of maturity at organization and technologic level.
Mean Time to Resolve (MTTR): This is obviously the most looked of all by managers, it will determine the length of an outage and often the impact on customers and service. Digging into it can bring a lot of improvement regarding traceability, debugging and how investigation are done. It also can help on how to handle incident : do we trouble current problem or just deploy quick fix to restore service and then investigate deeply in the background on affected service.
Mean time to Failure (MTTF): This last one is a little bit special , to my sense it mainly applies to hardware component, meaning it may drive logistic projects on how to replace components or potentially move onto a new iteration of the component. Although this metric is rarely used for software , never forget that despise what cloud provider may say, software always run over hardware. Persons involved in projects like IoT and hardware solution might learn interesting lessons by digging into this metric.
mean times definition

Service: DevOps and operations

We then dive into the so-called service metrics. Will be listed the generic metrics that should match any service operating in your system. Although only operational metrics were considered until recently, the DevOps methodology introduced some new ones to indicate flexibility and agility of the service.

Deployment Time & Frequency : This is the most common example of DevOps metrics added and are an indicator of agility and flexibility. The time it takes to deploy a service may indicate the effort needed to automate tasks. It may also be linked to time to recovery if an incident requires a new deployment to replace faulty service. Frequency on my opinion, rather than showing productivity or feature completion, it shows the maturity and confidence you have in both the service and its orchestration. A service deployed 100 times a day is obviously confident that deployment is a non-event, and that its orchestration and automation are mature.
Change Failure Rate: Often linked to Deployment failure, this is the percentage of changes that fails and need to be fixed / re-deployed. This is an indicator ultimately of code quality and deployment process. It can drive a automated testing strategy and also review process for deployment as well.
Four Golden Signals : These are the main signals that every service indicators should revolve around. Although simple, any failure will go through a trend change in one or more of these.
1. Latency : time it takes to answer a request
2. Traffic : volume of traffic going in : 0 = service most likely dead
3. Errors : obviously any increase there would trigger investigation
4. Saturation : how full the service capacity is. If performance tests were done, it can drive automatic scaling policies for the service.

Team:

Last but not least, any work deployed are still made by humans, ultimately to deliver value to another party. Any team working Agile or Waterfall or any other form of project management should be aware of the cost involved of the deployment from start to production of a service. Being efficient on top of being effective is a key goal that any business will be keen on improving. To do so raw data of team performance are necessary to drive value offer, optimization and also assess maturity of their teams through time.

Costs : Not necessarily money although this is definitely an objective of most business. Evaluating costs of developing and running services at scale is a very difficult task but the one which will drive most of the business. Optimization and frugality as constraint can lead to innovation and even governance changes such as moving to cloud environments or refine scaling of services.
Spent Time : This is actually a subset of cost but I would like to insist on it since I believe this is most of the cost and also a main driver for project performance. I would mainly use it afterwards to work on time estimation based on previous data. It also helps to fan-out by tasks to see optimization for example by putting senior people on certain key tasks ti gain time while more junior profile could get their hands on the well-known patterns.
Unplanned Work : Unplanned work is any activity in the IT organization that can’t be mapped to an authorized project, procedure or change request. This is also a nice indicator of maturity and validate lot of choice human or technology wise. If the unplanned work is huge it either means there were lot of operational troubles to work on, meaning it may be time to spend time stabilizing the technology and process or either the difference between estimation and actual work span differs too much meaning that estimation may have been too optimistic or even irrelevant at some point. Once again unfortunately in both case this is a hard metric to gather and does require significant effort, either at post-mortem after delivery or after an incident to track the actual time spent. Obviously the ideal would be to have as little unplanned work as possible but will never reach zero.