Must-Have metrics for successful Site Reliability

Shenril — Thu, 17 Jun 2021 05:34:39 +0000

Originally posted on Code-Artisan

An overall Site Reliability strategy should always be based on data and metrics. However having the right metrics is a key challenge to evolve services and practices internally. The rise of agile, DevOps and SRE introduced different scales of metrics that I propose to review together. The main three topics for metrics presented here will be Incidents, Service and Teams.

Incident metrics will be focused on how incidents are handled and which metrics to look for when improving Service Level. They can either be failure and production incident but also service degradation depending on the maturity of the product/solution

Service metrics are focused on the service itself and its lifecycle, meaning Devops concepts like deployment metrics and also operational metrics to look for on running services.

Last but not least , Teams metrics are the human and business sides of the coin. Yes you have a flawless service but at what cost. I do believe that is an important part on how we should look at metrics when touching at Service Levels.

Service Levels

Before diving into those metrics, I would like to propose a quick reminder on Service Level and their declinations as they are often misused or not well understood :

Service Level Indicator (SLI): This is actually the measure itself you will focus on, for instance availability (uptime) of a service. It’s a raw value with formulae on how to calculate it.
Service Level Objective (SLO): This is the objective fixed for one or more SLI value. For instance the infamous two nines : 99.99% availability of the service through the year.
Service Level Agreement (SLA): Finally this is the agreement between two parties on how to deliver the chosen SLO and eventually the consequence of not meeting the objectives on these indicators. Some agreement for instance states penalty to the provider if they can’t meet their objective and therefore breaks the agreement.

Now that we have the concepts defined, the rest of the article will mainly focus on SLI I believe are important when running a modern service to make it reach maturity and excellence in the long term. The SLO and SLA however will depend on your current maturity and also the objectives fixed with your customers and internally. Overall on the subject I highly recommend reading the SRE book published freely by Google on the subject.

With that said, let’s dive into the subject,

Incident : Don’t be Mean to me

Incident will unfortunately be an important part of the lifecycle of the services you maintain. Whether we define an incident as a full unavailability or just a degradation of service (unsuccessful requests, latency etc…). The metrics related to these incidents are often overlooked but are key to pinpoint organizational / technology problems and spot trends on the overall service level. The speed in which incident are registered, treated and resolved are also a key indicator to the maturity both of the team and the service. The main objective here will be to spot trends on incident handling and therefore the mathematical tool will be mean values. I propose to define them and explain visually what each value means:

Mean Time to Detection (MTTD): fairly straightforward , how much times does it take to detect a problem. This could influence the decision at monitoring level and refine the collection/evaluation time of rules. It might look simple but adjusting threshold just right at monitoring level to trigger only true positive alerts you can act-on is a huge challenge and constant work.
Mean Time to Acknowledge (MTTA): This is the actual time between the detection of the problem and someone actually acknowledging and assigning it for investigation. This metric is often a key part of SLA with some SLO to acknowledge events within 1 or 4 hours, even outside of business hours. In business reality almost nobody expect incident resolution in such time because you never know what will occur, however the time to take responsibility and acknowledge it will often appears on SLA.
Mean Time Between Failure (MTBF): This one is really interesting because it allows to spot patterns in failures in order to predict and correct long terms errors. A service failing every friday is most likely a sign of an underlying cause that needs to be fixed. The other benefit of defining this metrics is that it forces you to identify the incident uniquely because you will need to know if this is the same incident happening again or a new one. This capability may look trivial but also shows a certain level of maturity at organization and technologic level.
Mean Time to Resolve (MTTR): This is obviously the most looked of all by managers, it will determine the length of an outage and often the impact on customers and service. Digging into it can bring a lot of improvement regarding traceability, debugging and how investigation are done. It also can help on how to handle incident : do we trouble current problem or just deploy quick fix to restore service and then investigate deeply in the background on affected service.
Mean time to Failure (MTTF): This last one is a little bit special , to my sense it mainly applies to hardware component, meaning it may drive logistic projects on how to replace components or potentially move onto a new iteration of the component. Although this metric is rarely used for software , never forget that despise what cloud provider may say, software always run over hardware. Persons involved in projects like IoT and hardware solution might learn interesting lessons by digging into this metric.
mean times definition

Service: DevOps and operations

We then dive into the so-called service metrics. Will be listed the generic metrics that should match any service operating in your system. Although only operational metrics were considered until recently, the DevOps methodology introduced some new ones to indicate flexibility and agility of the service.

Deployment Time & Frequency : This is the most common example of DevOps metrics added and are an indicator of agility and flexibility. The time it takes to deploy a service may indicate the effort needed to automate tasks. It may also be linked to time to recovery if an incident requires a new deployment to replace faulty service. Frequency on my opinion, rather than showing productivity or feature completion, it shows the maturity and confidence you have in both the service and its orchestration. A service deployed 100 times a day is obviously confident that deployment is a non-event, and that its orchestration and automation are mature.
Change Failure Rate: Often linked to Deployment failure, this is the percentage of changes that fails and need to be fixed / re-deployed. This is an indicator ultimately of code quality and deployment process. It can drive a automated testing strategy and also review process for deployment as well.
Four Golden Signals : These are the main signals that every service indicators should revolve around. Although simple, any failure will go through a trend change in one or more of these.
1. Latency : time it takes to answer a request
2. Traffic : volume of traffic going in : 0 = service most likely dead
3. Errors : obviously any increase there would trigger investigation
4. Saturation : how full the service capacity is. If performance tests were done, it can drive automatic scaling policies for the service.

Team:

Last but not least, any work deployed are still made by humans, ultimately to deliver value to another party. Any team working Agile or Waterfall or any other form of project management should be aware of the cost involved of the deployment from start to production of a service. Being efficient on top of being effective is a key goal that any business will be keen on improving. To do so raw data of team performance are necessary to drive value offer, optimization and also assess maturity of their teams through time.

Costs : Not necessarily money although this is definitely an objective of most business. Evaluating costs of developing and running services at scale is a very difficult task but the one which will drive most of the business. Optimization and frugality as constraint can lead to innovation and even governance changes such as moving to cloud environments or refine scaling of services.
Spent Time : This is actually a subset of cost but I would like to insist on it since I believe this is most of the cost and also a main driver for project performance. I would mainly use it afterwards to work on time estimation based on previous data. It also helps to fan-out by tasks to see optimization for example by putting senior people on certain key tasks ti gain time while more junior profile could get their hands on the well-known patterns.
Unplanned Work : Unplanned work is any activity in the IT organization that can’t be mapped to an authorized project, procedure or change request. This is also a nice indicator of maturity and validate lot of choice human or technology wise. If the unplanned work is huge it either means there were lot of operational troubles to work on, meaning it may be time to spend time stabilizing the technology and process or either the difference between estimation and actual work span differs too much meaning that estimation may have been too optimistic or even irrelevant at some point. Once again unfortunately in both case this is a hard metric to gather and does require significant effort, either at post-mortem after delivery or after an incident to track the actual time spent. Obviously the ideal would be to have as little unplanned work as possible but will never reach zero.

References

https://dev.to/microtica/13-devops-metrics-for-increased-productivity-5084
https://sre.google/books/

Morning Routine and Healthy Habits in a hostile environment (Aka Parenting)

Shenril — Wed, 20 Feb 2019 07:55:05 +0000

One year ago I became a father.
While this event made me the happiest man alive, it also carried a bunch of challenges.

I've always been an advocate of morning routines, meditation, sports practices and other healthy habits regarding productivity and personal balance.

Despite my efforts this last year all these routines have been paused because I couldn't find the right way to make it work. I've been wondering a long time about this and would like to know your opinions or silver bullet to have both these healthy habits while keeping your family balance in check.

Here is what a typical day looks like for me:

6:00 Waking Up, shower, perform some home chores
6:30 <-> 7:30 Leave to work: I have one hour of commute in train letting me check newsfeeds, catch up some discussions on slack, read a book or even play my Switch from time to time but no substantial work (coding)
12:30 <-> 13:30 Lunch break: One hour basically trying to learn something while eating (currently Japanese). My company network policy prevents me to do much and I don't have a personal hotspot... Even with Pomodoro stuff the amount of coding you can do within one hour and controlled network access is fairly limited.
18:00 <-> 19:00 Go home. The commute is basically the same as in the morning, watch videos and read. Once I arrive I take the relay of my wife to take care of our baby while she prepares dinner. Sometimes the contrary (I love cooking)
20:00 Dinner: Try to catch up with my wife about her day and mine and upcoming plans for the weekend of the baby
20:30 <-> 21:30 Baby Time: I play with my daughter, bath her, try to calm her down for sleep while my wife cleans up and do some chores (and vice versa)
21:30 <-> 23:00 Finish the chores, continue the discussions with my wife and if the stars align and it's a full moon the baby doesn't wake up and I can get some personal time to do real coding, writing and concentration stuff. (In practice 2 hours a week). Finally going to sleep for another day.

This year as a result, as a father it's been an amazing experience and my daughter is the cutest thing in the world. But I did make some big tradeoffs on my personal routines. My personal projects and blog have been on hold for a while, I used to run marathons but training has been a challenge and I'm getting out of shape. My meditation routine is often canceled within 10 minutes by tears or unforeseen events...
I did look online for "Routines for parents" but it mainly focuses on building a routine for your child ... Not how to manage the parents' routines while parenting

Here are my challenges :

How do you handle stuff like morning/night routines as parents? My wife would gladly do routines too but someone has to take care of the baby. Do you do rotation throughout the week and weekend?
I live in Japan, where the concept of babysitting does not exist. Even if we manage to get the baby handled by someone, it's a one-off thing and hardly becomes a routine
I do read a LOT of content about routines, deep work, and healthy habits but when I read these authors saying they have a family I can't help but think "Where are your kids while you're doing a 4-hour deep work and going to the gym?" Does this mean that his/her partner takes full load of family life while they're doing their things?
Even if someday I do find some time to do personal development or training, it's very unlikely that I can reproduce this miracle the next day and be back to a routine.

Am I missing something? Doing something wrong or should just get up even earlier to get those things done? Am I aiming too high as a one-year-old baby girl father?

Related remarks

You out there having 1, 2, 3 kids or even more, I DEEPLY RESPECT YOU AND LOVE YOU FROM THE BOTTOM OF MY HEART

DEV Community: Shenril