James Heggs

Posted on Oct 10, 2020 • Edited on May 9, 2021

GCP DevOps Certification - Pomodoro Nine

#googlecloud #certification #sre #devops

A pause

I had to pause my learning for a few days. I think personally I just needed to let other things take priority - delivering on Tech Returners and making sure our forthcoming backend lessons are structured well ended up taking my "tech" priorities

The remaining time was focused on being a Dad - my current "personal project" is teaching my boy to ride his bike. Which is certainly reminding me of my natural fixed mindset. Consciously, day by day, remembering to be growth mindset and pass those thoughts of yet down to my son.

But enough about that time now for further SRE and this next few days it is about metrics, measuring and SLI's.

Characteristics of a good SLI

The first concept is that Service Level Indicators should have a predictable relationship with the happiness of your application users.

Most ops teams will be monitoring some for of system metrics like load average, CPU utilisation, memory usage etc

Are they good SLI's?

Probably not - the user doesn't care about the processor usage, they do care if your site/application is responding slowly.

Ok so what about correlations - you might see thread pool usage correlate with unhappy users so it seems like an SLI over the thread pools could be a good one?

Probably not - there could be cause/effect assumptions on the thread pool, jumping to a conclusion of system trend to user happiness could result in picking the wrong SLI.

Side note....I've done this....on multiple occasions. Maybe not as specifically for defining an SLO but definitely for when to page people and wake them up out of bed.

Cut to the chase

So the characteristics of a good SLI are:

Has a predictable relationship with user happiness
Shows service is working as users expect it to
Expressed as ratio of two numbers good events / valid events (resulting in value between zero and 100%)
Aggregated over a long time window

The last point is visualised really well in the example below

Notice that whilst both metrics capture a downward trend in user happiness. The top metric suffers from a lot more variance. In fact at one stage we might see that the percentage starts to increase and hit a false dawn that we have improved the reliability (in turn the happiness) only to find it decrease again shortly.

Also notice that during "happy times" the bottom example SLI has a narrow range of values - predictable and trending.

Ways of Measuring

There are 5 ways/approaches to measuring your SLI

Request Logs

This approach allows you to track the reliability of long user journeys. Such as a journey that navigates through multiple services. It also allows an option for back filling your SLI data if you still have your server side logs.

It will likely need a portion of engineering effort especially if there is some form of logic for identifying good user journeys (through multiple services).

Another potential drawback is that if it takes a portion of time process logs in order to find out whether the event was good or bad then you are risking an increase to your Time to Detect

Application Metrics

This has the same engineering challenge as logs in such that they might not tell the full story of the user journey (without engineering effort) however they are much easier to implement and you can get started exporting them relatively quickly.

Frontend Infrastructure Metrics

Stats from things like your frontend load balancer provide metrics that are the closest to the user.

Cloud providers might also have historical data that you can utilise to check if your SLI is predictable and aligns with happiness.

Downside is that your load balancer is likely stateless so cannot track a full user journey/session.

Synthetic Clients

Essentially acting like a user. A tool would act like a user. Crucially it should live outside of your infrastructure, acting and behaving exactly like a user. Theory that Happy synthetic clients === user happiness. (Yes the triple equals was intentional 🤦‍♂️)

The challenge is of course that a synthetic client is only a best guess of the average user. And users (humans) do unexpected things.

After the engineering effort this approach can often devolve in to Integration Testing.

Client Side Instrumentation

Provides the most accurate way of user experience.

Challenges are that it could have an impact on things like the battery life of your device, page performance etc.

Relying on these for quick operational response might also be risky due to the reliance on your clients usage of the application.

Another challenge of this aspect is the outside noise such as bad experience due to users being out of signal range. To give an example, you might find out that mobile clients suffer poor latency and higher error rates, but because you can't do a whole lot about it, you have to relax your SLO targets to accommodate it instead.

DEV Community