DEV Community

James Heggs
James Heggs

Posted on • Edited on

GCP DevOps Certification - Pomodoro Six

Well kind of day six...

The eagle eyed might have spotted that I missed a day.

I could drop out excuses like work got busy or other things but in reality I just chose to prioritise doing very little in my down time. In fact I watched The Social Dilemma - super interesting on how hooked to social media some of us find ourselves.

I find that if I take these breaks, it allows me to re-approach self development with a renewed interest. It also gives me time to consolidate what I've already learnt.

Well that is how I'll backwards justify it on this occasion anyway!

Edge Cases

Sometimes I think of the world as a consistent pattern but real life is much different. The impact of outage is one of those things that doesn't fit a pattern and is affected by the real world.

For example imagine the impact of an outage during the release of a brand new episode or title on Netflix. Their busiest time, everyone scrambling to get their watching fix.

Suddenly you might want your application or site to be even MORE reliable than usual - moving from 3 nines to 4 nines. You might consider implementing change freezes during that time or over provisioning for what you need - notice prioritising the reliability SLO over feature development.

The SRE course goes on to explain how its entirely reasonable to set more than one SLO target to capture the distribution of users. Explaining that not all users are equal. Example, you might find that having a longer latency SLO for three nines of your responses is good for most of your requests, but some users might find that too slow.

Right now whilst working through the content I'm trying to battle my inner brain telling me that things fit neatly in a box.

Error Budgets

Basically an inverse of reliability. Imagine the system is failing or providing to be unreliable for users - your error budget tells you how unreliable a service can be.

(I know it seems odd to read/write/say that)

Taking request status, if your SLO says that 99.9 percent of requests should be successful in a given quarter, your error budget allows 0.1 percent of requests to fail.

Or if we take downtime...

0.1 percent unavailability x
28 days in the four-week window x
24 hours in a day x 60 minutes in an hour =
40.32 minutes of downtime per month.

This is just about enough time for your monitoring systems to surface an issue, and for a human to investigate and fix it.

Unavailability

Not actually that much time and if you have one portion of unavailability that month - you will have likely burnt through your entire month budget.

This is where the importance of agreeing the error budget and SLO upfront with all required stakeholders and business leadership.

The error budget can be thought of being a tool for spending time on the things you want. Such as rolling out new features, software experiments etc.

Spending the error budget is actually useful!

Error Budget

In turn we get some useful side effects...

Side Effects

Credit to Cheryl Kang of Google for the tips in this blog post. Here's another useful blog from Cheryl.

Top comments (0)