DEV Community: James Heggs

Technical Leadership Katas - 002 Basing on credibility

James Heggs — Sat, 11 Sep 2021 08:45:31 +0000

Introduction

For lots of us our tech roles feel vocational. We have that moment where we realise that we can actually get paid for doing the things we enjoy such as coding, tinkering with servers or designing things.

So when we consider management it is often viewed as a fork in the road on our careers.

Route one - do we become more technical, exploring routes like architect or "Senior" {insert role her}, Principle Consultant those kinds of things.
Route Two - do we become managers. Scaling that hierarchy. Leading people and managing their mindset.

The problem with this is we tend to take approaches of reduction based decision making in deciding which route is right for us. As a result, route one and route two become:

Route One - "I can continue learning and get better at {insert skill here}"
Route Two - "I don't want to manage people's holidays"

This kata is here to debunk that myth and provide an exercise around how you can base your technical leadership on technical credibility. Tips to continue your technical development whilst acknowledging that you'll have further human responsibilities.

Past experience and how it played out

Around 12 years ago I was interview when asked where I saw my future going and if I wanted to lead people. Which I answered with a definite NO - I just wanted to code.

Fast forward and things look a little different now. Much of my work involves management in some form, whether that be decision making or developing our existing engineers.

But...I always wanted to remain credible. I didn't (and still don't) want to lose those technical skills. Being an engineer whether that be in Software Development or more recently platform engineering was part of my identity.

For a period of time I noticed my engineering work declining. I also felt those pangs of not knowing certain aspects - I remember one specific time when I'd not refreshed my Java skills and seeing one of our team had implemented a function using Lambda Expressions and I hadn't encountered that type of syntax before.

Don't confuse that feeling with ego. That feeling is one of "How can I lead people if I don't have credibility"

At that point I put in to place some action to ensure I had a level of knowledge that I was comfortable with around the tools/languages/practices being utilised.

Kata instructions

Step 1

Admittedly your time will be more restricted. Create a table to highlight areas where you want to be Aware and Efficient.

This will help you highlight where time needs to be spent. Efficiency will take longer time to achieve than awareness.

This will be a working document and change over time. If you do it in markdown it means you will be able to add links to items.

Here's an example below

Aware	Efficient
Pulumi	Java 16 language features
Design Systems

Step 2

Moving in to leadership means becoming comfortable with your calendar no longer being your own. People will need your support and that means you will have slots in to provide that support in whatever form it takes.

Counteract that with calendaring your focus time. Some examples I've seen work are 1 hour before things get "busy" or "1 hour after lunchtime", "1 hour at the end of the day". You'll know what is the best time for you (your mindset) and what is realistic for your working environment.

IMPORTANT Make sure to highlight your focus time as BUSY in your calendar software.

Step 3

Pick a project. Tapping in to the "Purpose" aspect of Dan Pink's great book called Drive we all need a purpose to a work. If you said go new Java 16 language features without giving a purpose you'd likely have a peak of motivation but then it would wane.

In step 3 identify a project that you could code, research, create that will improve your life and align that project with your "Efficient" topics.

Some candidates for this might be automating things that take your (or the teams) time up, researching and implementing new processes.

REALLY IMPORTANT Do NOT pick up a project that should be implemented by one of the teams. Everyone hates a leader that goes into their coding hole and then emerges with a brand new tool or process that they suddenly expect people to use.

If anyone wants advice on this one please do drop in the comments!

Step 4

Earlier steps highlighted that time is now at a premium so you need a few cheats when it comes to staying up to date on technical practices and tools.

Newsletters can be one of those cheats.

Using your aware/efficient table - Sign up for a few newsletters from the following link.

I particularly like DevOps Weekly and my favourite cheat is the ThoughtWorks Tech Radar.

Step 5

Identify which calendar slots you'll dedicate on working on your project and which ones you'll use to update yourself via your newsletters.

Submission process

The following write-ups should be recorded in your GitHub repo within the corresponding kata directly.

1. Your Aware/Efficient markdown file

Create a file called PERSONALDEVELOPMENTFOCUS.md and create your aware/efficient table.

2. Your project

Create a markdown file that identifies and explains the project you'll be working on.

3. Your focus time

Create a markdown file called FOCUSTIME.md that outlines your calendar plan and how that time will be spent. Apply the information from this file to your personal calendar.

Technical Leadership Katas - 001 Your situational starting point

James Heggs — Tue, 31 Aug 2021 21:59:42 +0000

Introduction

Situational leadership is a great tool for leading technical teams. Utilising situational leadership you can transform the progress of an individual and their own development.

To apply situational leadership I think you have to first start with yourself. Understanding your own take on the leadership approaches and where you naturally start.

Past experience and how it played out

In the past I was unaware of my natural starting point and the impact it had on my colleagues.

One of my teams were undertaking a new project to implement single sign on across our web application stack.

It was the first project for a brand new tech lead that had joined our team. Not only was it their first project it was also their first tech lead role EVER!

My natural starting point, although I was unaware at the time, is that of supporting. I tend to go straight to "Here's a project, I'm here if you need me".

It took me a number of weeks to realise for that given task, in the given context, I'd started at the wrong point whilst guiding the team.

Thankfully the tech lead in question had enough relationship confidence to bring this to my attention. I think the words they used were similar "Please can you just tell me exactly how to go about this". Seeing their fear of failure and realising they'd never faced a task this large or cross cutting I realised that I'd started at the wrong point for the tech lead (and subsequently their team).

If I'd been aware of understanding myself and then responding (ironically as the framework points out) to the situation then I could have saved the technical lead a few weeks anguish, fearing they would be unable to get the task done and internally questioning their skillset.

Kata instructions

Step 1

Gather an understanding of what situational leadership is by watching this video. If you've already come across it then jump on to step 2.

Step 2

Identify which quadrant you naturally fall on. At around the 2m20s mark on the video, it outlines the 4 different leadership styles. Review those styles and identify what is your most common "go to" approach.

Really important Do not think of one approach to be bad or good. Each of them have their own place and actually you'll want to leverage all of them depending on the situation. For this step just focus on identifying what YOUR most common approach is.

TIP If you find it hard to identify it yourself then ask a friend or colleague. Often people will share insight with you that you hadn't considered.

Step 3

Reflect on a time you have provided guidance on a task. Label it as either "Directing", "Coaching", "Supporting" or "Delegating".

Here's an example:

"In a recent discussion with an {role} on our team, they wanted a discussion about {task}. The supporting detail around the task was {task context}.

I approached the task is a {leadership style}."

Important In this step only make sure to record the approach taken. Label it but do not judge it just yet.

Step 4

You can't change the past but you can reflect on it.

Reflect on your notes from step 3.

Did you take what was the best situational approach?

Did you just go with your natural situational starting point and if so was that correct or would you change anything now?

If you had taken a different approach what might have been the outcome? Positive or negative.

Submission process

The following write-ups should be recorded in your GitHub repo.

1. Identify your natural situational starting point

Write up detail around your natural starting point. Why do you start there? What experiences from the past have had an impact on you and lead you to that starting point?

2. Share the story

Write up notes from your own personal story relating to step 3 above.

3. Iterate and improve

Write up the review notes relating to step 4 above. Reflect as much as possible on different situational approaches.

Technical Leadership Katas

James Heggs — Tue, 31 Aug 2021 21:59:24 +0000

Soft skills - You're joking right?

Technical leadership is hard. In fact let's put it out there, working with people in any capacity is hard.

People have feelings, emotions, outside influences, past experiences, current experiences, families, friends, lost families, lost friends and quite literally everything in between. Not to mention the whole "social media" thing and how that impacts us.

Given that knowledge, pushing yourself in to any form of position where you are responsible for those people is a very vulnerable and scary place to be.

When I say vulnerable I totally mean the relation that vulnerability has to courage and Brené Brown's amazing work.

Those type of skills, to lead people, are described as many different things and whether you agree with the term 'soft skills' or not (aside: Some people don't disagree with the term but feel our cultural interpretation of being 'soft' is wrong) I think one thing I hope we all agree on is that acquiring leadership skills is really tough.

Interesting whilst writing this I thought I'd try and find out where that phrase 'soft skills' originated from.

Managing (yep I said the M word) people with all those different emotions is really tough.

The type of immediate feedback we get from a failed unit test or failed deployment carries little to zero emotion (ok maybe not the deployment one 🙈). It's cold, hard and objective. The type of immediate feedback we get from an idea, approach, bad news, good news tends to be very different.

Kata exercises

Why not take the approach to managerial development in the same way that we might take with perfecting or learning a language through a series of Katas.

Over this "Tech Lead Kata" series, I'm going to share a series of Kata exercises for growing yourself as a technical leader.

Based on past managerial experiences I'll share some of the things that have got me through. I'll also share my stuff ups, the things I got wrong so you avoid being as stupid and foolish as me.

My aim is to share a Kata a week and lets get to 10 exercises and see how we get on.

Why bother listening to me?

Fair question! Your initial mind will quite rightly be skeptical and anything persuasive I write will have zero impact on your choice.

So I'll be all political and ask the question back - what's the alternative? Compare the kata exercises to alternatives, if you think they look good then go for them, if your alternative personal development approach looks like it'll yield better results then go for that one.

For the purposes of your comparison here is my LinkedIn if you want to understandably make some judgements 🤣

And you've also got the comments section - please please 🙏 share your thoughts in there to help others build up a picture on whether they should bother investing their time on the katas.

How to start?

Here's a suggested way of getting involved

Spin up a repository under your own account called "tech-lead-katas".

Within that repository create a directory for each and a README.md file within that directory. Something like this:

tech-lead-katas
    ├── 001-natural-situational-starting-point
    │   └── README.md
    └── 002-some-other-kata
        └── README.md

For those that are really brave, I'd encourage learning in the open - share the links to your kata submissions as comments on the posts! Or even start discussions around what you thought about the Kata - we'll all learn from each other.

Learning in the open will increase your connection and accountability to what you are working through much in the way that tools like #100daysOfCode or #100DaysOfCloud do

Kata 1 - Your natural situational starting point

Got this far and still reading??! Phew!

Here's the first Kata - Understanding your natural situational starting point

GCP DevOps Certification - Pomodoro Twelve

James Heggs — Sat, 15 May 2021 13:14:08 +0000

System Complexity

The Coursera SRE programmes shifts on to discussing system complexity and the introduction of the initial Service Level Indicators. Google recommend that around 1 to a maximum of 3 SLI's per user journey should be enough.

Thoughts behind limiting the number to a maximum of 3 SLI's:

Not all metrics make could SLIs
Each SLI increases the cognitive load on the operations team
More SLIs lower the signal-to-noise ratio (which can impact resolution time)

You may also have lots of user journey's through your complex system still resulting in many SLI's - however each journey should be assessed as to whether or not it is "important enough" to be tracked by a SLI.

An important caveat

Other metrics you might be already recording still have value. The above recommendation isn't one that should be used to ditch your existing metrics.

A deterioration in SLI's is an indicator that something is wrong, once that deterioration is bad enough to provoke some operational response that the other monitoring systems will really help in ascertaining a cause.

Managing complexity with aggregation

You might have multiple user journeys.

Take example an online store. People can view a home page that lists products. They can search for products. They can browse products by category and they can see the individual product details.

Each of those could be separate user journeys and result in multiple SLI's. However if you aggregated what you collect (in terms of SLI) from each journey then the SLI could be determined as an overall "Browse" SLI.

The Google course provides this example:

If all have availability and latency SLI's then they could be aggregated.

Another important caveat

Summing events together can work well for similar user journeys. However it might not fit a scenario where there is a large disparity between rates of the user journey such as request rates differing significantly.

IE. The number of valid events (thinking back to previous pomodoros) for a small but significant user journey could get lost in the noise of higher rate user journey.

If you face that then multiplying the SLI's by a weight based upon their portion of the whole might be on option for normalising data across and aggregation.

GCP DevOps Certification - Pomodoro Eleven

James Heggs — Sun, 09 May 2021 16:15:45 +0000

Data processing SLI's

There is a high likelihood that you'll be working working with a platform that works with user provided or user generated information to provide a service.

Google recommend four different types of SLI's for that use case:

Freshness
Correctness
Coverage
Throughput

Let's review each of them, much along the similar lines of previous blogs Google uses the term "valid", this time in regards to the data and "proportions" to reflect things as percentages.

Freshness

The data "freshness" can be considered as the proportion of valid data updated more recently than a give threshold.

Given that definition an implementation requires making two choices, which of the data this system processes are valid for the SLI, and, when the timer for measuring the freshness of the data starts and stops.

Correctness

When speaking of data correctness its important to note that users often have independent ways of validity checking data from your systems. As a result its important to consider an SLI for data correctness to ensure trust with your users.

The data "correctness" can be considered as the proportion of valid data producing correct output.

Given that definition an implementation requires making two choices, which of the data this system processes are valid for the SLI and how to determine whether the data is "correct".

Coverage

A coverage SLI is useful for scenarios when users have an expectation of when the data will be made available for them.

Data "coverage" can be considered as the proportion of valid data processed successfully.

Given that definition an implementation requires making two choices, which of the data this system processes are valid and whether the piece of data was processed successfully.

Throughput

Throughput SLI's for data are useful in scenarios where a latency SLI might not be right. EG. If the latency throughput varies a lot (peak times versus quiet times) then maybe a data throughput SLI might be applicable.

Data "throughput" can be considered as the proportion of time where the data processing rate is faster than a threshold.

For this SLI to work you have to turn an event into a portion of time. IE. How long did that one event take to process. Such as Bytes per second.

Any metric that scales at the same rate as the cost of processing should work for tracking the SLI. EG. Big data file to process would need longer time or more processing power.

GCP DevOps Certification - Pomodoro Ten

James Heggs — Mon, 08 Feb 2021 18:33:36 +0000

Formalising the SLI definition

In a previous post I shared the learning that the SLI should be expressed as a ratio between two numbers. That of good events over valid events

Working that way it allows us to ensure that SLI's fall as a percentage between 0% and 100%.

0% nothing is working
100% nothing is broken

This means it is intuitive and directly translates to SLO targets and the concept of error budgets.

Also because of the consistent percentage format it means that building tooling to track your SLI's is made easier. Alerting, SLO reporting etc can all be written to expect that same structure. Good events, valid events and your SLO threshold(s).

Valid Events

It might be tempting to consider ALL events. However the phrasing of valid is important as it allows for explicit declarations of events that would not be considered.

EG. You might receive some level of bots accessing your site impacting performance of their requests. As you learn about SLI performance then you can choose to exclude those from valid events. Another example might be that you have hundreds of possible HTTPS API calls but you decrease the scope of SLI monitoring down to specific request paths. So all valid paths are the ones within that scope.

Working example for request/response interaction

Availability

To utilise an SLI for availability then there are two choices to make:

Which requests are to be tracked as valid
What makes a response successful

Using terms already covered then it can be expressed as the proportion of valid requests served successfully.

You might be required to write complex logic to identify how available a system might be such as whether or not a user completed a full user journey, discounting where they might have voluntarily exited the process.

For example an e-commerce application might have a journey of:

Search => View => Add to basket => Checkout => Purchase => Confirmation

However people can "drop out" at any stages (irrespective of how available the system is) so measuring the SLI should only consider full user journeys.

Latency

For a web application, much like availability we can define it as the proportion of valid requests served faster than the threshold.

So yet again there are two choices:

Which requests are to be tracked as valid
When the timer should start and stop for those valid requests

Setting a threshold for fast enough is dependent on how accurately measured latency translates to the user experience and is more closely related to the SLO target. For example you can engineer a system to give a perception of speed with techniques like pre-fetching or caching.

Commonly you might set a threshold of 95% of all requests will respond faster than the threshold. However it is likely that people will still be happy if a lower percentage is present and generally the results would be long tail. EG. Some individuals will get a very slow experience but small percentages. So it might be worth setting thresholds that target 75% to 90% of requests.

Latency isn't just request/response. There might be scenarios such as data pipeline processing where latency comes in to play.

EG. If you have a batch processing pipeline that executes daily then it should not take more than 24 hours to complete.

A note on tracking latency of jobs is when alerts are triggered. If you only report when a batch job has completed and missed the latency target then you window between the threshold and job completion becomes a problem.

Let us assume a threshold of 60 minutes for a batch job but you job takes 90 minutes and triggers the SLO alert. There was a 30 minute window where we were operationally unaware of something having broken the SLO.

Quality

Back to our percentages, quality can be expressed by understanding two values. The proportion of valid requests served without degrading quality. This leaves our choices as:

Which requests are to be tracked as valid
How to determine whether the response was served with degrading quality

Similar to latency it might be worthwhile to set SLO targets across a spectrum because of their interaction with an availability target SLO.

The programme I am studying provides an example of a web application that fans out requests across 10 servers each of which have 99.9% availability SLO and each backend has an ability to reject requests when they are overloaded.

So you might say something like 99% of service responses have no missing backend responses. Further, that 99.9% have less than 1 missing backend response. Illustrated below:

Creativity in the Ops

James Heggs — Mon, 12 Oct 2020 21:46:05 +0000

Why a career in cloud engineering?

I recently asked the following question to my LinkedIn network.

Platform engineers / SRE folk / Ops people

What is it that brings us into those careers/interests? What gets us out of bed each day?

For me it's the diversity of tech alongside being on the operational frontline

Or maybe it's just because of #kubernetes right?

Some of the community thoughts

Personally, the ability to provide a foundation for other people to create amazing things always pushes me to get out of bed in the morning.

and

....there is a lot of team work involved and adapting and overcoming any obstacles that are faced.

and

On the platform side, I'm all about elegance, and security. Building and maintaining the infrastructure that makes thing hyper-secure, enables hyper-scale, and is simple to understand, yet elegant in its design.

and

The drive ultimately became the metrics; measuring and optimising the KPIs, tweaking and tweaking until the graphs represented perfection.

Now it’s just the people.

Why share this?

Those answers show the passion that people have for DevOps. Each of those leaders are really passionate for their craft and will encourage even more people with diverse background into DevOps based environments. And wow do we need it....

In the latest DORA State of DevOps report only 10% of individuals identified as female (of over 1000 surveyed).

And compare that to last year where it was 12%.

If we look at Women in Tech as a whole, it currently sits around 18% as an industry wide average.

We quite clearly have an even bigger gender diversity issue within DevOps.

It doesn't stop there.

What can we do

Over the past few years working with our Tech Returners I always ask what brings people to return to their technical careers or even what brought them to tech in the first place.

The majority of answers fall under two camps:

Creativity
Working in a team to create things (or solve problems) for people

As a DevOps community, I believe that we can do more to encourage people to return or enter platform/cloud/SRE based roles.

We often focus on the complexities of architectures or massage our own egos around our operational skills/war stories. We've all been to a meetup where the speaker asks for any questions, only to receive some rhetoric about how they should have done something differently.

If we focus on the creativity of ops roles paired with those human connections I believe we can build a more inclusive DevOps community.

In turn we'll be so much more productive because of being able to leverage all of that diverse thinking!

Signposting

Hopefully you might be a bit more inspired by cloud engineering roles and if so here's a few more community links you might find interesting:

GCP DevOps Certification - Pomodoro Nine

James Heggs — Sat, 10 Oct 2020 10:26:43 +0000

A pause

I had to pause my learning for a few days. I think personally I just needed to let other things take priority - delivering on Tech Returners and making sure our forthcoming backend lessons are structured well ended up taking my "tech" priorities

The remaining time was focused on being a Dad - my current "personal project" is teaching my boy to ride his bike. Which is certainly reminding me of my natural fixed mindset. Consciously, day by day, remembering to be growth mindset and pass those thoughts of yet down to my son.

But enough about that time now for further SRE and this next few days it is about metrics, measuring and SLI's.

Characteristics of a good SLI

The first concept is that Service Level Indicators should have a predictable relationship with the happiness of your application users.

Most ops teams will be monitoring some for of system metrics like load average, CPU utilisation, memory usage etc

Are they good SLI's?

Probably not - the user doesn't care about the processor usage, they do care if your site/application is responding slowly.

Ok so what about correlations - you might see thread pool usage correlate with unhappy users so it seems like an SLI over the thread pools could be a good one?

Probably not - there could be cause/effect assumptions on the thread pool, jumping to a conclusion of system trend to user happiness could result in picking the wrong SLI.

Side note....I've done this....on multiple occasions. Maybe not as specifically for defining an SLO but definitely for when to page people and wake them up out of bed.

Cut to the chase

So the characteristics of a good SLI are:

Has a predictable relationship with user happiness
Shows service is working as users expect it to
Expressed as ratio of two numbers good events / valid events (resulting in value between zero and 100%)
Aggregated over a long time window

The last point is visualised really well in the example below

Notice that whilst both metrics capture a downward trend in user happiness. The top metric suffers from a lot more variance. In fact at one stage we might see that the percentage starts to increase and hit a false dawn that we have improved the reliability (in turn the happiness) only to find it decrease again shortly.

Also notice that during "happy times" the bottom example SLI has a narrow range of values - predictable and trending.

Ways of Measuring

There are 5 ways/approaches to measuring your SLI

Request Logs

This approach allows you to track the reliability of long user journeys. Such as a journey that navigates through multiple services. It also allows an option for back filling your SLI data if you still have your server side logs.

It will likely need a portion of engineering effort especially if there is some form of logic for identifying good user journeys (through multiple services).

Another potential drawback is that if it takes a portion of time process logs in order to find out whether the event was good or bad then you are risking an increase to your Time to Detect

Application Metrics

This has the same engineering challenge as logs in such that they might not tell the full story of the user journey (without engineering effort) however they are much easier to implement and you can get started exporting them relatively quickly.

Frontend Infrastructure Metrics

Stats from things like your frontend load balancer provide metrics that are the closest to the user.

Cloud providers might also have historical data that you can utilise to check if your SLI is predictable and aligns with happiness.

Downside is that your load balancer is likely stateless so cannot track a full user journey/session.

Synthetic Clients

Essentially acting like a user. A tool would act like a user. Crucially it should live outside of your infrastructure, acting and behaving exactly like a user. Theory that Happy synthetic clients === user happiness. (Yes the triple equals was intentional 🤦‍♂️)

The challenge is of course that a synthetic client is only a best guess of the average user. And users (humans) do unexpected things.

After the engineering effort this approach can often devolve in to Integration Testing.

Client Side Instrumentation

Provides the most accurate way of user experience.

Challenges are that it could have an impact on things like the battery life of your device, page performance etc.

Relying on these for quick operational response might also be risky due to the reliance on your clients usage of the application.

Another challenge of this aspect is the outside noise such as bad experience due to users being out of signal range. To give an example, you might find out that mobile clients suffer poor latency and higher error rates, but because you can't do a whole lot about it, you have to relax your SLO targets to accommodate it instead.

GCP DevOps Certification - Pomodoro Eight

James Heggs — Mon, 05 Oct 2020 11:04:29 +0000

Improving reliability

Shifting away from impacts on reliability, it's time to move to thinking about how to improve reliability.

There are lots of options for improvement and prioritising them requires some thought. I'll share the options in a moment but the Google SRE team share this calculation for measuring the impact on your error budget.

TTD
Time to Detect - The time between the user being impacted and someone in your team being informed.

TTR
Time to resolution - The time between being informed and a fix being presented.

impact%
Impact percentage - how many users will this particular failure impact.

TTF
Time to failure (sometimes called TBF time between failures) is how frequently you expect something to happen.

All together...

The expected impact of a particular type of failure on your error budget is proportional to the time-to-detect plus the time-to-resolution multiplied by the percentage of impact over the time-to-failure. This last value TTF expresses how frequently you expect this particular failure to occur.

To improve reliability

So to improve reliability you could focus on reducing the TTR. Maybe setting up quicker alerting or more frequent monitoring checks.

Or even introduce automated alerting in an environment that previously relied upon humans spotting things on graphical dashboards.

Maybe spotting Single Points of Failure (SPOF's) in your architecture and then replicating those is another option to lower the impact% or maybe doing canary releases with dedicated groups of users thus lowering the impact%

The key point

Having this calculation means we can start to prioritise which areas of the SLO impact we focus on.

GCP DevOps Certification - Pomodoro Seven

James Heggs — Sat, 03 Oct 2020 17:05:11 +0000

The largest single source

of unreliability to a system is change

Now we take learning back to DevOps and understand why that friction occurred. New features, a progressive application that delivers new features competing with a stable reliable system that doesn't change. Dev and Ops.

For me, SLO's remove that personal aspect. Agreeing upfront the error budget, the level of reliability means both groups allow for each other. It is managing expectations.

And key to success is alignment of incentives between development and operations.

If a service is within SLO then you could...

One approach is only releasing features until the error budget is exhausted, then focusing development on reliability improvements until the budget is refilled.

No silver bullets...or are there?

Paint a real world scenario. You've spent all your error budget - can't incur any more outages or downtime but the product team really want to push out a new feature.

We have all been there and it's a reality of development. Hint - It's not personal, don't treat it as such!

But if this situation occurs teams can furnish stakeholders with a silver bullet token. A token that allows the bearer to propose ignoring the rules.

The tokens don't refresh and if the release is desired to still go ahead, the token bearer provides their token to the SRE's in order to enable the release.

GCP DevOps Certification - Pomodoro Six

James Heggs — Fri, 02 Oct 2020 20:15:18 +0000

Well kind of day six...

The eagle eyed might have spotted that I missed a day.

I could drop out excuses like work got busy or other things but in reality I just chose to prioritise doing very little in my down time. In fact I watched The Social Dilemma - super interesting on how hooked to social media some of us find ourselves.

I find that if I take these breaks, it allows me to re-approach self development with a renewed interest. It also gives me time to consolidate what I've already learnt.

Well that is how I'll backwards justify it on this occasion anyway!

Edge Cases

Sometimes I think of the world as a consistent pattern but real life is much different. The impact of outage is one of those things that doesn't fit a pattern and is affected by the real world.

For example imagine the impact of an outage during the release of a brand new episode or title on Netflix. Their busiest time, everyone scrambling to get their watching fix.

Suddenly you might want your application or site to be even MORE reliable than usual - moving from 3 nines to 4 nines. You might consider implementing change freezes during that time or over provisioning for what you need - notice prioritising the reliability SLO over feature development.

The SRE course goes on to explain how its entirely reasonable to set more than one SLO target to capture the distribution of users. Explaining that not all users are equal. Example, you might find that having a longer latency SLO for three nines of your responses is good for most of your requests, but some users might find that too slow.

Right now whilst working through the content I'm trying to battle my inner brain telling me that things fit neatly in a box.

Error Budgets

Basically an inverse of reliability. Imagine the system is failing or providing to be unreliable for users - your error budget tells you how unreliable a service can be.

(I know it seems odd to read/write/say that)

Taking request status, if your SLO says that 99.9 percent of requests should be successful in a given quarter, your error budget allows 0.1 percent of requests to fail.

Or if we take downtime...

0.1 percent unavailability x
28 days in the four-week window x
24 hours in a day x 60 minutes in an hour =
40.32 minutes of downtime per month.

This is just about enough time for your monitoring systems to surface an issue, and for a human to investigate and fix it.

Not actually that much time and if you have one portion of unavailability that month - you will have likely burnt through your entire month budget.

This is where the importance of agreeing the error budget and SLO upfront with all required stakeholders and business leadership.

The error budget can be thought of being a tool for spending time on the things you want. Such as rolling out new features, software experiments etc.

Spending the error budget is actually useful!

In turn we get some useful side effects...

Credit to Cheryl Kang of Google for the tips in this blog post. Here's another useful blog from Cheryl.

GCP DevOps Certification - Pomodoro Five

James Heggs — Wed, 30 Sep 2020 19:17:04 +0000

SLO in latency terms

A common SLA might be response times.

Lets say your SLA is that every request will be resolved within 300ms then you might set an SLO of 200ms.

Crucially notice that you will find out your breaking your SLO before breaking the customers trust and service level agreement.

Going back to the concept of reliability being considered a feature - if you have a clear SLO on response times and then you start to break it then its essentially an indicator that feature velocity should slow and reliability/performance investments should be prioritised. Yay!

Measuring Reliability

Now this part I love! The key notion around the element that you are going to measure is the Service Level Indicator (SLI)

So for example - measuring the response time would be done by checking the latency. So latency is your SLI.

SLI's tend to be expressed as the proportion of events that were good. EG. How many requests were within the 200ms mark vs how many requests in total.

Setting the SLO for the SLI

Wow even I'm now hating the acronyms but let us go on.

So we're measuring response times and we know how many requests were marked as good (200ms or less) from all our requests.

But what SLO target might we set?

They have a few key features.

Generally percentage based value
Utilise the SLI
Timeframe (EG. Last 4 weeks)

For example:

99% of requests will fall within 200ms over the last 4 weeks.