GCP DevOps Certification - Pomodoro Three

#googlecloud #devops #certification #sre

How Google see DevOps

Watching through the video content presented by @lizthegrey and @sethvargo - they discuss the potential friction between developers and operations. Change vs Stability.

Thank god I haven't been explaining it wrong! 🤓 Well by this video at least.

Google outline DevOps as having 5 characteristics

Reduce organisation silos
Accept failure as normal
Implement gradual change
Leverage tooling and automation
Measure everything

Moving on to SRE

Google think of Site Reliability Engineering in a similar manner to the way an object oriented class might implement an interface

class SRE implements DevOps

Specifically SRE's will

Share ownership of environments with developers (Reducing organisation silos)
Service Level Objects and Blameless post-mortems (Accepting failure as normal)
Reduce the cost of failure such as Canary releases (Implementing gradual change)
Eliminate as much manual work as possible (Tooling and automation)
Measuring TOIL and system reliability (Measuring everything)

Lets talk reliability

This part I love. Actually question how reliable your systems need to be and the inverse - what unreliability error budget is required.

3 nines = 99.9% = 40mins over 28 day period

So thats just about enough for a monitoring system to spot an issue, alert someone and the human to take action. Depending on the root cause of course.

4 nines = 99.99% = 4 mins over 28 day period

Now you're in to machine based detection and self healing world. Software updates and roll-outs probably need to be isolated to decoupled areas.

5 nines = 99.999% = 28 secs over 28 day period

Good luck! There is even potential for your monitoring system to actually miss this amount of down time. Imagine if you're checking uptime every minute, you could have just missed your downtime issue and falsely reported that you are 'up'.

Now extend those thoughts to consuming public cloud services like Google Cloud. Introduce a support request round trip and you've likely consumed your reliability error budget.

Super interesting thoughts on just how "available" does a system need to be and what implications does it have.