DEV Community

James Heggs
James Heggs

Posted on • Edited on

GCP DevOps Certification - Pomodoro Three

How Google see DevOps

Watching through the video content presented by @lizthegrey and @sethvargo - they discuss the potential friction between developers and operations. Change vs Stability.

Thank god I haven't been explaining it wrong! 🤓 Well by this video at least.

Google outline DevOps as having 5 characteristics

  • Reduce organisation silos

  • Accept failure as normal

  • Implement gradual change

  • Leverage tooling and automation

  • Measure everything

Moving on to SRE

Google think of Site Reliability Engineering in a similar manner to the way an object oriented class might implement an interface

class SRE implements DevOps
Enter fullscreen mode Exit fullscreen mode

Specifically SRE's will

  • Share ownership of environments with developers (Reducing organisation silos)

  • Service Level Objects and Blameless post-mortems (Accepting failure as normal)

  • Reduce the cost of failure such as Canary releases (Implementing gradual change)

  • Eliminate as much manual work as possible (Tooling and automation)

  • Measuring TOIL and system reliability (Measuring everything)

Lets talk reliability

This part I love. Actually question how reliable your systems need to be and the inverse - what unreliability error budget is required.

3 nines = 99.9% = 40mins over 28 day period

So thats just about enough for a monitoring system to spot an issue, alert someone and the human to take action. Depending on the root cause of course.

4 nines = 99.99% = 4 mins over 28 day period

Now you're in to machine based detection and self healing world. Software updates and roll-outs probably need to be isolated to decoupled areas.

5 nines = 99.999% = 28 secs over 28 day period

Good luck! There is even potential for your monitoring system to actually miss this amount of down time. Imagine if you're checking uptime every minute, you could have just missed your downtime issue and falsely reported that you are 'up'.

Now extend those thoughts to consuming public cloud services like Google Cloud. Introduce a support request round trip and you've likely consumed your reliability error budget.

Super interesting thoughts on just how "available" does a system need to be and what implications does it have.

The StayPuft Man

Stay Puft Man

Ok I admit it - I couldn't help myself....today I ran one of the mock tests. You know just to understand my gaps.

I managed to get 7 out of 13 correct - about 53%.

My gaps at this stage were around specific Google API's specifically within Stackdriver and recommended security practices.

To anyone that has worked with me - that probably isn't new information #cowboy

Top comments (0)