Bala K

Posted on Apr 24, 2021 • Edited on May 9, 2021 • Originally published at krishnam-bala.Medium

Reliability Engineering: Two Mistakes High

#sre #design #developer #reliability

This blog is about — What Is "Two Mistakes High"? and How is it relevant to our IT industry?

What does it mean?
Why TWO mistakes?
Is it relevant to us?
How do we apply in an application?

I stumbled on this expression a couple of years back while reading a tech book. When I read it, it sounded OK; nothing too exciting or remarkable about that. Later, I noticed something; it was a seed that started to root stronger. Now I can't get it out; every time I make some decision about reliability/stability, regardless of whether it is for work or for fixing something at home (be it electrical/plumping/carpentry/IoT project), I almost hear someone whispering in my ear "are yooou flying two mistakes high?"
Hold on ✋, before you call this paranoia, let's see what does it mean to "Flying Two Mistakes High"?

What does it mean?

This expression is used while kids are learning to fly remote control aeroplanes.

When you learn to fly, you will be doing some manoeuvres and learning acrobatics. Of course, you will try out a stunt of some sort. And, quickly, you will learn this lesson: If you make a mistake, your plane will naturally lose some altitude.
And, you will see, mistakes equate to altitude.

So, keeping your plane "two mistakes high" means keeping it high enough that you have enough altitude to recover from two independent mistakes.

Why TWO mistakes?

While you are recovering from the first mistake — and you are now already lower in altitude — what happens if you make another slip-up? If your plane isn't high enough to recover from the second mistake, well, it's terrible news. And, if you lose too much altitude, you know what happens — broken toy at worst.

In that event, you always want to stay high enough to recover from a mistake, even while you are still recovering from the first mistake. As a result, you don't crash, no matter what goes wrong.

Is it relevant to us (IT)?

We saw where the expression came from. And, this is a good analogy for maintaining availability in our most critical applications.
In our critical modern applications, It means that even when something is going wrong with our application, we want to keep our application running reliably enough so that we can afford for something else to go wrong while we are still recovering from the main problem. Think about it: during our recovery process, we are typically stressed and perhaps in a tricky situation doing potentially ad-hoc things — just the type of situation that can cause us to make another mistake.

While I was researching more on this topic (just a glorified way of saying "I googled it"), I was able to find only a handful of info that relates this philosophy with the IT field — most of them are about how it can be connected to availability.
But, to me, it is more than that. It applies to many more scenarios.

I believe, inherently, most of us are risk-takers, and we would like to push the envelope now and then test our limit (in this case, try our application limit — keeping the error budget in mind). I consider the R/C plane analogy as a meta-thinking tool. It makes you take calculated risks in any given situation.

It is a lesson about redundancy, and it's a lesson about resiliency, and it is a lesson about …you get the idea. It effectively applies to modern application development, change management, operation. It even applies in many other aspects: from dealing with hardware failures to data redundancy, capacity planning, performing retries in your service calls, reducing toil, risk management, and disaster planning.

For those curious minds who might ask, "Why stop at two? Why not three or more mistakes high?"

Short answer: To keep it simple, I don't want to go Inception experience here to avoid Limbo. To start with, two sounds reasonable enough.

How do we apply in an application?

For starters, when we identify the failure scenarios that we anticipate, we should walk through the ramifications of those scenarios and our recovery plan for them. We make sure the recovery plan itself does not have the potential for mistakes or other shortcomings built into it — in short, we check that the recovery plan can work, and it has backup for shortcomings.
Sounds simple, right?…. Big No, it is easier said than done. But we can practice wherever/whenever we should to make it a habit.

Conclusion

A few years back, when one of the applications was facing relatively high stability issues, my mentor gave me the advice to bring things under control. This is the gist of what he said: "Hey Bala, it is ALL about asking the right questions, and there is no need to get overwhelmed".

I can firmly say, "are you flying two mistakes high?" comes under that list of right-questions-to-ask.

For site reliability engineering, the word "mindset" is critical. Being an effective SRE is as much about how you think as it is about your technical skills.
— Kevin Casey

Latest comments (1)

GrahamTheDev • Apr 24 '21 • Edited

I like the "two mistakes high" analogy - now I need to go fiddle with some of my applications which are "trimming the grass" they are flying that low and see if I can get some altitude!