Discussion on: Even the Big Ones Mess Up

View post

You should always make a big fuss about mistakes which happen in production. Of course mistakes happen, but why didn't this mistake happen in Test or Acceptance. What do you need to fix to the way you go to production so that this does not happen again the next time.

Quite often production issues are the result of good enough mentality. The quality should be good, it does not have to be perfect. Something is good when it works, and you have proof that it works. It might not be the best in performance or scalability, but you know where its limitations are. And here is where something that is good can still fail in production. For example, in case of the Amazon issue they probably had in incorrect estimation of the surge of new devices. And that's fine. But if they did not even consider for it, and test for it. Then they deserve all the fuss that should be made about it.

In production there should only be two cases of issues which are (kind of) acceptable.

"Oh fuck!" Usually the result of somebody performing an explicit action, like deleting a file on the wrong server.
"That's interesting." Something happens which defies the world as you have defined it. These are usually the result of a user performing a combination or series of actions which where not accounted for in the logic.

Both these issues are not really solvable. You can only reduce the number of occurrences. This is what defines your software/process maturity.
You can attempt to expose these problems by employing things like chaos engineering and fuzzing testing. But that only gets you so far. In fuzzing testing you generally only try to find the edge cases of a single unit. But for the "That's interesting" you probably need to invoke a whole series of edge cases.

Al Romano • Feb 12 '19

100% this. ^{^}

I'm now changing incident report reasons to "Oh Fuck" and "That's interesting...".

So. Much. Yes.