Testing For Success vs. Failure

punch — Fri, 08 Feb 2019 19:42:48 +0000

When you build something, and tweak it to satisfy all of the scenarios it can cover, how/when/how much do you test it for failure?

Let's say that I build a monitor to watch the ratio between two specific metrics, and I want it to alert me when that ratio drops below 0.8, rather than 1 (indicating that there is no issue), or 0.9 (indicating that we might have something righting itself, i.e. an autoscaling host being killed off as it's no longer needed).

I've built this monitor and tweaked the thresholds based on historical examples of:

Times when we wanted to be alerted, based on what was going on, and what the ratio looked like at that time
Times when we expect the ratio to not be 1, but we don't need to alert, as we have scheduled a change during that time period

I've researched this, tested in, even did some new example tests of #1 and #2. Based on everything I've tested thus far, the new monitor I've built satisfies everything and would have alerted on all times when we wanted it to, and would have ignored all of the times we wanted it to ignore the metric ratio. I present the results of my testing, my research, my reasonings, and the monitor, to my manager, who says:

"You need to come up with an example of where this monitor fails."

Is he right?

Remember, I have:

tested different metrics, and combinations/ratios, to find the optimal way to monitor these scenarios
tweaked my thresholds to satisfy when I do and do NOT want the monitor to alert us

Testing for unknown unknowns is always difficult. In this case, I'm being asked to make a monitor that is completely perfect, and will not need to be tweaked in the future even if our infrastructure changes.

Can/should this be done? How/why/why not?

DEV Community: punch

Testing For Success vs. Failure

"You need to come up with an example of where this monitor fails."

Is he right?