Healthy systems are not the same as resilient systems

#programming #devops #systems #discuss

A few years ago, before I even knew the term “chaos engineering,” I accidentally practiced it.

We had a small container orchestration cluster running several applications. Nothing huge. A couple of nodes. Everything looked healthy most of the time.

But there was one annoying category of issue nobody could fully explain:
occasionally, some applications would fail in strange ways after deployment.

The failures looked random.
Transient.
“Probably flaky.”

One day I got curious and started doing something very simple:
manually killing nodes and restarting workloads to see what actually happened.

From the outside, it probably looked pointless.

“If the cluster is designed for failures, of course it should recover.”

But something interesting happened.

Certain applications consistently broke only when scheduled onto one specific node.

The “random” bug suddenly became deterministic.

The cluster wasn’t truly homogeneous. One node had a subtle configuration difference that only revealed itself under failure and rescheduling conditions. Under normal operation, the issue stayed hidden long enough to be dismissed as noise.

That experience stayed with me because it changed how I think about systems.

Healthy systems are not the same as resilient systems.

A system can look perfectly stable right until the moment reality forces it into an unusual state.

And I suspect many organizations avoid these kinds of experiments for understandable reasons: