DEV Community

Cover image for You can't test your way to certainty — so falsify instead
Harish
Harish

Posted on

You can't test your way to certainty — so falsify instead

I'm currently building a PoC that rebuilds one of our core services from scratch — same job, completely different architecture. The design didn't come from a whiteboard moment; it accreted. I'd solved one issue in the existing system, then another, then another, and one day I looked back at the pile of fixes and something just spoke: this doesn't seem right anymore. The old shape couldn't hold all the patches. So I started over.

And that's where the trouble began.

The fear of infinite examples

As I implemented the new architecture, I worked the way most of us do: run it through an example, watch it break, add a tweak to handle that case, repeat. It felt productive. Each example I fixed made the system a little more capable.

Then a quiet dread crept in: what if this never ends?

I had 20 examples to test the PoC. But 20 was only the start — soon it would be 500, then eventually a million. I could feel myself on a treadmill. Every increment in scale would surface new cases, and I'd be patching forever, never able to point at the system and say "it's done." I couldn't see the end. The space of possible inputs was just too vast, and I was trying to map it one example at a time.

The worst part wasn't the work. It was not knowing how much work was left — or whether "left" even had a bottom.

Engineers have empirical bias

Here's the thing I eventually named. We engineers make a huge number of design decisions based on the examples we happen to observe. We see a few hundred cases, spot the patterns, and bake those patterns into the system. Data scientists have a term for this trap: empirical bias — over-fitting your conclusions to the sample you happened to look at.

In a small project that's fine; you've basically seen everything. But in a large project, the unknowns are unknown. Nobody has observed them. They're not on any list of edge cases because no one knows they exist yet.

Which leads to the question that was actually keeping me up:

How do you know the things that you don't know?

A surprising answer: Karl Popper's falsification

The answer didn't come from an engineering blog. It came from philosophy of science.

Karl Popper was a 20th-century philosopher who asked what separates real science from things that merely sound scientific. His answer was falsifiability. A claim is scientific only if it makes a prediction that could, in principle, be proven wrong.

His famous example is swans. No matter how many white swans you observe, you can never prove the statement "all swans are white" — there could always be another swan around the corner. A thousand confirmations don't make it true. But a single black swan disproves it instantly. So the logic of discovery is asymmetric: confirmation is weak, refutation is decisive.

From this, Popper drew a sharp line. A good theory is a bold conjecture that sticks its neck out — it tells you exactly what observation would kill it, and then survives every attempt to kill it. A theory that can't be falsified by any conceivable observation — one that's compatible with literally any outcome — isn't strong, it's empty. That's not science. That's pseudo-science.

This is, more or less, how mainstream science actually works. Anyone can cook up a theory from observations. What earns it the name "scientific" is that it can be falsified and hasn't been — yet.

How this applies to engineering

Once I saw it, I couldn't unsee the parallel. I had been doing the unscientific thing: accumulating confirming examples ("look, it handles this one too!") and hoping the pile would eventually feel tall enough. It never would. That's the white-swan game, and it has no end.

The shift was to invert it. Instead of trying to confirm my system on more and more cases, try to falsify it. Concretely:

  1. Write down every design decision, heuristic, and assumption the system rests on. Make the implicit explicit. You can't test a belief you haven't articulated.
  2. For each one, name its black swan. Ask: what would disprove this? What exact thing has to happen for this decision to be wrong? If you genuinely can't think of anything that would break it, be suspicious — that's the pseudo-science smell. A decision that can't fail usually isn't saying anything.
  3. Instrument the black swan. Put a log line or a metric on the precise condition that would prove the assumption false. Now the system itself watches for its own counterexamples.
  4. If you're big enough, go hunting. This is exactly what Netflix did with Chaos Monkey. Rather than wait for a server to die at 3 a.m. and hope the system coped, they wrote a tool that randomly kills production instances during business hours — deliberately manufacturing the black swan while engineers are awake to watch it. The assumption under test is "we can lose any single instance and survive," and Chaos Monkey tries to falsify it on a schedule. It later grew into a whole "Simian Army" (latency injection, zone failures, and so on). The philosophy is pure Popper: don't trust that you're resilient because nothing has broken yet — actively try to break it, and let the failures find you before your customers do.

The point isn't to predict every unknown. It's to set a tripwire on each assumption so the unknown announces itself the moment it arrives.

The peace it gave me

This reframe did something I didn't expect: it gave me peace.

I stopped trying to imagine the million examples in advance, because I finally understood that I couldn't, and that chasing them was the wrong game anyway. Instead: state the assumptions, instrument the black swans, and ship it. Unless and until a tripwire fires, there's nothing to worry about. And when one does fire, I'll know instantly — and I'll fix it then, with a real counterexample in hand instead of a hypothetical fear.

The treadmill stopped. Not because the unknowns went away, but because I'd outsourced the worrying to my logs.

Over to you

I'm genuinely curious how others think about this — building large-scale systems that have to handle countless cases you can't enumerate up front. Does the falsification lens resonate, or do you think I'm forcing a philosophy metaphor onto something that needs plain engineering? Tell me where this breaks. And if a particular idea ever gave you peace in the face of the unknown, I'd love to hear it.

Top comments (1)

Collapse
 
algorhymer profile image
algorhymer

Thx, you received a like.

Yeah yeah Netflix is cool, check out this jepsen.io/ too! It is not closed source like Netflix.

I personally like these google search phrases too: "Knuth", "invariant", "log4j", "Dijkstra", "program synthesis", "istqb", "therac-25", "scientific method", "Oracle investor lawsuit", "Coq", "property based stateful testing", "static analysis", "Richard S. Bird", "Agda", "quickcheck/fastcheck/hypothesis/jqwik", "Compcert", "sel4", "ariane flight V88".

Also here's a really simple fun activity:

  1. Choose a simple thing like bubble sort and a dynamic array/vector as input.
  2. Think up a random dumb proposition: after bubble sort the input's length does not change.
  3. Take a pen and sit down at a table.
  4. Take a paper put it on the left side.
  5. Take another paper put it on the right side.
  6. On left try convincing yourself that the proposition holds, on right try convincing yourself that it is a lie.

It is not really a challenge, it is more like a grab-beer-lets-talk session with your own self.
Kind of like a playing a turn-based strategy with two players and you are hotseating both sides.
If it gets boring, you can change the topic by picking a different proposition, different algo, different data structure.
Don't do it too much though, because it'll make you look like me:

milton

Yeah... back to Netflix...

This reframe did something I didn't expect: it gave me peace.

We can learn from analyzing the technical details of Netflix's prod env.
But I think we can also learn something, if we literally just simply look at the pixels dancing in prod.
Netflix prod at one time - allegedly - showed these data points:

netflix

Allegedly, aka it could be deep fake, I don't know.

Related to this Netflix subject, here's an oldie but goodie.
If any of the stuff in that pdf feels like it is directly describing 2026... I assure you... that report was done around 1968, give or take a year or two.

All in all, your article was nice, and genuinely new (aka not a generic dev.to level "What was your win this week?" interaction farm), until the question, where I dropped the ball and wasn't able to follow:

Does the falsification lens resonate, or do you think I'm forcing a philosophy metaphor onto something that needs plain engineering?

eek

What's plain engineering?