Harish

Posted on Jun 21

You can't test your way to certainty — so falsify instead

#software #architecture #computerscience

I'm currently building a PoC that rebuilds one of our core services from scratch — same job, completely different architecture. The design didn't come from a whiteboard moment; it accreted. I'd solved one issue in the existing system, then another, then another, and one day I looked back at the pile of fixes and something just spoke: this doesn't seem right anymore. The old shape couldn't hold all the patches. So I started over.

And that's where the trouble began.

The fear of infinite examples

As I implemented the new architecture, I worked the way most of us do: run it through an example, watch it break, add a tweak to handle that case, repeat. It felt productive. Each example I fixed made the system a little more capable.

Then a quiet dread crept in: what if this never ends?

I had 20 examples to test the PoC. But 20 was only the start — soon it would be 500, then eventually a million. I could feel myself on a treadmill. Every increment in scale would surface new cases, and I'd be patching forever, never able to point at the system and say "it's done." I couldn't see the end. The space of possible inputs was just too vast, and I was trying to map it one example at a time.

The worst part wasn't the work. It was not knowing how much work was left — or whether "left" even had a bottom.

Engineers have empirical bias

Here's the thing I eventually named. We engineers make a huge number of design decisions based on the examples we happen to observe. We see a few hundred cases, spot the patterns, and bake those patterns into the system. Data scientists have a term for this trap: empirical bias — over-fitting your conclusions to the sample you happened to look at.

In a small project that's fine; you've basically seen everything. But in a large project, the unknowns are unknown. Nobody has observed them. They're not on any list of edge cases because no one knows they exist yet.

Which leads to the question that was actually keeping me up:

How do you know the things that you don't know?

A surprising answer: Karl Popper's falsification

The answer didn't come from an engineering blog. It came from philosophy of science.

Karl Popper was a 20th-century philosopher who asked what separates real science from things that merely sound scientific. His answer was falsifiability. A claim is scientific only if it makes a prediction that could, in principle, be proven wrong.

His famous example is swans. No matter how many white swans you observe, you can never prove the statement "all swans are white" — there could always be another swan around the corner. A thousand confirmations don't make it true. But a single black swan disproves it instantly. So the logic of discovery is asymmetric: confirmation is weak, refutation is decisive.

From this, Popper drew a sharp line. A good theory is a bold conjecture that sticks its neck out — it tells you exactly what observation would kill it, and then survives every attempt to kill it. A theory that can't be falsified by any conceivable observation — one that's compatible with literally any outcome — isn't strong, it's empty. That's not science. That's pseudo-science.

This is, more or less, how mainstream science actually works. Anyone can cook up a theory from observations. What earns it the name "scientific" is that it can be falsified and hasn't been — yet.

How this applies to engineering

Once I saw it, I couldn't unsee the parallel. I had been doing the unscientific thing: accumulating confirming examples ("look, it handles this one too!") and hoping the pile would eventually feel tall enough. It never would. That's the white-swan game, and it has no end.

The shift was to invert it. Instead of trying to confirm my system on more and more cases, try to falsify it. Concretely:

Write down every design decision, heuristic, and assumption the system rests on. Make the implicit explicit. You can't test a belief you haven't articulated.
For each one, name its black swan. Ask: what would disprove this? What exact thing has to happen for this decision to be wrong? If you genuinely can't think of anything that would break it, be suspicious — that's the pseudo-science smell. A decision that can't fail usually isn't saying anything.
Instrument the black swan. Put a log line or a metric on the precise condition that would prove the assumption false. Now the system itself watches for its own counterexamples.
If you're big enough, go hunting. This is exactly what Netflix did with Chaos Monkey. Rather than wait for a server to die at 3 a.m. and hope the system coped, they wrote a tool that randomly kills production instances during business hours — deliberately manufacturing the black swan while engineers are awake to watch it. The assumption under test is "we can lose any single instance and survive," and Chaos Monkey tries to falsify it on a schedule. It later grew into a whole "Simian Army" (latency injection, zone failures, and so on). The philosophy is pure Popper: don't trust that you're resilient because nothing has broken yet — actively try to break it, and let the failures find you before your customers do.

The point isn't to predict every unknown. It's to set a tripwire on each assumption so the unknown announces itself the moment it arrives.

The peace it gave me

This reframe did something I didn't expect: it gave me peace.

I stopped trying to imagine the million examples in advance, because I finally understood that I couldn't, and that chasing them was the wrong game anyway. Instead: state the assumptions, instrument the black swans, and ship it. Unless and until a tripwire fires, there's nothing to worry about. And when one does fire, I'll know instantly — and I'll fix it then, with a real counterexample in hand instead of a hypothetical fear.

The treadmill stopped. Not because the unknowns went away, but because I'd outsourced the worrying to my logs.

Over to you

I'm genuinely curious how others think about this — building large-scale systems that have to handle countless cases you can't enumerate up front. Does the falsification lens resonate, or do you think I'm forcing a philosophy metaphor onto something that needs plain engineering? Tell me where this breaks. And if a particular idea ever gave you peace in the face of the unknown, I'd love to hear it.

Top comments (7)

algorhymer • Jun 21

Thx, you received a like.

Yeah yeah Netflix is cool, check out this jepsen.io/ too! It is not closed source like Netflix.

I personally like these google search phrases too: "Knuth", "invariant", "log4j", "Dijkstra", "program synthesis", "istqb", "therac-25", "scientific method", "Oracle investor lawsuit", "Coq", "property based stateful testing", "static analysis", "Richard S. Bird", "Agda", "quickcheck/fastcheck/hypothesis/jqwik", "Compcert", "sel4", "ariane flight V88".

Also here's a really simple fun activity:

Choose a simple thing like bubble sort and a dynamic array/vector as input.
Think up a random dumb proposition: after bubble sort the input's length does not change.
Take a pen and sit down at a table.
Take a paper put it on the left side.
Take another paper put it on the right side.
On left try convincing yourself that the proposition holds, on right try convincing yourself that it is a lie.

It is not really a challenge, it is more like a grab-beer-lets-talk session with your own self.
Kind of like a playing a turn-based strategy with two players and you are hotseating both sides.
If it gets boring, you can change the topic by picking a different proposition, different algo, different data structure.
Don't do it too much though, because it'll make you look like me:

Yeah... back to Netflix...

This reframe did something I didn't expect: it gave me peace.

We can learn from analyzing the technical details of Netflix's prod env.
But I think we can also learn something, if we literally just simply look at the pixels dancing in prod.
Netflix prod at one time - allegedly - showed these data points:

Allegedly, aka it could be deep fake, I don't know.

Related to this Netflix subject, here's an oldie but goodie.
If any of the stuff in that pdf feels like it is directly describing 2026... I assure you... that report was done around 1968, give or take a year or two.

All in all, your article was nice, and genuinely new (aka not a generic dev.to level "What was your win this week?" interaction farm), until the question, where I dropped the ball and wasn't able to follow:

Does the falsification lens resonate, or do you think I'm forcing a philosophy metaphor onto something that needs plain engineering?

What's plain engineering?

Harish • Jun 24 • Edited

Thank you so much for sharing insights and useful resources!
Oh by plain engineering, I meant going through countless examples thinking deeply to come up with a general solution. I am also realising the Netflix example I shared is not the best one here. What I was referring to is more around edge cases. Lets say you are scraping a web page, and the first thing you would expect is to get the response in one direct fetch. But then there are Client side rendered apps, there is firewall, cookie blocks...etc
If you ask how the heck are you going to come up with a metric/log to figure that out? I dont know yet. But I feel actively looking for that case is the right direction. The crux of Netflix example is to simulate that environment well before it happens so you're ready 😅

The experiment you described is what I feel is lacking in my thought process. I feel like I don't spend enough time realising the bounds and defying it too.
I also like this idea Distributed Systems Safety Research as a Service. There is going to be huge demand for this now as we are pushing a lot of AI generated code...

algorhymer • Jun 24

Oh by plain engineering, I meant going through countless examples thinking deeply to come up with a general solution.

Can you give me an example?

Harish • Jun 27

I am building a scraper to download PDFs from a website. I started with a simple direct HTTP fetch + lookup .pdf anchor tags to download them.
But every day I encounter a new case that hasn't been handled like

implicit PDF strings without an explicit .pdf
the PDF links are hydrated through an API call, not available on HTML
the PDF links are sometimes found in the script tag
some of them cannot be downloaded without the headers and cookies in place.
IP rate limits from the website. WAF blocks ...etc

The whole article was inspired by this project. Everytime I would hit a new case, I would feel the doubt in my head have I built something generic if not how do I make this resilient? And the falsification gave me some kind of an answer. I am still yet to figure how to implement these specific signals in this specific case...

algorhymer • Jun 27

Interesting problem! Yeah!
It requires resilience and creativity to solve a problem such as that.
Please understand that I think you are a skilled individual.
Below parts are not a critique of your work at all, but rather about the zeitgeist of this activity we call programming.

On the other hand, I'd not classify this as engineering.

You are forced to work around, ill-defined, man-made, arbitrary, undocumented decisions made by others.
Some have reasons, some are simply others' negligence.
I would classify this as intelligently shooting in the dark and judging by the echo of the bullet how to adjust your aim: Educated Guessing

For me, subjectively, engineering is a different thing.
For example, mechanical engineering originally started off just like what you've described, hacking away at it in the dark.
Eventually though, that engineering field matured.
Those people have standards, documented laws, openly available design principles which are equations, not mere words.
All of this has benefits, because that field has been defanged to a certain degree.
Example: Say you need a ball bearing. Manufacturers don't sell you 'Try this framework it is good' in a conniving way.
You visit the site, enter load conditions and other parameters, and - based on equations freely available - they predict for example how long that thing will last for you, of course approximately and with error range.
Mechanical engineering is not fully mature, and lobbying by companies such as Schaeffler is always trying to undermine the profession's integrity for a short term market gain, but it is on the right track.

Software engineering, on the other hand, is still in its infancy.
Manufacturers are not reigned in, court cases are rare, financial fines are low, so bad-faith actors pay it out from their pocket change.
Warranties are non-existent.
Standards are not enforced.
Therefore it often times feels like what your cool example above described:
Educated Guessing

This is exactly the reason why I stopped doing new Advent of Code puzzles at:
2024 Day 12 Part 2
It was no longer engineering, just Educated Guessing, for me.

Please understand this is my own subjective perspective.
For example Google prefers a more lenient approach.
They deleted the infra from under UniSuper, which was an attack against the Australian public's financial well-being.
Imagine if someone did that scale of damage to physical infrastructure.
It would be bigger than the two towers and the planes.

Good luck in the crazy world of scraping.

Harish • Jun 28

I do understand why you say this as "educated guessing" or "shooting in the dark". It does feel that way sometimes.
I love advent of code, simply because it gives me the exposure to use broad set of tools available in language to get familiar with. I use it for that not problem solving.

But I'm still unclear on how would you define engineering?
Also dude we should connect. Email me arishh2@gmail.com or x.com/HarishTeens
Whichever you prefer.

algorhymer • Jun 28

I'd say this might be an example.
You should talk to the author @arashkabiri

TLDR on my low level: He makes car be able to move. Grease/oil/lubricant. Otherwise gears and whatnot eat eachother out etc. start not working like Netflix's UI after opening 10 previews.
Besides his main job, he did JS to create a form which encodes the principles, and presents it in a Q&A kind of hand-held approach.
Aka the grease/lubricant/idk is the Engineering, while the js is packaging.

I think he might be able to shed a light on the subject for you better than I can...

Because you missed my point about Advent of Code 2024 Day 12 Part 2 which summarizes my stance, and gives an actual reproducible case.
Before that, in each year:

You were given a problem statement.
Algorithmically solveable, deterministic.
The question was about correctness/time complexity/space complexity/generalizability. Which are Engineering level properties.

That day though. I solved it ofc, I can. Do you know how I named the function which did it?

juanJoyaBorja, google it, it is a guy who became a meme.

I chose that name, because that problem broke social contract: it introduced eyeballing.

Jobs are about eyeballing. So when I go home, trying to enjoy my free time... in my last place of normalcy... I am forced to adapt to eyeballer culture?! Nope. Nono. No more Kate. Kate goes home

You truly deeply, deep down in your heart know, you must know:
Dude, it is a webscraper. It is not engineering. You are crawling random graphs in a specific webdriver at most which says it is a Protocol but in reality it is pinky promise. Browser? Hell, Brave used to turn your machine into a cryptominer.
And above that layer?
You are getting bugged out because you have zero followable spec.
You are managing it well, fighting against the waves, but come on...
Would you like a bike like that?
Even the dirt under it has no guarantee or warranty, and zero legal liability, just a wall-of-text EULA about telemetry and copyright.
I do not claim it isn't intelligent, or whatever to pull it off, mind you.
I am just saying there's a difference between 'it works' and 'This product has warranty, and we as a team, stand by that warranty, with our money and our full legal liability. And despite how confident we are in it: We do have damage containment protocols in case vis maior happens.'.
Compare that to Facebook, yanking out redundancy in its new data center.

Here's a last rhetorical question: Are you peer-reviewed? Are you properly QA-d by people who do not have vested interest in just 'flagging it green'?

See conflict of interest in its fully glory, right here

This is our job: Ship it.