DEV Community

Cover image for When Bad Code Crashes a Billion Windows Computers 🚨

When Bad Code Crashes a Billion Windows Computers 🚨

Arjun Vijay Prakash on July 21, 2024

Recently, a significant "bad-code attack" affected many organizations due to a problematic update from CrowdStrike Falcon, a well-regarded cybersec...
Collapse
 
spo0q profile image
spO0q 🐒 • Edited

The software supply chain is broken.

One company shipping updates that affect the kernel of machines that control critical systems. I mean you can blame it on the bad commits, but these are only the loudy consequences.

Collapse
 
pauljlucas profile image
Paul J. Lucas
  • Supply chains have nothing to do with this.
  • CrowdStrike does not update the Windows kernel. The only company that updates the kernel is Microsoft.
  • Whether a bad commit was involved is irrelevant. As I pointed out in my other comment, the problem is that CrowdStrike's driver apparently does no file validation.
Collapse
 
spo0q profile image
spO0q 🐒

I choose my words. Whether you validate me or not is irrelevant.

Never said CrowdStrike update the kernel but "that affect". Cheers.

Thread Thread
 
pauljlucas profile image
Paul J. Lucas • Edited

CrowdStrike doesn't "affect" the kernel either. Their code did not cause the kernel code to crash nor affect the kernel in any way. It was CrowdStrike's code that crashed because it dereferenced a null pointer.

The only reason the word "kernel" is being used at all here is because CrowdStrike's code, as a driver, runs in "kernel mode" as opposed to "user mode." Just because code runs in kernel mode doesn't mean it affects the kernel. "Kernel mode" and "kernel" are two different things.

Perhaps choose words more carefully.

Thread Thread
 
spo0q profile image
spO0q 🐒

Perhaps choose words more carefully

The supply chain is globally broken, to me, whether it's because of Windows or not:

CrowdStrike's code that crashed because it dereferenced a null pointer

Their crash crashed the machines. It should not be possible, and it means any other solution with the same patterns/privileges would have cause this mess, which means it's an attack vector (complex but possible).

My bad for the "kernel mode" vs "kernel," though. While affecting the core with the same privileges as the kernel and damaging the kernel itself are two different things, it does not change my point.

Thread Thread
 
pauljlucas profile image
Paul J. Lucas

A "supply chain" isn't what you think it is. This is. Windows is wholly made in Redmond; CrowdStrike's software is wholly made by them. Neither is sourcing raw materials to make software from all over the globe.

Their crash crashed the machines. It should not be possible, and it means any other solution with the same patterns/privileges would have cause this mess, which means it's an attack vector (complex but possible).

True, but not relevant to my only points that I made in my original post.

Thread Thread
 
spo0q profile image
spO0q 🐒

When I say stupid things, I do recognize it.

Supply chain is that for me in this case, though.

Even if it's unfortunate here, the "threat" comes from a vendor, which is enough to include the word "supply chain" in the debate:

A supply chain attack, also called a value-chain or third-party attack, occurs when someone infiltrates your system through an outside partner or provider with access to your systems and data

Thread Thread
 
pauljlucas profile image
Paul J. Lucas

Except in this case, there is no "someone" that infiltrated any systems. No person or group used CrowdStrike's software to infiltrate Windows systems. It was entirely CrowdStrike's fault.

Thread Thread
 
spo0q profile image
spO0q 🐒

yes, but it's quite the same result. A supply chain problem can be intentional or the result of an accident.

Thread Thread
 
pauljlucas profile image
Paul J. Lucas

I view the "someone" as a necessary condition. Anyway, I'm not going to argue it further. Believe whatever you want.

Thread Thread
 
spo0q profile image
spO0q 🐒

I view the "someone" as a necessary condition

no, it's part of the risk.

Anyway, I'm not going to argue it further. Believe whatever you want.

too bad :(

Collapse
 
pauljlucas profile image
Paul J. Lucas

The driver itself wasn't updated, so while of course such should be tested, it's irrelevant here. The problem was their driver apparently does no file validation to ensure the format is sane.

Their apology is pretty worthless. It's just damage control. Their issue is one of naïveté, arrogance, or otherwise good developers being forced to “ship it” by management.

Collapse
 
arjuncodess profile image
Arjun Vijay Prakash

"No file validation" - agreed, that's a key issue.

But damage control is important too, though, for managing the fallout and maintaining trust, right?

Thanks for the insight, by the way!

Collapse
 
pauljlucas profile image
Paul J. Lucas

Damage control is important to prevent the stock price from tanking too much.

Trust? They've lost that. Time and the market will tell if they can regain it. It depends on how many competitors they have and how lazy companies are to switching to another vendor.

Collapse
 
leob profile image
leob • Edited

Great overview, especially with regards to the technical details (kernel mode drivers etc), but this leaves one big question:

HOW on earth could their testing not have picked this up? I mean, if this update crashed all of their customer's computers, it would surely crash any of their "test" computers when the update got installed on it ...

There's one theory I can come up with: they produced the update, and tested it extensively, and then there was one "cowboy" in their organization who thought:

"Oh, let me just quickly add this last-minute 'improvement' that I've put together - for sure it's harmless, and I'm sure it works, no need to go through full QA/testing again!"

Meaning they probably need to further tighten their (presumably already tight) procedures ...

By the way, I totally agree with the "incremental rollout" suggestion - I was astonished when I realized that they rolled out this update to ALL of their customers at the same moment ...

I mean, it's so obvious:

Roll it out to a few customers first ... let it simmer for a day or two ... roll it out to a bigger group ... only then roll it out to everyone else.

P.S. was it really a billion computers? I read somewhere that one percent (1%) of the Windows computers worldwide was affected by this - does this mean that ONE HUNDRED BILLION Windows computers exist globally?

Collapse
 
pauljlucas profile image
Paul J. Lucas

The rationale for doing an update to every customer simultaneously is that you want to protect them from threats ASAP.

Suppose there's a new threat in the wild. CrowdStrike codes up an update to neutralize the new threat. But suppose CrowdStrike instead did gradual updates. Suppose you're one of the customers who did not get the update first. Your computers remain vulnerable. Now suppose your computers are compromised by the threat. You blame CrowdStrike for not updating your computers in a timely manner.

Basically, CrowdStrike is damned either way.

Collapse
 
leob profile image
leob

They're not damned either way - they should simply ALWAYS do a manual (not just automated) sanity check by installing any update, no matter how tiny, on a few of their test computers before releasing it - that's what went wrong here ... but yeah it's always easy in hindsight :D

P.S. they've now announced that they do consider more gradual rollouts, even if just spaced apart by a few hours, not days or weeks, it's probably worth the tradeoff re vulnerabilities

Thread Thread
 
pauljlucas profile image
Paul J. Lucas

I agree that they should have done better in-house testing. However, that's not what my comment was about. My comment was only about the policy of rolling out updates simultaneously to customers.

As for their updated policy to do skewed roll-outs, only time will tell if it's a better policy. Certainly if the rollout is spread across hours and not days, the likelihood that some customer somewhere will (a) get compromised during that roll-out window because they didn't get the update yet, (b) realize they didn't get the update as soon as they possibly could have, and (c) sues CrowdStrike as a result is extremely small — but non-zero.

Of course, if I were CrowdStrike, I'd put something into their customer contract that guards against that possibility just to be legally bullet-proof.

Thread Thread
 
leob profile image
leob

Yeah just put that into their contract ... there will ALWAYS be a vulnerability window no matter how short, so yeah :)

Collapse
 
arjuncodess profile image
Arjun Vijay Prakash

I too had the same thought, adding to that - or was it some work of an intern?

Loved that you pointed it out, actually, the term "billion" in the article was indeed metaphorical and meant to emphasize the scale of the issue. I already know that the real number is just too low.

Thanks for the feedback!

Collapse
 
leob profile image
leob

Two days later, and the mystery is solved - it slipped through their QA because there was a bug in their test/QA software itself ! So their QA program/software said "yes it's good", but in reality it wasn't ...

Still baffles me that they didn't do a quick manual sanity check as well, and ONLY relied on the automated testing, but okay ... they (the supplier) have now also indicated that in the future they want to move away from "big bang" updates, and towards more gradual rollouts - sounds like a good idea to me ;-)

Collapse
 
leob profile image
leob

It must have slipped through their QA procedures SOMEHOW but yeah it's baffling ...

Collapse
 
best_codes profile image
Best Codes

It's estimated that about 8.5 million computers were directly affected by the CrowdStrike crash, which is a far way off from 1 Billion. 😂

Other users would have been indirectly affected by the services that were directly affected, like AWS, Gmail, etc. but the estimates are still far less than a billion.

Nice article though. :)

Collapse
 
arjuncodess profile image
Arjun Vijay Prakash • Edited

Actually, the term "billion" in the article was indeed metaphorical and meant to emphasize the scale of the issue.

I already know that the real number is just too low.

Thanks!

Collapse
 
anmolbaranwal profile image
Anmol Baranwal

Loved reading the complete details :)

Collapse
 
arjuncodess profile image
Arjun Vijay Prakash

Appreciate the kind words, Anmol!

PS: i too have fun reading your articles as well. keep up the great work, man.

Collapse
 
asmyshlyaev177 profile image
Alex

Why monkey patching insecure by design system is a bad idea. MS famous for being cheap on development, CrowdStrike is holding tradition.

Collapse
 
pauljlucas profile image
Paul J. Lucas

This isn't really a case of monkey patching that applies mostly to non-compiled languages. The CrowdStrike case is closer either to dynamic loading or using data as code, i.e., pseudo-code.