Arjun Vijay Prakash

Posted on Jul 21, 2024

When Bad Code Crashes a Billion Windows Computers 🚨

#watercooler #microsoft #cybersecurity #news

Recently, a significant "bad-code attack" affected many organizations due to a problematic update from CrowdStrike Falcon, a well-regarded cybersecurity software, mainly on Windows PCs.

This incident caused widespread issues, including business interruptions, delayed flights, and disrupted news broadcasts.

This blog aims to examine the technical aspects of what happened, how it impacted systems, and the measures taken to resolve the issue. Let's get into it:

Where and When did it start?

Troy Hunt, the creator of Have I Been Pwned, first brought attention to the issue on Friday, July 19, 2024.

Imagine waking up on deployment day and seeing the screen full of this shade of #0664e4. I don't really care if it has a name.

Initially, there were concerns about a potential cyberattack from a hacker, but it became clear that the problem came from a faulty update issued by CrowdStrike.

This update resulted in numerous computers experiencing the Blue Screen of Death (BSOD). Sarcastically, making the day - International Day of BSOD.

CrowdStrike Falcon: A Technical Overview

Yes, Crowdstrike protects 298 out of the Fortune 500 companies!

Let's take a look at Falcon now:

CrowdStrike Falcon is a sophisticated Endpoint Detection and Response (EDR) solution designed to protect enterprise systems from cybersecurity threats.

Unlike traditional antivirus software that operates primarily at the user level,

Falcon integrates deeply with the operating system, especially that of Windows, leveraging "kernel-mode drivers" to monitor and intercept potential threats at a very-very low level.

Just mentioning it for the sake of the "statistics."

The Role of Kernel-Mode Drivers

Kernel-mode drivers operate at a privileged level within the operating system, providing them with direct access to hardware and system resources.

This allows them to perform critical tasks efficiently. However, any issues with these drivers can lead to severe system instability, as they interact closely with the core components of the operating system.

And this was the sole reason for this world drama.

The Faulty Update

The problematic update from CrowdStrike Falcon included a corrupted driver file filled with zeroes instead of valid executable code.

When the system attempted to load this driver, it caused immediate system crashes, leading to the BSOD.

This error multiplied across many systems due to the widespread use of CrowdStrike Falcon in business environments.

Widespread Impact

The faulty update had far-reaching consequences:

Business Operations: Many businesses experienced interruptions, leading to cancelled meetings and halted workflows.
News Broadcasts: News networks faced significant disruptions in their broadcasting capabilities.
Flight Operations: Airports encountered delays as critical systems used for managing flights were rendered inoperative.
Retail Operations: Stores relying on computer systems for sales and inventory management faced operational challenges.

Globally 5,078 air flights, 4.6% of those scheduled that day, were cancelled.

Everywhere, this blue screen was found.

Crowdstrike tanks down by ~20% this month.

Response from Key Figures

George Kurtz, CEO of CrowdStrike, addressed the issue, emphasizing their efforts to rectify the situation. However, from my point of view(of course), his communication lacked a "direct apology", which some interpreted as a lack of acknowledgement of the severity of the problem.

Later, at an interview, he did exactly that:

In contrast, Satya Nadella, CEO of Microsoft, provided a clear and concise statement, reassuring users that Microsoft was working closely with CrowdStrike to resolve this issue.

Technical Resolution Steps

https://en.wikipedia.org/wiki/2024_CrowdStrike_incident#Remedy

Lessons Learned and Future Precautions

This incident underscores the importance of rigorous testing for software updates, particularly those involving kernel-mode drivers.

It also highlights the need for clear and empathetic communication from companies when issues arise

Moving forward, companies can adopt several best practices:

Comprehensive Testing: Implement thorough pre-release testing procedures to identify potential issues.
Incremental Rollouts: Deploy updates gradually to monitor for issues before widespread distribution.
Clear Communication: Provide transparent and empathetic communication to affected users, including detailed steps for resolution.

The third one is the most important. When a significant issue occurs, it's essential for company leaders to publicly apologize.

While an apology alone cannot undo the damage, it shows accountability and a commitment to addressing the problem.

It's obvious they can't undo the damage, but they should work towards resolving the issue by engaging with customers through technical meetings.

Conclusion

In conclusion, this situation shows how important it is for companies to test their updates carefully before releasing them.

Mistakes can have huge impacts, as we’ve seen with the problems caused by the bad update.

Companies should fix problems clearly and honestly, and always be ready to help their customers through the mess.

As earlier mentioned: even though they can’t undo what’s happened, they should work hard to make things better and keep everyone informed.

Comment your thoughts on this billion computers outage drama.

Connect with me @ Linktree. Follow me on @ Twitter.

Happy Coding! Thanks for 26498!

Top comments (28)

Paul J. Lucas • Jul 22 '24

The driver itself wasn't updated, so while of course such should be tested, it's irrelevant here. The problem was their driver apparently does no file validation to ensure the format is sane.

Their apology is pretty worthless. It's just damage control. Their issue is one of naïveté, arrogance, or otherwise good developers being forced to “ship it” by management.

Arjun Vijay Prakash • Jul 22 '24

"No file validation" - agreed, that's a key issue.

But damage control is important too, though, for managing the fallout and maintaining trust, right?

Thanks for the insight, by the way!

Paul J. Lucas • Jul 22 '24

Damage control is important to prevent the stock price from tanking too much.

Trust? They've lost that. Time and the market will tell if they can regain it. It depends on how many competitors they have and how lazy companies are to switching to another vendor.

leob • Jul 22 '24 • Edited

Great overview, especially with regards to the technical details (kernel mode drivers etc), but this leaves one big question:

HOW on earth could their testing not have picked this up? I mean, if this update crashed all of their customer's computers, it would surely crash any of their "test" computers when the update got installed on it ...

There's one theory I can come up with: they produced the update, and tested it extensively, and then there was one "cowboy" in their organization who thought:

"Oh, let me just quickly add this last-minute 'improvement' that I've put together - for sure it's harmless, and I'm sure it works, no need to go through full QA/testing again!"

Meaning they probably need to further tighten their (presumably already tight) procedures ...

By the way, I totally agree with the "incremental rollout" suggestion - I was astonished when I realized that they rolled out this update to ALL of their customers at the same moment ...

I mean, it's so obvious:

Roll it out to a few customers first ... let it simmer for a day or two ... roll it out to a bigger group ... only then roll it out to everyone else.

P.S. was it really a billion computers? I read somewhere that one percent (1%) of the Windows computers worldwide was affected by this - does this mean that ONE HUNDRED BILLION Windows computers exist globally?

Arjun Vijay Prakash • Jul 22 '24

I too had the same thought, adding to that - or was it some work of an intern?

Loved that you pointed it out, actually, the term "billion" in the article was indeed metaphorical and meant to emphasize the scale of the issue. I already know that the real number is just too low.

Thanks for the feedback!

leob • Jul 24 '24

Two days later, and the mystery is solved - it slipped through their QA because there was a bug in their test/QA software itself ! So their QA program/software said "yes it's good", but in reality it wasn't ...

Still baffles me that they didn't do a quick manual sanity check as well, and ONLY relied on the automated testing, but okay ... they (the supplier) have now also indicated that in the future they want to move away from "big bang" updates, and towards more gradual rollouts - sounds like a good idea to me ;-)

leob • Jul 22 '24

It must have slipped through their QA procedures SOMEHOW but yeah it's baffling ...

Paul J. Lucas • Jul 25 '24

The rationale for doing an update to every customer simultaneously is that you want to protect them from threats ASAP.

Suppose there's a new threat in the wild. CrowdStrike codes up an update to neutralize the new threat. But suppose CrowdStrike instead did gradual updates. Suppose you're one of the customers who did not get the update first. Your computers remain vulnerable. Now suppose your computers are compromised by the threat. You blame CrowdStrike for not updating your computers in a timely manner.

Basically, CrowdStrike is damned either way.

leob • Jul 26 '24

They're not damned either way - they should simply ALWAYS do a manual (not just automated) sanity check by installing any update, no matter how tiny, on a few of their test computers before releasing it - that's what went wrong here ... but yeah it's always easy in hindsight :D

P.S. they've now announced that they do consider more gradual rollouts, even if just spaced apart by a few hours, not days or weeks, it's probably worth the tradeoff re vulnerabilities

Paul J. Lucas • Jul 26 '24

I agree that they should have done better in-house testing. However, that's not what my comment was about. My comment was only about the policy of rolling out updates simultaneously to customers.

As for their updated policy to do skewed roll-outs, only time will tell if it's a better policy. Certainly if the rollout is spread across hours and not days, the likelihood that some customer somewhere will (a) get compromised during that roll-out window because they didn't get the update yet, (b) realize they didn't get the update as soon as they possibly could have, and (c) sues CrowdStrike as a result is extremely small — but non-zero.

Of course, if I were CrowdStrike, I'd put something into their customer contract that guards against that possibility just to be legally bullet-proof.

leob • Jul 26 '24

Yeah just put that into their contract ... there will ALWAYS be a vulnerability window no matter how short, so yeah :)

Paul J. Lucas • Jul 29 '24

A "supply chain" isn't what you think it is. This is. Windows is wholly made in Redmond; CrowdStrike's software is wholly made by them. Neither is sourcing raw materials to make software from all over the globe.

Their crash crashed the machines. It should not be possible, and it means any other solution with the same patterns/privileges would have cause this mess, which means it's an attack vector (complex but possible).

True, but not relevant to my only points that I made in my original post.

BestCodes • Jul 22 '24

It's estimated that about 8.5 million computers were directly affected by the CrowdStrike crash, which is a far way off from 1 Billion. 😂

Other users would have been indirectly affected by the services that were directly affected, like AWS, Gmail, etc. but the estimates are still far less than a billion.

Nice article though. :)

Arjun Vijay Prakash • Jul 22 '24 • Edited

Actually, the term "billion" in the article was indeed metaphorical and meant to emphasize the scale of the issue.

I already know that the real number is just too low.

Thanks!

Paul J. Lucas • Jul 25 '24

Supply chains have nothing to do with this.
CrowdStrike does not update the Windows kernel. The only company that updates the kernel is Microsoft.
Whether a bad commit was involved is irrelevant. As I pointed out in my other comment, the problem is that CrowdStrike's driver apparently does no file validation.

Paul J. Lucas • Jul 27 '24 • Edited

CrowdStrike doesn't "affect" the kernel either. Their code did not cause the kernel code to crash nor affect the kernel in any way. It was CrowdStrike's code that crashed because it dereferenced a null pointer.

The only reason the word "kernel" is being used at all here is because CrowdStrike's code, as a driver, runs in "kernel mode" as opposed to "user mode." Just because code runs in kernel mode doesn't mean it affects the kernel. "Kernel mode" and "kernel" are two different things.

Perhaps choose words more carefully.

Anmol Baranwal • Jul 22 '24

Loved reading the complete details :)

Arjun Vijay Prakash • Jul 22 '24

Appreciate the kind words, Anmol!

PS: i too have fun reading your articles as well. keep up the great work, man.

Paul J. Lucas • Jul 29 '24

Except in this case, there is no "someone" that infiltrated any systems. No person or group used CrowdStrike's software to infiltrate Windows systems. It was entirely CrowdStrike's fault.

Alex • Jul 23 '24

Why monkey patching insecure by design system is a bad idea. MS famous for being cheap on development, CrowdStrike is holding tradition.

Paul J. Lucas • Jul 25 '24

This isn't really a case of monkey patching that applies mostly to non-compiled languages. The CrowdStrike case is closer either to dynamic loading or using data as code, i.e., pseudo-code.

Paul J. Lucas • Jul 29 '24

I view the "someone" as a necessary condition. Anyway, I'm not going to argue it further. Believe whatever you want.

View full discussion (28 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.