DEV Community

Cover image for What we can learn from the #CrowdStrike meltdown.
CodeWithCaen
CodeWithCaen

Posted on • Updated on • Originally published at blog.desilva.se

What we can learn from the #CrowdStrike meltdown.

A short reflection on the recent CrowdStrike IT disaster.

I feel like there's a lot to learn from the #CrowdStrike meltdown where a bug in a software update is causing havoc across the world. Here's what immediately comes to mind, both from the perspective of the company, and all of us as a society.

1. Don't put all your update eggs in one basket.

If you're a global service provider, unless you're sending a critical security patch, do you really need to go for a global rollout, or can you do it in batches? That way, if something goes wrong, you limit the fallout.

2. The importance of testing, and being responsible.

Why did the faulty code pass the CI/CD checks? As an avid software tester, I can't help but wonder how CrowdStrike's systems are set up. If you are a service provider for critical societal infrastructure like hospitals and aviation, I feel that you have a responsibility to have solid testing pipelines before releasing an update. Of course, things can still be missed, but again, I can't help but wonder about the robustness of their delivery pipelines.

3. The single point of failure problem.

Why aren't we more concerned about relying on single points of failures? It's quite honestly frightening how much chaos can be caused due to a single mistake, and it's even more terrifying the direction we're heading in. People in our industry have been moving towards relying on outside entities, and now everyone is paying the price.

Looking to the future.

This essay got dark. So let's try to end it on a more positive note. What can we as IT professionals do to prevent this from happening again? How can we make sure that we're not the next #CrowdStrike? I think that's a conversation worth having.

For now, I'm going to go back to my testing, and make sure that my code is as solid as it can be. I hope you do the same. And do let me know your thoughts on this whole situation. I'm curious to hear what you think and what you would do differently.

Top comments (18)

Collapse
 
ben profile image
Ben Halpern

Thanks for the post

Collapse
 
codewithcaen profile image
CodeWithCaen

Glad you liked it! Feel free to share it if you think others would find it interesting :)

Collapse
 
jennyphan profile image
Jenny Phan

This is exactly the same thing I have been telling people. This should have been caught in QA testing and after the production deployment, there should have been verification that it was working, if not then rollback immediately. If you’re a provider of software for any company, you should be doing phased rollouts and automated testing. I’m wondering was this test scenario just missed? Seems like a big one:)

Collapse
 
eduardopatrick profile image
Eduardo Patrick

Thanks for sharing, appreciated a lot the points just mentioned, they remind us about how we should be careful in a lot of aspects, even having a robust delivery system, or having a great team, shit happens and we need to be prepare to handle in the best way.

Collapse
 
syeo66 profile image
Red Ochsenbein (he/him) • Edited

Also. Sometimes the cure is worse than the problem.

Collapse
 
chasm profile image
Charles F. Munat

Did you mean "worse"?

Collapse
 
syeo66 profile image
Red Ochsenbein (he/him)

Yes. Thanks.

Collapse
 
gahunda profile image
S. Ben Ali

I heard on some news channel that they couldn’t test the update for every machine out there, since there’s like so many makes and models: windows PCs, servers, workstations and whatnot. Something to this extent. I’m not defending, just stating.

Then again, I’m totally for the incremental delivery/rollout. Maybe they should’ve targeted a specific zone first.

Collapse
 
codewithcaen profile image
CodeWithCaen

Thanks for giving this context, I do disagree with them though. They provide services to critical parts of society, and evidently have the power to grind our lives to a halt. They have a responsibility to ensure they have adequate measures to prevent these things from happening. That's the bare minimum.

They are (were?) a multi-billion dollar corporation. They should have a fleet of all the hundreds if not thousands most common deployment devices they provide services to, so they can actually test their software before releasing it.

Collapse
 
auraswap profile image
Liz Wait

As a tester, I had very similar questions! Thanks for the post.

Collapse
 
qacomet profile image
Lucas@QAComet

It's amazing how transparent CrowdStrike has been in their post-incident report. I'm glad to see they are now implementing a comprehensive QA process.

Collapse
 
codewithcaen profile image
CodeWithCaen

Feels a bit too late don't you think?

Collapse
 
qacomet profile image
Lucas@QAComet

Don't get me wrong, the damage has been done and I'm sure we'll continue to see new updates about the fallout. I just see this as a positive because it sets some standards for publicly traded companies to reference.

Collapse
 
matthewpersico profile image
Matthew O. Persico

The big problem here is Windows update and how it is configured. Windows update corporation should be configured to go against a local update server. Local updates server and then call home for various pieces of infrastructure and various software updates. Then, you can stage to update yourself across your fleet.

Collapse
 
theooliveira profile image
Theo Oliveira • Edited

Bro, what test, by the analysis they made on twitter the file was only a bunch of zeros. That makes no sense at all.

Collapse
 
qacomet profile image
Lucas@QAComet • Edited

Turns out the null bytes were caused by CrowdStrike's crash in the middle of the update, the content update they released wasn't full of null bytes. What happened was they released a configuration file that caused an out-of-bounds memory error, setting off a chain of odd behavior for the computer.

Collapse
 
codewithcaen profile image
CodeWithCaen

Link?

Collapse
 
michaelmior profile image
Michael Mior

The retro posted by CrowdStrike seems to indicate that wasn't the problem.