A short reflection on the recent CrowdStrike IT disaster.
I feel like there's a lot to learn from the #CrowdStrike meltdown where a bug ...
For further actions, you may consider blocking this person and/or reporting abuse
Thanks for the post
Glad you liked it! Feel free to share it if you think others would find it interesting :)
This is exactly the same thing I have been telling people. This should have been caught in QA testing and after the production deployment, there should have been verification that it was working, if not then rollback immediately. If you’re a provider of software for any company, you should be doing phased rollouts and automated testing. I’m wondering was this test scenario just missed? Seems like a big one:)
Devs making software are never perfect ; so bugs happen. Tester doing QA are never perfect ; so bugs could remain unspotted and put into production ? IMHO, it's very hard to measure the quality of your tests. If you have no bug, you can only measure how expensive it is to make and maintain tests. When you wreak havoc in production, it suggests that there is a flaw in your QA. My point is : it's easier to measure test quality when it is low.
Thanks for sharing, appreciated a lot the points just mentioned, they remind us about how we should be careful in a lot of aspects, even having a robust delivery system, or having a great team, shit happens and we need to be prepare to handle in the best way.
Also. Sometimes the cure is worse than the problem.
Did you mean "worse"?
Yes. Thanks.
I heard on some news channel that they couldn’t test the update for every machine out there, since there’s like so many makes and models: windows PCs, servers, workstations and whatnot. Something to this extent. I’m not defending, just stating.
Then again, I’m totally for the incremental delivery/rollout. Maybe they should’ve targeted a specific zone first.
Thanks for giving this context, I do disagree with them though. They provide services to critical parts of society, and evidently have the power to grind our lives to a halt. They have a responsibility to ensure they have adequate measures to prevent these things from happening. That's the bare minimum.
They are (were?) a multi-billion dollar corporation. They should have a fleet of all the hundreds if not thousands most common deployment devices they provide services to, so they can actually test their software before releasing it.
As a tester, I had very similar questions! Thanks for the post.
The big problem here is Windows update and how it is configured. Windows update corporation should be configured to go against a local update server. Local updates server and then call home for various pieces of infrastructure and various software updates. Then, you can stage to update yourself across your fleet.
It's amazing how transparent CrowdStrike has been in their post-incident report. I'm glad to see they are now implementing a comprehensive QA process.
Feels a bit too late don't you think?
Don't get me wrong, the damage has been done and I'm sure we'll continue to see new updates about the fallout. I just see this as a positive because it sets some standards for publicly traded companies to reference.
Bro, what test, by the analysis they made on twitter the file was only a bunch of zeros. That makes no sense at all.
Turns out the null bytes were caused by CrowdStrike's crash in the middle of the update, the content update they released wasn't full of null bytes. What happened was they released a configuration file that caused an out-of-bounds memory error, setting off a chain of odd behavior for the computer.
Link?
The retro posted by CrowdStrike seems to indicate that wasn't the problem.