DEV Community

Cover image for CrowdStrike is overhauling their QA process
Lucas@QAComet
Lucas@QAComet

Posted on

CrowdStrike is overhauling their QA process

Because of the recent outage from CrowdStrike's software update, affecting over 8.5 million users, they've had to overhaul their QA process into one with many layers of quality checks. CrowdStrike has released a preliminary report detailing their plans to revamp their quality assurance processes. The incident, which resulted from an erroneous update passing through their automated validation system, caused widespread BSODs. In fact, it's estimated this deployment failure cost fortune 500 companies over $5.4 billion dollars.

The root cause was traced back to a change in a configuration file that led to an out-of-bounds memory read. CrowdStrike admitted that their testing process for configuration file updates was relatively lax, relying solely on a single automated content validation system, which seemed more akin to a JSONSchema validator than a comprehensive testing suite.

How CrowdStrike will implement their QA

To prevent such incidents in the future, CrowdStrike is implementing several new QA processes:

  • Local developer testing: Surprisingly, their previous development process didn't include a final manual check on a developer machine before releasing updates. They will now incorporate human oversight in the process.
  • Content update and rollback testing: CrowdStrike plans to rigorously test their content updates and rollback procedures, simulating various fault scenarios to ensure their systems can handle broken updates effectively.
  • Stress testing: The company will expand its stress testing to include configuration file updates, not just new components of their Sensor system. This comes after realizing that updates may have overwhelmed parts of the computers being updated, causing fail-safes to break.
  • Fuzzing: CrowdStrike will enhance their fuzzing efforts, particularly for the "Interpreter" system that reads configuration files, which may have internal logic.
  • Fault Injection: The company will adopt this common testing technique, deliberately introducing errors into different parts of their system to evaluate how other components respond to failure.
  • Stability testing: CrowdStrike will implement stability testing, gradually removing portions of the functioning system to study how different components break – a crucial technique for mission-critical systems.
  • Content interface testing: Finally, they will bolster testing for their underlying content interface, focusing on making their interpreter more resilient to unforeseen issues.

By implementing these comprehensive QA processes, CrowdStrike aims to significantly reduce the risk of future outages and ensure a more robust and reliable service for users all around the world.

Links

Top comments (0)