Trust me, breaking stuff and looking deeper into the intricate details has always been one of the favorite pass-times for developers for years. Chaos Engineering is not so different, its even more fun. It allows you to break stuff but in production.
While the idea might not sound too great at the beginning because of the risks involved you might say, but stick around just a little longer and I'll show you just how interesting it can be without harming your production environment at all.
Chaos Engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news.
You are literally breaking things on purpose here to check the resilience of your applications. Just as scientists conduct experiments to study physical and social phenomena, Chaos Engineering uses experiments to learn about a particular system.
Has it ever happened to you that you have been called out of bed because the application you work on wasn’t working anymore? Or have you spent time on a Saturday doing manual failover tests? If yes then you probably are enthusiastic to learn how to avoid this.
Companies might use up a lot of recourse in hosting and managing their microservice and distributed cloud architectures, have you ever imagined what would happen if these systems which we depend upon fails? These failures not only cause costly outages for companies but also hurt customers trying to shop, transact business, and get work done. A single hour of downtime costs companies thousands of dollars thus companies can't afford to let this slide.
Companies need a solution to this challenge—waiting for the next costly outage is not an option. To meet the challenge head on, more and more companies are turning to Chaos Engineering.
Chaos Engineering should not be misunderstood as a replacing for either unit or end-to-end testing that you are already doing. It can rather be considered as an extra reliability and resilience testing that you are adding in your arsenal and by no means can they be simulated through either unit or e2e tests.
When you want to explore the many ways a complex system can misbehave, injecting communication failures like latency and errors is one good approach. But we also want to explore things like a large increase in traffic, race conditions, byzantine failures, and unplanned or uncommon combinations of messages. At the end of the day we wouldn't want an unhappy customer.
- We can only do Chaos Engineering in production
This is simply a myth and you can of-course do chaos injection locally as well. Although Chaos Engineering is often executed in production this is probably not the place to start. If you want to do your first experiments it might be possible to do this in an acceptance or test environment. But for certain cases like if you want to test larger parts of your application landscape, production is the only place you can do this because it is often impossible to emulate a fully distributed application landscape in a test or acceptance environment.
- Chaos Engineering is only for large distributed system
This is not true, you can run chaos engineering on even the tiniest of application to the largest of systems. Even a small attack/blocking a part of the system like an API call or intentionally disrupting the network of a system might be good enough to check the resilience of a small system, you don't even need to know the source code for this. Although its application shines brighter in a large system but this claim is false that you can't use it for smaller systems.
- Chaos Engineering is tethered to a certain type of technology only
Nope. It's not just for the cool kids. Your monolith can use Chaos Engineering. You might have mostly seen the term Chaos Engineering and Microservice architecture interlinked in a lot of places but you can of-course use it in any sort of architecture you want. That legacy system you have can also benefit from Chaos Engineering.
Chaos Monkey is the original chaos engineering tool created at Netflix. It’s still being maintained and is currently integrated into Spinnaker which is Netflix’s CICD tool.
Litmus is a toolset to do cloud-native chaos engineering by MayaData. Litmus provides tools to orchestrate chaos on Kubernetes along with a closely scope UI which has lots of features to get you started.
Gremlin is a company started by some of Netflix’s and Amazon’s Chaos Engineers who productized Chaos as a Service (CaaS). Gremlin is a paid service that gives you a CLI, agent and website that will help you set up chaos experiments.
Chaos Mesh is a Cloud Native Computing Foundation (CNCF) hosted project. It is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments with an amazing Chaos Dashboard for managing, designing, monitoring Chaos Experiments
“Chaos Engineering doesn’t cause problems, it just reveals them” – Nora Jones
With that said all I'd like to mention is that Chaos Engineering is a tool to make your job easier. By proactively testing and validating your system’s failure modes you will reduce your operational burden, increase your availability, and sleep better at night. So go ahead and give it a try if you haven't already.
Don't forget to share these resources with someone who you think might benefit from these. Peace out. ✌🏼