So I’ve done quite a few posts recently about resiliency. And it’s a topic that more and more is very important to everyone as you build out solutions in the cloud.
The new buzz word that’s found its way onto the scene is Chaos engineering. And really this is a practice of building out solutions that are more resilient. That can survive faults and issues that arise, and ensure the best possibly delivery of those solutions to end customers. The simple fact is that software solutions are absolutely critical to every element of most operations, and to have them go down can ultimately break down a whole business if this is not done properly.
At its core, Chaos engineering is about pessimism :). Things are going to fail.
Sort of like every other movement, like Agile and DevOps, Chaos Engineering embraces a reality. In this case that reality is that failures will happen, and should be expected. The goal being that you assume, that there will be failures and should architected to support resiliency.
So what does that actually mean, it means that you determine the strength of the application, by doing controlled experiments that are designed to inject faults into your applications and seeing the impact. The intention being that the application grows stronger and able to handle any faults and issues while maintaining the highest resiliency possible.
Now a lot of people will read the above, and say that “chaos engineering” is just the latest buzz word to cover something everyone’s doing. And there is an element of truth to that, but the details are what matters.
And what I mean by that, is that there is a defined approach to doing this and doing it in a productive manner. Much like agile, and devops. In my experience, some are probably doing elements of this, but by putting a name and methodology to it, we are calling attention to the practice for those who aren’t, and helping with a guide of sorts to how we approach the problem.
There are several key elements that you should keep in mind as you find ways to grow your solution by going down this path.
- Embrace the idea that failures happen.
- Find ways to be proactive about failures.
- Embrace monitoring and visibility
Sort of how Agile embraced the reality that “Requirements change”, and DevOps embrace that “All Code must be deployed.” Chaos engineering embraces that the application will experience failures. This is a fact. We need to assume that any dependency can break, or that components will fail or be unavailable. So what do we mean at a high level for each of these:
The idea being that elements of your solution will fail, and we know this will happen. Servers go down, service interruptions occur, and to steal a quote from Batman Begins, “Sometime things just go bad.”
I was in a situation once where an entire network connection was taken down by a Squirrel.
So we should build our code and applications in such a way that embraces that failures will eventually occur and build resiliency into our applications to accommodate that. You can’t solve a problem, until you know there is one.
How do we do that at a code level? Really this comes down to looking at your application, or micro service and doing a failure mode analysis. And a taking an objective look at your code and asking key questions:
- What is required to run this code?
- What kind of SLA is offered for that service?
- What dependencies does the service call?
- What happens if a dependency call fails?
That analysis will help to inform how you handle those faults.
In a lot of ways, this results in leveraging tools such as patterns, and practices to ensure resiliency.
After you’ve done that failure mode analysis, you need to figure out what happens when those failures occur:
- Can we implement patterns like circuit breaker, retry logic, load leveling, and libraries like Polly?
- Can we implement multi-zone, multi-region, cluster based solutions to lower the probability of a fault?
Also at this stage, you can start thinking about how you would classify a failure. Some failures are transient, others are more severe. And you want to make sure you respond appropriately to each.
For example, a monitoring networking outage is very different from a database being down for an extended period. So another key element to consider is how long the fault lasts for.
Now based on the above, the next question is, how do I even know this is happening? With micro service architectures, applications are becoming more and more decentralized means that there are more moving parts that require monitoring to support.
So for me, the best next step is to go over all the failures, and identify how you will monitor and alerts for those events, and what your mitigations are. Say for example you want to do manual failover for your database, you need to determine how long you return failures from a dependency service before it notifies you to do a failover.
Or how long does something have to be down before an alert is sent? And how do you log these so that your engineers will have visibility into the behavior. Sending an alert after a threshold does no one any good if they can’t see when the behavior started to happen.
Personally I’m a fan of the concept her as it calls out a very important practice that I find gets overlooked more often than not.