Thus begins the prologue to Antifragile by Nassim Nicholas Taleb, one of the most popular authors of recent years.
What does the randomness have to do with our work as engineers? What does chaos have to do with cloud computing? Much more than you can imagine and in this story, we will understand why.
Generally, a cloud application is composed of various components: virtual machines, databases, load balancers and other services that by communicating with each other support our business. Complex and distributed systems which, as such, could suddenly fail.
The answer to this type of problem has often resulted in a greater number of tests; mostly end-to-end tests that stimulate all the layers of an application. A sort of black-box testing where, in front of an input A, an output B is expected. If the answer is C, somewhere in the system something went wrong.
In some cases, such tests are stimulated at regular intervals on production systems to verify them over a 24-hour period, but in case of a failure, we won’t know much about the error.
Even worse, this practice exposes us to another risk: the illusion of control. Each time a test is successfully completed, we become more and more convinced of the robustness of the system; day after day we become more and more proud of the excellent work done.
A metaphor by Bertrand Russell, later adapted by the Taleb himself, tells the “great turkey problem”:
This surprise represents what is called a Black Swan: an unexpected event, with catastrophic effects.
The (fake) illusion of control forced the turkey to revise his beliefs about the comfort of his life, just as they were at their peak.
Likewise, repeated positive end-to-end tests could lead us to the equally fallacious conclusion that our systems are indeed foolproof.
This superficiality of judgment, to use a lexicon dear to Taleb, exposes us to the Black swan: the possibility that an unforeseen problem could ruin our plans for the weekend.
Possible examples of Black swans:
- server’s shutdown;
- database fail;
- CPU overhead;
- memory exhaustion;
- disk-space exhaustion;
- high network latencies;
- insufficient permissions;
To protect your system from these and other issues, a cloud provider offers various possibilities. In AWS, to counter the sudden shutdown of a machine, the potentialities of the EC2 Auto Scaling service could be exploited; to mitigate the impact of a database fail, you could take advantage of Amazon RDS’s automatic failover mechanism, and so on.
These tricks will certainly make our application less fragile, but will they reduce the chance of a Black Swan?
By the term «fragile», in general, we mean an object that could be damaged – even irremediably – due to random events (e.g. a crystal glass that breaks after accidentally falling on the floor).
On the contrary, terms such as «robust» or «resistant» are used to define objects which, upon the occurrence of random events, maintain the exact same initial properties.
Randomness has made these objects neither better nor worse.
With the term «antifragile», however, Taleb refers to everything that benefits from randomness and stress factors. The muscles of our body, for example, benefit from the right amount of stress, as this is a prerequisite for their growth.
Returning to our systems, if everything we have done to counteract their fragility has ended with configuring the Auto Scaling feature, we will be robust but not yet antifragile; therefore still exposed to a Black Swan.
Taleb offers us the main suggestion to avoid nasty surprises. To be antifragile, and reduce the chances of a Black Swan as much as possible, we have to crave for error.
But what if we are in a situation where our end-to-end tests keep running smoothly and the error just doesn’t come?
By «chaos engineering» we mean the practice of deliberately injecting an error into a system, in order to observe, in vivo, the consequences.
The main advantage of this approach, compared to the more classic end-to-end test, comes from no longer having to depend on our assumptions. The impact of the mistake will be right before our eyes. By doing so, we will have the opportunity to highlight aspects of the system that we were not aware of: chain reactions, performance problems, metrics that escaped the monitoring tools and so on.
Knowing the weak points of the system is already a great start, as it forces us not to let our guard down; what we will have to aim for, however, will be the resolution of those weaknesses. This improvement will make systems a little more «antifragile».
The greater the error cases that we will be able to inject – and solve – the lower the chance of a Black Swan.
The choice of the word «engineering» next to the word «chaos» might sound like an oxymoron, but it isn’t; indeed, it well represents the essence of the process underlying the practice itself. Just as a scientific experiment has its well-coded phases, the chaos engineering experiment has its own.
Definition of the steady-state
In this phase, we are going to make it clear what should be the capacity of the system when there are no errors. This step is crucial and it will be essential to have monitoring infrastructures that allow us to accurately collect the metrics useful to define it. System metrics (e.g. CPU, memory) are certainly useful, but it would be even more useful to get business metrics:
Formulation of a hypothesis
Once this is done we will formulate a hypothesis; that is, we will try to imagine what the system behaviour might be following a specific error.
A useful exercise at this stage would be to ask each team member for answers to the above questions. This practice should not have the purpose of finding the first or last of the class but, simply, to show how, in some cases, the knowledge of the system can vary between members of the same team.
Another way to remind everyone to stay prepared.
Definition of a stop-condition
It will be crucial to understand when and how to stop the experiment.
The when can be manual, timed or dynamic. We will resort to manual blocking if we realize that the experiment is having unexpected consequences. Time, on the other hand, is the upper limit, that is the maximum time within which, for better or for worse, the experiment will have to end. Finally, a dynamic stop-condition will block the experiment if our monitoring tool signals an alarm situation (eg: the average of orders has fallen by 70% for more than five minutes).
How to stop the experiment is a bit more complex and depends on the available tool.
Run the experiment
Once this is done, we just have to launch our experiment and inject the errors we had foreseen when defining the hypotheses. Easy to say, a little less to do. Until recently, the practice of chaos engineering required good scripting skills to make the occurrence of the error programmatic.
This made chaos engineering less accessible to unstructured teams, reducing the practice to the prerogative of a bunch of large groups. Not surprisingly, some of the most famous chaos engineering scripts were those created within Netflix in the Simian Army project, such as chaos monkey: a script that randomly knocked down production machines.
Nowadays, the need for this type of practice is raising and advanced tools for chaos engineering are starting to become available. At re:Invent 2020, AWS unveiled Fault Injection Simulator (FIS) a fully-managed service for injecting errors into our systems.
The most interesting work comes after the experiment. There we can learn more about our system by asking ourselves questions like:
- how did our system behave?
- the countermeasures (ex: autoscaling, failover, circuit-breaker) have managed the error?
- Were there any unexpected errors in other parts of the system?
- Were there any performance problems?
- Was the error detected by our monitoring tools?
- How long did it take to get restored?
The answers and any solutions to these questions will then be prioritized and applied as soon as possible. By doing this we will be a little safer from a Black Swan.
To take advantage of chaos engineering as much as possible, the experiments must be launched into production. Point.
However, this practice carries its own risks and thinking of starting directly from production would be a bit risky. Better to start with experiments in development scenarios, and then, as soon as we feel a little more confident and practical, move towards staging and, why not, production.
The conclusions to this article are obvious and can be summarized once again by quoting Taleb’s words: «Don’t be a turkey!».
That is, we have to avoid simplistically believing that, in the absence of errors for an extended period, we are safe from any risk. A Black Swan can always be on the prowl and show up as soon as we let our guard down.
Instead, we have to try to stress our system, injecting the most probable errors, observing its behaviour in search of any unpredictable behaviour. Only in this way will we be able to truly know our system and be a little more confident in its ability to withstand error and our ability to know how to manage it properly.
In a 2011 talk, Jesse Robbins, a volunteer firefighter then hired at AWS with the title of Master of Disaster, reported the phrase they teach every firefighter on the first day of training school:
Let’s never forget that.
The ideas, the concepts and this article itself would not have seen the light without the precious insights mentioned below.
- Nassim Nicholas Taleb – The Black Swan
- Nassim Nicholas Taleb – Antifragile
- Casey Rosenthal & Nora Jones – Chaos Engineering
- Principle of Chaos Engineering
- AWS Blog – Building Resilient services at Prime Video with Chaos Engineering
- Adrian Hornsby – Chaos Engineering pt.1
- Pavlos Ratis – Chaos Engineering resources
- Jesse Robbins – GameDay: Creating Resiliency Through Destruction
If you liked this post, please consider to support me.
Commissions may be earned from some of the link above.