Sayan Mondal for LitmusChaos

Posted on Mar 27, 2021 • Edited on Sep 19, 2022

How Litmus Orchestrates Chaos

#tutorial #cloudnative #kubernetes #litmuschaos

Litmus is a Cross-Cloud Chaos Orchestration framework for practising chaos engineering in cloud-native environments. Litmus provides a chaos operator, a large set of chaos experiments on its hub, detailed documentation, and a friendly community.

In this blog, we'll take a look at how Litmus Orchestrates Chaos Experiments and the individual components/resource breakdowns to get a perfect step by step understanding of how things happen behind the hood.

The Basics First

Litmus takes a cloud-native approach to create, manage and monitor chaos. Litmus orchestrates Chaos using the following Primary Kubernetes Custom Resources:

Chaos Experiment: This is a low-level definition of the Chaos Experiment itself, in this CR you'd find the different chaos parameters, the libraries used to implement chaos, the permissions associated with the particular experiment as well as some Operational characteristics like if the privileged mode is enabled, the security context of your experiment, etc. The experiment tunables like the TOTAL_CHAOS_DURATION, LIB_IMAGE, etc, is already set to a default value with the help of this CR. (The default experiment parameters are pulled from the ChaosHub)

Chaos Engine: This user-facing CR helps in binding the application instance with the ChaosExperiment. It defines the Run Policies and also holds the status of your experiment. This CR helps you customize the experiment according to your need since it can override some of the default characteristics/tunables in your experiment CR. This CR reconciles with the Chaos Operator (We'll get to more details about Chaos Operator soon, but for now think about it as a chaos generator that takes your tuned settings and applies the chaos).

Chaos Result: This resource is created by the experiment and is used to store details of the experiment run. It contains important experiment level details like the current Verdict (Awaited, Passed, Failed), the current status and the nature of the experiment results, the Chaos Engine reference, salient application/result attributes. It is also a source for metrics collection. It is updated/patched with the status of the experiment run. It is not removed as part of the default cleanup procedures to allow for extended reference.

The Run Time Flow

The first and foremost step is to install Litmus, either by directly applying the Kubernetes manifest or by doing a helm install of the litmus chart.

helm install litmuschaos  --namespace litmus ./charts/litmus-2-0-0-beta/

kubectl apply -f https://raw.githubusercontent.com/litmuschaos/litmus/master/litmus-portal/cluster-k8s-manifest.yml

NOTE: Take a look at the detailed installation instructions to have Litmus installed step by step in your system

Once Litmus has been installed in your cluster, it'd create the CRDs and the Chaos Operator as a part of the installation. Once you have the basic CRDs up and running in your cluster, you can pull your experiments from the Hub of experiments, the ChaosHub, containing 30+ experiments for you to try out.

After successfully pulling the experiment in your cluster as CRs, the ChaosEngine has to be created which binds your application instance to the chaos experiment you just pulled. This whole time, the Chaos Operator keeps a watch at the ChaosEngine and as soon as it detects a new experiment installation it spawns up a Chaos Runner which is responsible for executing the Chaos Experiment with the help of Experiment Jobs (The Jobs are spawned by the Chaos Runner based on how you have created the Chaos Engine)

These Experiment Jobs then create the Chaos Results, which are continually updated as the experiment is being executed. The Chaos Runner listens to these results and patches them to the Chaos Engine accordingly.

How the Chaos Operator works

Now at this point, we are already aware of Chaos Engine being one of the important Custom Resources in this project, but there's much more to it that just binding application instances. The Chaos Engine CR is what holds the crucial information of which is the target application to induce chaos on!

The Operator on the other hand is responsible for identifying whether the target application is existing on the cluster from the Chaos Engine (It does so with the help of app namespace and app label which is defined as a part of the Chaos Engine CRD). Once it gets confirmation that these applications are present on the cluster, it does a routine annotation check (Can be turned off as well, but in case you are someone worried about controlling the blast radius, do keep it enabled).

Next comes validating the policies/run properties that have been constructed as a part of Chaos Engine (Default as well as when modified), the Engine is what holds the service accounts which is responsible for creating the Chaos Runner and Experiment Pods (RBAC). Role Bindings are very helpful when it comes to assessing the permissions you have given for a workflow or individual experiments. It is then followed by a validation of the experiment tunables that had been overridden on added, for example, adding a new ConfigMap to push some secret to it, Overriding the default experiment parameters, pushing instance-specific data dynamically during experiment run, etc.

The operator also takes note/validates the Chaos Runner attributes as well, it checks whether the pull policy, the set of images and arguments are correct or not, it comes really handy when you are targeting infra changes.

Job Cleanup Policy

This is a tunable property of the workflow which decides whether you want to clear all the resources used for inducing chaos in your application and revert it back to the original state, or you want to retain all the resources created and use it to export metrics in your own Analytical solutions.

Monitoring

The operator is also responsible for monitoring your application state, if this tunable is set to true then the experiment metrics are monitored. The operator ensures that these metrics are collected and enables the exporter to scrape them from the Chaos Engine.

In the future releases, Litmus won't have this tunable parameter anymore, since it won't be available anymore in the Chaos Engine the users can't use this tunable from there, instead, if they don't want to collect the metrics of their workflows they have to skip the installation of the chaos-exporter.

Apart from all the above Jobs of the Chaos Operator, it also reconciles the state of the Chaos Engine and takes certain steps based on the result of the Chaos Experiment.

The Three modes of the Operator

Litmus provides certain operational modes when it comes to diversifying the range of chaos operations for different personas making it suitable for use-cases of SREs, Developers, Service Owners, etc so that they can induce chaos on their specific cluster (individual or shared).

There are three modes that the Chaos Operator supports

Standard
Admin
Namespaced

Standard - The Standard Mode is where Litmus run on a centralized namespace (typically litmus) and it watches for Chaos Engines created in the cluster in different namespaces. Mostly suited for SREs, DevOps Admins, etc.

Admin - Admin mode is one of the ways the chaos orchestration is set up in Litmus, wherein all chaos resources (i.e., install time resources like the operator, Chaos Experiment CRs, Chaos Service Account and runtime resources like Chaos Engine, Chaos Runner, etc) are set up in a single admin namespace (typically, litmus). This mode typically needs a "wider" & "stronger" ClusterRole, albeit one that is still just a superset of the individual experiment permissions. In this mode, the applications in their respective namespaces are subjected to chaos while the chaos job runs elsewhere, i.e., admin namespace.

Namespaced - This mode constraints developers to very strict environments where they are very conscious about the blast radius and the policies are set in such a way that they don't get the visibility to watch the Chaos Operators in any other namespace.

Chaos Orchestration

Now that you are aware of all the Chaos Resources and the primary details which make up the Chaos Experiment, its time to take a look at the main Orchestration. There are three main states of Chaos Orchestration namely

Initialization
Completed
Stopped

Initialization

Once the operator triggers Chaos and spawns the Chaos Runner Pod, the experiment is in the Initialization phase. The Chaos Operator is instrumented with Kubernetes Finalizers which ensures to only delete the supporting chaos resources if and only if all the child resources have been exhausted/reached a logical conclusion and your experiment have been finished.

Completed

The execution of the experiment is verified by the Chaos Runner with the help of the Chaos Result resource which then states/validates that the experiment has been completed. At this point, re-applying the Chaos Engine would repeat the same steps and the experiment would be repeated seamlessly. Depending upon your job cleanup policy the Chaos Operator might either remove or retain your experiment Pods. If you decide to remove all the Chaos Resources, the Finalizer would be removed as well.

Stopped

Once initialized, you are having second thoughts about your experiments and want to abruptly stop it in the middle! You can actually do so by either patching the Engine state to stop or delete the Chaos Engine resource completely. If you have patched the Operator then it will set the Engine Status to Stopped. If you re-apply the experiment at this point, it'll re-trigger the experiment run. As a part of this abort operation, it forcefully deletes the Chaos Pods.

Conclusion

That's all folks 👨‍🏫, Thank you for reading it till the end. I hope you had a productive time learning about Litmus and how it works internally.

Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you?
Join Our Community On Slack For Detailed Discussion, Feedback & Regular Updates On Chaos Engineering For Kubernetes: https://kubernetes.slack.com/messages/CNXNB0ZTN
(#litmus channel on the Kubernetes workspace)