loading...

K8s Chaos Dive: Chaos-Mesh Part 1

craigmorten profile image Craig Morten Updated on ・11 min read

In this series I walk through several different open source offerings for performing chaos testing / engineering within your Kubernetes clusters.

In K8s Chaos Dive: Kube-Monkey I covered Kube-Monkey, a simple implementation of the Netflix Chaos Monkey for Kubernetes which allows you randomly kill pods.

This tool is great for getting off the ground with Chaos testing in Kubernetes but has a couple of failings:

  1. It is only able to kill pods, it can't impact the cluster in any other way.
  2. It requires you to modify the system under test (SUT) by adding labels. This adds extra overhead pre-test for the engineering team and means you need to redeploy applications to enable / disable chaos testing.

In this post we cover a different tool that offers a richer set of features without the need to modify or redeploy existing applications.

Chaos-Mesh

Introduction

Chaos-Mesh is a chaos engineering toolkit that offers a wide range of testing capabilities, from simple pod killing to IO and Network disruption, for the purpose of validating the failure-resiliency of your services.

The tool runs as two main deployments in the cluster:

  • controller-manager - used to schedule and manage the lifecycle of chaos experiments.
  • chaos-daemon - a daemonset (runs on every node) with privileged system permissions over a node's network, cgroup, etc.

For some experiments the controller-manager also uses admission webhooks to dynamically inject a chaos-sidecar into pods, for example, in order to hijack the I/O of the application container.

The tests themselves are defined using Kubernetes manifests based on one of the six custom resource definitions that Chaos-Mesh provides:

  1. PodChaos
    • pod-kill - Killing pods.
    • pod-failure - Pods becoming unavailable.
    • container-kill - Killing pods' containers.
  2. NetworkChaos
    • netem chaos - Create network delay, duplication, loss, or corruption.
    • network-partition - Simulate network partition through separating pods into several independent subnets by blocking communication between them.
  3. IOChaos - Simulate file system faults such as I/O delay or read / write errors.
  4. TimeChaos - Inject clock skew into pods.
  5. StressChaos
    • cpu-burn - Simulate pod CPU stress.
    • memory-bun - Simulate pod memory stress.
  6. KernelChaos - Inject kernel errors into pods.

To create and run experiment, you create a Kubernetes manifest file and deploy to the cluster. The controller-manager will then detect the new experiment object and execute the defined chaos experiment.

In addition to deploying chaos experiments using kubectl / helm, Chaos-Mesh also comes with it's own dashboard through which you can create and monitor experiments - useful if you prefer a GUI!

See below for a high level overview of the setup:

Chaos-Mesh architecture diagram: custom resource definitions for each chaos experiment type exist in cluster, controller-manager and chaos-dashboard exist as deployments in cluster, chaos-daemon runs on every node and applications can have chaos-sidecar injected by controller-manager.

Walk-through

Further details on Chaos-Mesh can be found on it's GitHub repository and in the documentation.

Here we'll walk through setting up the first of three tests:

  1. A pod killing test using the Chaos-Mesh Dashboard - similar to the one covered in K8s Chaos Dive: Kube-Monkey for comparison.
  2. A CPU stress test using Kubernetes manifest files - covered in K8s Chaos Dive: Chaos-Mesh Part 2.
  3. A Memory stress test using Kubernetes manifest files - covered in K8s Chaos Dive: Chaos-Mesh Part 2.

Setting Up A Cluster

I have covered local Minikube Kubernetes cluster setup in a previous tutorial so will not re-visit here in full, please refer to the link for details.

Once you're ready, start your cluster:

minikube start --driver=virtualbox

And this time we will also enable the Kubernetes Metrics Server so we can monitor pod resources later on:

minikube addons enable metrics-server

Deploying A Target Application

Let's deploy some hello-world like nginx pods to target in our experiments (but feel free to use your own applications!). For this we're going to use Helm - a CLI that provides repository management, templating and deployment capabilities for Kubernetes manifests.

If you're own MacOS you can install using Homebrew, installation for other OS' are available on the Helm Installation Docs.

brew install helm

We can now create a new Helm chart (a collection of templated Kubernetes manifests) which will call nginx:

helm create nginx

The default chart created by Helm in the create command is for an nginx image, and we will use this out-of-the-box setup as it suits us just fine!

Next we create a new namespace for our target application(s):

kubectl create ns nginx

And finally we deploy 10 replicas of our nginx application, using Helm, to our nginx namespace:

helm upgrade --install nginx ./nginx \
  -n nginx \
  --set replicaCount=10

We can check whether the deployment was successful using both Helm and kubectl:

helm ls -n nginx
kubectl get pod -n nginx

We should see our release is deployed and there should be 10 pods running in the cluster πŸŽ‰.

Deploying Chaos-Mesh

Let's now deploy the Chaos-Mesh. In this tutorial I'm going to use the latest direct from the Chaos-Mesh GitHub repository using Helm, but you can also install using an installation script provided by the Chaos-Mesh team - check out the Installation Documentation for further details.

First we clone the repository:

git clone https://github.com/chaos-mesh/chaos-mesh

We can then install the Chaos-Mesh custom resource definitions to our cluster which allow us to define and install our chaos experiments:

$ kubectl apply -f ./chaos-mesh/manifests/crd.yaml

customresourcedefinition.apiextensions.k8s.io/iochaos.chaos-mesh.org created
customresourcedefinition.apiextensions.k8s.io/kernelchaos.chaos-mesh.org created
customresourcedefinition.apiextensions.k8s.io/networkchaos.chaos-mesh.org created
customresourcedefinition.apiextensions.k8s.io/podchaos.chaos-mesh.org created
customresourcedefinition.apiextensions.k8s.io/podnetworkchaos.chaos-mesh.org created
customresourcedefinition.apiextensions.k8s.io/stresschaos.chaos-mesh.org created
customresourcedefinition.apiextensions.k8s.io/timechaos.chaos-mesh.org created

Next we create a namespace for the Chaos-Mesh deployments:

kubectl create ns chaos-mesh

And finally, install Chaos-Mesh into the cluster:

helm upgrade --install chaos-mesh ./chaos-mesh/helm/chaos-mesh \
  -n chaos-mesh \
  --set dashboard.create=true

Note the --set dashboard.create=true flag which let's Chaos-Mesh know you wish to use the new (experimental) dashboard.

And that's it! We can check that our installation worked successfully:

$ helm ls -n chaos-mesh

NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
chaos-mesh      chaos-mesh      1               2020-08-20 17:02:34.347893 +0100 BST    deployed        chaos-mesh-v0.1.0       v1.0.0    

$ kubectl get pods -n chaos-mesh -l app.kubernetes.io/instance=chaos-mesh

NAME                                      READY   STATUS    RESTARTS   AGE
chaos-controller-manager-fd568948-qzvv2   1/1     Running   0          12s
chaos-daemon-sdkq6                        1/1     Running   0          12s
chaos-dashboard-6d8466f445-2dgk4          1/1     Running   0          12s

Chaos-Mesh Dashboard

Let's get the dashboard we installed opened up in a browser and have an explore! First we can get Minikube to tell us where it is running and launch it:

minikube service chaos-dashboard -n chaos-mesh

This should load the Chaos-Mesh Dashboard in a browser with an Overview page open showing "Total Experiments" and various other widgets.

Chaos-Mesh Dashboard Overview page open in a browser

Given we haven't created or run any experiments yet it isn't particularly exciting, so let's get cracking and start our first chaos experiment.

Experiment 1: Killing Pods

In this experiment we will create and run a new chaos experiment that will kill a random percentage of our Nginx pods every 30 seconds using the Chaos-Mesh Dashboard.

First we click the "New Experiment" button on left side-menu.

Chaos-Mesh Dashboard with "New Experiment" button highlighted

This opens up a "Create A New Experiment" page. Fill in the name of your experiment (e.g. kill-percentage-nginx-pods) and choose the nginx namespace.

There are also options to add labels and annotations to your chaos experiment object which we will leave blank in this tutorial, but you might find useful if your setup requires either for audit, automation or other purposes.

You will also notice on the right hand side of the screen there are options to load from previous experiments, archives (deleted experiments) as well as upload from a yaml file. These can be useful if you want to re-run an old test, or upload an existing Kubernetes manifest so you can modify the experiment using the GUI.

Once you have filled in the form, click the "Next" button to proceed.

First "Create A New Experiment" page with filled in form and "Next" button highlighted

The second experiment creation page allows you to set the scope of your experiment, i.e. which pods should be impacted. Here we will set the "Namespace Selectors" to nginx (this may already be pre-populated for you!) and for the second "Label Selectors" field, we will choose app.kubernetes.io/name: nginx.

These selectors ensure that our experiment will only target the nginx namespace, and pods in that namespace that have the nginx name label. For your future experiments you can target several namespaces and / or labels for more complex scenarios.

For the "Mode" dropdown, choose Random Max Percent - this should cause a new "Mode Value" input field to appear in which we will enter "100". These two fields will configure our experiment to target a random percentage of the eligible pods between 0 and 100%.

Second "Create A New Experiment" page with filled in form

Navigating down the page, you may notice there are some additional options which allow you to also select pods based on annotation as well as by phase, e.g. only "Running" or "Pending" pods. There is also a section in which you can manually exclude eligible pods from the experiment which we will leave as-is with all pods selected.

Second "Create A New Experiment" page additional options

Click the "Next" button to navigate to the "Target" page. Here we can choose the exact type of chaos experiment we want to run from the six available offerings.

For this experiment we will use the default selected option of "Pod Lifecycle" and in the dropdown we will choose the "Pod Kill" PodChaos action. This will configure the experiment to target the previously selected pods for killing. Click the "Next" button.

Third "Create A New Experiment" page with filled in form

The final page allows us to define a schedule for our pod killing experiment. Here we will type @every 30s into the "Cron" form field so that our experiment schedules the random pod killing every 30 seconds. The schedule accepts any valid cron syntax supported by the robfig/cron Go library.

Forth "Create A New Experiment" page with filled in form

Let's complete the experiment creation and click the "Finish" button! This will open an "All steps are complete" confirmation page from which you can either navigate back to previous steps, reset the config or submit. Let's submit our experiment by clicking the "Submit" button πŸŽ‰.

Final "Create A New Experiment" confirmation page with "Submit" button

If we now navigate to the "Experiments" tab in the left side menu we can see our new PodChaos experiment listed.

Experiments page listing new PodChaos kill-percentage-nginx-pods experiment

Clicking on the experiment we are taken to a details page where we can see the key experiment configuration, a timeline showing experiment execution and an events table allowing you to view details on a particular scheduled event - in this case a schedule pod killing every 30 seconds.

Experiment details showing configuration and timeline sections

Experiment details showing events section

At the top there are also some options to pause the experiment and archive it (which will delete the experiment). In the "Configuration" section there is also an "Update" button which allows you to modify the experiment yaml in a editor modal.

Experiment details update experiment modal

From the timeline we can see that our experiment is running every 30 seconds, and we can confirm this by watching our Nginx pods in the cluster where we can see a random percentage of the pods are being killed every 30s:

$ kubectl get pods -n nginx -w

NAME                     READY   STATUS    RESTARTS   AGE
nginx-5c96c8f58b-7cstm   1/1     Running   0          100s
nginx-5c96c8f58b-8f9n2   0/1     Running   0          10s
nginx-5c96c8f58b-8htvx   1/1     Running   0          100s
nginx-5c96c8f58b-9vw8v   1/1     Running   0          70s
nginx-5c96c8f58b-cczvx   1/1     Running   0          2m10s
nginx-5c96c8f58b-dnxbz   1/1     Running   0          100s
nginx-5c96c8f58b-p8svr   1/1     Running   0          10s
nginx-5c96c8f58b-plzf6   1/1     Running   0          10s
nginx-5c96c8f58b-ptlsz   1/1     Running   0          100s
nginx-5c96c8f58b-rk4ht   0/1     Running   0          10s

Awesome! We have set up a pod killing chaos experiment and can see it successfully killing our pods. Let's pause the experiment and archive it to remove the experiment from the cluster using buttons on the experiment details page (also available on the "Experiments" page).

You can still find information on your experiment by visiting the "Archives" tab on the left side menu which provides you with a full report on every chaos experiment you have run.

Archives page listing past experiments and links to reports

Archived experiment report details page showing information on the kill-percentage-nginx-pods experiment

Β Clean-up

Let's clean-up and remove everything we've created today (skip this if you are progressing onto part 2 of this tutorial!).

helm delete chaos-mesh -n chaos-mesh
kubectl delete ns chaos-mesh
kubectl delete crd iochaos.chaos-mesh.org
kubectl delete crd kernelchaos.chaos-mesh.org 
kubectl delete crd networkchaos.chaos-mesh.org
kubectl delete crd podchaos.chaos-mesh.org 
kubectl delete crd podnetworkchaos.chaos-mesh.org
kubectl delete crd stresschaos.chaos-mesh.org
kubectl delete crd timechaos.chaos-mesh.org
helm delete nginx -n nginx
kubectl delete ns nginx
minikube stop
minikube delete

That's all folks for this tutorial!

There's a lot to take in so have chosen to separate the CPU and memory based experiments into a second follow-up post K8s Chaos Dive: Chaos-Mesh Part 2.

Enjoy the tutorial? Have questions or comments? Or do you have an awesome way to run chaos experiments in your Kubernetes clusters? Drop me a message in the section below or tweet me @CraigMorten!

Till next time πŸ’₯

Posted on by:

craigmorten profile

Craig Morten

@craigmorten

26 β€’ London β€’ That JS Guy β€’ JavaScript, TypeScript, React, Node, Deno, Kubernetes, Azure β€’ I also tweet stuff

Discussion

pic
Editor guide