DEV Community

Cover image for K8s Chaos Dive: Kube-Monkey
Craig Morten
Craig Morten

Posted on • Updated on

K8s Chaos Dive: Kube-Monkey

In this series I walk through several different open source offerings for performing chaos testing / engineering within your Kubernetes clusters.

Kube-Monkey

Introduction

Kube-Monkey is a simple implementation of the Netflix Chaos Monkey for Kubernetes which allows you randomly delete pods during scheduled time-windows (there has to be some manner of control right? 😏), enabling you to test and validate the failure-resiliency of your services.

The tool runs as a deployment in your cluster, and deletes pods via the Kube API. Unlike some other more complex chaos offerings, it doesn't offer the ability to disrupt the Nodes themselves or impact network or IO, it is purely a pod killing tool. Nevertheless it is quick to configure and deploy and allows you simulate and test your products resiliency to pod failure - it can be valuable to know when you have an outage how quickly will everything come back online (if at all if face a multi-micro-service outage)!

The pod termination schedule is created once a day on weekdays (no weekend callouts, phew! πŸ˜…) at a configurable time (default 8am). For each target pod (configured by labels and an allowlist), the scheduler flips a biased coin to determine if a pod should be killed, and if so a random time is selected from the daily window (default 10am to 4pm).

When the time comes to terminate the pod, the eligibility of the pod and other settings are double checked for changes (which are honoured), and if all is still in order, the pod is terminated.

Let's Get Hands-On

Further details about Kube-Monkey can be found on it's GitHub repository, but for now let's get going deploying some chaos and seeing how it all works out!

Setting Up A Cluster

First thing we will need is a Kubernetes cluster to play with. You may already have one set-up (please don't follow this tutorial in a Production cluster! 😱), or you might have a favourite local setup you want to use - the exact cluster and provider shouldn't make much difference (unless you've already locked down security pretty tight!).

If you don't have a cluster to hand then I recommend using Minikube for a local Kubernetes development setup.

The installation instructions are in the link, but I will cover the highlights here as well:

  1. Check virtualization is supported by your computer.
  2. Install kubectl.
  3. Install a hypervisor - I recommend VirtualBox

For MacOS users who have Homebrew setup, this looks something like:

# Check if virtualization is supported on macOS.
# If you see VMX in the output (should be colored) then you
# are good to go!
sysctl -a | grep -E --color 'machdep.cpu.features|VMX'

# Install kubectl CLI for interacting with Kubernetes clusters.
brew install kubectl

# Install VirtualBox which we will use as our hypervisor.
brew cask install virtualbox
Enter fullscreen mode Exit fullscreen mode

Finally you can install Minikube, e.g.

brew install minikube
Enter fullscreen mode Exit fullscreen mode

Once your installation is complete, you can create your local cluster with the following command:

minikube start --driver=virtualbox
Enter fullscreen mode Exit fullscreen mode

This starts Minikube, instructing it to use the VirtualBox driver. It will start a control plane node before then updating and running the Minikube VM on which it downloads and installs the latest stable Kubernetes version.

Once complete you should have a new local cluster, and Minikube will have already configured kubectl to use the Minikube cluster. You can confirm using:

$ kubectl config current-context

minikube
Enter fullscreen mode Exit fullscreen mode

Now we're good to start deploying some applications!

Deploying A Target Application

We can't kill pods if there are no pods to kill! πŸ˜…

Let's deploy some hello-world like nginx pods (but equally feel free to use your own applications!). For this we're going to use Helm - a CLI that provides repository management, templating and deployment capabilities for Kubernetes manifests.

If you're own MacOS you can install using Homebrew, installation for other OS' are available on the Helm Installation Docs.

brew install helm
Enter fullscreen mode Exit fullscreen mode

We can now create a new Helm chart (a collection of templated Kubernetes manifests) which will call nginx:

helm create nginx
Enter fullscreen mode Exit fullscreen mode

The default chart created by Helm in the create command is for an nginx image, and we will use this out-of-the-box setup as it suits us just fine!

Next we create a new namespace for our target application(s):

kubectl create ns nginx
Enter fullscreen mode Exit fullscreen mode

And finally we deploy 10 replicas of our nginx application, using Helm, to our nginx namespace:

helm upgrade --install nginx ./nginx \
  -n nginx \
  --set replicaCount=10
Enter fullscreen mode Exit fullscreen mode

We can check whether the deployment was successful using both Helm and kubectl:

helm ls -n nginx
kubectl get pod -n nginx
Enter fullscreen mode Exit fullscreen mode

We should see our release is deployed and there should be 10 pods running in the cluster πŸŽ‰.

Making our application a target

In order for pods to be considered by Kube-Monkey we need to add specific labels to the Kubernetes deployment manifest file.

Open up the ./nginx/templates/deployment.yaml Helm template in your favourite IDE and modify it to include new kube-monkey labels to both the metadata.labels and spec.template.metadata.labels sections as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "nginx.fullname" . }}
  labels:
    {{- include "nginx.labels" . | nindent 4 }}
    kube-monkey/enabled: "enabled"              # Enable termination of this deployment
    kube-monkey/identifier: "nginx-victim"      # Custom name for our target
    kube-monkey/mtbf: "1"                       # Average number of days between targeting one of these pods
    kube-monkey/kill-mode: "random-max-percent" # The killing method
    kube-monkey/kill-value: "100"               # Killing values, depends on chosen killing method
spec:
{{- if not .Values.autoscaling.enabled }}
  replicas: {{ .Values.replicaCount }}
{{- end }}
  selector:
    matchLabels:
      {{- include "nginx.selectorLabels" . | nindent 6 }}
  template:
    metadata:
    {{- with .Values.podAnnotations }}
      annotations:
        {{- toYaml . | nindent 8 }}
    {{- end }}
      labels:
        {{- include "nginx.selectorLabels" . | nindent 8 }}
        kube-monkey/enabled: "enabled"          # See here also
        kube-monkey/identifier: "nginx-victim"  # See here also
    spec:
    ... rest of file
Enter fullscreen mode Exit fullscreen mode

Let's break our additions down:

  1. First we've added kube-monkey/enabled: "enabled" in two locations. This tells Kube-Monkey that this deployment should be considered a target in it's termination schedule.
  2. Next is kube-monkey/identifier: "nginx-victim". This is a unique identifier label which is used by Kube-Monkey to determine which pods belong to which deployment (because deployment labels are inherited by the pods they create). Generally it's advised to use the same value as the deployments name, but you don't have to (for instance we haven't here!).
  3. kube-monkey/mtbf: "1" is the next label. mtbf stands for "Mean Time Between Failure" and determines the average number of days between which the Kubernetes deployment can expect to have one of it's pods killed. We've set this value to 1 which means our nginx pods would be considered every day. Note this isn't an exact number of days between kills, only an average used when determining the likeliness of killing pods during the schedule phase.
  4. kube-monkey/kill-mode: "random-max-percent" is an option which allows you to detail how this deployment should be attacked by the Kube-Monkey. There are several options including:
    1. kill-all which will result in all pods being killed;
    2. fixed which will result in a fixed number of pods being killed;
    3. random-max-percent which allows you to define a percentage range for the number pods to be killed - good if you want truly random behaviour. The value provided in the kube-monkey/kill-value label determines the maximum percentage that can / will be killed during a scheduled period.
    4. fixed-percent which is similar to fixed, but defined as a percentage - better if you use horizontal pod autoscaling and want the number of pods killed to be relative to the total number available.
  5. Lastly there is kube-monkey/kill-value: "100" which works alongside the kube-monkey/kill-mode option to determine the number / percentage of pods to be killed.

Now we've added our labels, let's upgrade our deployment in the cluster and check its status using the same commands as before:

helm upgrade --install nginx ./nginx -n nginx
helm ls -n nginx
kubectl get pod -n nginx
Enter fullscreen mode Exit fullscreen mode

Let There Be Chaos

Now let's introduce Kube-Monkey into our cluster and start creating some chaos.

First we clone the repo:

git clone https://github.com/asobti/kube-monkey
Enter fullscreen mode Exit fullscreen mode

We can then create a new namespace for the Kube-Monkey deployment and deploy using Helm, same as we did for our Nginx application:

# Create the namespace
kubectl create ns kube-monkey

# Deploy Kube-Monkey
helm upgrade --install kube-monkey ./kube-monkey/helm/kubemonkey \
  -n kube-monkey \
  --set config.debug.enabled=true \
  --set config.debug.schedule_immediate_kill=true \
  --set config.dryRun=false \ # Be careful!
  --set config.whitelistedNamespaces="{nginx}"

# Check the deployment status
helm ls -n kube-monkey
kubectl get pod -n kube-monkey
Enter fullscreen mode Exit fullscreen mode

You may notice that we provided a few additional configuration options when we deployed Kube-Monkey with Helm:

  1. We enabled debug - in your clusters you won't want to do this as it will generate verbose logs, but for this tutorial / testing it can be useful to see what is going on.
  2. We also set schedule_immediate_kill to true. This is a debug option that instead of scheduling a new chaos window every day, schedules a new window every 30s so you can test out your configuration easily without having to wait a day between tests!
  3. We have set dryRun to false - this means Kube-Monkey will actually kill the target pods. Be sure to test out first with dryRun set to true in any important clusters so you can be sure that the correct pods will be targeted! (We want controlled chaos, bringing down prod is not the goal πŸ˜‚)
  4. Finally we have set the whitelistedNamespaces array to add our nginx namespace to the allowed list.

Because we have set schedule_immediate_kill to true, Kube-Monkey will immediately start applying the configured kill instructions. We can see this working by checking out the Kube-Monkey logs:

$ kubectl logs -n kube-monkey -l release=kube-monkey -f

...
I0819 13:08:47.192768       1 kubemonkey.go:19] Debug mode detected!
I0819 13:08:47.192853       1 kubemonkey.go:20] Status Update: Generating next schedule in 30 sec
I0819 13:09:17.193689       1 schedule.go:64] Status Update: Generating schedule for terminations
I0819 13:09:17.215602       1 schedule.go:57] Status Update: 1 terminations scheduled today
I0819 13:09:17.215809       1 schedule.go:59] v1.Deployment nginx scheduled for termination at 08/19/2020 09:09:22 -0400 EDT
        ********** Today's schedule **********
        k8 Api Kind     Kind Name               Termination Time
        -----------     ---------               ----------------
        v1.Deployment   nginx           08/19/2020 09:09:22 -0400 EDT
        ********** End of schedule **********
I0819 13:09:17.218029       1 kubemonkey.go:62] Status Update: Waiting to run scheduled terminations.
I0819 13:09:22.632324       1 request.go:481] Throttling request took 103.486967ms, request: DELETE:https://10.96.0.1:443/api/v1/namespaces/nginx/pods/nginx-6bb5bbd776-fclgb
I0819 13:09:22.836478       1 request.go:481] Throttling request took 151.926056ms, request: GET:https://10.96.0.1:443/api/v1/namespaces/nginx/pods/nginx-6bb5bbd776-xkzg8
I0819 13:09:23.033219       1 request.go:481] Throttling request took 181.015105ms, request: DELETE:https://10.96.0.1:443/api/v1/namespaces/nginx/pods/nginx-6bb5bbd776-xkzg8
I0819 13:09:23.049849       1 kubemonkey.go:70] Termination successfully executed for v1.Deployment nginx
I0819 13:09:23.049869       1 kubemonkey.go:73] Status Update: 0 scheduled terminations left.
I0819 13:09:23.049876       1 kubemonkey.go:76] Status Update: All terminations done.
I0819 13:09:23.049999       1 kubemonkey.go:19] Debug mode detected!
I0819 13:09:23.050006       1 kubemonkey.go:20] Status Update: Generating next schedule in 30 sec
...
Enter fullscreen mode Exit fullscreen mode

Here we can see that scheduled a termination window, targeted our Nginx deployment before killing 2 of our 10 pods - it looks like it's working!

Let's just check that our pods are actually being killed:

$ kubectl get pod -n nginx -w

NAME                     READY   STATUS    RESTARTS   AGE
nginx-6bb5bbd776-b5drk   1/1     Running   0          6m53s
nginx-6bb5bbd776-b6gqj   1/1     Running   0          5m28s
nginx-6bb5bbd776-b7pkj   1/1     Running   0          5m29s
nginx-6bb5bbd776-bl88b   1/1     Running   0          12s
nginx-6bb5bbd776-cswxk   1/1     Running   0          12s
nginx-6bb5bbd776-f84mr   1/1     Running   0          11s
nginx-6bb5bbd776-krkgn   1/1     Running   0          2m44s
nginx-6bb5bbd776-nvf42   1/1     Running   0          2m44s
nginx-6bb5bbd776-s4n7l   1/1     Running   0          12s
nginx-6bb5bbd776-w2chn   0/1     Running   0          11s
Enter fullscreen mode Exit fullscreen mode

There we have it, it looks like 5 of our 10 pods have been killed in the last 12s as well now. Think we can call that a success πŸŽ‰.

Clean Up

In it's current state, Kube-Monkey will continue to kill our Nginx pods every 30s until the end of time. Let's save some energy and do some clean-up!

helm delete kube-monkey -n kube-monkey
kubectl delete ns kube-monkey
helm delete nginx -n nginx
kubectl delete ns nginx
Enter fullscreen mode Exit fullscreen mode

That should be both the Nginx and Kube-Monkey deployments removed from our cluster. We can also tear down the Minikube cluster by running:

minikube stop
minikube delete
Enter fullscreen mode Exit fullscreen mode

And that should be us back to square one.

Next Steps

So today we've successfully:

  1. Created a new Kubernetes cluster.
  2. Deployed an Nginx application configured to be disrupted by Kube-Monkey.
  3. Deployed Kube-Monkey into the cluster to kill our Nginx pods on a scheduled basis.
  4. Cleaned up after our experiment by deleting all of our deployments and deleted the Kubernetes cluster.

What's next is to use Kube-Monkey for chaos experiements in your pre-production (or even production if brave!) Kubernetes clusters and start reviewing and validating your applications' resiliency. Here's some pointers:

  1. Update your existing Kubernetes manifests or Helm charts with the appropriate kube-monkey labels. Perhaps you can start with kube-monkey/enabled: "disabled" to gain confidence that your application still deploys without issue.
  2. Add the Kube-Monkey Helm chart to your collection of Helm charts. Update the values.yaml and various templates to meet your needs - for instance you might want to set an appropriate value for timeZone and logLevel as well as your own custom schedule windows using the runHour, startHour and endHour options.
  3. Deploy your newly configured Kube-Monkey chart to your Kubernetes cluster - perhaps with dryRun set to true intially so you can follow the logs and make sure that it is going to behave as expected.
  4. Set Kube-Monkey dryRun to false and start regularly chaos testing! You should configure alerts to capture any undesired behaviour and monitor cluster and application health regularly - I recommend checking out Prometheus for cluster telemetry and alerting (using alertmanager) and Grafana for monitoring dashboards.

That's It!

That's all folks - hope that was a quick and useful tutorial into setting up Kube-Monkey for simple pod-killing based chaos testing.

What are you guys using for chaos testing in Kubernetes? Have any cool suggestions, questions or comments - drop them in the section below!

Till next time y'all! πŸ‘‹

Top comments (0)