DEV Community

Cover image for Scheduling Chaos: An introduction to the Litmus Chaos Scheduler
Sanjay Nathani
Sanjay Nathani

Posted on

Scheduling Chaos: An introduction to the Litmus Chaos Scheduler

Introduction

Hey all! I am Sanjay Nathani, one of the Contributors to the LitmusChaos Project & a Software Engineer at MayaData. By now, I assume that you are already familiar with the concept of cloud-native chaos engineering and how the litmuschaos project enables you to achieve it here.

As members of the larger chaos engineering community, one of the observations we made while examining the use-cases of different adopters was that chaos needs to be made available as a background service. While random injections via manual execution of the experiments in pre-prod/production (read gamedays) and CI-driven execution on dev environments is still the norm in many cases, there are a lot of organizations adopting a continuous-chaos strategy as part of a shift-left paradigm, in which staging clusters (or equivalent environments that mimic prod characteristics and traffic) are subject to service and infrastructure faults repeatedly in a periodic or random fashion. The goal, in most of these cases, is to observe the resilience of the microservices at various times/operational states. It is common knowledge that the load on the microservices in a cluster varies over the course of its existence - there are peak traffic periods - which may last for few hours in a day or few days in a month, etc., and it is necessary to compare how the KPIs (key performance indicators) fare at different periods upon failures.

Based on this, we decided to create the chaos-scheduler to inject chaos repeatedly, while providing a flexible schema for developers and SREs by which they can automate chaos runs while being able to define minimum intervals between two instances of chaos or specify the total number of chaos instances across a time range, etc.,

What is Chaos Scheduler?

The Chaos Scheduler is a Kubernetes controller (built using the Operator-SDK framework) that reconciles a custom resource called ChaosSchedule, which, essentially, is a higher-level abstraction that embeds within itself the (now-familiar) ChaosEngine template along with a schedule specification. While still an alpha component today, the Chaos Scheduler is seeing adoption already and is poised towards becoming an optional component in the Litmus deployment bundle (helm chart).

In this blog, let’s take a closer look at the scheduling options provided by the chaos scheduler and how you can give it a spin in your cluster.

Dissecting the ChaosSchedule Custom Resource

The ChaosSchedule is the core schema that defines the chaos workflow for a given Application Under Test (AUT) or Node Under Test (NUT). It defines the following:

  • Execution Schedule for the experiments
  • Template Spec of ChaosEngine detailing the chaos action

As mentioned earlier, one of the goals with the Chaos Scheduler was to provide a flexible and rich set of configuration options, for, there is already the standard Kubernetes Cron Job if the requirement is only about repeating the chaos action. As of today, there are 3 ways in which we can schedule the chaos to be injected:

  • Now: This will trigger the chaos as soon as the ChaosSchedule CR is created and is similar to the on-demand execution model available today.
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx
spec:
  schedule:
    now: true
  engineTemplateSpec:
    appinfo:
      appns: 'default'
      applabel: 'app=nginx'
      appkind: 'deployment'
    # It can be true/false
    annotationCheck: 'true'
    #ex. values: ns1:name=percona,ns2:run=nginx
    auxiliaryAppInfo: ''
    chaosServiceAccount: pod-delete-sa
    monitoring: false
    # It can be delete/retain
    jobCleanUpPolicy: 'delete'
    experiments:
      - name: pod-delete
        spec:
          components:
            env:
              # set chaos duration (in sec) as desired
              - name: TOTAL_CHAOS_DURATION
                value: '30'

              # set chaos interval (in sec) as desired
              - name: CHAOS_INTERVAL
                value: '10'

              # pod failures without '--force' & default terminationGracePeriodSeconds
              - name: FORCE
                value: 'false'
  • Once: This will schedule the chaos at a specific time denoted by executionTime.
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx
spec:
  schedule:
    once:
      executionTime: "2020-05-12T05:47:00Z"   #should be modified according to current UTC Time
  engineTemplateSpec:
    appinfo:
      appns: 'default'
      applabel: 'app=nginx'
      appkind: 'deployment'
    # It can be true/false
    annotationCheck: 'true'
    #ex. values: ns1:name=percona,ns2:run=nginx
    auxiliaryAppInfo: ''
    chaosServiceAccount: pod-delete-sa
    monitoring: false
    # It can be delete/retain
    jobCleanUpPolicy: 'delete'
    experiments:
      - name: pod-delete
        spec:
          components:
            env:
              # set chaos duration (in sec) as desired
              - name: TOTAL_CHAOS_DURATION
                value: '30'

              # set chaos interval (in sec) as desired
              - name: CHAOS_INTERVAL
                value: '10'

              # pod failures without '--force' & default terminationGracePeriodSeconds
              - name: FORCE
                value: 'false'
  • Repeat: This type of schedule will ensure the repeated execution of chaos over a time range. We define the startTime & endTime with a minChaosInterval specified to ensure a mandatory cool-off period to observe adherence to MTTR (Mean-Time-To-Recover). This option also allows whitelisting/blacklisting days of a week for chaos. Here is a sample of how to inject the chaos in this way.
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx
spec:
  schedule:
    repeat:
      startTime: "2020-05-12T05:47:00Z"   #should be modified according to current UTC Time
      endTime: "2020-05-12T05:52:00Z"   #should be modified according to current UTC Time
      minChaosInterval: "2m"   #format should be like "10m" or "2h" accordingly for minutes and   hours
      instanceCount: "2"
      includedDays: "mon,tue,wed"
  engineTemplateSpec:
    appinfo:
      appns: 'default'
      applabel: 'app=nginx'
      appkind: 'deployment'
    # It can be true/false
    annotationCheck: 'true'
    #ex. values: ns1:name=percona,ns2:run=nginx
    auxiliaryAppInfo: ''
    chaosServiceAccount: pod-delete-sa
    monitoring: false
    # It can be delete/retain
    jobCleanUpPolicy: 'delete'
    experiments:
      - name: pod-delete
        spec:
          components:
            env:
              # set chaos duration (in sec) as desired
              - name: TOTAL_CHAOS_DURATION
                value: '30'

              # set chaos interval (in sec) as desired
              - name: CHAOS_INTERVAL
                value: '10'

              # pod failures without '--force' & default terminationGracePeriodSeconds
              - name: FORCE
                value: 'false'

Needless to say, the ChaosSchedule is referenced as the owner of the secondary resources (chaosengine) with Kubernetes DeletePropagation policies ensuring their removal too upon deletion of the ChaosSchedule CR.

In the subsequent section, let us view the steps involved in setting up a demo environment to try out the Chaos Scheduler.

Getting Started

In this section, let us view the steps involved in setting up a demo environment to try out the Chaos Scheduler.

Install Litmus Chaos Operator, RBAC and CRDs

kubectl apply -f https://litmuschaos.github.io/pages/litmus-operator-latest.yaml

namespace/litmus created
serviceaccount/litmus created
clusterrole.rbac.authorization.k8s.io/litmus created
clusterrolebinding.rbac.authorization.k8s.io/litmus created
deployment.apps/chaos-operator-ce created
customresourcedefinition.apiextensions.k8s.io/chaosengines.litmuschaos.io created
customresourcedefinition.apiextensions.k8s.io/chaosexperiments.litmuschaos.io created
customresourcedefinition.apiextensions.k8s.io/chaosresults.litmuschaos.io created

Install Chaos Scheduler and it's CRDs

kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-scheduler/master/deploy/crds/chaosschedule_crd.yaml

customresourcedefinition.apiextensions.k8s.io/chaosschedules.litmuschaos.io created
kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-scheduler/master/deploy/chaos-scheduler.yaml

deployment.apps/chaos-scheduler created

Create the pod delete Chaos Experiment in default namespace

NOTE: In this example, I intend to inject chaos on a single replica Nginx deployment running in the default namespace. Modify according to your environment.

kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-charts/1.4.0/charts/generic/pod-delete/experiment.yaml

chaosexperiment.litmuschaos.io/pod-delete created

Setup the RBAC for execute the pod-delete chaos

kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-charts/1.4.0/charts/generic/pod-delete/rbac.yaml

serviceaccount/pod-delete-sa created
role.rbac.authorization.k8s.io/pod-delete-sa created
rolebinding.rbac.authorization.k8s.io/pod-delete-sa created

Annotate your application to enable chaos

kubectl annotate deploy/nginx-deployment litmuschaos.io/chaos="true"

deployment.extensions/nginx-deployment annotated

Before proceeding let's see whether all the things are up and running successfully or not.

  • Chaos Scheduler and Chaos Operator should be running perfectly
   kubectl get po -n litmus

   chaos-operator-ce-5cd5894879-k7wgz   1/1     Running   0          
   10m
   chaos-scheduler-84fcccb5bd-mjpnj     1/1     Running   0          
   10m
  • Ensure the Service Accounts for scheduler and operator are created
   kubectl get sa -n litmus

   default     1         10m
   litmus      1         10m
   scheduler   1         10m
  • Ensure the service account for the intended experiment is created successfully
   kubectl get sa

   default         1         10m
   pod-delete-sa   1         10m

Now we can safely move further

Create a ChaosSchedule yaml with the application and experiment information along with the scheduling logic

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: schedule-nginx
namespace: litmus
spec:
    schedule:
        repeat:
startTime: "2020-05-12T05:47:00Z"   #should be modified according to current UTC Time
 endTime: "2020-05-12T05:52:00Z"   #should be modified according to current UTC Time
 minChaosInterval: "2m"   #format should be like "10m" or "2h" accordingly for minutes   and   hours
 instanceCount: "2"
 includedDays: "mon,tue,wed"
engineTemplateSpec:
  appinfo:
    appns: 'default'
    applabel: 'app=nginx'
    appkind: 'deployment'
  # It can be true/false
  annotationCheck: 'true'
  #ex. values: ns1:name=percona,ns2:run=nginx
  auxiliaryAppInfo: ''
  chaosServiceAccount: pod-delete-sa
  monitoring: false
  # It can be delete/retain
  jobCleanUpPolicy: 'delete'
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            # set chaos duration (in sec) as desired
            - name: TOTAL_CHAOS_DURATION
              value: '30'

            # set chaos interval (in sec) as desired
            - name: CHAOS_INTERVAL
              value: '10'

            # pod failures without '--force' & default terminationGracePeriodSeconds
            - name: FORCE
              value: 'false'

Create a ChaosSchedule custom resource

kubectl apply -f chaos-schedule.yaml

Watch the injection of chaos at any point of time

watch kubectl get pod 

Describe the ChaosSchedule for the details of chaos injection.

kubectl describe chaosschedule schedule-nginx

Name:         schedule-nginx
Namespace:    default
Labels:       <none>
Annotations:  API Version:  litmuschaos.io/v1alpha1
Kind:         ChaosSchedule
Metadata:
  Creation Timestamp:  2020-05-14T08:44:32Z
  Generation:          3
  Resource Version:    899464
  Self Link:           /apis/litmuschaos.io/v1alpha1/namespaces/default/chaosschedules/  schedule-nginx
  UID:                 347fb7e6-2c9d-428e-9ce1-42bdcfdab37d
Spec:
  Chaos Service Account:  
  Engine Template Spec:
    Appinfo:
      Appkind:              deployment
      Applabel:             app=nginx
      Appns:                default
    Chaos Service Account:  litmus
    Components:
      Runner:
    Experiments:
      Name:  pod-delete
      Spec:
        Components:
        Rank:             0
    Job Clean Up Policy:  retain
  Schedule:
    Repeat:
      End Time:            2020-05-12T05:52:00Z
      Included Days:       Mon,Tue,Wed
      Instance Count:      2
      Min Chaos Interval:  2m
      Start Time:          2020-05-12T05:47:00Z
  Schedule State:        active
Status:
  Active:
    API Version:       litmuschaos.io/v1alpha1
    Kind:              ChaosEngine
    Name:              schedule-nginx
    Namespace:         default
    Resource Version:  899463
    UID:               14f49857-8879-4129-a5b9-a3a592149725
  Last Schedule Time:  2020-05-14T08:44:32Z
  Schedule:
    Start Time:              2020-05-14T08:44:32Z
    Status:                  running
    Total Instances:         1
Events:
  Type    Reason            Age   From             Message
  ----    ------            ----  ----             -------
  Normal  SuccessfulCreate  39s   chaos-scheduler  Created engine schedule-nginx

Halting ChaosSchedule

At any point of time we can halt a chaosschedule which simply means stopping the further execution of chaos. Here is the way to halt the chaosschedule.
Power of halting a schedule comes into action when we do not want to disturb the production cluster or an application at some point of time because of some important activity(migration) going on. We can halt the schedule without putting in the efforts of deleting and recreating the schedule.

Change the spec.ScheduleState to halt

spec:
    scheduleState: halt

Conclusion

With the Chaos Scheduler the user is not burdened with trying to re-apply chaosengine manifests or remember to do chaos at different times by himself/herself and instead only has to compare execution results! As you read this, the Chaos Scheduler is being improved to support randomized execution within a time range. So, more power coming your way!! Do try out the steps and let us know what you feel about the scheduler and what use-cases it must support!

Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join Our Community #litmus channel in Kubernetes Slack
Contribute to LitmusChaos and share your feedback on Github
If you like LitmusChaos, become one of the many stargazers here

GitHub logo litmuschaos / litmus

Litmus helps Kubernetes SREs and developers practice chaos engineering in a Kubernetes native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q

LitmusChaos

Litmus

Cloud-Native Chaos Engineering

Slack Channel Build Status Docker Pulls GitHub stars GitHub issues Twitter Follow CII Best Practices Quality Gate Status BCH compliance FOSSA Status YouTube Channel



Overview

Litmus is a toolset to do cloud-native chaos engineering. Litmus provides tools to orchestrate chaos on Kubernetes to help SREs find weaknesses in their deployments. SREs use Litmus to run chaos experiments initially in the staging environment and eventually in production to find bugs, vulnerabilities. Fixing the weaknesses leads to increased resilience of the system.

Litmus takes a cloud-native approach to create, manage and monitor chaos. Chaos is orchestrated using the following Kubernetes Custom Resource Definitions (CRDs):

  • ChaosEngine: A resource to link a Kubernetes application or Kubernetes node to a ChaosExperiment. ChaosEngine is watched by Litmus' Chaos-Operator which then invokes Chaos-Experiments
  • ChaosExperiment: A resource to group the configuration parameters of a chaos experiment. ChaosExperiment CRs are created by the operator when experiments are invoked by ChaosEngine.
  • ChaosResult: A resource to hold the results of a chaos-experiment. The Chaos-exporter reads the…

Top comments (0)