One of the foremost principles of chaos engineering is the need to build a hypothesis around steady-state or desired system behavior and be able to automate the process of gauging, analyzing & reporting it over the course of an experiment. This is indispensable to the persona executing these experiments in order to gain valuable insights on the resilience of their applications & deployment infrastructure. In other words, it is imperative for chaos engineering frameworks/toolsets to provide the ability to define their hypotheses & validate them against the same.
Further, in the world of cloud-native chaos engineering, the hypothesis definition is expected to be declarative in nature, much in the same way the chaos intent is. Another very interesting factor here is the diversity in the way the steady-state is defined. Most of the time, they are systemic behavior patterns: metrics around error rates, latency percentiles, etc., and at other times liveness of a crucial downstream service, state (values within) of a database, or even Kubernetes resources. Typically, they can be mapped to operational SLOs (Service Level Objectives) that have been agreed upon. These are the parameters & conditions that verify that the system “works” (as a discerning reader you might have understood the subtle difference here b/w chaos experimentation over standard failure testing: verification that the system works over validating how it works. Anyway, I digress..).
So, to summarize our understanding (courtesy: standard chaos principles & learnings from the awesome chaos community):
- Hypotheses are crucial to chaos experiments
- Chaos frameworks need to burn this into experiments (remember automated chaos!)
- Hypothesis definitions should be declarative in nature & thereby easily tuned & scaled
- Accommodate the diverse nature of hypotheses definitions, with the right schema & architecture.
As a community, we identified the need for “steady-state” validations pretty early & introduced the pre-chaos & post-chaos checks in the experiments, with appropriate Kubernetes events generated in the respective stages. In the “generic” category of experiments, this corresponds to a vanilla check that verifies if all pods bearing the specified application label in a given namespace are in the “Running” state (the same check also verifies that all the containers within this pod have the right status).
In application-specific chaos experiments (such as the Kafka-broker-failure or Cassandra-node-failure) health checks native to the applications have been included in the experiment business logic. There are also examples and documentation resources available for users to add custom checks to implement a hypothesis within a newly bootstrapped/scaffolded chaos experiment using the Litmus SDK. In some ways, this ticks the points (a) & (b) in our summarization.
However, the aforementioned approach also places the onus on the chaos developers/users to create and maintain new experiments or modify existing ones with additional logic every time the hypothesis changes (for example, new criteria might be added to indicate steady-state).
This ties in directly to ( c), i.e., we needed a declarative way to specify checks to be carried out, while reusing the standard (generic) chaos experiments from the ChaosHub, and also accommodating standard interfaces to describe these checks (d).
The aforementioned reasons prompted us to come up with the “Litmus Probes” feature in the 1.7.0 release!
In this blog, I will talk about how you can use Litmus Probes to build custom checks against your application/infra in a reasonably declarative fashion, i.e., by providing additional inputs in the ChaosEngine YAML manifest.
Litmus probes can be defined as “pluggable” checks you can define within the ChaosEngine for any chaos experiment. The experiment pods execute these checks based on the mode they are defined in & factor their success as necessary conditions in determining the verdict of the experiment (along with the standard “in-built” checks).
Litmus currently supports four types of probes:
- httpProbe: To query health/downstream URIs
- cmdProbe: To execute any user-desired health-check function implemented as a shell command
- k8sProbe: To perform CRUD operations against native & custom Kubernetes resources
- promProbe: To validate the SLO by prometheus endpoint
These probes can be used in isolation or in several combinations to achieve the desired checks. As we will see in subsequent sections, while the httpProbe, promProbe & k8sProbe are fully declarative in the way they are conceived, the cmdProbe expects the user to provide a shell command to implement checks that are highly specific to the application use case. Does it sound similar to the “command” or “shell” module in the ansible world? The intent is similar too!
The probes can be set up to run in different modes:
- SoT: Executed at the Start of Test as a pre-chaos check
- EoT: Executed at the End of Test as a post-chaos check
- Edge: Executed both, before and after the chaos
- Continuous: The probe is executed continuously, with a specified polling interval during the chaos injection.
- OnChaos: The probe is executed continuously, with a specified polling interval strictly for chaos duration of chaos.
All probes share some common attributes:
- probeTimeout: Represents the time limit for the probe to execute the check specified and return the expected data.
- retry: The number of times a check is re-run upon failure in the first attempt, before declaring the probe status as failed.
- interval: The period between subsequent retries
- probePollingInterval: The time interval for which continuous probe should be sleep after each iteration.
- initialDelaySeconds: Represents the initial waiting time interval for the probes.
Let us take a look at the different probe categories in some more detail.
httpProbe allows developers to specify a URL which the experiment uses to gauge health/service availability (or other custom conditions) as part of the entry/exit criteria. The received status code is mapped against an expected status. It supports http
In HTTP Get method, it sends an HTTP Get request to the provided URL and matches the response code based on the given criteria(==, !=, oneOf).
In the HTTP Post method, it sends an HTTP Post request to the provided URL. The HTTP body can be provided in the body field. In the case of a complex Post request in which the body spans multiple lines, the
bodyPath attribute can be used to provide the path to a file consisting of the same. This file can be made available to the experiment pod via a ConfigMap resource, with the ConfigMap name being defined in the ChaosEngine OR the ChaosExperiment CR.
It can be defined at
.spec.experiments.spec.probe inside ChaosEngine.
bodyPath are mutually exclusive.
probe: - name: "check-frontend-access-url" type: "httpProbe" httpProbe/inputs: url: "<url>" insecureSkipVerify: false method: get: criteria: == # supports == & != and oneof operations responseCode: "<response code>" mode: "Continuous" runProperties: probeTimeout: 5 interval: 5 retry: 1 probePollingInterval: 2
The httpProbe is better used in the Continuous mode of operation as a parallel liveness indicator of a target or downstream application. It uses the
probePollingInterval property to specify the polling interval for the access checks.
insecureSkipVerify can be set to true to skip the certificate checks.
The cmdProbe allows developers to run shell commands and match the resulting output as part of the entry/exit criteria. The intent behind this probe was to allow users to implement a non-standard & imperative way of expressing their hypothesis. For example, the cmdProbe enables you to check for specific data within a database, parse the value out of a JSON blob being dumped into a certain path, or check for the existence of a particular string in the service logs.
In order to enable this behavior, the probe supports an inline mode in which the command is run from within the experiment image as well as a source mode, where the command execution is carried out from within a new pod whose image can be specified. While inline is preferred for simple shell commands, source mode can be used when application-specific binaries are required. The cmdProbe can be defined at
.spec.experiments.spec.probe the path inside the ChaosEngine.
probe: - name: "check-database-integrity" type: "cmdProbe" cmdProbe/inputs: command: "<command>" comparator: type: "string" # supports: string, int, float criteria: "contains" #supports >=,<=,>,<,==,!= for int and contains,equal,notEqual,matches,notMatches for string values value: "<value-for-criteria-match>" source: image: "<repo>/<tag>" # it can be any image mode: "Edge" runProperties: probeTimeout: 5 interval: 5 retry: 1 initialDelaySeconds: 5
With the proliferation of custom resources & operators, especially in the case of stateful applications, the steady-state is manifested as status parameters/flags within Kubernetes resources. k8sProbe addresses verification of the desired resource state by allowing users to define the Kubernetes GVR(group-version-resource) with appropriate filters (field selectors/label selectors). The experiment makes use of the Kubernetes Dynamic Client to achieve this. The k8sProbe can be defined at
.spec.experiments.spec.probe the path inside the ChaosEngine.
It supports the following CRUD operations which can be defined at
create: It creates kubernetes resource based on the data provided inside probe.data field.
delete: It deletes matching kubernetes resource via GVR and filters (field selectors/label selectors).
present: It checks for the presence of kubernetes resource based on GVR and filters (field selectors/labelselectors).
absent: It checks for the absence of kubernetes resource based on GVR and filters (field selectors/labelselectors).
probe: - name: "check-app-cluster-cr-status" type: "k8sProbe" k8sProbe/inputs: command: group: "<appGroup>" version: "<appVersion>" resource: "<appResource>" namespace: "default" fieldSelector: "metadata.name=<appResourceName>,status.phase=Running" labelSelector: "<app-labels>" operation: "present" # it can be present, absent, create, delete mode: "EOT" runProperties: probeTimeout: 5 interval: 5 retry: 1
PromProbe allows users to run Prometheus queries and match the resulting output against specific conditions. The intent behind this probe is to allow users to define metrics-based SLOs in a declarative way and determine the experiment verdict based on its success. The probe runs the query on a Prometheus server defined by the endpoint, and checks whether the output satisfies the specified criteria.
The promql query can be provided in the query field. In the case of complex queries that span multiple lines, the queryPath attribute can be used to provide the link to a file consisting of the query. This file can be made available in the experiment pod via a ConfigMap resource, with the ConfigMap being passed in the ChaosEngine OR the ChaosExperiment CR.
NOTE: query and queryPath are mutually exclusive
probe: - name: 'check-probe-success' type: 'promProbe' promProbe/inputs: endpoint: '<prometheus-endpoint>' query: '<promql-query>' comparator: criteria: '==' #supports >=,<=,>,<,==,!= comparision value: '<value-for-criteria-match>' mode: 'Edge' runProperties: probeTimeout: 5 interval: 5 retry: 1
The Litmus chaos experiments run the probes defined in the ChaosEngine and update their stage-wise success in the ChaosResult custom resource, with details including the overall probeSuccessPercentage (a ratio of successful checks v/s total probes) and failure step, where applicable. The success of a probe is dependent on whether the expected status/results are met and also on whether it is successful in all the experiment phases defined by the probe’s execution mode. For example, probes that are executed in “Edge” mode, need the checks to be successful both during the pre-chaos & post-chaos phases to be declared as successful.
The pass criteria for the experiment is a logical AND function of all the probes defined in the ChaosEngine as well as inbuilt entry/exit criteria. Failure of either indicates a failed hypothesis and is deemed experiment failure. And an opportunity to fix the underlying problem!
Provided below is a chaosresult snippet containing the probe status for a mixed-probe chaosengine.
Name: app-pod-delete Namespace: test Labels: name=app-pod-delete Annotations: <none> API Version: litmuschaos.io/v1alpha1 Kind: ChaosResult Metadata: Creation Timestamp: 2020-08-29T08:28:26Z Generation: 36 Resource Version: 50239 Self Link: /apis/litmuschaos.io/v1alpha1/namespaces/test/chaosresults/app-pod-delete UID: b9e3638a-b7a4-4b93-bfea-bd143d91a5e8 Spec: Engine: probe Experiment: pod-delete Status: Experimentstatus: Fail Step: N/A Phase: Completed Probe Success Percentage: 100 Verdict: Pass Probe Status: Name: check-frontend-access-url Status: Continuous: Passed 👍 Type: HTTPProbe Name: check-app-cluster-cr-status Status: Post Chaos: Passed 👍 #EoT Type: K8sProbe Name: check-database-integrity Status: Post Chaos: Passed 👍 #Edge Pre Chaos: Passed 👍 Type: CmdProbe Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Summary 7s pod-delete-0s2jt6-s4rdx pod-delete experiment has been Passed
The probes are an effective mechanism to burn-in hypotheses into chaos experiments and arm them with more meaning & context. In some ways, they address the “opacity” today with respect to what constitutes an experiment pass or failure. Probes are also useful analytics aids indicating resiliency over a period of time. For example, the probeSuccessPercentage for a given experiment (with a set of mandatory probes) against a specific application can be tracked over time to gauge the progress being made in the feature & deploy practices.
One of the questions we got from the community as we set out to build this feature is whether the probes are a replacement for application-specific chaos experiments. The answer is “Not really”. While the probes do enable reuse of the generic experiments and give it an app context, they are not intended to perform deep application-level verification, much less inject app-specific faults. Take the example of the Kafka chaos experiments on the hub - for example. The focus here is on identifying a certain type of app replica to inject chaos on & use test downstream apps (to simulate the producer/consumer) for validation purposes. The probes can be useful in enhancing these experiments and therefore work more as aids than replacements.
Hope this feature helps you practice Chaos Engineering in an even better way. Do try it & let us know what you think.
Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join Our Community
#litmus channel in Kubernetes Slack
Contribute to LitmusChaos and share your feedback on Github
If you like LitmusChaos, become one of the many stargazers here
Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
Open Source Chaos Engineering Platform
Read this in other languages.
LitmusChaos is an open source Chaos Engineering platform that enables teams to identify weaknesses & potential outages in infrastructures by inducing chaos tests in a controlled way. Developers & SREs can practice Chaos Engineering with LitmusChaos as it is easy to use, based on modern Chaos Engineering principles & community collaborated. It is 100% open source & a CNCF project.
LitmusChaos takes a cloud-native approach to create, manage and monitor chaos. The platform itself runs as a set of microservices and uses Kubernetes custom resources to define the chaos intent, as well as the steady state hypothesis.
At a high-level, Litmus comprises of:
- Chaos Control Plane: A centralized chaos management tool called chaos-center, which helps construct, schedule and visualize Litmus chaos workflows
- Chaos Execution Plane Services: Made up of a chaos…