DEV Community

loading...

ChaosToolkit Basics

mikotian
not your usual QA guy
・4 min read

The major aim of Resiliency Experiments is always the same -

To observe a steady state
Do something that attempts to disrupt that steady state
Observe if we still have a steady state
Enter fullscreen mode Exit fullscreen mode

These three steps are always constant in any resiliency experiment.

What changes are the ways we measure or cause these steps to happen.

For instance,

To see if a service is up we can have multiple methods viz. Health Checks, Monitoring Checks or Request Checks.
To simulate a service downtime, we can shutdown the machine or the specific process (each has its own case)
Enter fullscreen mode Exit fullscreen mode

In order to correctly perform the experiment when automated and to maintain the sameness for the tests, we needed to implement a tool that would provide a homogeneous way of doing this.

We evaluated Gremlin and Chaos Toolkit for these experiments.

This document describes the working of Chaos Toolkit, keeping in mind what is listed above.
Chaos Toolkit is a declarative framework for Chaos Engineering experiments which can be extended for any system by using drivers. It has open source drivers for most major systems.

We will take a look at the declarative framework and its basics:

Each Experiment will have the following sections:

Description Section:

This is just meta data for the test. Version, Title & Description mention specifics of the experiment. The tags mention what the system being tested is composed of or what the test tags are. There can be multiple tags but only singular version, title & description.

Description Section

Steady State Hypothesis Section:

This is the heart of the experiment. This section is executed when the experiment is started and after the "method" section is executed. The first time execution is to ensure that steady state hypothesis is met before we do any action. If it is not, then the experiment is failed. The Steady State

The way to verify a state or hypothesis is by using a probe. ChaosToolkit provides a set of probes that can be used to check on services or URLs. Each probe has a provider which mentions the type of the probe and arguments, if applicable. Each probe is different and may have multiple arguments or none at all.

In the example below, there are two probes.

The first one is a python function that returns a boolean. This is verified in the "tolerance" parameter.

The second one is a built in http probe that checks a url and verifies if the response code is 200

Writing a good steady state hypothesis is essential to get your resiliency tests be repeatable and accurate. Too simple and you risk missing issues. Too complex and you risk false negatives.
Enter fullscreen mode Exit fullscreen mode

Steady State Hypothesis

"Action" Section:

This section is where we perform our "disruptions"

Depending on the drivers, we can have many types of disruptions, for e.g:

The AWS driver can shutdown instances, random machines in availability zones etc.

Kubernetes drivers have the ability to kill random pods.

In the example below, we can see that the action is terminating a pod based on a random name from a given pattern. After the action is performed, we can pause execution for a provided value in seconds to let the system stabilize and then proceed to probe for our expected values. In this case we are probing by reference of a probe that has already been defined earlier in the experiment.

Action Section

We can see the other probes defined as well. Just as there can be multiple probes, there can be multiple actions within the "methods" section.

It might be better to keep the action itself brief and limited so that we can objectively identify what that action does to our system. If there are multiple actions, determining what caused the failure might be difficult.
Enter fullscreen mode Exit fullscreen mode

This is not always true though as many issues will come up because of multiple actions performed simultaneously. In short, YMMV.

snippet

Rollback Section:

This section is to undo any permanent change to the system we may have caused during the course of the experiment. For example, we could start an instance that we shut down in the experiment.

Overall, an experiment descriptor file looks like below:

Simple ChaosToolkit Experiment
{
    "version": "1.0.0",
    "title": "Stop an instance",
    "description": "it should stop",
    "tags": [
        "stop"
    ],
    "steady-state-hypothesis": {
        "title": "Application responds",
        "probes": [
            {
                "name": "count-instances",
                "type": "probe",
                "tolerance": 1,
                "provider": {
                    "func": "count_instances",
                    "type": "python",
                    "arguments": {
                        "filters": []
                    },
                    "module": "chaosaws.ec2.probes"
                }
            }
        ]
    },
    "method": [
        {
            "type": "action",
            "name": "stop-an-ec2-instance",
            "provider": {
                "type": "python",
                "module": "chaosaws.ec2.actions",
                "func": "stop_instance",
                "arguments": {
                    "instance_id": "i-0e1f0c1d97589b5e9"
                }
            },
            "pauses": {
                "after": 30
            }
        },
        {
            "name": "count-instances",
            "type": "probe",
            "tolerance": 0,
            "provider": {
                "func": "count_instances",
                "type": "python",
                "arguments": {
                    "filters": []
                },
                "module": "chaosaws.ec2.probes"
            }
        },
        {
            "type": "probe",
            "name": "healthcheck-service-must-still-respond",
            "provider": {
                "type": "http",
                "url": "http://localhost:8080/healthcheck"
            }
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

Discussion (0)