Shardul Srivastava for AWS Community Builders

Posted on Aug 9, 2021 • Edited on Aug 12, 2021 • Originally published at shardul.dev

Cloud Native Chaos Engineering with Chaos Mesh

#kubernetes #chaosengineering #chaosmesh #awscommunitybuilder

With Cloud, distributed architectures have grown even more complex and with complexity comes the uncertainty in how the system could fail.

Chaos Engineering aims to test system resiliency by injecting faults to identify weaknesses before they cause massive outages such as improper fallback settings for a service, cascading failures due to a single point of failure, or retry storms due to misconfigured timeouts.

History

Chaos Engineering started at Netflix back in 2010 when Netflix moved from on-prem servers to AWS infrastructure to test the resiliency of their infrastructure.

In 2012, Netflix open-sourced ChaosMonkey under Apache 2.0 license that randomly terminates instances to ensure that services are resilient to instance failures.

Cloud Native Chaos Engineering in CNCF Landscape

CNCF focuses on Cloud Native Chaos Engineering defined as engineering practices focused on (and built on) Kubernetes environments, applications, microservices, and infrastructure.

Cloud Native Chaos Engineering has 4 core principles:

Open source
CRDs for Chaos Management
Extensible and pluggable
Broad Community adoption

CNCF has two sandbox projects for Cloud Native Chaos Engineering

Chaos Mesh

Chaos Mesh is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments. It is based on Kubernetes Operator pattern and provides a Chaos Operator to inject into the applications and Kubernetes infrastructure in a manageable way.

Chaos Operator uses Custom Resource Defition(CRD) to define chaos objects. It provides a variety of these CRDs for fault injection such as :

Chaos Mesh Installation

Chaos Mesh can be installed quickly using installation script. However, it's recommended to use Helm 3 chart in production environments.

To install Chaos Mesh using Helm :

Add the Chaos Mesh repository to the Helm repository.

   helm repo add chaos-mesh https://charts.chaos-mesh.org

It's recommended to install ChaosMesh in a separate namespace, so you can either create a namespace chaos-testing manually or let Helm create it automatically, if it doesn't exist :

   helm upgrade \
        --install \
        chaos-mesh \
        chaos-mesh/chaos-mesh \
        -n chaos-testing \
        --create-namespace \
        --version v2.0.0 \
        --wait

Note: If you're using GKE or EKS with containerd, then use

   helm upgrade \
        --install \
        chaos-mesh \
        chaos-mesh/chaos-mesh \
        -n chaos-testing \
        --create-namespace \
        --set chaosDaemon.runtime=containerd \
        --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
        --version v2.0.0 \
        --wait

Verify if pods are running :

   kubectl get pods -n chaos-testing

Run First Chaos Mesh Experiment

Chaos Experiment describes what type of fault is injected and how.

Setup an Nginx pod and expose it on port 80.

  kubectl run nginx --image=nginx --labels="app=nginx" --port=80

Get the IP of the nginx pod

  kubectl get pods nginx -ojsonpath="{.status.podIP}"

Open another terminal and setup a test pod to test the connectivity to nginx pod :

  kubectl run -it test-connection --image=radial/busyboxplus:curl -- sh
  ping <IP of the Nginx Pod> -c 2

this should show you the time it takes to ping the IP :

Create your first Chaos Experiment by running :

  kubectl apply -f - <<EOF
  apiVersion: chaos-mesh.org/v1alpha1
  kind: NetworkChaos
  metadata:
    name: nginx-network-delay
  spec:
    action: delay
    mode: one
    selector:
      namespaces:
        - default
      labelSelectors:
        'app': 'nginx'
    delay:
      latency: '1s'
    duration: '60s'
  EOF

this will create a CRD of type NetworkChaos that will introduce a latency of 1 second in the network of pods with labels app:nginx i.e nginx pod for the next 60 seconds.

Test the response of ping to the nginx pod now to see the delay of 1 second.

Run HTTPChaos Experiment

HTTPChaos allows you to inject faults in the request and response of an HTTP server. It supports abort,delay,replace,patch fault types.

Note: Before proceeding, delete the NetworkChaos experiment created earlier.

Check the response time of nginx pod :

   kubectl exec -it test-connection -- sh
   time curl <IP of the Nginx Pod>

Create HTTPSChaos experiment by running:

   kubectl apply -f - <<EOF
   apiVersion: chaos-mesh.org/v1alpha1
   kind: HTTPChaos
   metadata:
     name: nginx-http-delay
   spec:
     mode: all
     selector:
       labelSelectors:
         app: nginx
     target: Request
     port: 80
     delay: 1s
     method: GET
     path: /
     duration: 5m
   EOF

this will create a CRD of type HTTPChaos that will introduce a latency of 1 seconds to the requests sent to the pods with labels app:nginx i.e nginx pod on port 80 for the next 5 mins.

Note: If you get an error like admission webhook "vauth.kb.io" denied the request, as of version 2.0 there is an open issue 2187 and a temporary fix is to delete the validating webhook.

kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io validate-auth

Test the response time of nginx pod :

   time curl <IP of the Nginx Pod>

you will see the additional 1 second latency in the response.

DEV Community