With Cloud, distributed architectures have grown even more complex and with complexity comes the uncertainty in how the system could fail.
Chaos Engineering aims to test system resiliency by injecting faults to identify weaknesses before they cause massive outages such as improper fallback settings for a service, cascading failures due to a single point of failure, or retry storms due to misconfigured timeouts.
History
Chaos Engineering started at Netflix back in 2010 when Netflix moved from on-prem servers to AWS infrastructure to test the resiliency of their infrastructure.
In 2012, Netflix open-sourced ChaosMonkey under Apache 2.0 license that randomly terminates instances to ensure that services are resilient to instance failures.
Cloud Native Chaos Engineering in CNCF Landscape
CNCF focuses on Cloud Native Chaos Engineering defined as engineering practices focused on (and built on) Kubernetes environments, applications, microservices, and infrastructure.
Cloud Native Chaos Engineering has 4 core principles:
- Open source
- CRDs for Chaos Management
- Extensible and pluggable
- Broad Community adoption
CNCF has two sandbox projects for Cloud Native Chaos Engineering
Chaos Mesh
Chaos Mesh is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments. It is based on Kubernetes Operator pattern and provides a Chaos Operator to inject into the applications and Kubernetes infrastructure in a manageable way.
Chaos Operator uses Custom Resource Defition(CRD) to define chaos objects. It provides a variety of these CRDs for fault injection such as :
- PodChaos
- NetworkChaos
- DNSChaos
- HTTPChaos
- StressChaos
- IOChaos
- TimeChaos
- KernelChaos
- AWSChaos
- GCPChaos
- JVMChaos
Chaos Mesh Installation
Chaos Mesh can be installed quickly using installation script. However, it's recommended to use Helm 3 chart in production environments.
To install Chaos Mesh using Helm :
- Add the Chaos Mesh repository to the Helm repository.
helm repo add chaos-mesh https://charts.chaos-mesh.org
- It's recommended to install ChaosMesh in a separate namespace, so you can either create a namespace
chaos-testing
manually or let Helm create it automatically, if it doesn't exist :
helm upgrade \
--install \
chaos-mesh \
chaos-mesh/chaos-mesh \
-n chaos-testing \
--create-namespace \
--version v2.0.0 \
--wait
Note: If you're using GKE or EKS with containerd
, then use
helm upgrade \
--install \
chaos-mesh \
chaos-mesh/chaos-mesh \
-n chaos-testing \
--create-namespace \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock \
--version v2.0.0 \
--wait
- Verify if pods are running :
kubectl get pods -n chaos-testing
Run First Chaos Mesh Experiment
Chaos Experiment describes what type of fault is injected and how.
- Setup an Nginx pod and expose it on port 80.
kubectl run nginx --image=nginx --labels="app=nginx" --port=80
- Get the IP of the nginx pod
kubectl get pods nginx -ojsonpath="{.status.podIP}"
- Open another terminal and setup a test pod to test the connectivity to nginx pod :
kubectl run -it test-connection --image=radial/busyboxplus:curl -- sh
ping <IP of the Nginx Pod> -c 2
this should show you the time it takes to ping the IP :
- Create your first Chaos Experiment by running :
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: nginx-network-delay
spec:
action: delay
mode: one
selector:
namespaces:
- default
labelSelectors:
'app': 'nginx'
delay:
latency: '1s'
duration: '60s'
EOF
this will create a CRD of type NetworkChaos
that will introduce a latency of 1 second
in the network of pods with labels app:nginx
i.e nginx pod for the next 60 seconds
.
- Test the response of ping to the nginx pod now to see the delay of
1 second
.
Run HTTPChaos Experiment
HTTPChaos
allows you to inject faults in the request and response of an HTTP server. It supports abort
,delay
,replace
,patch
fault types.
Note: Before proceeding, delete the NetworkChaos experiment created earlier.
- Check the response time of nginx pod :
kubectl exec -it test-connection -- sh
time curl <IP of the Nginx Pod>
- Create
HTTPSChaos
experiment by running:
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
name: nginx-http-delay
spec:
mode: all
selector:
labelSelectors:
app: nginx
target: Request
port: 80
delay: 1s
method: GET
path: /
duration: 5m
EOF
this will create a CRD of type HTTPChaos
that will introduce a latency of 1 seconds
to the requests sent to the pods with labels app:nginx
i.e nginx pod on port 80 for the next 5 mins
.
Note: If you get an error like admission webhook "vauth.kb.io" denied the request
, as of version 2.0 there is an open issue 2187 and a temporary fix is to delete the validating webhook.
kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io validate-auth
- Test the response time of nginx pod :
time curl <IP of the Nginx Pod>
you will see the additional 1 second
latency in the response.
Top comments (0)