DEV Community

Lakshya Gupta
Lakshya Gupta

Posted on • Updated on

Introduction To Cloud Native Chaos Engineering

In this blog, I will be sharing my learning from a Live stream about chaos engineering. the link to which can be found [here].

The problems before Chaos Engineering

Before Chaos Engineering was adopted there was no way to tackle problems that were unpredictable. For example, high traffic on the app would cause a system outage. This outage, even for a small time would cost millions to companies and their reputation as well. This is where we felt that normal streamlined testing is not enough.

There are many services running in the system and the interaction between them can be unpredictable. These interactions often result in downtime of the system.

What is Chaos Engineering?

Chaos engineering can be thought of as a mechanism of running experiments on the system by exposing it to real-life scenarios to see whether the system can withstand unexpected disruptions. These include scenarios of high traffic on the system or when the system faces any sort of outage. By running these experiments on the system we try to find out its weakness and makes the system more resilient. Any event capable of disrupting steady state is a potential variable in a Chaos experiment

There are four steps in Chaos Engineering:

  • Steady-state
  • Hypothesis
  • Experiment
  • Adapt

More details on these can be found on

What is resilience

Resilience is a system's ability to stay working even when a fault is occurred. Resilience of a system can be challenged in many ways such as when the services become unhealthy or when a node in a Kubernetes cluster goes to not-ready state or when there is a memory leak in the system. Chaos engineering caters to these problems of not maintaining a resilient system.

How to build resilience

  1. Identify steady state conditions i.e. the state that we want the system to be in ideally.
  2. Introduce a fault related to what you want to test
  3. Did the system regain its steady state after introducing a fault?
  4. If yes then the system is resilient
  5. If no then work on this weakness and again introduce a fault

How Litmus and Chaos Engineering is introduced in DevOps space?

The general idea is that we should also focus on operations and not just the development side. Instead of writing tests for all the microservices individually, we can automate the entire process using Litmus. This make everything very smooth.

Why Kubernetes?

  • Kubernetes is the de facto standard in the industry.
  • Kubernetes is highly scalable.
  • High availability

Cloud Native Chaos Engineering

Cloud Native leverages the idea of using cloud storage to run microservices on the system. Some principles of Cloud Native Chaos Engineering are:

  • Projects are mainly open source
  • Community support
  • Open observability
  • Open API
  • GitOps


  • Run services without an outage
  • Run services to meet the business SLAs and SLOs
  • Scale your services on demand
  • Upgrade you services without an outage

Some advantages of using Litmus:

  • Litmus is cross-cloud
  • It takes a cloud-native approach to create, manage and monitor chaos
  • It is a complete framework to implement Chaos Engineering within a cloud-native ecosystem
  • It helps both developers and SREs to automate the chaos experiment.

Useful Resources

Beginner friendly issues to contribute:

Blogs and videos to learn more about Litmus:

You can join the slack community too! That is where you can ask your queries and get to learn more about chaos & contributing to Litmus! To join the slack please follow the following steps!
Step 1: Join the Kubernetes slack using the following link:

Step 2: Join the #litmus channel on the Kubernetes slack or use this link after joining the Kubernetes slack:

Thank you! :)

Top comments (1)

neelanjan00 profile image
Neelanjan Manna

This is really awesome for anyone who wants to get a peek of Chaos Engineering and LitmusChaos. Great blog, kudos! 👏