DEV Community

Maxime Guilbert
Maxime Guilbert

Posted on

Chaos Engineering - A world of destruction?

When we talk about Chaos Engineering and we are not really aware of what it is, it can really be scary.

We heard that Netflix is killing nodes in their prod clusters!

The previous sentence is generally what people retains of a small talk about chaos enginering. And for sure (especially for managers and people which are less technicians) it's something you will be afraid of.


So today, we will start a discussion about chaos engineering, what it is, what we can do to implement it without killing clusters and see some tools which will help you to implement that.

It will be a compact introduction of Chaos Engineering, if you want to have more informations or want dedicated posts on a particular subject treated here, please let me know it in the comments.


Origins

As we said previously, Netflix is at the beginning for this. To test their infrastructure and its resiliency by killing instances.

Here is the project link and its documentation.
Github project
Online documentation

But from this, the chaos engineering is born.


Chaos Engineering

As defined in the Principles of Chaos Engineering:

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

With this idea, we keep the Netflix's mindset of "disturbing live instances to see what happen and be resilient" but without directly killing live instances.

The objective is to add some issues at random moments to see if everything continue to go well, like :

- adding latency

Add a little bit of latency to see if everything goes right. Or add some latency to go over your timeout and test your retry mechanism. (if you have one)

- return http code 500 for apis

It's another way to test your retry mechanism to be sure it's a false positive.

- killing pods

In the idea that you always have at least 2 pods on prod to see what happend if a pod is kill for whatever reason. Does it's correctly load balanced? Does the new instance start quickly enough? Should you have more instances?

So it's far away from killing cluster nodes, but it's a first step to add perturbations on your system and see if it's resilient.

In prod or not in prod?

For sure, if you want to see what happend with real cases, and if you want to be sure to have something fitting for your customers, doing it in prod is the best.

But, every systems, teams and/or compagnies are not enough ready to do it in production. So don't worry, you can do it in non production but you MUST be able to simulate real traffic.

You must have the load testing tools ready with real user scenarios. Just doing load tests on GET to retrieve always the same data (for example, or always adding new elements) can help to find some issues, but real scenarios can put your system in particular situations you never thought about.


How to test it

In this last part, we will see some tools which can help you do it automatically.

The first one is maybe a tool you already use : Istio.

Generally, go check your service mech documentation to see if they have some tools for this.

Istio

In it's functionnalities, Istio have a Fault Injection one. With this, you can add some delay to an api call, or a percentage of calls to an api

  - fault:
      delay:
        fixedDelay: 7s
        percentage:
          value: 100
Enter fullscreen mode Exit fullscreen mode

or you can directly return a specific http status

  - fault:
      abort:
        httpStatus: 500
        percentage:
          value: 100
Enter fullscreen mode Exit fullscreen mode

So if you already have the tool, it's a quick configuration to add in a VirtualService.

Chaos Mesh

Chaos Mesh is a tool dedicated to chaos engineering which is a part of the CNCF tool stack.

Dedicated for Kubernetes and Physical nodes, it will help you to simulate a bunch of faults like :

  • pods faults
  • network fauls
  • stress scenarios
  • DNS faults
  • JVM applications faults
  • Cloud communication faults
  • ...

With this you will be able to check if your kubernetes cluster is resilient enough.

But, you may use other tools arround your kubernetes cluster, databases, queue services...

Litmus Chaos

And that's where Litmus Chaos comes. Also a CNCF project, this project is for me more accurate to use if you haven't only a kubernetes cluster.

It has a lot of functionnalities to test a Kubernetes or physical cluster like the other projects. But it has more functionnalities to test things arround. Like simulate if a disk is filled, have simulation scenarios for other tools like Kafka, Cassandra or Core DNS. And also for Cloud providers! (AWS, Azure & GCP)

Display like a marketplace, the Hub LitmusChaos will show you all what can be done.


An important thing, those 3 tools are open source projects. So if you want to add new configurations, some specific scenarios which are not implemented... or open an issue to ask the community, don't hesitate!


As I said earlier, I didn't go through all the tools, how they work... but if is there some elements you are curious about, you want to have more informations. Please leave a comment to let it me know.

I hope it will help you! 🍺

Top comments (0)