DEV Community

Cover image for The Role Of Chaos Engineering in Building Anti-Fragile Systems
Ogonna Nnamani
Ogonna Nnamani

Posted on

The Role Of Chaos Engineering in Building Anti-Fragile Systems

Intro
Welcome back to the Antifragile series guys!
We will be discussing the role of Chaos Engineering in designing antifragile systems.

Firstly, 

What is Chaos Engineering?

Chaos engineering is a controlled chaos in systems design. It involves deliberately injecting failures and unexpected events into a system to see how it responds. The goal is to uncover weaknesses and vulnerabilities before they cause major issues in real-world scenarios.

Building a system that responds almost immediately to failure shows resiliency and that is what antifragility is really about.
Who needs Chaos Engineering?

Implementing chaos engineering in an architecture involves a lot of planning because nobody wants to build and destroy. Certain use cases and industry inspire this method of systems design such as:

  • Tech Companies
    Especially those providing online services, cloud computing, or software as a service (SaaS) platforms.

  • Financial Services
    Banks, stock exchanges, payment processors, and other financial institutions rely on highly available and secure systems to process transactions.

  • Healthcare:
    With the increasing digitization of medical records and telemedicine, healthcare organizations need reliable systems to provide critical services to patients.

  • Energy and Utilities
    Power plants, oil refineries, and utility companies use complex systems for monitoring and managing infrastructure.

These industries require constant uptime and design methods like chaos engineering can be implemented to test for resiliency.

Tools used for Chaos Engineering
Chaos Mesh
An open-source chaos engineering platform for Kubernetes-based applications. It allows users to orchestrate chaos experiments to test the resilience of their Kubernetes clusters. These include pod failure, network latency and load testing.

Image of types of chaos by Chaos mesh

Pumba
A chaos testing tool specifically designed for Docker containers. It allows users to introduce network latency, packet loss, and other disruptions to Docker containers to simulate real-world failures. 
Pumba can kill, stop or remove running containers. It can also pause all processes withing running container for specified period of time. 

Chaos Monkey
Developed by Netflix, Chaos Monkey is one of the earliest chaos engineering tools. It randomly terminates virtual machine instances to ensure that engineers design systems that can withstand failures.

Litmus Chaos
An open-source chaos engineering platform for Kubernetes. It provides a framework and a set of pre-defined chaos experiments for testing Kubernetes resilience.
Litmus was accepted to CNCF on June 25, 2020 and moved to the Incubating maturity level on January 11, 2022.

LITMUS CHAOS

Apache Bench 
(ab is the real program file name) is a single-threaded command line computer program used for benchmarking (measuring the performance of) HTTP web servers. This tool can be used to stress test your APIs or endpoints to ensure they can withstand huge concurrent traffic before deployment.

These are some real life tools that are used to test for resiliency in your products to ensure you achieve an antifragile infrastructure.

There are some special use cases where Chaos engineering is automated and continually implemented.

A real life example is a company called FINBOURNE. Finbourne is a financial technology company that provides a cloud-based investment management platform called LUSID. LUSID is designed to help asset managers, wealth managers, and financial institutions streamline their investment operations.

Finbourne hosts their infrastructure on AWS and they implement an automated, special type of chaos engineering that terminates an application every seventeen(17) minutes, terminates an EC2 instance every six(6) hours and fails an Availibility zone twice weekly just to continually evaluate how quickly they recover from a failure.

Mindblowing right !!!

These are some of the extreme design methods some companies undergo just to ensure optimal performance and resiliency.
That will be all on chaos engineering today!!

 If you enjoyed this read, connect with me on LinkedIn
HAPPY CLOUD COMPUTING!

Top comments (0)