DEV Community: ChaosMeta

Automated Chaos Engineering ChaosMeta V0.6 Officially Released

ChaosMeta — Thu, 02 Nov 2023 06:06:35 +0000

The new version V0.6.0 of Chaos Engineering ChaosMeta is now officially released! This version contains many new features and enhancements. The orchestration interface provides support for various nodes including traffic injection, measurement, etc., and provides visual support for the entire drill process. Solve the last problem of "continuously automated running experiments" in the principles of chaos engineering.

Introduction

ChaosMeta is a cloud-native chaos engineering platform designed for automated exercises. It provides platform functions such as visual orchestration and scheduling, data isolation, and multi-cloud management, as well as rich fault injection capabilities, covering the entire life cycle of the drill. It embodies the methodology, technical capabilities and product capabilities that Ant Group has accumulated over many years in large-scale red and blue offensive and defensive drills at the company level.

New Features

In the new version, fault capabilities such as DNS anomalies and log injection are added, and the visual orchestration interface provides support for various nodes such as traffic injection and measurement, providing support capabilities for automated chaos engineering.

▌Lossless injection

Log injection is a simple fault capability that essentially appends text content to files. But the idea of lossless injection extended from this is more important.

As the name suggests, non-destructive injection is to conduct drills without really affecting the business, and to discover the shortcomings of the application's emergency procedures such as monitoring alarms, hemostasis, and self-healing. It is a risk-minimized drill that is very suitable for production environments. Way.

There are generally two implementation solutions for lossless injection:

If the monitoring indicators of an application rely on log content, then by injecting the corresponding content into the application's log file, the perfection of the relevant emergency processes of the target application can be verified non-destructively;
Directly tamper with the monitoring data of target monitoring items (such as CPU usage) to verify whether the subsequent emergency process is complete. Below are two walkthrough scenarios for log injection:

（1）Common Error

We usually monitor the number of combined keywords such as "Error" and "Exception" in the log file to determine whether there is an abnormality in the status of an application. If a sudden increase occurs, it is highly likely that the application has failed. Therefore, this type of failure can be simulated using the ability to log injection (file append)

（2）Interface request time consuming

Due to the performance impact on monitoring collection and reporting, some applications use an asynchronous collection solution. The RPC framework outputs the request time and return code of each interface to the log, and then the collection agent asynchronously collects data from the log file. Report.

The following is an example of time-consuming monitoring and collection of a message push interface. The time-consuming data of each interface is collected from the log file and reported to the monitoring platform:

In this case, you can also use the fault capability of log injection (file append) to simulate a fault scenario where the request takes too long, without actually injecting a network fault into the application.

Lossless injection is very efficient in scenarios that only require rapid verification of standardized emergency response capabilities such as monitoring alarms, positioning, and plans for a large number of applications.

▌Automate running experiments

The industry-recognized principles of chaos engineering are:

Develop a hypothesis surrounding steady-state behavior
Diverse real-world events
Run experiments in production
Minimize explosion radius
Continuously run experiments automatically

There are always ways to implement the first four of them in the industry, but the last "continuous automated operation experiment" has never had a better implementation plan.

Of course, many well-known chaos engineering projects have tried to solve this problem, and most of them provide product capabilities for scheduled execution. However, it is doubtful whether this scheduled execution capability can be used on a large scale in a production environment.

The reason is that on the one hand, fault injection is a high-risk action. Without sufficient pre-admission and other inspection operations, there is not enough sense of security to automatically trigger; on the other hand, a drill is not just "fault injection", we often also need Do a lot of other "manual analysis" work, such as checking whether the target application status and current environment meet the preset conditions and whether the traffic is satisfied before injecting faults. After injecting faults, discover how much time it takes to locate and recover, analyze emergency efficiency, etc.

ChaosMeta disassembles this kind of "manual analysis" work into different types of atomic execution tasks, divided into various types of nodes such as "fault injection", "metric execution", "traffic injection", "waiting", etc., and then based on Flexible orchestration capabilities are combined into automated drill scenarios with various business semantics. Here are a few simple examples:

High service availability and constant freshness

We all have high service availability requirements for online applications, such as the ability to have multiple copies + automatic load balancing of services. Regular regular drills are a way to ensure that high availability capabilities are kept fresh.

Since it is a production environment, we cannot just automatically initiate it casually, because you cannot guarantee whether multiple copies of an application will be available at all times. For example, before the drill, only one copy happens to be available, and there are a large number of users. In the case of traffic access, the drill you configured is automatically launched, which will cause immeasurable consequences.

In this example, there are several operations that can increase your confidence in automated drills: confirm that there are multiple copies of the application, that the service user traffic is within an acceptable range, and that the application can return to the multi-copy state after the drill.

The purpose of our exercise is to verify the high availability of the service, so the corresponding service availability measurement is also necessary.

So as long as all the worrying factors are configured into the choreography, as long as the running result is successful, this drill will be a drill that meets expectations. On the contrary, if the execution fails, the relevant person in charge will be notified through the alarm to intervene. This will greatly free up manpower investment in drills.

Red and blue attack and defense automation

When conducting red and blue offensive and defensive drills, the blue army is generally responsible for designing scenarios for the drill, and finally makes an objective evaluation of the red army's emergency efficiency (personnel, platforms), in order to guide the direction of the red army's defense capability building.

A common way to judge the standard is to judge whether the Red Army's emergency discovery, location, and recovery time in a fault meet the requirements of 1, 5, and 10 minutes respectively. Otherwise, points will be deducted. Since "time consuming" is involved, an accurate starting time point (fault takes effect) and target time point (discovery, location, and recovery time) are necessary.

The starting time point is the moment when the Red Army believes that the fault standard has been formed, and this is not necessarily the moment of fault injection in the traditional sense. For example, if the target service promises to guarantee a service delay of less than 3000ms, then only the network delay will be more than 3000ms. If it is considered to be a fault, then the Red Army should go to the emergency response. Similarly, the business recovery target is also below 3000ms. However, if the calculation is directly based on the fault injection operation time, it will cause a large error. Even this "fault injection" may not actually cause the fault that the Red Army thinks, so fault effectiveness measurement is also a very necessary link.

In order to evaluate the emergency response efficiency of the Red Army for each failure, the Blue Army needs to collect data from various emergency platforms (monitoring platform, positioning platform, self-healing platform, etc.). Manual collection and analysis is a very arduous task. However, traditional chaos engineering platforms only have fault simulation capabilities, and these "manual operations" must be performed repeatedly for the same drill scenario. ChaosMeta hopes to configure these "manual operations" into the platform to improve drill efficiency.

Network fault attack and defense drills

This is a simple red and blue attack and defense drill example, for applications in scenarios where network latency is too large

Since network traffic monitoring is involved, service traffic is a necessary condition. Without service traffic, network delay is injected, and an alarm of excessive service delay will not be triggered. Therefore, a node for Mock traffic needs to be configured;

Access detection is also required to measure whether the current flow level meets expectations, otherwise the next process will most likely not be approved by the Red Army;

Then what is left is to measure the failure effective point and business recovery time point, which is used to collect data to analyze emergency efficiency.

Future Direction

Next, we will continue to improve our capabilities in all aspects

Support multi-cloud and non-cloud management, that is, manage cross-cluster pods/nodes and non-k8s machines/bare containers;
Improve the data analysis capabilities of measurement capabilities. Currently, it can only measure a single moment such as failure effectiveness, location, recovery, etc., and it is not possible to combine the moments of multiple nodes for analysis (for example: recovery time - effectiveness time < target time consumption);
Support more atomic capabilities of various types of nodes, such as supporting business-level fault capabilities of mainstream open source projects such as mysql, oceanbase, and redis;
Support some fault capabilities and measurement capabilities related to the stability of large model training and inference architecture risks, such as GPU high load injection.

Join Us

As an open project, we recognize the open source R&D model and are committed to building the ChaosMeta community into an open and creative community. In the future, all R&D, discussions and other related work will be run transparently in the community.

We welcome any form of participation, including but not limited to questions, code contributions, technical discussions, demand suggestions, etc. Looking forward to receiving community ideas and feedback to drive the project forward.

If you are interested in our project or design concept, please star our project to support it.

GitHub：https://github.com/traas-stack/chaosmeta
Documentation：https://chaosmeta.gitbook.io/chaosmeta-en
Email：chaosmeta.io@gmail.com
Twitter：AntChaosMeta
DingTalk Group：21765030887
WeChat Public Account:

Ant Group Chaos Engineering - ChaosMeta V0.5 Release

ChaosMeta — Wed, 06 Sep 2023 03:00:53 +0000

The new version V0.5.0 of ChaosMeta, has been officially released! This version includes many new features and enhancements, offering users platform capabilities to support various stages of chaos engineering, along with a user-friendly interface to lower the entry barrier.

Introduction to Core New Features of ChaosMeta V0.5

The current version primarily introduces platform interface components (chaosmeta-platform), metric components (chaosmeta-measure-operator), and traffic injection components (chaosmeta-flow-operator).

▌Platform Interface

The platform interface provides a user-friendly operational interface for users to conveniently utilize the capabilities of ChaosMeta. The current platform capabilities primarily include:
1.Space Management: Data isolation based on organizations or activities, ensuring data security and privacy.

2.User Permission Management: Controls access permissions for different roles, effectively managing the usage of chaos engineering experiments.

3.Experiment Orchestration: Utilizing drag-and-drop visual operations to make experiment orchestration more user-friendly and flexible, improving user efficiency.

4.Experiment Results: Provides the ability to trace experiment execution details, allowing users to monitor experiment progress and results in real-time. This facilitates data analysis and decision-making for users.

▌Metric Engine

The current metric engine includes four types of metric capabilities:

monitor: Performs expected value judgments on monitoring metrics, such as whether the CPU usage of a particular machine exceeds 90%. It supports Prometheus by default.
pod: Performs expected value judgments on pod-related data, such as whether the number of instances of a particular application's pod is greater than 3.
http: Performs expected value judgments on HTTP requests, such as whether a specified HTTP request returns a status code of 200.
tcp: Performs expected value judgments on TCP requests, such as testing whether the 8080 port of a specific server is reachable.

▌Traffic Engine

The current traffic injection capability only supports injecting HTTP traffic. In the future, we will gradually add support for other types of traffic injection, such as RPC, DB client, Redis client, and more.
The underlying implementation is based on the open-source component JMeter, where each traffic injection task initiates a JMeter job for execution.

Core Design Principles of ChaosMeta

ChaosMeta is designed to address several commonly encountered industry challenges.

1. Integration of Chaos Engineering Practice Stages

In the current industry, mainstream Chaos Engineering projects mainly focus on how to create failures. However, engineers who frequently conduct Chaos Engineering drills understand the pain points of repetitive tasks in each drill, such as checking if the current environment meets the prerequisites for the drill (drill admission), verifying if the business traffic is sufficient (traffic injection), assessing if the injected failures meet the expected outcomes (failure measurement), ensuring if the business services are restored within the predefined time (recovery measurement), and conducting retrospective analysis to identify risk points.

Based on the industry landscape and the aforementioned problem analysis, combined with Ant Group's years of experience in the field of Chaos Engineering, the ChaosMeta platform is designed to cover the technical support for various stages, including "admission check," "traffic injection," "failure injection," "failure measurement," "recovery measurement," and "injection recovery." This design aims to free up human resources in each stage, streamlining Chaos Engineering practices.

2. Reusability of Fault Experiment Design Experience

Before conducting Chaos Engineering drills, there is another labor-intensive task involved, which is the design of experiment scenarios. Currently, this part heavily relies on human design capabilities, making it challenging to fully automate the process. However, we can systematically abstract and document reusable design experiences, allowing for quick reuse when conducting Chaos Engineering drills on similar components. This is the original intention behind the design of a risk catalog.

Initially, the risk catalog was primarily open-sourced in a theoretical manner. In the future, it will be integrated into the ChaosMeta project as a platform capability.

3. Complexity of Cloud-Native Infrastructure Environment

The infrastructure environments of most companies are currently built on Kubernetes. The stability of both the cloud itself and cloud-native applications is of paramount importance. Traditional fault injection methods may struggle to address these challenges. Therefore, in the design of ChaosMeta, we aim to address the following problems:

Fault Injection Capabilities for Kubernetes Stability
The focus is on the stability of Kubernetes itself, including core components such as APIServer and Scheduler, as well as the handling of abnormal states for various resources and exceptions in Operator applications.

Platform Support for Cloud-Native Deployment
chaosMeta is designed based on an operator-developed cloud-native architecture (see user documentation), making it naturally compatible with cloud-native environments for deployment.

Platform Support for Managing and Injecting Faults into Containers in the Cloud
With traditional methods, injecting faults into containers requires transferring the single-machine fault injection tool into the target container and executing commands. However, the base images of most business containers are minimalistic, lacking support for various command tools such as tc and fallocate. This limitation restricts container fault injection due to environmental factors.

ChaosMeta uses a "containerized injection" approach to inject faults into pods and nodes within the cluster. The single-machine fault injection tool, chaosmetad, supports fault injection into containers on the host machine without the need to copy chaosmetad into the container. This is achieved by selectively entering the target container's Linux namespace on the host machine, enabling the use of host machine tools to simulate abnormalities in the corresponding namespace of the container.

Platform Support for Managing and Injecting Faults into Containers Outside the Cloud

Many companies still have business processes that are not yet migrated to the cloud. These processes are either deployed on ordinary physical/virtual machines or launched as bare containers on top of the existing infrastructure (e.g., Docker containers). In such cases, the platform needs to support managing these targets outside the cloud. ChaosMeta's single-machine fault injection tool, chaosmetad, supports agent mode startup, which periodically reports container information from the machines to the platform. The platform can directly select targets outside the cluster and issue fault injection tasks.

Multi-Cluster Management
While it is recommended to deploy one control platform per cluster, there are still users who prefer centralized management. ChaosMeta's platform design supports managing different clusters' kubeconfig and performing fault injection across clusters.

4. Automatable Paths

The ultimate goal of platform technology is to liberate human resources and evolve towards automation and intelligence. While we may not have fully achieved this yet, it is crucial to at least move in the right direction.
ChaosMeta's automation-focused Chaos Engineering philosophy is primarily supported by the platform capabilities at each stage of Chaos Engineering drills, with the "risk catalog" serving as a theoretical foundation. This allows ChaosMeta to progressively evolve towards automated Chaos Engineering.

2023 Roadmap

This year's goal is primarily to improve the capabilities of the platform and ensure that the basic capabilities at each stage are fully developed. We will also collaborate further with other open-source communities, such as OceanBase and SOFA, to strive for the achievement of a complete version 1.0.

Platform Capabilities are as follows

Ability to support all types of orchestration nodes
Built-in generic experiment templates for some open-source components
Provision of an Agent management interface to manage physical machines and containers in the cloud and on-premises
Support for cross-cluster management

Basic capabilities at each stage include

Traffic Injection Capability: Enhancing traffic injection capabilities based on JMeter to provide more advanced traffic injection scenarios.
Measurement Capability: Offering advanced state measurement capabilities in the direction of cloud-native technology.
Fault Injection Capability: Progressing towards component-level fault injection capabilities, such as integrating fault injection capabilities for open-source components like OceanBase, MySQL, Redis, Etcd, and more.

Risk Catalog

The formal release of the open-source theoretical version includes the integration of two main capabilities into the ChaosMeta platform: "Generic Experiment Templates" and "Component-level Fault Injection Capability".

Join ChaosMeta Community

As an open project, we embrace the open-source development model and strive to build the ChaosMeta community as an open and creative space. Going forward, all development, discussions, and related work will be transparently conducted within the community.
We welcome all forms of participation, including but not limited to asking questions, contributing code, engaging in technical discussions, and providing suggestions for improvements. We look forward to receiving community ideas and feedback to drive the project further.
If you are interested in our project or design philosophy, please star our project to show your support.
Note: Please note that the provided translations are as accurate as possible, but it is always recommended to have a native English speaker review the content for professional use.

Project GitHub Repository:
https://github.com/traas-stack/chaosmeta
Official Documentation:
https://chaosmeta.gitbook.io/chaosmeta-cn

WeChat Group, please add the group owner as a friend (WeChat ID: KingsonKai) to receive an invitation to join the group.

DingTalk Group: 21765030887

WeChat Public Account: ChaosMeta Chaos Engineering

Ant Group's automated chaos engineering ChaosMeta is officially open source

ChaosMeta — Tue, 13 Jun 2023 02:29:01 +0000

ChaosMeta Introduction

ChaosMeta is a chaos engineering platform designed for cloud-native and automated drills. It is the external open source version of XMonkey, an internal chaos engineering platform of Ant Group, which embodies the methodology, technical capabilities and product capabilities accumulated by Ant Group in the practice of large-scale red and blue offensive and defensive drills at the company level for many years.
Driven by years of complex fault drill scenarios within the company, XMonkey has accumulated a lot of unique experience in the field of chaos engineering. It is an important platform for Ant Group R&D, testing, quality, SRE and other personnel to conduct historical fault drills and tap potential system risks. We very much hope that these experiences can be discussed and evolved together with the industry, so we open sourced ChaosMeta.
ChaosMeta is designed to be a one-stop exercise comprehensive solution that includes a complete chaos engineering life cycle to help users quickly discover potential risks in business applications and systems. In addition, it also has a built-in "risk catalog" that condenses Ant Group's accumulation in the field of technical risks for many years. This is a summary of basic general risks in various technical fields.
With the platform capabilities of the chaos engineering life cycle as the technical support and the "risk catalog" as the theoretical support, ChaosMeta can evolve towards automated chaos engineering.

Core Features

▌Chaos Engineering Lifecycle

The current mainstream chaos engineering projects in the industry mainly focus on how to create faults. Engineers who often do drill-related work should understand that there are still the following pain points in each drill: Check whether the current environment meets the preset conditions of the drill (Admission Check), whether the business flow is satisfied (Traffic Injection), whether the failure effect after injection is in line with expectations (Fault Measurement), whether the business service is restored within the preset time (Recovery Measurement), analyze and summarize the risk points.
Based on the current situation of the industry and the analysis of the above problems, combined with Ant Group's years of experience in the field of chaos engineering, ChaosMeta proposed a model of Chaos Engineering Lifecycle, covering "Admission Check", "Traffic Injection", "Fault Injection", and "Fault Measurement" , "Recovery Measurement", "Injection Recovery" and other stages of technical support, providing technical basis for automated chaos engineering.

▌Risk Catalog

Ant Group holds company-level large-scale red and blue offensive and defensive drills every year for all businesses of the company, and many businesses also conduct 7X24-hour drills and monthly normal drills.
Drill object types cover cloud products, Kubernetes, Operator applications, databases (OceanBase, Etcd, etc.), middleware (message queues, distributed scheduling, configuration centers, etc.), business applications (trading systems, marketing systems, etc.), etc.
For each type of application, some of the most common risks can always be abstracted, such as message queues, there will be risks such as message accumulation, message loss, message order confusion, and unstable network of relying parties; such as transaction systems, there will be risks such as distributed transactions, amount consistency, and concurrency conflicts.
The "Risk Catalog" is a "Risk Encyclopedia" that Ant Group has polished over the years in the context of a large-scale financial Internet architecture, and we will build the parts common to the open source world into the ChaosMeta project, as the theoretical basis of automated chaos engineering.

▌Abundant cloud-native fault injection capabilities

Large-scale and high-frequency drills have also promoted the construction of various fault injection capabilities. In addition to common system resource faults, kernel faults, network faults, JVM injection, etc. in the industry, we also provide a wealth of cloud-native fault injection capabilities.
For example:

accumulating a large number of pending pods in the kubernetes cluster, dragging down the scheduling system;
injecting a validating admission webhook into the creation process of a resource object, prolonging the verification time and affecting the creation efficiency;
injecting a mutating admission webhook to mutate specific fields;
build a large number of Watch&List links to increase the burden on APIServer to call back Operator.

Cloud Native Architecture

ChaosMeta's core platform capabilities are implemented based on the Operator development model, so it naturally supports cloud native. ChaosMeta is divided into three layers of design, see the official documentation for details:

The user layer is composed of chaosmeta-platform components, whose main task is to lower the threshold for users to use, provide a visual interface, and facilitate users to use platform functions such as planning, orchestration, experiment configuration, and experiment record details;
The engine layer includes ChaosMeta's core platform capabilities such as remote injection, orchestration, and measurement, as well as the realization of some cloud-native fault capabilities;
The kernel layer mainly includes the realization of single-machine fault injection capabilities, mainly including the chaosmetad component, which provides the way of resident HTTP service and the way of command line execution, and also encapsulates the corresponding daemonset component (chaosmeta-daemonset), which can be flexibly Match the drill platform with different needs.

RoadMap

The planning of ChaosMeta is divided into two main routes: platform capabilities and fault injection capabilities. The current main task is to complete the transformation of the main capabilities in the architecture diagram and open them to the external version.

▌Platform Capability

The future evolution of ChaosMeta platform capabilities is divided into three stages.

Phase 1 - Manual Configuration
The goal to be achieved is to open all the components in the architecture diagram to the outside world. At this time, it can support the complete life cycle of chaos engineering, enter the field of primary automated chaos engineering, and use the "risk catalog" as a theoretical reference. Once manual configuration, multiple times automatically.

Phase 2 - Automation
At this stage, the "Risk Catalog" will play a greater role. It not only gives the risk of a class of applications, but also the corresponding prevention and emergency recommendations, and the score of each item, and ChaosMeta will The "risk catalog" is integrated into a risk medical examination package of general components, which realizes the one-click "physical examination" capability, inputs target application information, and directly outputs a risk score and risk analysis report.

Phase 3 - intelligence
Explore the direction of combining artificial intelligence.

▌Fault Injection Capability

The following is just a classification of fault capabilities. For the specific atomic fault capabilities provided, please refer to the description of fault capabilities (welcome to submit issues and put forward new capability requirements, and those with higher requirements are given priority):

Join ChaosMeta Community

As an open project, we recognize the open source R&D model and are committed to making the ChaosMeta community an open and creative community. In the future, all R&D, discussion and other related work will be carried out transparently in the community.
We welcome any form of participation, including but not limited to questions, code contributions, technical discussions, requirements suggestions, etc. Looking forward to receiving community ideas and feedback to move the project forward further.

GitHub address：https://github.com/traas-stack/chaosmeta
Official documentation：https://chaosmeta.gitbook.io/chaosmeta-en
Twitter：AntChaosMeta
Slack group：https://app.slack.com/client/T057ERYMS8J/C057883SM38?geocode=zh-cn
DingTalk Group：21765030887