The new version V0.5.0 of ChaosMeta, has been officially released! This version includes many new features and enhancements, offering users platform capabilities to support various stages of chaos engineering, along with a user-friendly interface to lower the entry barrier.
Introduction to Core New Features of ChaosMeta V0.5
The current version primarily introduces platform interface components (chaosmeta-platform), metric components (chaosmeta-measure-operator), and traffic injection components (chaosmeta-flow-operator).
▌Platform Interface
The platform interface provides a user-friendly operational interface for users to conveniently utilize the capabilities of ChaosMeta. The current platform capabilities primarily include:
1.Space Management: Data isolation based on organizations or activities, ensuring data security and privacy.

2.User Permission Management: Controls access permissions for different roles, effectively managing the usage of chaos engineering experiments.

3.Experiment Orchestration: Utilizing drag-and-drop visual operations to make experiment orchestration more user-friendly and flexible, improving user efficiency.

4.Experiment Results: Provides the ability to trace experiment execution details, allowing users to monitor experiment progress and results in real-time. This facilitates data analysis and decision-making for users.

▌Metric Engine
The current metric engine includes four types of metric capabilities:
- monitor: Performs expected value judgments on monitoring metrics, such as whether the CPU usage of a particular machine exceeds 90%. It supports Prometheus by default.
- pod: Performs expected value judgments on pod-related data, such as whether the number of instances of a particular application's pod is greater than 3.
- http: Performs expected value judgments on HTTP requests, such as whether a specified HTTP request returns a status code of 200.
- tcp: Performs expected value judgments on TCP requests, such as testing whether the 8080 port of a specific server is reachable.
▌Traffic Engine
The current traffic injection capability only supports injecting HTTP traffic. In the future, we will gradually add support for other types of traffic injection, such as RPC, DB client, Redis client, and more.
The underlying implementation is based on the open-source component JMeter, where each traffic injection task initiates a JMeter job for execution.

Core Design Principles of ChaosMeta
ChaosMeta is designed to address several commonly encountered industry challenges.
1. Integration of Chaos Engineering Practice Stages
In the current industry, mainstream Chaos Engineering projects mainly focus on how to create failures. However, engineers who frequently conduct Chaos Engineering drills understand the pain points of repetitive tasks in each drill, such as checking if the current environment meets the prerequisites for the drill (drill admission), verifying if the business traffic is sufficient (traffic injection), assessing if the injected failures meet the expected outcomes (failure measurement), ensuring if the business services are restored within the predefined time (recovery measurement), and conducting retrospective analysis to identify risk points.
Based on the industry landscape and the aforementioned problem analysis, combined with Ant Group's years of experience in the field of Chaos Engineering, the ChaosMeta platform is designed to cover the technical support for various stages, including "admission check," "traffic injection," "failure injection," "failure measurement," "recovery measurement," and "injection recovery." This design aims to free up human resources in each stage, streamlining Chaos Engineering practices.

2. Reusability of Fault Experiment Design Experience
Before conducting Chaos Engineering drills, there is another labor-intensive task involved, which is the design of experiment scenarios. Currently, this part heavily relies on human design capabilities, making it challenging to fully automate the process. However, we can systematically abstract and document reusable design experiences, allowing for quick reuse when conducting Chaos Engineering drills on similar components. This is the original intention behind the design of a risk catalog.
Initially, the risk catalog was primarily open-sourced in a theoretical manner. In the future, it will be integrated into the ChaosMeta project as a platform capability.

3. Complexity of Cloud-Native Infrastructure Environment
The infrastructure environments of most companies are currently built on Kubernetes. The stability of both the cloud itself and cloud-native applications is of paramount importance. Traditional fault injection methods may struggle to address these challenges. Therefore, in the design of ChaosMeta, we aim to address the following problems:
Fault Injection Capabilities for Kubernetes Stability
The focus is on the stability of Kubernetes itself, including core components such as APIServer and Scheduler, as well as the handling of abnormal states for various resources and exceptions in Operator applications.
Platform Support for Cloud-Native Deployment
chaosMeta is designed based on an operator-developed cloud-native architecture (see user documentation), making it naturally compatible with cloud-native environments for deployment.

Platform Support for Managing and Injecting Faults into Containers in the Cloud
With traditional methods, injecting faults into containers requires transferring the single-machine fault injection tool into the target container and executing commands. However, the base images of most business containers are minimalistic, lacking support for various command tools such as tc and fallocate. This limitation restricts container fault injection due to environmental factors.
ChaosMeta uses a "containerized injection" approach to inject faults into pods and nodes within the cluster. The single-machine fault injection tool, chaosmetad, supports fault injection into containers on the host machine without the need to copy chaosmetad into the container. This is achieved by selectively entering the target container's Linux namespace on the host machine, enabling the use of host machine tools to simulate abnormalities in the corresponding namespace of the container.

Platform Support for Managing and Injecting Faults into Containers Outside the Cloud
Many companies still have business processes that are not yet migrated to the cloud. These processes are either deployed on ordinary physical/virtual machines or launched as bare containers on top of the existing infrastructure (e.g., Docker containers). In such cases, the platform needs to support managing these targets outside the cloud. ChaosMeta's single-machine fault injection tool, chaosmetad, supports agent mode startup, which periodically reports container information from the machines to the platform. The platform can directly select targets outside the cluster and issue fault injection tasks.

Multi-Cluster Management
While it is recommended to deploy one control platform per cluster, there are still users who prefer centralized management. ChaosMeta's platform design supports managing different clusters' kubeconfig and performing fault injection across clusters.
4. Automatable Paths
The ultimate goal of platform technology is to liberate human resources and evolve towards automation and intelligence. While we may not have fully achieved this yet, it is crucial to at least move in the right direction.
ChaosMeta's automation-focused Chaos Engineering philosophy is primarily supported by the platform capabilities at each stage of Chaos Engineering drills, with the "risk catalog" serving as a theoretical foundation. This allows ChaosMeta to progressively evolve towards automated Chaos Engineering.
2023 Roadmap
This year's goal is primarily to improve the capabilities of the platform and ensure that the basic capabilities at each stage are fully developed. We will also collaborate further with other open-source communities, such as OceanBase and SOFA, to strive for the achievement of a complete version 1.0.
Platform Capabilities are as follows
- Ability to support all types of orchestration nodes
- Built-in generic experiment templates for some open-source components
- Provision of an Agent management interface to manage physical machines and containers in the cloud and on-premises
- Support for cross-cluster management
Basic capabilities at each stage include
- Traffic Injection Capability: Enhancing traffic injection capabilities based on JMeter to provide more advanced traffic injection scenarios.
- Measurement Capability: Offering advanced state measurement capabilities in the direction of cloud-native technology.
- Fault Injection Capability: Progressing towards component-level fault injection capabilities, such as integrating fault injection capabilities for open-source components like OceanBase, MySQL, Redis, Etcd, and more.
Risk Catalog
- The formal release of the open-source theoretical version includes the integration of two main capabilities into the ChaosMeta platform: "Generic Experiment Templates" and "Component-level Fault Injection Capability".
Join ChaosMeta Community
As an open project, we embrace the open-source development model and strive to build the ChaosMeta community as an open and creative space. Going forward, all development, discussions, and related work will be transparently conducted within the community.
We welcome all forms of participation, including but not limited to asking questions, contributing code, engaging in technical discussions, and providing suggestions for improvements. We look forward to receiving community ideas and feedback to drive the project further.
If you are interested in our project or design philosophy, please star our project to show your support.
Note: Please note that the provided translations are as accurate as possible, but it is always recommended to have a native English speaker review the content for professional use.
Project GitHub Repository:
https://github.com/traas-stack/chaosmeta
Official Documentation:
https://chaosmeta.gitbook.io/chaosmeta-cn
WeChat Group, please add the group owner as a friend (WeChat ID: KingsonKai) to receive an invitation to join the group.



Top comments (0)