This article introduces Alibaba Cloud's open-source RCA Benchmark for evaluating AI agents in IT operations.
Alibaba Cloud has released RCA Benchmark to build a standardized root cause analysis evaluation dataset and evaluation protocol system for Agentic Ops. It is also the industry's first open source benchmark project that addresses the evaluation of AI Agent diagnostics capabilities for distributed system failures at the system level. Alibaba Cloud has partnered with institutions across the realms of observability, artificial intelligence for IT operations, and cloud-native infrastructure — including CAICT, the Institute of Software/Computer Network Information Center of the Chinese Academy of Sciences, Tsinghua University, Fudan University, and Nankai University — to jointly build an industrial ecosystem and establish a standardized, trustworthy O&M agent evaluation system, laying a solid foundation for large-scale industry adoption.
Based on long-term product implementation and service practices in the observability and artificial intelligence for IT operations realms, Alibaba Cloud has recognized that rootcause analysis is the most complex and hardest-to-standardize core process in O&M agent capability evaluation. Unlike tasks with fixed inputs and ground truth such as text Q&A and code generation, RCA Agents operate against continuously running distributed complex architectures. They must proactively filter valid info from multi-source observability data such as indicators, logs, Tracing Analysis, and system events, trace abnormal propagation paths based on service dependencies and entity topology relationships, and ultimately locate the root cause of failures. The industry has not yet established a unified, systematic evaluation benchmark, making it impossible to objectively compare the fault diagnostics capabilities of various AI Agents or to quantify the effectiveness of technology evolution and capability iteration.
Industry urgently needs a unified RCA evaluation standard
As enterprise Agentic Ops enters the stage of large-scale implementation, the lack of an evaluation system has become a key constraint on industry development, and the traditional evaluation paradigm can no longer meet the development demands of artificial intelligence for IT operations:
Traditional evaluation mode completely fails
Root cause analysis is not a simple text processing task. AI Agents must perform real-time indicator queries, log analysis, Tracing Analysis, and change management event assessment, and conduct cross-tool collaborative diagnostics. Traditional evaluation methods that rely on static log fragments and a single label cannot distinguish whether an agent completes full logical reasoning-based diagnostics or merely achieves an accidental hit based on alerting appearances, resulting in significant shortcomings in evaluation effectiveness.Multi-source observable data is difficult to standardize
RCA evaluation involves multi-source observable signals such as indicators, logs, Tracing Analysis, and system events. These data types are coupled across time and entity dimensions, and failure impact propagates layer by layer along business traces. Taking a database slow query failure as an example, it triggers a chain reaction: increased MySQL query time, increased invoked service latency, upstream service timeout, and frontend 5xx errors. Single-dimension observable data can only render partial symptoms and cannot revert the complete failure propagation logic.Causal propagation chains easily lead to evaluation misjudgment
The industry commonly confuses abnormal symptoms with failure root causes. Frontend alerting mostly reflects the end of a failure trace, while the real root cause often lies in downstream databases, caches, MSMQ, or the container scheduling layer. If a dataset does not fully depict the causal propagation path and a diagnosis is deemed correct simply by hitting services around the alert, evaluation aliasing is highly likely.Cross-domain entity identity lacks a unified specification
The same business entity has completely fragmented naming systems across different O&M systems: naming conventions for the same business entity are fragmented across the application performance management, Kubernetes, and cloud resource layers. Evaluation can only rely on character matching or manual subjective judgment, resulting in problems such as unstable scoring, non-reproducible results, and unauditable flows.
In this context, Alibaba Cloud states clearly that building a systematic, standardized RCA Agent evaluation benchmark has evolved from an academic research topic into essential infrastructure for the large-scale implementation of Agentic Ops.
RCA Benchmark Core Definition
RCA Benchmark is not a single-file dataset, but a benchmark suite evaluation system with a complete architecture and closed-loop logic. It consists of three modules: runtime environment, structured sample set, and evaluation protocol.
- Runtime Environment: Build a microservice simulation system capable of generating real failure signals, supporting interactive diagnostic queries by AI agents, and completely eliminating the traditional pattern of providing only standard log fragments.
- Structured Sample Set: Build a fault sample library with Layer 4 structured ground truth. Each case fully covers four core elements: fault type, normalized root cause entity, causal propagation chain, and key evidence checkpoints.
- Evaluation Protocol: Define standardized scoring rules to convert AI agent outputs into quantitative fractions for horizontal comparison. Centered on deterministic rules, this minimizes dependency on Large Language Model (LLM) review and ensures fair and objective scoring.
The project covers all mainstream scenarios, including microservices model failures, database and intermediary failures, Container Orchestration and cloud-native platform failures, cloud resource layer failures, and LLM and agent runtime failures.
RCA Benchmark Core Design Principles and Overall Technical Architecture
RCA Benchmark takes real-world native simulation as its core design concept. It builds a benchmark foundation based on an E-commerce microservice architecture deployed in Kubernetes clusters, containing over 40 business services with call chains up to 7 layers deep. It does not use synthetic data and fully covers typical business dependencies such as synchronous RPC, asynchronous messages, databases, caches, MSMQ, and gateways. With full-domain access to the observability foundation, it supports agents in retrieving seven categories of observation data: indicators, logs, Tracing Analysis, alerting, resource topology, Kubernetes events, and performance profiling. By continuously injecting differentiated background traffic, it replicates production day-night fluctuations, business peaks, and scheduled batch processing payload features to establish a reliable pre- and post-failure comparison baseline.
The project innovatively introduces a four-layer structured ground truth system, abandoning the traditional single root cause label pattern. It completes standardized definitions spanning failure types, normalized entities, causal propagation chains, and key evidence edge zones, with a complementary root cause identification, boundary demarcation, and procedure three-dimension weighted scoring frame that calculates composite scores at 40%, 30%, and 30% weights. Nearly 70% of scores rely on deterministic quantization computation based on failure type topology semantics distance and entity topology distance. Multi-dimensional graded evaluation covers failure semantics matching, topology positioning accuracy, diagnostic evidence, and causal logic completeness, systematically avoiding evaluation bias from random hits. The entire process features transparent rules, reproducible results, and auditable flows.
The platform achieves full-scenario coverage of over 40 failure types across 6 categories at the application layer, intermediary layer, container platform layer, and cloud resource layer through four injection channels: chaos engineering tools, Kubernetes-native O&M, switch configuration, and Alibaba Cloud service APIs. It builds a failure coverage graph across vertical and horizontal dimensions to ensure comprehensive and balanced evaluation scope. To address the industry pain point of fragmented cross-domain entity identities, the platform incorporates a unified entity model (UModel) that assigns cross-domain unique primary keys to all entities, completing multi-domain entity mapping and topology distance calculation through a standardized normalization flow, enabling end-to-end traceability, reproducibility, and auditability.
The system also establishes a four-layer GSTO Quality Gate with multiple admission checks covering structure specifications, signal validity, time windows, and open adaptability settings, strictly filtering invalid samples with failure chain aliasing. Over 200 compliance samples have been accumulated to date, covering all failure type categories and classified into four difficulty levels (L1–L4), with L2 and L3 medium-to-high difficulty scenarios serving as the core evaluation focus.
The project adheres to the principle of open source co-construction. Core capabilities including the evaluation frame, failure directory, scoring protocol, and Quality Gate are fully open source, with co-construction channels open to observability vendors, Agentic Ops developers, and enterprise SRE teams. Reserved non-public test samples and compliance gates prevent data contamination and ensure the fairness and credibility of industry evaluation rankings.
Alibaba Cloud's open source RCA Benchmark establishes a standardized, reproducible, and auditable unified capability ruler for Agentic Ops in the industry, enabling objective benchmarking and quantitative measure of diagnostics capabilities across different agents. Leveraging the tiered difficulty system and full-scenario failure coverage, it supports enterprises in technology selection and business implementation iterations. By open-sourcing core capabilities, it significantly reduces the cost of building in-house evaluation systems. Through dynamic dataset updates, saturation monitoring, and a closed-loop scenario feedback mechanism, the benchmark continuously iterates its capabilities, co-building a long-term evolving, open, and shared O&M intelligent agent industry ecosystem.
Make every failure assessment evidence-based, and every diagnostics capability quantifiable, benchmarkable, and evolvable.
Top comments (0)